Table of Contents
Transformers
Architectural Design
Complete Model Architecture
Self-Attention (Mask optional)
- It is theoretically similar to additive attention but can be implemented in much optimized matrix multiplication code
Multi Head Attention
- Also, dot product may lead to larger values, which is why we scale the output specifically by $\sqrt{d_k}$, because, if we consider matrix multiplication of identity square matrix of $n==d_k$, then its variance will scale from 1 to exactly $d_k$ that is why, we only scale dot product down by it and not any other number.
Normalization
- Batch Normalization
- It is a method that normalizes activations across a mini-batch of definitive size.
- [Problem 1] → If batch size is too small, then it becomes noisy. Also, while training in distributed way, batch size needs to stay same across the systems, or else each system will have different $\gamma$ and $\beta$.
- [Problem 2] → Does not work in RNN, as they have activations for each time-step
- Weight Normalization
- It normalizes the weights of a layer instead of activations
- Good practice is to use Mean-only Batch-Norm with Weight-Norm
- Layer Normalization — used in this paper
- Perform normalization across features instead of mini-batches.
- This performs better especially for the case of RNNs, where batch-norm fails miserably
Positional Encodings
$$
PE_{(pos, 2i)} = sin(pos/10000^{2i/d_{model}})
$$
$$
PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d_{model}})
$$
- Sinusoid (above) performs identical to a learned embedding
Self-Attention