Transformers

Architectural Design

Complete Model Architecture

Self-Attention (Mask optional)

It is theoretically similar to additive attention but can be implemented in much optimized matrix multiplication code

Multi Head Attention

Also, dot product may lead to larger values, which is why we scale the output specifically by $\sqrt{d_k}$, because, if we consider matrix multiplication of identity square matrix of $n==d_k$, then its variance will scale from 1 to exactly $d_k$ that is why, we only scale dot product down by it and not any other number.

Batch Normalization
- It is a method that normalizes activations across a mini-batch of definitive size.
- [Problem 1] → If batch size is too small, then it becomes noisy. Also, while training in distributed way, batch size needs to stay same across the systems, or else each system will have different $\gamma$ and $\beta$.
- [Problem 2] → Does not work in RNN, as they have activations for each time-step
Weight Normalization
- It normalizes the weights of a layer instead of activations
- Good practice is to use Mean-only Batch-Norm with Weight-Norm
Layer Normalization — used in this paper
- Perform normalization across features instead of mini-batches.
- This performs better especially for the case of RNNs, where batch-norm fails miserably

$$ PE_{(pos, 2i)} = sin(pos/10000^{2i/d_{model}}) $$

$$ PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d_{model}}) $$