Table of Contents


Transformers

Architectural Design

Complete Model Architecture

Self-Attention (Mask optional)

Multi Head Attention

Normalization

Positional Encodings

$$ PE_{(pos, 2i)} = sin(pos/10000^{2i/d_{model}}) $$

$$ PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d_{model}}) $$

Self-Attention