Table of Contents


Transformers

Architectural Design

Complete Model Architecture

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/ba73e421-c5d1-4a7c-ac6e-8a892b41cf19/Untitled.png

Self-Attention (Mask optional)

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/9bf5277c-6516-44da-8b11-9a9354c11e5a/Untitled.png

Multi Head Attention

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/c7e92a8d-90ea-4147-887f-24d139398e82/Untitled.png

Normalization

Positional Encodings

$$ PE_{(pos, 2i)} = sin(pos/10000^{2i/d_{model}}) $$

$$ PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d_{model}}) $$

Self-Attention

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/72e3148e-7648-4750-9988-260679f79021/Untitled.png