Writting down some notes when tring to understand these 2 things.

Some prerequisites

Optional:
Neural Networks Visualized Video
A blog post about LSTM
These 2 are the earlier version of AI. It describes what neural network is and how LSTM improves it.
Not directly related to Attention and Transformer.

Required:
Word embedding. But no need to go deep. Just need to know that a matrix is used to represent a word’s meaning in a vector space.
When 2 words are similar, we expect them to be close in the vector space.
Take 10 minutes to watch the word embedding section: https://www.youtube.com/watch?v=wjZofJX0v4M

Attention

I watched this Attention video: https://www.youtube.com/watch?v=eMlx5fFNoYc

To put it simple, attention is a mechanism to update the embeddings of each word(token).
The input of an attention is a series of tokens with their embeddings, which represent their meanings.
Attention will then run a process to update embeddings, so they contain rich-context information.

Each attention would have a few pre-trained parameters, Query Matrix (Wq), Key Matrix (Wk), Value Matrix (Wv).

When we want to udpate token xi’s embedding, the process is

xi · Wq to generate a query Qi
for all other xj, calculate xj · Wk to generate the Kj
find the relevance(weight) between xj and xi by calculating the inner product of the above 2 output matrices (Qi and Kj, j != i)
use softmax to normalize the values
xj · Wv = Vj. sum of Vj * normalized weight (0~1) is what we add back to xi as the update

Sementic meaning
Qi (Query) → To update token xi, what information do we need from other tokens?
Kj (Key) → When other tokens ask me to evaluate how relevant I am to them, how do I respond?
xj · Wv = Vj (Value) -> When other tokens want to update themselves using my embedding, what do I give them?

Wq, Wk, Wv are pretrained parameters and are the same for updating all tokens

[screenshot from 3Blue1Brown]

This is roughly how an attention works.

Noted that the above is called self-attention (in transformer most of stuff is self-attention).

Let’s say we have 2 token sequences (a1~a3, b1~b3) with their own Wq, Wk, Wv.

If we calculate the relevance by combining Wqa and Wkb, this is cross attention.

Transformer

Input: Embedding + Positionw
Transformer layer = (Self-Attention + Feed Forward Network）× N
N layers

A Transformer layer:

3 things

1. Multi-Head Self-Attention

Each head has its own Query, Key, Value metrices

Assuming there’s 100 dimensions from the input embedding, and there’s 5 heads

The three metrices would trun the 100 dim into 20 dim

Each head would do the what attention does to a token xᵢ:
Use Wq, Wk, Wv to calculate the updated xi (20 dimensions)

Please check Attention section for details

Concate the results from different heads, then do a transformation to project it back to the original dimension (100)

2. Add & Norm

The output from 1. + original xᵢ
Do layer normalization (minus average, diveided by std)

To avoid the numbers going overflow. Adjust bit by bit.

3. Feed-Forward Network（FFN）

Think of it this way

Attention：update information
FFN： It does not act like attention taking context into consideration. Just some business logic transofrmation, independently for each token. Something like FFN(x)=W2σ(W1x+b1)+b2

🔹 4. Another Add & Norm

Stable!

Why multiple layers

Multiple layers let the model think multiple times about the same sentence.

Simple example:

The book that the student recommended is expensive.

One layer can:

See that “recommended” relates to “book”
Mix information across tokens

Two+ layers can:

Resolve who recommended what
Ignore the distracting phrase “that the student…”
Correctly link “is expensive” → “book”

Each layer refines the understanding.

Encoder / Decoder？

Encoder vs Decoder： can it peek on the future?

The visivbility of attention

Encoder（BERT / Embedding）

When updating a token, use context from everywhere
To understand
search / RAG

Decoder（GPT）

When updating a token, use context only before this token
to generate next token

How to use output

GPT（Decoder-only）

GPT updates tokens using only past context, then takes the final token’s representation. That last token goes through a linear layer and softmax to predict the next token.

BERT（Encoder-only）

BERT updates every token by looking at the full input context, so each token encodes the complete meaning of the sentence.
These contextualized representations are ideal for understanding tasks like classification, search, and embeddings.