AI Notes: LLM Inference
Here is a copy of my notes from when I was learning about LLM structure. They're pretty raw, but I look back at them from time-to-time, so I've posted them here in the hope that other people might find them useful. No attempt has been made to make this accessible to the lay person.
- Words are converted to tokens
- Tokens are analogous to Unicode code points (lookup table)
- Just integers
- Could lookup the next step (embeddings) directly, but storage would be unnecessarily large
- Taken from a dictionary of the total vocabulary
- Can be organized by root words, affixes, punctuation and related elements
- Some systems use statistically determined elements
- Dictionaries can be custom
- Can use common/popular existing vocabs
- Model families often share the same tokenizer and codepoints
- Can also fallback to UTF-8 encodings for unknown words
- Can be organized by root words, affixes, punctuation and related elements
- Tokens are analogous to Unicode code points (lookup table)
- Tokens are used to create embeddings
- Embeddings are the direct representation of tokens in residual latent space at the lowest level of abstraction (lookup table)
- The token to embedding dictionary is created during training and is specific to the model (as is the latent space)
- Semantic vector (what an embedding is) is a point in n-dimension latent space, sometimes referred to as a direction
- Vector dimensions in smaller models is ~768. In large ones, more like ~16,000
- Multi-head is principally responsible for larger model dimensions; more space is given so each head can have ~768 dimensions to work with
- In some models, absolute token position is added here (not used in this example)
- The final input tensor is typically [batch, seq_len, d_model]
- batch is which conversation is being processed
- A model can process numerous conversations simultaneously via this structure
- seq_len is the length of the token sequence in the conversation
- d_model denotes a semantic vector (the label shows that this vector is the full model size; ~768, etc)
- This structure is appended to both by the user(s) and the output of each inference invocation
- batch is which conversation is being processed
- Embeddings are the direct representation of tokens in residual latent space at the lowest level of abstraction (lookup table)
- Multi-head self-attention
- Input tensor is projected into three different coordinate spaces using matrixes created during training
- Q: (Query) An ambiguity map: what needs to be understood about the input token
- (-Color, -Size, -Metallic)
- Can be different (usually smaller) dimensions than latent space, but must be multiple of heads
- K: (Key) A certainty map: what is concretely understood about the input token
- (+Color, +Size, +Metallic)
- Basically metadata about the input vector
- Same dimension rules as Q
- V: (Value) The actual payload
- (Red, Small, Yes)
- In practice, should be same dimensions as model; it is not the job of the MHSA to do compression or expansion
- Kind of like positve/negative charged particles, the model wants Q+K=0 or >0 (the needs are met by supply)
- If there is a match, the output value will be high (meaning that these inputs should attend to one another; opposites attract)
- Q: (Query) An ambiguity map: what needs to be understood about the input token
- Q, K and V are reformed to [batch, head_num, seq_len, d_head]
- d_head is a fragment of the whole semantic vector (size is d_model / num_heads)
- Each head will consider the data from separate perspectives (grammar, syntax, meaning, etc)
- What is being considered will change as the model moves up through levels of abstraction
- Input vector positions are used to lookup and add learned per-head positional bias
- The model produces these lookup tables as part of training
- Best to do it here so there is no risk of positional data being lost when the vector is split
- Many models do not use this approach
- Q and K (which are segmented by head) are multiplied to produce a matrix which encodes relationship between all input vectors
- Encodes what input vectors can supply what information to other input vectors that need that data
- The shape of the matrix is a square: seq_len x seq_len. The complete structure is: [batch, num_heads, seq_len, seq_len]
- The result is scaled by the square root of the head width to compress the output variance such that softmax does not produce vanishing gradients
- The results are softmaxed row-wise to normalize the values
- The resulting structure is the attention_matrix
- The attention matrix is multiplied by the values to create the un-projected output matrix
- This structure loses the explicit relationships between inputs, the attention matrix, from each head’s perspective (kept in graph models and used in debugging)
- The results are then concatenated back into the model’s dimensions and then projected back into the model’s latent space via a learned output matrix
- The values here also encode, and pretty strongly, how important each input is to processing (a importance heat map)
- The output of this block (a kind of semantic diff) is normalized and then added to the input to produce the final value
- This diff is how the meaning of the input vector changes in light of other input vectors
- Some models normalize before the block
- Input tensor is projected into three different coordinate spaces using matrixes created during training
- Multi-layer perceptron/Feed forward block
- The input vectors are multiplied by a set of weights in preparation for activation
- These weights identify important combinations of features (feature detectors)
- Each layer has it’s own latent space with an expanded dimension (usually 4x)
- The vector is destructively projected into the new combinatorial latent space
- Bias is added and the layer is activated (element-wise across the tensor)
- The vectors are then multiplied by a second set of weights
- These weights provide instructions on how to modify the features of the input vectors based on identified meta-features
- Compresses the representation back down, combining meta-features into d_model
- Crucially, the output redefines some or all of the features to have a new meaning and the values provide a diff to change the old values to the new representation
- This change in definition raises the model’s level of abstraction
- These weights provide instructions on how to modify the features of the input vectors based on identified meta-features
- The weights of these layers contain the crystallized intelligence of the model
- “Paris is the capital of France” is not stored directly in any way
- “Paris is the capital of France” is an emergent property of the interactions of a subset of weights
- Another bias is applied; not for activation but as this layer’s contribution to the residual stream
- The output of this block is normalized and then added to the input to produce the final value
- Some models do not normalize here, but rely on normalization before the MHSA block
- The resultant vector is the same dimensionality as the input, but has a redefined latent space at a higher level of abstraction
- The input is now encoded in terms of this new latent space definition
- Moving the input upwards in terms of abstraction is the purpose of the MLP
- The input vectors are multiplied by a set of weights in preparation for activation
- Repeated invocations
- The input is passed from layer-to-layer, each of which contains a MHSA and MLP block (on the order of 75 layers in larger cases)
- Early layers focus on the meaning of the input
- Middle layers focus on what the model knows about the input
- Later layers focus on what the model thinks about the input
- These are not cleanly delineated, but more of an uneven gradient of increasing abstraction
- Within the structure of the model, the program (weights), emerges naturally from the data based on the loss function
- Token selection
- The output of the final layer is projected through an output matrix, [batch, seq_len, vocab_size], into vocabulary space
- The result is called a logit (from “log-odds”, not “logic”), which is an unnormalized probability score
- The logits are converted into normalized probabilities using softmax
- A selection process is used to determine which token is selected:
- Top-k (one is randomly selected from the top x values; i.e. one of top 10)
- Top-p (one is randomly selected from a dynamic list of all values over x; i.e. one of all values over .9)
- Both can include temperature scaling (logit/temperature); high temperature lowers relative distance thus creating more candidates
- Greedy: Just picks the highest one - deterministic
- And many more
- Top-p with temperature is popular
- The output of the final layer is projected through an output matrix, [batch, seq_len, vocab_size], into vocabulary space
- The process is completed again and again until the model outputs and EOS token (end-of-sequence) or the system limit is reached