AI Notes: LLM Inference

2025/08/10 11:56:00 Advanced AI Technical-Audience

Here is a copy of my notes from when I was learning about LLM structure. They're pretty raw, but I look back at them from time-to-time, so I've posted them here in the hope that other people might find them useful. No attempt has been made to make this accessible to the lay person.

Words are converted to tokens
- Tokens are analogous to Unicode code points (lookup table)
  - Just integers
  - Could lookup the next step (embeddings) directly, but storage would be unnecessarily large
- Taken from a dictionary of the total vocabulary
  - Can be organized by root words, affixes, punctuation and related elements
    - Some systems use statistically determined elements
  - Dictionaries can be custom
  - Can use common/popular existing vocabs
  - Model families often share the same tokenizer and codepoints
  - Can also fallback to UTF-8 encodings for unknown words
Tokens are used to create embeddings
- Embeddings are the direct representation of tokens in residual latent space at the lowest level of abstraction (lookup table)
  - The token to embedding dictionary is created during training and is specific to the model (as is the latent space)
  - Semantic vector (what an embedding is) is a point in n-dimension latent space, sometimes referred to as a direction
  - Vector dimensions in smaller models is ~768. In large ones, more like ~16,000
    - Multi-head is principally responsible for larger model dimensions; more space is given so each head can have ~768 dimensions to work with
- In some models, absolute token position is added here (not used in this example)
- The final input tensor is typically [batch, seq_len, d_model]
  - batch is which conversation is being processed
    - A model can process numerous conversations simultaneously via this structure
  - seq_len is the length of the token sequence in the conversation
  - d_model denotes a semantic vector (the label shows that this vector is the full model size; ~768, etc)
  - This structure is appended to both by the user(s) and the output of each inference invocation
Multi-head self-attention
- Input tensor is projected into three different coordinate spaces using matrixes created during training
  - Q: (Query) An ambiguity map: what needs to be understood about the input token
    - (-Color, -Size, -Metallic)
    - Can be different (usually smaller) dimensions than latent space, but must be multiple of heads
  - K: (Key) A certainty map: what is concretely understood about the input token
    - (+Color, +Size, +Metallic)
    - Basically metadata about the input vector
    - Same dimension rules as Q
  - V: (Value) The actual payload
    - (Red, Small, Yes)
    - In practice, should be same dimensions as model; it is not the job of the MHSA to do compression or expansion
  - Kind of like positve/negative charged particles, the model wants Q+K=0 or >0 (the needs are met by supply)
    - If there is a match, the output value will be high (meaning that these inputs should attend to one another; opposites attract)
- Q, K and V are reformed to [batch, head_num, seq_len, d_head]
  - d_head is a fragment of the whole semantic vector (size is d_model / num_heads)
  - Each head will consider the data from separate perspectives (grammar, syntax, meaning, etc)
  - What is being considered will change as the model moves up through levels of abstraction
- Input vector positions are used to lookup and add learned per-head positional bias
  - The model produces these lookup tables as part of training
  - Best to do it here so there is no risk of positional data being lost when the vector is split
  - Many models do not use this approach
- Q and K (which are segmented by head) are multiplied to produce a matrix which encodes relationship between all input vectors
  - Encodes what input vectors can supply what information to other input vectors that need that data
  - The shape of the matrix is a square: seq_len x seq_len. The complete structure is: [batch, num_heads, seq_len, seq_len]
- The result is scaled by the square root of the head width to compress the output variance such that softmax does not produce vanishing gradients
- The results are softmaxed row-wise to normalize the values
- The resulting structure is the attention_matrix
- The attention matrix is multiplied by the values to create the un-projected output matrix
  - This structure loses the explicit relationships between inputs, the attention matrix, from each head’s perspective (kept in graph models and used in debugging)
- The results are then concatenated back into the model’s dimensions and then projected back into the model’s latent space via a learned output matrix
  - The values here also encode, and pretty strongly, how important each input is to processing (a importance heat map)
- The output of this block (a kind of semantic diff) is normalized and then added to the input to produce the final value
  - This diff is how the meaning of the input vector changes in light of other input vectors
  - Some models normalize before the block
Multi-layer perceptron/Feed forward block
- The input vectors are multiplied by a set of weights in preparation for activation
  - These weights identify important combinations of features (feature detectors)
  - Each layer has it’s own latent space with an expanded dimension (usually 4x)
  - The vector is destructively projected into the new combinatorial latent space
- Bias is added and the layer is activated (element-wise across the tensor)
- The vectors are then multiplied by a second set of weights
  - These weights provide instructions on how to modify the features of the input vectors based on identified meta-features
    - Compresses the representation back down, combining meta-features into d_model
  - Crucially, the output redefines some or all of the features to have a new meaning and the values provide a diff to change the old values to the new representation
  - This change in definition raises the model’s level of abstraction
- The weights of these layers contain the crystallized intelligence of the model
  - “Paris is the capital of France” is not stored directly in any way
  - “Paris is the capital of France” is an emergent property of the interactions of a subset of weights
- Another bias is applied; not for activation but as this layer’s contribution to the residual stream
- The output of this block is normalized and then added to the input to produce the final value
  - Some models do not normalize here, but rely on normalization before the MHSA block
  - The resultant vector is the same dimensionality as the input, but has a redefined latent space at a higher level of abstraction
  - The input is now encoded in terms of this new latent space definition
  - Moving the input upwards in terms of abstraction is the purpose of the MLP
Repeated invocations
- The input is passed from layer-to-layer, each of which contains a MHSA and MLP block (on the order of 75 layers in larger cases)
- Early layers focus on the meaning of the input
- Middle layers focus on what the model knows about the input
- Later layers focus on what the model thinks about the input
- These are not cleanly delineated, but more of an uneven gradient of increasing abstraction
- Within the structure of the model, the program (weights), emerges naturally from the data based on the loss function
Token selection
- The output of the final layer is projected through an output matrix, [batch, seq_len, vocab_size], into vocabulary space
  - The result is called a logit (from “log-odds”, not “logic”), which is an unnormalized probability score
- The logits are converted into normalized probabilities using softmax
- A selection process is used to determine which token is selected:
  - Top-k (one is randomly selected from the top x values; i.e. one of top 10)
  - Top-p (one is randomly selected from a dynamic list of all values over x; i.e. one of all values over .9)
  - Both can include temperature scaling (logit/temperature); high temperature lowers relative distance thus creating more candidates
  - Greedy: Just picks the highest one - deterministic
  - And many more
  - Top-p with temperature is popular
The process is completed again and again until the model outputs and EOS token (end-of-sequence) or the system limit is reached