AI Notes: LLM Inference

Here is a copy of my notes from when I was learning about LLM structure. They're pretty raw, but I look back at them from time-to-time, so I've posted them here in the hope that other people might find them useful. No attempt has been made to make this accessible to the lay person.

  • Words are converted to tokens
    • Tokens are analogous to Unicode code points (lookup table)
      • Just integers
      • Could lookup the next step (embeddings) directly, but storage would be unnecessarily large
    • Taken from a dictionary of the total vocabulary
      • Can be organized by root words, affixes, punctuation and related elements
        • Some systems use statistically determined elements
      • Dictionaries can be custom
      • Can use common/popular existing vocabs
      • Model families often share the same tokenizer and codepoints
      • Can also fallback to UTF-8 encodings for unknown words
  • Tokens are used to create embeddings
    • Embeddings are the direct representation of tokens in residual latent space at the lowest level of abstraction (lookup table)
      • The token to embedding dictionary is created during training and is specific to the model (as is the latent space)
      • Semantic vector (what an embedding is) is a point in n-dimension latent space, sometimes referred to as a direction
      • Vector dimensions in smaller models is ~768. In large ones, more like ~16,000
        • Multi-head is principally responsible for larger model dimensions; more space is given so each head can have ~768 dimensions to work with
    • In some models, absolute token position is added here (not used in this example)
    • The final input tensor is typically [batch, seq_len, d_model]
      • batch is which conversation is being processed
        • A model can process numerous conversations simultaneously via this structure
      • seq_len is the length of the token sequence in the conversation
      • d_model denotes a semantic vector (the label shows that this vector is the full model size; ~768, etc)
      • This structure is appended to both by the user(s) and the output of each inference invocation
  • Multi-head self-attention
    • Input tensor is projected into three different coordinate spaces using matrixes created during training
      • Q: (Query) An ambiguity map: what needs to be understood about the input token
        • (-Color, -Size, -Metallic)
        • Can be different (usually smaller) dimensions than latent space, but must be multiple of heads
      • K: (Key) A certainty map: what is concretely understood about the input token
        • (+Color, +Size, +Metallic)
        • Basically metadata about the input vector
        • Same dimension rules as Q
      • V: (Value) The actual payload
        • (Red, Small, Yes)
        • In practice, should be same dimensions as model; it is not the job of the MHSA to do compression or expansion
      • Kind of like positve/negative charged particles, the model wants Q+K=0 or >0 (the needs are met by supply)
        • If there is a match, the output value will be high (meaning that these inputs should attend to one another; opposites attract)
    • Q, K and V are reformed to [batch, head_num, seq_len, d_head]
      • d_head is a fragment of the whole semantic vector (size is d_model / num_heads)
      • Each head will consider the data from separate perspectives (grammar, syntax, meaning, etc)
      • What is being considered will change as the model moves up through levels of abstraction
    • Input vector positions are used to lookup and add learned per-head positional bias
      • The model produces these lookup tables as part of training
      • Best to do it here so there is no risk of positional data being lost when the vector is split
      • Many models do not use this approach
    • Q and K (which are segmented by head) are multiplied to produce a matrix which encodes relationship between all input vectors
      • Encodes what input vectors can supply what information to other input vectors that need that data
      • The shape of the matrix is a square: seq_len x seq_len. The complete structure is: [batch, num_heads, seq_len, seq_len]
    • The result is scaled by the square root of the head width to compress the output variance such that softmax does not produce vanishing gradients
    • The results are softmaxed row-wise to normalize the values
    • The resulting structure is the attention_matrix
    • The attention matrix is multiplied by the values to create the un-projected output matrix
      • This structure loses the explicit relationships between inputs, the attention matrix, from each head’s perspective (kept in graph models and used in debugging)
    • The results are then concatenated back into the model’s dimensions and then projected back into the model’s latent space via a learned output matrix
      • The values here also encode, and pretty strongly, how important each input is to processing (a importance heat map)
    • The output of this block (a kind of semantic diff) is normalized and then added to the input to produce the final value
      • This diff is how the meaning of the input vector changes in light of other input vectors
      • Some models normalize before the block
  • Multi-layer perceptron/Feed forward block
    • The input vectors are multiplied by a set of weights in preparation for activation
      • These weights identify important combinations of features (feature detectors)
      • Each layer has it’s own latent space with an expanded dimension (usually 4x)
      • The vector is destructively projected into the new combinatorial latent space
    • Bias is added and the layer is activated (element-wise across the tensor)
    • The vectors are then multiplied by a second set of weights
      • These weights provide instructions on how to modify the features of the input vectors based on identified meta-features
        • Compresses the representation back down, combining meta-features into d_model
      • Crucially, the output redefines some or all of the features to have a new meaning and the values provide a diff to change the old values to the new representation
      • This change in definition raises the model’s level of abstraction
    • The weights of these layers contain the crystallized intelligence of the model
      • “Paris is the capital of France” is not stored directly in any way
      • “Paris is the capital of France” is an emergent property of the interactions of a subset of weights
    • Another bias is applied; not for activation but as this layer’s contribution to the residual stream
    • The output of this block is normalized and then added to the input to produce the final value
      • Some models do not normalize here, but rely on normalization before the MHSA block
      • The resultant vector is the same dimensionality as the input, but has a redefined latent space at a higher level of abstraction
      • The input is now encoded in terms of this new latent space definition
      • Moving the input upwards in terms of abstraction is the purpose of the MLP
  • Repeated invocations
    • The input is passed from layer-to-layer, each of which contains a MHSA and MLP block (on the order of 75 layers in larger cases)
    • Early layers focus on the meaning of the input
    • Middle layers focus on what the model knows about the input
    • Later layers focus on what the model thinks about the input
    • These are not cleanly delineated, but more of an uneven gradient of increasing abstraction
    • Within the structure of the model, the program (weights), emerges naturally from the data based on the loss function
  • Token selection
    • The output of the final layer is projected through an output matrix, [batch, seq_len, vocab_size], into vocabulary space
      • The result is called a logit (from “log-odds”, not “logic”), which is an unnormalized probability score
    • The logits are converted into normalized probabilities using softmax
    • A selection process is used to determine which token is selected:
      • Top-k (one is randomly selected from the top x values; i.e. one of top 10)
      • Top-p (one is randomly selected from a dynamic list of all values over x; i.e. one of all values over .9)
      • Both can include temperature scaling (logit/temperature); high temperature lowers relative distance thus creating more candidates
      • Greedy: Just picks the highest one - deterministic
      • And many more
      • Top-p with temperature is popular
  • The process is completed again and again until the model outputs and EOS token (end-of-sequence) or the system limit is reached