AI Part 4: Large Language Models

Since LLMs like ChatGPT are the subject of most of the current hype and misinformation, we should examine what they are and what they are not. LLMs are an incomplete implementation of the heuristic/mid-brain; it works on the basis of lossy-data storage and pattern matching.

Lossy is a technical term from the world of data compression: you can use compression to take a large file and make it smaller, that is, store it more efficiently. Lossless allows you to get back exactly what you compressed as no information is thrown away. Lossy compression is when information that is deemed unimportant by some metric is thrown out and you can’t get back exactly what you put in, but a reasonable facsimile.

Movies are a good example of lossy compression; we get a great end-product by throwing away information we don’t think the viewer will notice, resulting in video files being 200x smaller and therefore 200x cheaper to store and distribute.

LLMs store a huge number of concepts in an extremely small space, in part by throwing away a huge amount of “superfluous” data and trying to reconstruct what is missing when it needs it.

A top-shelf model's size, as commonly used, is on the order of 800GB. That's as much as approximately ten 4k movies or all the text on Wikipedia in all languages about 3 times over.

Since such models are trained on data on the order of 1,000,000GB (1PB), that suggests a compression ratio of 1250:1 (1,000,000 ÷ 800) using a simple, but misleading calculation.

Neural AI stores much more about the relationships and patterns about the data, or metadata, than the actual data itself.

As such, the real compression ratio of factual data, and by extension how much detail is thrown away, is much higher; perhaps 10,000:1 or more. A simple model, like the ones many people use for free, have a ratio 100 times higher.

Humans do the same thing; we might store enough in our minds to recognize the Coca-cola logo, but not enough to reproduce it from memory. We might remember the significance of a spreadsheet without memorizing all the numbers in it.

If we try to remember the specific numbers from the spreadsheet, we may be able to reconstruct the ones we can’t remember. But if we get it wrong, we’ll say something incorrect. LLMs will make the same kind of mistake, and this is one of the sources of hallucinations.

I stated that LLMs are an “incomplete implementation of the heuristic/mid-brain”. One of the pieces that's missing is the ability to identify when confidence in an output is low, stop, then check for better information or switch to reasoning about something. The most recent models have made progress in checking for better information and citing their sources.

Reason, on the other hand, is not so clear cut. Many AI companies advertise their models as reasoning, but would be hard pressed to identify where in the model such abstract reasoning happens exactly.

Systems in this category use a number of techniques to improve the output of their mid-brain such as pulling in extra data when confidence is low, running a prompt through the system multiple times and taking an average of the outputs, or by using non-neural AI techniques built into the scaffolding, but none of these approaches switch the model itself out of its intuitive operation.

LLMs are currently missing many other elements of human mid-brain cognition, known and unknown, such as:

  • Temporal reasoning or a direct sense of time: LLMs struggle with "what happened first" or "how long ago" questions
  • A clear separation between itself, the user and other sources of data; many models will regard the user’s input as something it said or did, or worse, something that it read online is something the user said (this is a security risk)
  • Knowing when they don’t know (measuring the likely accuracy of a response)
  • Displaying their mid-brain thinking accurately
  • Social intuition
  • In Freudian terms:
    • Lack of super-ego: No separate conscience which limits behavior; ethical limiting can be subverted
    • Lack of id: No drives/needs to fulfill (sometimes a positive trait, sometimes negative)
  • A sufficiently advanced long-term memory: This is the subject of intense research
  • A Default Mode Network: the part of our brains which feed us simulations and what-ifs when we are not engaged in a task; supports learning, long-term goals and prediction accuracy
  • There are numerous well-known oddities:
    • Not knowing how many “r”s there are in strawberry
    • Failing simple mathematical calculations (just like humans)

This is in addition to lacking some or all fore-brain functions, such as abstract reasoning or self-awareness. There is also an argument that “embodied experience”, including physical sensations, spatial awareness, or bodily intuition is important because such factors play a large role in human decision making.

None of this is to suggest that the situation cannot change in the future; after all, human reasoning capability is implemented in terms of neurons. It is also not to detract from how impressive LLMs are as a technical milestone.

Language and Cognition

Understanding the connection between language and thought is helpful in understanding how and why they work. It is not by accident that the kind of thinking displayed by the likes of Claude or ChatGPT emerged from the development of Large Language Models.

In fact, the groundbreaking paper from Google released in 2017 (as noted in Part 1 of this series) and whose technology underwrites all modern LLMs was created in an effort to solve problems for Google Translate by making the service understand the meaning and context of words.

Prior to that, language models could not differentiate between the sentences "dog bites man" and "man bites dog". It's difficult to make quality translations without even that level of understanding, especially between dissimilar languages like English and Japanese.

In psychology and other adjacent disciplines, language has been very closely related to cognition for decades; a phenomenon somewhat obtusely known as "linguistic relativity".

It's been well established that language is tightly intertwined with thought, which makes sense because both rational thought and language need small, well-defined bundles of meaning; the former to build and manipulate ideas and the latter in order to transmit them to others socially.

These little bundles are called "symbols". We have an idea of a cat, which encodes many connected ideas: 4 legs, fur, purring, aloofness, eats fish, pointy ears, etc. When we say the word "cat" to someone else, we expect that a roughly equivalent idea will form in the listeners mind.

If we add the word "cheshire" to "cat", we then also encode big grins, pink and purple stripes, yellow eyes, and the world of Alice in Wonderland (at least if the cultural definers of "cheshire cat" are Lewis Carrol and Disney).

By the same token, "woman", "elderly woman" and "grandmother" each encode a series of ideas and societal expectations that are related, but different from each other.

The Sapir-Whorf Hypothesis suggests that these cultural/linguistic encodings have a significant impact on how people think; how they process symbols.

For example, in Japanese, you reference strangers you don't know, not with something generic like "sir" or "miss", but by what their relation to you would be if they were related "sister", "mother", "grandmother". This ideation is deeply buried in the culture.

A common phrase is "It takes a village to raise a child", which indirectly encodes the idea that the whole village is acting as a family to the benefit of each other, which isn't such a stretch given how the feudal society was organized with "family" or "clan" as a placeholder for "nation".

Thinking through a situation using "granny" instead of "ma'am", with everything attached to those words, can't help but drive thinking in different directions because the emotional context and societal expectations of those words are sufficiently different.

A colleague and good friend of mine embodies this idea well. He is tri-lingual (German, English, Japanese) and during a series of conversations we had on morality and the nature of knowledge (epistemology) he expressed his own experience taking the same set of facts and arriving at three very different conclusions that all felt intuitively right, depending on what language he thought in.

So, do LLMs think?

In the last part of this series, we explored the idea that human and machine cognition progresses from simple to increasingly sophisticated representations. We can arrange our understanding of language similarly:

  • Syntax: The rules for combining words/symbols. "The cat sat on the mat" is legal, while "The sat the on cat mat" is not. Note that this is independent of actual meaning: "The furb urkled on the sliz" is valid, even if we don't know what a "furb", "urk" and "sliz" are, although we do know that it was the "furb" that "urk"led.
  • Semantics: The meanings of the symbols themselves, and how they interact. "The cat sat on the mat" and "the mat sat on the cat" are both syntactically correct, but have different semantics (meaning).
  • Pragmatics: The actual, intended meaning of the sentence. "Can you pass the salt?" is not a question of "Do you have the capability to pass the salt" but a request for action.

The listener then forms an idea of how they feel about the request, and decides whether or not and how to respond based on the relationship dynamic, personal inclinations and cultural expectations (by passing the salt and saying "here you go", ideally).

Each level implies a higher level of understanding about what is being requested until we arrive at a given outcome. Given the structural progression that we see in both human and artificial neural networks, there is a certain inevitability to exhibiting thinking, or behaviour that often seems indistinguishable from thinking.

Common framing of what is happening tends to fall into two camps: "It's just statistics" and "It's a complete intelligence (AGI)". I think the truth is a bit more nuanced.

My opinion is that the thinking LLMs do is genuine. Inefficiently, incompletely, with a high error rate and limited to heuristic/mid-brain work, but genuine nonetheless.

There are also a number of promising looking paths going forward to improve on that thinking, even if the fruits of such inquiry might not be realized immediately.