Emergent Abilities in LLMs
After a year of ChatGPT, we have taken the capability of LLMs to generate text that is undistinguishable to human-written text for granted. But as Chomsky says, good science begins with the willingness to be puzzled by the things we take for granted. Besides, this capability used to blow our minds before we got used to it.
Okay, let us go deeper. What is the model doing when it produces this answer? It is producing a probability distribution, conditioned on my prompt, over the entire vocabulary and picking the highest one. It does this autoregressively until some stopping criteria is met. But how did it learn to do that? It can do that accurately because of the parameters it learned (including the embedding matrix) at training time. It learns those parameters when modeling the entire distribution of words and their usage in the English language (or any language available on the internet). The training process of LLMs includes two steps: pretraining and fine-tuning.
1. Pretraining involves starting with randomly set parameters and testing the performance of the model against the training objective of language modelling (a.k.a. next word prediction). The predicted token is compared with the actual next word in the training data and a loss value is calculated based on their similarity (loss would be zero if it got the word correctly). Then all the parameters are updated w.r.t the loss using backprop. This process is repeated for a vast amount of text data taken from the internet. The number of parameters is also huge in the case of LLMs (putting the “large” in “LLM”).
2. Fine-tuning involves aligning the model’s representation of language to a specific domain (e.g. question answering). This is done using methods like instruction tuning.
But how do these training objectives lead to an ability as abstract as reasoning? How is the simple objective of next word prediction leading to a human like natural language usage?
There have been many attempts at explaining this. But let us define emergence first, we are talking about the ability of the models to do well on tasks they were not explicitly trained for. You will see other definitions of emergence out there like “a drastic and discontinuous change in performance on non-objective tasks as the parameter (e.g. scale) is changing continuously and steadily”. Some works which adapt this definition focus on why the capability of the models to perform well on non-objective tasks only comes about past a certain threshold in scale. But they don’t actually discuss why or how these non-objective capabilities come about in the first place (at the scale at which they do occur).
I will use the first definition of emergence in this article. What is it about language and/or the algorithm we’re using that enables a model that approximates the distribution of words (how words appear w.r.t to every other word in the training data) to also approximate higher order uses of language like reasoning? Why are these non-objective abilities (e.g. reasoning) piggybacking on the objective tasks (language modelling)?
I think the answer is that verbal reasoning and other functions that we are seeing as being external to language are just parts of language. Wittgenstein said that the meaning of a word is just its usage, which is uncontroversial when we are considering a word/token item in an embedding matrix. Say [40030810] is the vector for the word “cat” in the embedding matrix, but what is the meaning of [40030810] alone? Absolutely nothing. It’s only meaningful in the context of its relation to other items (its proximity to some and remoteness to others) in the entire embedding matrix. If word-meaning is part of the language games, then maybe all these higher-order functions are also a part of the language games. So that, if you are able to train a model to learn the language games in a certain way (next word prediction in this case) then, the capability to play all kinds of other language games will also be acquired.
We modeled language and all that comes with it, but have we modeled thought? Of course not. We humans choose our next words because they correspond to our thoughts the most (i.e. make the most sense). LLMs choose their next words because they are the most likely, given the preceding text, the distribution of words on the internet, and the distribution of words in the fine-tuning dataset. Human language-use is anchored to thought, it is one of the ways we articulate our rich ideas. But, assuming language-use is not thought rather only its interface, what is an LLM’s language-use anchored to? It is this: The recorded usages of language on the internet, further shaped by downstream preference optimization steps. No thought and no intentionality; it’s just turtles (language-use) all the way down.
(references will be added soon)