Transformers are a type of deep neural network that have radically advanced AI capabilities in recent years. First described by a team of Google scientists back in 2017, transformer systems are now essential in large language models such as ChatGPT, translation software, and increasingly vision and multimodal information processing tools. How do transformers work, and why have they so transformed the AI field?
Effectively transformers are a smart way of introducing context into deep neural networks whilst analysing data. To take language models, consider the following sentence:
“The small dog carried the bone”
The first step in a deep learning system aimed at analysing the sentence is to “tokenize” it. This means to chunk it into relevant parts (let’s assume that it chunks each word):
“The - small - dog - carried - the - bone”
The next step is to “embed” each token, to give them a code that corresponds to its meaning. This code is usually a vector, a set of numbers where each number represents different aspects of the meaning of the word. For example, the words King and Queen might be represented by very similar vectors, but a few numbers relating to the gender of an object might vary to represent that difference.
King= <0, 17, 16, 18, 14, … …12, 2> Queen = <0, 17, 16, 18, 14, … …15, 7>
Large language models use these symbols to translate text or generate predictions on what would come next in natural conversation. For these models to do this accurately, the code needs to consider how context impacts the meaning.
Consider how early examples of Google translate used to struggle to use context to pick out the correct meaning from a word with multiple definitions. To use a humorous example, consider “I went swimming at the bank” compared to “I went swimming at the bank”:
Whilst we might understand intuitively what bank is being referred to in this instance, a computer program does not automatically have access to a way to understand which meaning is being referred to here.
Context refers to the way in which the meaning of a word varies from case to case. It is present in more subtle ways for example, when we think about bones we might think about human skeletons, but in the sentence above we think about the chew toys given to dogs. If an AI algorithm based its prediction or translation upon encoded meanings, we’d like to update these tokens to reflect this context. Context-specific encodings will train it to make more fine-grained predictions.
Transformer modules can be trained to learn which words are relevant to the meaning of other words so that when we input a novel sentence into a transformer it will change the meanings of each word accordingly. Effectively, they learn how to change the vectors representing words to map their exact meanings more closely.
By using multiple transformer modules in parallel, each module can learn independent aspects of context relevance. For example, one might look at how the subject of a verb impacts the object – how dog affects the meaning of bone; and another might look at how adjectives affect nouns – how small affects our idea of what kind of dog it is.
The advantage of transformers is that these attention processes can be parallelised and can be used to consider context over longer ranges. Early models of attention used recurrent connections to model attention processes. Essentially, feedback from earlier words shapes the analysis of later works in a sentence. The challenge was that information from earlier on tended to die down after a few more words.
In response to this challenge, long-short-term memory circuits were invented. They included a special mechanism to maintain information about data further back in processing. Transformers still have a huge advantage however, in the degree to which they can be parallelised.
Parallelisation refers to splitting a computational task into different parts which can be completed simultaneously, thus speeding up the time it takes to complete that task. Suppose a Satnav needed to check traffic, weather conditions, and speed limits before predicting travel time. As each task is independent of the other, they could be completed simultaneously by different processing units. Though little stands to be gained here, in AI systems with massive computational demands the advantage of widespread parallelisation is obvious.
Both recurrent networks, and LSTM circuits required systems to wait until earlier words were processed before analysing later ones, preventing parallelisation. Transformers, on the other hand, consider the role of each word simultaneously. They can therefore quickly consider large numbers of tokens and types of relevance, improving times for training and use. This is especially crucial given the quantity of training these systems require in order to learn to perform their task successfully.
Transformer models have been put to a variety of usages. LLMs are the most notable, with the T in ChatGPT standing for transformer. They have, however, begun to be implemented in other similar contexts such as image recognition algorithms by breaking down images into patches analogous to the symbols described earlier. The transformer modules then allow the different patches to influence one another. For example, an eye-like shape near patches resembling noses and ears are most likely part of a face, whereas eye-like shapes near car door mirrors are more likely to be headlights.
This approach has been shown to be generalisable beyond vision. Researchers have had success in speech recognition by first converting audio input into a visual format. After which they convert that image into patches and feed it into a transformer to analyse the overall shape of the audio snippet.
Transformers have dramatically changed the AI landscape. There is some concern over the cost of training these models. Compared to alternative approaches such as so-called “convolutional neural networks” in visual processing, the costs of training are very high. Companies are willing to spend a great deal on training to reap the benefits of transformer systems.
The hope is that in the future more fine-tuned algorithms can be found that increase training efficiency so that these models can continue to scale and improve without a reduction in efficacy compared to alternative AI algorithms. In any case, with an absence of competitive alternatives, transformers are a vital part of cutting-edge deep learning algorithms, and vital to understand if you want to be up to date with AI technology today.