- What are large language models (LLMs) and why are they considered to be the closest thing to Artificial General Intelligence currently available?
- Why are LLMs capable of performing intellectual tasks at a similar level to humans?
- An overview of LLMs and how to utilize and benefit from them.
Can you guess the missing word in the following sentence:
“Most newborn kittens open their eyes between the ages of 2 and 16 ___”?
I’m sure you’ve guessed it correctly based on your biology knowledge and understanding of the English language.
You have just solved the problem that autoregressive language models are meant to solve. An autoregressive language model receives a text - or the “prompt” - and chooses the token that most naturally comes after the prompt.
Although this predict-the-next-token interface may seem simple, it opens the possibility to solve basically all kinds of problems in natural language processing - and today we’ll show you how.
You can also see it in action in this video:
An autoregressive language model could be implemented in a variety of different ways. Today, we’re going to talk about models that are based on the transformer neural network architecture. The transformer architecture was first introduced in 2017 by specialists from Google Research and Google Brain. Today, transformers not only have completely taken over the NLP field but have also beat convolutional neural networks at some Computer Vision problems and play an important role in text-to-image solutions. Transformer models that are practical to use are very large, up to hundreds of billions of neuron connections, so the subject of this article is often called “large language models” (LLM).
The performance of LLMs largely depends on their size – the more neuron connections there are, the more naturally the model will complete the prompt. OpenAI GPT3 has 175 billion parameters and it’s accessible without any special knowledge – here is the GPT3 playground that we will be using later on. However, OpenAI charges money for the usage of GPT3, and users also have to comply with their strict usage policies. There are several alternatives to GPT3 that are free for downloading and running locally or hosting on your own server. One of the alternatives – the largest version of BLOOM – has 176 billion parameters which are even larger than GPT3. However, running a model that large will require a ton of computational resources (so it isn’t ideal to run them on home computers) whereas smaller free models will yield worse results. There is no solution that fits all the cases – users have to choose the model according to their problem.
Tokenization and Context Window
Let’s clarify the difference between a token and a word. Language models work with tokens and require a tokenizer to convert a text to the sequence of tokens the model can handle. Sometimes one token corresponds to a word, sometimes it doesn’t – it depends on the inner workings of the tokenizer.
Let’s take a look at the GPT3 tokenizer.
When it comes to English, one token usually corresponds to a word or a punctuation sign. You can also notice that the token includes a space character that precedes the word, so for GPT3 it’s better not to have a space at the end of the prompt.
In other languages that use the Latin script, the situation is worse – typically, a token consists of 1-4 characters. However, Polish has some special characters that require 2 tokens.
The worst tokenization is with languages that have a different writing system: in Japanese, for example, kana characters and common kanji correspond to one token, but more advanced kanji are represented with several tokens.
If you build your own model, you can shape the tokenizer any way you want. Tokens don’t even have to correspond to parts of natural language text. For example, if you’re building a model that generates piano music, each token can encode a set of piano keys being pressed at a certain time.
LLMs have limited context window size – which means that the number of tokens in the prompt and the completion combined can’t exceed a certain limit. The context size of GPT3 175B is 4096 tokens, for most other models it’s currently either 2048 or 4096 tokens.
How to generate a large text using an autoregressive model? First, we must come up with the beginning of our text and use it as a prompt. Then, we must take our prompt and generate the next token. Then, we will add this token to the prompt and generate the next token again. We do this operation, again and again, producing large text.
When users interact with a language model through the API or through some UI, it will always follow this interface: the user specifies the prompt and the number of tokens the user plans to iteratively complete the prompt with.
Let’s take a look at some examples using the GPT3 playground.
Let’s assume that we need to write a recommendation letter for an employee. We put the beginning of the letter as a prompt, and GPT3 will complete our text.
We can then guide the language generation in different directions by changing the prompt. Here is an example of a recommendation letter created with a longer prompt.
Generating text is cool, but how to solve other NLP problems?
The model chooses the most natural way to complete the prompt. What if we build our prompt in such a way that the most natural way to complete the prompt will lead to the solution to our problem?
Let’s consider an example where we need to make a machine translator from French to English. We can build the prompt in the following way:
We fill the prompt with French/English parallel sentences at different lines separated by an empty line, then put the French sentence we want to translate at the end followed by another line break. This way, the logical continuation of our prompt seems to be the English translation of the target sentence. Small note: we can optionally put the line “Translation from French to English” at the beginning of the prompt for even more context.
Does it all work? It actually does! The language model is smart enough to understand what we want from it by examples (as a human would) and it will output the English translation.
This process is called “few-show learning” – by engineering the prompt in the right way, we managed to teach the model to solve our NLP problem on a relatively small number of examples, without changing the weights of the model.
Let’s make a chatbot!
For this purpose, we’re going to fill the prompt in the following way: first, we introduce the description of our bot, then put several phrases alternating between the user and the bot preceded by the speaker name, and then add the name of the bot at the end. This way, the natural continuation of the prompt is the phrase that the bot is going to say:
In an actual user-bot conversation scenario, we would receive an answer from the user, then add both bot and user phrases to our prompt and generate the bot reply again. This way, the bot remembers what happened before in the conversation.
So far, the LLM behavior was deterministic: given the same prompt it will generate the same completion. However, can we bring some variability in the LLM output?
How does an LLM choose a token to complete the prompt with? For each token in the dictionary, the LLM predicts its probability to come after the prompt, then the LLM finds the token with the highest probability.
Let’s take a look at our French-to-English machine translator example again.
First, we will select “Show probabilities” -> “Full spectrum”. Then, the color of each token in the completion will indicate the probability of generating this specific token at its step.
If we hover over individual words, we can see alternative options with their probabilities.
Here, an alternative to “movie” was “film”:
In this context, the competitor of “among” is “with”:
“the” has just a slightly higher probability than its closest competitor. An alternative to generating “the” would be just skipping the article:
“young” is competing with “youth” and “youngsters”:
If at one generation step, we choose the second best option, the generation will then follow a completely different path.
LLM usage interfaces provide you with a way to control the degree of randomness in model generation by the temperature parameter. The higher the temperature, the more often the model will generate tokens lower in the probability ranking.
Let’s change the temperature from 0 to 1. Also, make sure that top_p is set to >0 (1 is recommended) if you want to generate with temperature.
This way, we can generate different translations of the same French phrase, yay!
GPT3 is available through the simple API that users can call from Python or from CLI (there are also wrappers for other programming languages). However, OpenAI charges a small amount of money for each API request (it also applies to Playground).
OpenAI also has smaller versions of GPT3, and calling them is cheaper than the 175B one (see the pricing page). In money calculations, it counts the number of prompt and completion tokens combined. However, please keep in mind that the usage of models that are fine-tuned on your own data (we will talk about fine-tuning later) is more expensive (but you can still use the few-shot learning technique on default models).
Smaller versions of GPT3 include:
- Curie. GPT3 stops generating not only when it reaches one of the stop sequences or if it has generated the desired number of tokens, but also when it thinks it’s the end of the document – and in this example, the model through the document ends there.
HuggingFace has all kinds of models, not only GPT3-like ones. In order to search for LLMs users can use the text generation tag. These users can also read the deep technical specification of a certain model.
Let’s take a look at some options.
BLOOM is a 176B LLM trained on multilingual data. Here are examples of its generations in HuggingFace UI:
OPT is an open-source alternative to GPT3 available in different sizes: facebook/opt-125m, facebook/opt-350m, facebook/opt-1.3b, facebook/opt-2.7b, facebook/opt-6.7b, facebook/opt-30b, facebook/opt-66b.
GPT-J 6B by EleutherAI has around 6 billion parameters.
The previous generation of GPT – GPT2 with 1.5 billion parameters is also available for use.
LLM’s abilities to handle tasks in a particular language depend on how often texts in that language appear in the training data and also on how well its tokenizer handles texts in that language.
GPT3 was trained on multilingual data. The more familiar the language is, the more GPT3 was exposed to it. In theory, GPT3 can handle tasks in any language. It performs better with languages that are popular and those that use the Latin script (because of the tokenization issues we mentioned earlier).
LLMs can be non-natural language data. For example, a model that is trained on the source code is able to complete the programming code (BLOOM can do that too). One of the tools that do that is Github Copilot, which is based on Codex (which is basically a GPT3 for programming code).
Since programming code often contains natural language text as code comments, the Codex model performs reasonably well on natural language. It means that users can also utilize Github Copilot for natural language completion.
With few-shot learning, the more examples are given to the model, the better it will solve the user’s problem. However, there are some reasons why users can’t put as many examples as they want:
- Context window has a limited size
- If you use some commercial API, the large prompt will lead to a higher cost
These problems can be solved with the fine-tuning technique.
With fine-tuning, users create a copy of the LLM and then train it on data that is specific to their problem, changing the model weights. Users can fine-tune both GPT3 and HuggingFace models (here is the guide for GPT3). If you fine-tune an LLM locally (for example, a HuggingFace one) please keep in mind that fine-tuning a model on a GPU requires much more memory than it’s running. Training LLMs on CPU is technically possible, but it will take ages to do so.
Let’s take a look at some examples of what fine-tuned models can do.
Source Code Commenting
We have fine-tuned a model based on GPT3 Curie that writes a docstring comment for the Java method (our training set included about 2000 method+comment pairs)
Another example is InstructGPT developed by OpenAI. This model is trained to treat the prompt as an instruction to execute. With InstructGPT we can build a French->English machine translator much easier:
ChatGPT is also a fine-tuned version of GPT3. It was trained on conversational data only, so it can hold a conversation basically like a human.
ChatGPT was trained to act as a virtual assistant: the training data included a lot of conversations with correct responses to various human requests.
Let’s ask it to write some code.
You can ask ChatGPT follow-up questions because it remembers what was said earlier in the conversation.
The main benefit of large language models is that they are capable of handling many intellectual tasks on par with humans.
Large language models can solve various business problems such as:
- Content creation: document generation, suggesting text completions, and audio generation
- Document processing: text correction, document formatting, text summarization, information extraction, searching documents, and text classification
- Conversational AI: chatbots or virtual assistants that can interact with users in a natural way and provide assistance or answer questions
- Language tasks: machine translation and text stylization
- Source code-related tasks: generation, completion, correction, commenting, and search
Overall, the potential applications for large language models in business are vast and are able to bring numerous benefits to the companies that utilize them.
While AI and machine learning have been an important part of Akvelon’s R&D for years, our team of highly-skilled software engineers has deep expertise in the field of large language model development and deployment, with extensive technical knowledge and experience in building and optimizing these models for various business applications.
We can help your business harness the power of these advanced technologies to drive growth and success. Let us put our expertise to work for you.
Contact us at firstname.lastname@example.org to discuss your plans for implementing the latest technological advancements further.
This article was written by
Data Scientist at Akvelon