"A **bigram language model**, for instance, takes the immediately preceding word to determine the probabilities for the next word. We use `<s>` to indicate the start token. Modify this code snippet so that each `prev` term (key) has a dictionary consisting of the number of times the `curr` term appears afterwards."
> Inspired by Jeff Dean's talk, [Exciting Trends in Machine Learning](https://youtu.be/oSCRZkSQ1CE).
In this lesson, we'll apply what we know about neural networks toward the study of **language models**: probabilistic models of a natural (human) language, such as English. The goal of today is to become more conversational in language modeling techniques by studying recent natural language processing methods (rather than applications or implications). By the end of this lesson, students will be able to:
- Explain how unigram and bigram language models determine the probability of the next word.
- Explain how embedding spaces can provide a semantic, distributed representation for concepts.
- Explain the benefits of self-attention mechanisms and hidden states in a RNN.
In its simplest form, a language model is similar to the `Document` class that we defined in the *Search* assessment consisting of a term-frequency dictionary. A **unigram language model** guesses the next word based on the term-frequency dictionary alone. In the document, `doggos/doc1.txt`, each unique word appears once, so each word has a term frequency of 1.
Unigram models are simple but not particularly useful. For one, there's no notion of context: each word is sampled entirely independently from every other word. Large language models like ChatGPT combine learn from internet-scale training datasets and consider the preceding words (tokens) when determining the probability of the next word.
A **bigram language model**, for instance, takes the immediately preceding word to determine the probabilities for the next word. We use `<s>` to indicate the start token. Modify this code snippet so that each `prev` term (key) has a dictionary consisting of the number of times the `curr` term appears afterwards.
Unigram and bigram language models are more generally known as **n-gram language models**. With larger context windows (n), n-gram models can produce more understandable results. But the approach has a fundamental limitation: it's sensitive to the exact words and the sequence they appeared in the training set.
To address the first problem of learning word meaning, how might a computer even learn the meaning of a word? Strings in Python are sequences of characters where each character is just a number. Or, if you recall how we handled the city names "NY" and "SF" to predict the location of a home in model evaluation, strings could also be represented as "dummy variables" or boolean categories.
<tableborder="1"class="dataframe">
<thead>
<trstyle="text-align: right;">
<th></th>
<th>beds</th>
<th>bath</th>
<th>year_built</th>
<th>sqft</th>
<th>price_per_sqft</th>
<th>elevation</th>
<th>city_NY</th>
<th>city_SF</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2.0</td>
<td>1.0</td>
<td>1960</td>
<td>1000</td>
<td>999</td>
<td>10</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>1</th>
<td>2.0</td>
<td>2.0</td>
<td>2006</td>
<td>1418</td>
<td>1939</td>
<td>0</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>2</th>
<td>2.0</td>
<td>2.0</td>
<td>1900</td>
<td>2150</td>
<td>628</td>
<td>9</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>3</th>
<td>1.0</td>
<td>1.0</td>
<td>1903</td>
<td>500</td>
<td>1258</td>
<td>9</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>4</th>
<td>0.0</td>
<td>1.0</td>
<td>1930</td>
<td>500</td>
<td>878</td>
<td>10</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>487</th>
<td>5.0</td>
<td>2.5</td>
<td>1890</td>
<td>3073</td>
<td>586</td>
<td>76</td>
<td>False</td>
<td>True</td>
</tr>
<tr>
<th>488</th>
<td>2.0</td>
<td>1.0</td>
<td>1923</td>
<td>1045</td>
<td>665</td>
<td>106</td>
<td>False</td>
<td>True</td>
</tr>
<tr>
<th>489</th>
<td>3.0</td>
<td>2.0</td>
<td>1922</td>
<td>1483</td>
<td>1113</td>
<td>106</td>
<td>False</td>
<td>True</td>
</tr>
<tr>
<th>490</th>
<td>1.0</td>
<td>1.0</td>
<td>1983</td>
<td>850</td>
<td>764</td>
<td>163</td>
<td>False</td>
<td>True</td>
</tr>
<tr>
<th>491</th>
<td>3.0</td>
<td>2.0</td>
<td>1956</td>
<td>1305</td>
<td>762</td>
<td>216</td>
<td>False</td>
<td>True</td>
</tr>
</tbody>
</table>
Word2vec refers to a technique that can learn a word embedding (semantic representation) by guessing how to fill in blank from the immediate context. Words that tend to appear in similar contexts probably have similar meanings, so an algorithm can learn the meaning of human words by examining a word's favorite neighbors. Words with similar neighbors should have similar representations. What word could appear in the following blank?
> The city of _________ has an oceanic climate.
This enables us to find synonyms as the [TensorFlow Embedding Projector](https://projector.tensorflow.org/) shows. But the authors also pointed out some interesting [examples of the learned relationships](https://arxiv.org/pdf/1301.3781.pdf#page=10), such as how, in the embedding space, `Paris - France + Italy = Rome`.
Although word embeddings produce a semantic representation for the meaning of words, there still remains a challenge of how we might combine these word embeddings to form meaningful sentences. How might we use machine learning to combine word embeddings in a way that is sensitive to context?
Earlier, we learned two ways that neural networks could be used to classify handwritten digit images. But both approaches (scikit-learn `MLPClassifier` and Keras `Conv2D`) involved learning weights and biases only from the input pixel values. In language modeling where the possible sequences of words are infinite, it becomes much harder to train neural network weights and biases to directly handle every possibility.
**Recurrent neural networks** (RNNs) represent a different way of organizing a neural network by calculating the output of a neuron taking into account not only the inputs but also a *hidden state* that represents previously-generated outputs. Unlike *hidden layers* in a neural network, **hidden states** provide additional information about previously-generated steps to the current step. In other words, a RNN learns to predict the next word based on the current input as well as information obtained from previous words.
For example, we can use recurrent neural networks for language modeling by considering how they might generate a response sequence. Given an input `X` of one or more words, a recurrent neural network generates an output `O` one word at a time while considering the hidden state `V`.
The **seq2seq** framework utilizes 2 RNNs to implement "sequence-to-sequence" tasks:
- An **encoder** that learns a machine representation for an input sequence, or *context*.
- A **decoder** that can take a machine representation (context) and decode it to a target sequence.
Originally, these models were used for machine translation, so the problem was framed as [reading an input sentence "ABC" and producing "WXYZ" as output](https://arxiv.org/pdf/1409.3215.pdf#page=2). By changing the training set to give the encoder context and have the decoder predict the expected reply, [the seq2seq framework can be used to model conversations](https://arxiv.org/pdf/1506.05869.pdf#page=2). How does this framework differ from using a single recurrent neural network for language modeling?
Example code using keras: [character-level sequence-to-sequence modeling](https://keras.io/examples/nlp/lstm_seq2seq/). For word-level tasks, we can use word embeddings to provide more information to the model about the meaning of each word.
With more data, the seq2seq approach can produce good results, but at a high computational cost because the hidden state needs to be updated on each step. Recurrent neural networks end up spending significant amounts of resources computing hidden states sequentially.
The **transformer architecture** addresses this by throwing out the RNN and instead utilizing a mechanism called *self-attention* to learn context without relying on hidden states. The goal of **self-attention** is to identify associations or relationships between different words in a sequence. Consider the sentence:
> The animal didn't cross the street because it was too tired
What does "it" refer to—the animal or the street? Let's read about why this makes a difference in translation on the [Google Research blog](https://blog.research.google/2017/08/transformer-novel-neural-network.html) on the seminal work ["Attention is all you need"](https://arxiv.org/abs/1706.03762) and study the [Tensor2Tensor Intro notebook](https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb) to see for ourselves how attention heads learn different relationships between words.
Although the transformer architecture was not the first to introduce the idea of an attention mechanism, it has since become the most popular way to not only use attention but to define language models in general.
In this lesson, we'll learn about the newest advances in **large language models** since the advent of the transformer architecture. By the end of this lesson, students will be able to:
- Get familiar with some techniques for improving large language models.
- Discuss the potential impact of model parameters on model performance and other kinds of impact (e.g. environmental, financial).
%% Cell type:markdown id: tags:
## How do LLMs work?
Below are two videos from Prof. Steve Seitz's YouTube Channel ["Graphics in 5 Minutes"](https://www.youtube.com/@g5min):
Similar to any ML models we learned, LLMs also need to be trained and then can be used for predicting outputs. [GPT in 60 Lines of Numpy](https://jaykmody.com/blog/gpt-from-scratch/) by Jay Mody shows how one can replicate a tiny verion of GPT-2 in just 60 lines of python code. Let's play with this [demo](https://colab.research.google.com/drive/1IZTR5OM3AuB5yU6TY_5GX7Ug0pVSVTi2?usp=sharing) for a bit.
%% Cell type:markdown id: tags:
### Techniques for Improving LLMs
Now we look at techniques that help better train LLMs or improve the performance during inference.
#### Instruction fine-tuning
Fine-tuning in the world of neural networks is to take an already trained network, and use another dataset to run through some training iterations again to update that trained network's parameters so that the fine-tuned version can perform better on a task related to that second dataset. Similar ideas work well for transformers too.
[Fine-tune Gemma models in Keras using LoRA (colab notebook)](https://colab.research.google.com/github/google/generative-ai-docs/blob/main/site/en/gemma/docs/lora_tuning.ipynb?utm_source=agd&utm_medium=referral&utm_campaign=open-in-colab&hl=de) or the [doc version](https://ai.google.dev/gemma/docs/lora_tuning)
#### Reinforcement learning from human feedback
Reinforcement learning refers to a kind of machine learning where a policy is learned so that an agent will follow this policy to take an action to respond to different states in order to achieve the maximum accumulated reward. The reinforcement aspect basically refers to the fact that the agent behavior is "guided" by the reward, and rewards in the desired behavior reinforces the agent's learned policy so that it will tend to choose actions that lead to greater rewards. For LLM agents, human-in-the-loop feedback is also very helpful.
[RLHF Wikipedia page](https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback)
#### Few-shot learning
Unlike the previous two techniques that improve LLM training, few-shot learning does not change model parameters but simply employ interesting prompting strategies by providing examples in the prompt so that the LLM can learn through its context window.
[Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165): *"While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches."*
#### Chain-of-thought prompting
Another interesting strategy is to include all the intermediate steps, which is a kind of few-show learning but brings it one step further.
[Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903): *"We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier."*
[Tree of Thoughts: Deliberate Problem Solving with Large Language Models](https://arxiv.org/abs/2305.10601): *"ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices."*
%% Cell type:markdown id: tags:
### Model Parameters
%% Cell type:markdown id: tags:
[Training Compute-Optimal Large Language Models](https://arxiv.org/abs/2203.15556): *"By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled."*
[Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155): *"Making language models bigger does not inherently make them better at following a user's intent....In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters.... Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent."*
[Energy and Policy Considerations for Deep Learning in NLP](https://arxiv.org/abs/1906.02243)
[On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜](https://dl.acm.org/doi/10.1145/3442188.3445922): *"In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models."*