Skip to content
Snippets Groups Projects
Commit fe4b44d8 authored by Yuxuan Mei's avatar Yuxuan Mei
Browse files

fix some issues in the notes

parent ace316cc
Branches main
No related tags found
No related merge requests found
%% Cell type:markdown id:841b9a1f-16ff-4880-bed7-48e99b96cc3f tags:
# Language Models
> Inspired by Jeff Dean's talk, [Exciting Trends in Machine Learning](https://youtu.be/oSCRZkSQ1CE).
In this lesson, we'll apply what we know about neural networks toward the study of **language models**: probabilistic models of a natural (human) language, such as English. The goal of today is to become more conversational in language modeling techniques by studying recent natural language processing methods (rather than applications or implications). By the end of this lesson, students will be able to:
- Explain how unigram and bigram language models determine the probability of the next word.
- Explain how embedding spaces can provide a semantic, distributed representation for concepts.
- Explain the benefits of self-attention mechanisms and hidden states in a RNN.
%% Cell type:code id:45302cff-d6c7-4f56-b6a2-4f550d3da667 tags:
``` python
import os
import random
import re
from collections import Counter
from typing import Any
def clean(token: str, pattern: re.Pattern[str] = re.compile(r"\W+")) -> str:
"""
Returns all the characters in the token lowercased and without matches to the given pattern.
>>> clean("Hello!")
'hello'
"""
return pattern.sub("", token.lower())
def sample(frequencies: dict[Any, float], k: int = 1) -> list[Any]:
"""
Returns a list of k randomly sampled keys from the given frequencies with replacement.
>>> sample({"test": 1})
['test']
>>> sample({"test": 1}, k=3)
['test', 'test', 'test']
"""
return random.choices(list(frequencies), weights=frequencies.values(), k=k)
```
%% Cell type:markdown id:3f775cfe-5d86-4093-b3ed-56e2a3e22c93 tags:
## Statistical models
In its simplest form, a language model is similar to the `Document` class that we defined in the *Search* assessment consisting of a term-frequency dictionary. A **unigram language model** guesses the next word based on the term-frequency dictionary alone. In the document, `doggos/doc1.txt`, each unique word appears once, so each word has a term frequency of 1.
%% Cell type:code id:838336d8-0e16-4c15-a28c-77e01946afd1 tags:
``` python
terms = {
"dogs": 1,
"are": 1,
"the": 1,
"greatest": 1,
"pets": 1,
}
sample(terms, 20)
```
%% Cell type:markdown id:83adf77a-8767-45d9-a210-af95de4913eb tags:
Unigram models are simple but not particularly useful. For one, there's no notion of context: each word is sampled entirely independently from every other word. Large language models like ChatGPT combine learn from internet-scale training datasets and consider the preceding words (tokens) when determining the probability of the next word.
A **bigram language model**, for instance, takes the immediately preceding word to determine the probabilities for the next word. We use `<s>` to indicate the start token. Modify this code snippet so that each `prev` term (key) has a dictionary consisting of the number of times the `curr` term appears afterwards.
%% Cell type:code id:71a7ec72 tags:
``` python
os.chdir("assessments")
```
%% Cell type:code id:7e9ee841-8af5-46d3-a31b-8521643d2686 tags:
``` python
terms = {}
for filename in os.listdir("small_wiki"):
if filename.endswith(".html"):
with open(os.path.join("small_wiki", filename)) as f:
words = ["<s>"] + [clean(word) for word in f.read().split() if clean(word)]
for prev, curr in zip(words, words[1:]):
terms[prev] = curr
terms["<s>"] # Should be {"metadata": 70} to indicate all 70 documents start with "metadata"
```
%% Cell type:markdown id:0d222eac-cf09-4508-8e88-3e97d9958734 tags:
To generate a sequence of a given length, repeatedly use the `last` word to sample the next word and append it to the result.
%% Cell type:code id:8febb176-d612-4681-a051-4e629b3b4726 tags:
``` python
n_words = 20
result = "what is the best".split()
for _ in range(n_words):
last = result[-1]
result += sample(terms[last])
result
```
%% Cell type:markdown id:642ccc52-7fe9-4c85-bf14-7d83a18c56f4 tags:
## Word embeddings
Unigram and bigram language models are more generally known as **n-gram language models**. With larger context windows (n), n-gram models can produce more understandable results. But the approach has a fundamental limitation: it's sensitive to the exact words and the sequence they appeared in the training set.
To address the first problem of learning word meaning, how might a computer even learn the meaning of a word? Strings in Python are sequences of characters where each character is just a number. Or, if you recall how we handled the city names "NY" and "SF" to predict the location of a home in model evaluation, strings could also be represented as "dummy variables" or boolean categories.
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>beds</th>
<th>bath</th>
<th>year_built</th>
<th>sqft</th>
<th>price_per_sqft</th>
<th>elevation</th>
<th>city_NY</th>
<th>city_SF</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2.0</td>
<td>1.0</td>
<td>1960</td>
<td>1000</td>
<td>999</td>
<td>10</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>1</th>
<td>2.0</td>
<td>2.0</td>
<td>2006</td>
<td>1418</td>
<td>1939</td>
<td>0</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>2</th>
<td>2.0</td>
<td>2.0</td>
<td>1900</td>
<td>2150</td>
<td>628</td>
<td>9</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>3</th>
<td>1.0</td>
<td>1.0</td>
<td>1903</td>
<td>500</td>
<td>1258</td>
<td>9</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>4</th>
<td>0.0</td>
<td>1.0</td>
<td>1930</td>
<td>500</td>
<td>878</td>
<td>10</td>
<td>True</td>
<td>False</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>487</th>
<td>5.0</td>
<td>2.5</td>
<td>1890</td>
<td>3073</td>
<td>586</td>
<td>76</td>
<td>False</td>
<td>True</td>
</tr>
<tr>
<th>488</th>
<td>2.0</td>
<td>1.0</td>
<td>1923</td>
<td>1045</td>
<td>665</td>
<td>106</td>
<td>False</td>
<td>True</td>
</tr>
<tr>
<th>489</th>
<td>3.0</td>
<td>2.0</td>
<td>1922</td>
<td>1483</td>
<td>1113</td>
<td>106</td>
<td>False</td>
<td>True</td>
</tr>
<tr>
<th>490</th>
<td>1.0</td>
<td>1.0</td>
<td>1983</td>
<td>850</td>
<td>764</td>
<td>163</td>
<td>False</td>
<td>True</td>
</tr>
<tr>
<th>491</th>
<td>3.0</td>
<td>2.0</td>
<td>1956</td>
<td>1305</td>
<td>762</td>
<td>216</td>
<td>False</td>
<td>True</td>
</tr>
</tbody>
</table>
Word2vec refers to a technique that can learn a word embedding (semantic representation) by guessing how to fill in blank from the immediate context. Words that tend to appear in similar contexts probably have similar meanings, so an algorithm can learn the meaning of human words by examining a word's favorite neighbors. Words with similar neighbors should have similar representations. What word could appear in the following blank?
> The city of _________ has an oceanic climate.
This enables us to find synonyms as the [TensorFlow Embedding Projector](https://projector.tensorflow.org/) shows. But the authors also pointed out some interesting [examples of the learned relationships](https://arxiv.org/pdf/1301.3781.pdf#page=10), such as how, in the embedding space, `Paris - France + Italy = Rome`.
%% Cell type:markdown id:00430943-4ff0-4cb5-81f0-71609f12e7ab tags:
## Recurrent neural networks
Although word embeddings produce a semantic representation for the meaning of words, there still remains a challenge of how we might combine these word embeddings to form meaningful sentences. How might we use machine learning to combine word embeddings in a way that is sensitive to context?
Earlier, we learned two ways that neural networks could be used to classify handwritten digit images. But both approaches (scikit-learn `MLPClassifier` and Keras `Conv2D`) involved learning weights and biases only from the input pixel values. In language modeling where the possible sequences of words are infinite, it becomes much harder to train neural network weights and biases to directly handle every possibility.
**Recurrent neural networks** (RNNs) represent a different way of organizing a neural network by calculating the output of a neuron taking into account not only the inputs but also a *hidden state* that represents previously-generated outputs. Unlike *hidden layers* in a neural network, **hidden states** provide additional information about previously-generated steps to the current step. In other words, a RNN learns to predict the next word based on the current input as well as information obtained from previous words.
For example, we can use recurrent neural networks for language modeling by considering how they might generate a response sequence. Given an input `X` of one or more words, a recurrent neural network generates an output `O` one word at a time while considering the hidden state `V`.
![Recurrent neural network](https://upload.wikimedia.org/wikipedia/commons/b/b5/Recurrent_neural_network_unfold.svg)
Example code using keras: [character-level text generation with LSTM](https://keras.io/examples/generative/lstm_character_level_text_generation/)
%% Cell type:markdown id:8c1dfd87-cf1d-4706-98ed-cba9c25cab38 tags:
### Sequence-to-sequence framework
The **seq2seq** framework utilizes 2 RNNs to implement "sequence-to-sequence" tasks:
- An **encoder** that learns a machine representation for an input sequence, or *context*.
- A **decoder** that can take a machine representation (context) and decode it to a target sequence.
Originally, these models were used for machine translation, so the problem was framed as [reading an input sentence "ABC" and producing "WXYZ" as output](https://arxiv.org/pdf/1409.3215.pdf#page=2). By changing the training set to give the encoder context and have the decoder predict the expected reply, [the seq2seq framework can be used to model conversations](https://arxiv.org/pdf/1506.05869.pdf#page=2). How does this framework differ from using a single recurrent neural network for language modeling?
Example code using keras: [character-level sequence-to-sequence modeling](https://keras.io/examples/nlp/lstm_seq2seq/). For word-level tasks, we can use word embeddings to provide more information to the model about the meaning of each word.
%% Cell type:markdown id:dcc7e430-b1ac-4681-bd67-f0e0636e7e4a tags:
## Transformer architecture
With more data, the seq2seq approach can produce good results, but at a high computational cost because the hidden state needs to be updated on each step. Recurrent neural networks end up spending significant amounts of resources computing hidden states sequentially.
The **transformer architecture** addresses this by throwing out the RNN and instead utilizing a mechanism called *self-attention* to learn context without relying on hidden states. The goal of **self-attention** is to identify associations or relationships between different words in a sequence. Consider the sentence:
> The animal didn't cross the street because it was too tired
What does "it" refer to—the animal or the street? Let's read about why this makes a difference in translation on the [Google Research blog](https://blog.research.google/2017/08/transformer-novel-neural-network.html) on the seminal work ["Attention is all you need"](https://arxiv.org/abs/1706.03762) and study the [Tensor2Tensor Intro notebook](https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb) to see for ourselves how attention heads learn different relationships between words.
Although the transformer architecture was not the first to introduce the idea of an attention mechanism, it has since become the most popular way to not only use attention but to define language models in general.
......
%% Cell type:markdown id: tags:
# Large Language Models
In this lesson, we'll learn about the newest advances in **large language models** since the advent of the transformer architecture. By the end of this lesson, students will be able to:
- Get familiar with some techniques for improving large language models.
- Discuss the potential impact of model parameters on model performance and other kinds of impact (e.g. environmental, financial).
%% Cell type:markdown id: tags:
## How do LLMs work?
Below are two videos from Prof. Steve Seitz's YouTube Channel ["Graphics in 5 Minutes"](https://www.youtube.com/@g5min):
%% Cell type:markdown id: tags:
%% Cell type:code id: tags:
```
%%html
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/lnA9DMvHtfI?si=QJRk0fEHZNbuKf8b" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
```
%% Cell type:markdown id: tags:
%% Cell type:code id: tags:
```
%%html
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/YDiSFS-yHwk?si=KY34lWBCaoIzNCEW" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
```
%% Cell type:markdown id: tags:
### LLMs are neural networks
Similar to any ML models we learned, LLMs also need to be trained and then can be used for predicting outputs. [GPT in 60 Lines of Numpy](https://jaykmody.com/blog/gpt-from-scratch/) by Jay Mody shows how one can replicate a tiny verion of GPT-2 in just 60 lines of python code. Let's play with this [demo](https://colab.research.google.com/drive/1IZTR5OM3AuB5yU6TY_5GX7Ug0pVSVTi2?usp=sharing) for a bit.
%% Cell type:markdown id: tags:
### Techniques for Improving LLMs
Now we look at techniques that help better train LLMs or improve the performance during inference.
#### Instruction fine-tuning
Fine-tuning in the world of neural networks is to take an already trained network, and use another dataset to run through some training iterations again to update that trained network's parameters so that the fine-tuned version can perform better on a task related to that second dataset. Similar ideas work well for transformers too.
[Fine-tune Gemma models in Keras using LoRA (colab notebook)](https://colab.research.google.com/github/google/generative-ai-docs/blob/main/site/en/gemma/docs/lora_tuning.ipynb?utm_source=agd&utm_medium=referral&utm_campaign=open-in-colab&hl=de) or the [doc version](https://ai.google.dev/gemma/docs/lora_tuning)
#### Reinforcement learning from human feedback
Reinforcement learning refers to a kind of machine learning where a policy is learned so that an agent will follow this policy to take an action to respond to different states in order to achieve the maximum accumulated reward. The reinforcement aspect basically refers to the fact that the agent behavior is "guided" by the reward, and rewards in the desired behavior reinforces the agent's learned policy so that it will tend to choose actions that lead to greater rewards. For LLM agents, human-in-the-loop feedback is also very helpful.
[RLHF Wikipedia page](https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback)
#### Few-shot learning
Unlike the previous two techniques that improve LLM training, few-shot learning does not change model parameters but simply employ interesting prompting strategies by providing examples in the prompt so that the LLM can learn through its context window.
[Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165): *"While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches."*
#### Chain-of-thought prompting
Another interesting strategy is to include all the intermediate steps, which is a kind of few-show learning but brings it one step further.
[Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903): *"We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier."*
[Tree of Thoughts: Deliberate Problem Solving with Large Language Models](https://arxiv.org/abs/2305.10601): *"ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices."*
%% Cell type:markdown id: tags:
### Model Parameters
%% Cell type:markdown id: tags:
[Training Compute-Optimal Large Language Models](https://arxiv.org/abs/2203.15556): *"By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled."*
[Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155): *"Making language models bigger does not inherently make them better at following a user's intent....In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters.... Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent."*
[Energy and Policy Considerations for Deep Learning in NLP](https://arxiv.org/abs/1906.02243)
<table class="medium css-g3m580 svelte-1h7q5sw striped compact resortable resorted" cellspacing="0"><thead><tr class="css-1gm54bh svelte-1h7q5sw"> <th class="align-left type-text svelte-1h7q5sw first-mobile first-desktop no-padding" style="border-bottom:2px solid #333333;font-size: 100%;" colspan="1" data-column="X.1" data-row="-1"><span class="svelte-1h7q5sw dw-bold"> </span></th><th class="align-left type-date svelte-1h7q5sw resortable" style="border-bottom:2px solid #333333;font-size: 100%;" colspan="1" data-column="Date of original paper" data-row="-1"><span class="svelte-1h7q5sw dw-bold"> Date of original paper </span></th><th class="align-right type-number svelte-1h7q5sw resortable" style="border-bottom:2px solid #333333;font-size: 100%;" colspan="1" data-column="Energy consumption (kWh)" data-row="-1"><span class="svelte-1h7q5sw dw-bold"> Energy consumption (kWh) </span></th><th class="align-right type-number svelte-1h7q5sw resortable" style="border-bottom:2px solid #333333;font-size: 100%;" colspan="1" data-column="Carbon footprint (lbs of CO2e)" data-row="-1"><span class="svelte-1h7q5sw dw-bold"> Carbon footprint (lbs of CO2e) </span></th><th class="align-left type-text svelte-1h7q5sw last-mobile last-desktop resortable sort-asc" style="border-bottom:2px solid #333333;font-size: 100%;" colspan="1" data-column="Cloud compute cost (USD)" data-row="-1"><span class="svelte-1h7q5sw dw-bold"> Cloud compute cost (USD) </span></th> </tr></thead> <tbody><tr class="css-kfswhc svelte-1h7q5sw"> <td style=" ;" class="type-text svelte-1h7q5sw first-mobile first-desktop" colspan="1">GPT-2</td><td style=" ;" class="type-date svelte-1h7q5sw" colspan="1">Feb, 2019</td><td style=" ;" class="type-number svelte-1h7q5sw" colspan="1">-</td><td style=" ;" class="type-number svelte-1h7q5sw" colspan="1">-</td><td style=" ;" class="type-text svelte-1h7q5sw last-mobile last-desktop" colspan="1">$12,902-$43,008</td></tr><tr class="css-kfswhc svelte-1h7q5sw"> <td style=" ;" class="type-text svelte-1h7q5sw first-mobile first-desktop" colspan="1">Transformer (213M parameters)</td><td style=" ;" class="type-date svelte-1h7q5sw" colspan="1">Jun, 2017</td><td style=" ;" class="type-number svelte-1h7q5sw" colspan="1">201</td><td style=" ;" class="type-number svelte-1h7q5sw" colspan="1">192</td><td style=" ;" class="type-text svelte-1h7q5sw last-mobile last-desktop" colspan="1">$289-$981</td></tr><tr class="css-kfswhc svelte-1h7q5sw"> <td style=" ;" class="type-text svelte-1h7q5sw first-mobile first-desktop" colspan="1">BERT (110M parameters)</td><td style=" ;" class="type-date svelte-1h7q5sw" colspan="1">Oct, 2018</td><td style=" ;" class="type-number svelte-1h7q5sw" colspan="1">1,507</td><td style=" ;" class="type-number svelte-1h7q5sw" colspan="1">1,438</td><td style=" ;" class="type-text svelte-1h7q5sw last-mobile last-desktop" colspan="1">$3,751-$12,571</td></tr><tr class="css-kfswhc svelte-1h7q5sw"> <td style=" ;" class="type-text svelte-1h7q5sw first-mobile first-desktop" colspan="1">Transformer (65M parameters)</td><td style=" ;" class="type-date svelte-1h7q5sw" colspan="1">Jun, 2017</td><td style=" ;" class="type-number svelte-1h7q5sw" colspan="1">27</td><td style=" ;" class="type-number svelte-1h7q5sw" colspan="1">26</td><td style=" ;" class="type-text svelte-1h7q5sw last-mobile last-desktop" colspan="1">$41-$140</td></tr><tr class="css-kfswhc svelte-1h7q5sw"> <td style=" ;" class="type-text svelte-1h7q5sw first-mobile first-desktop" colspan="1">ELMo</td><td style=" ;" class="type-date svelte-1h7q5sw" colspan="1">Feb, 2018</td><td style=" ;" class="type-number svelte-1h7q5sw" colspan="1">275</td><td style=" ;" class="type-number svelte-1h7q5sw" colspan="1">262</td><td style=" ;" class="type-text svelte-1h7q5sw last-mobile last-desktop" colspan="1">$433-$1,472</td></tr><tr class="css-kfswhc svelte-1h7q5sw"> <td style=" ;" class="type-text svelte-1h7q5sw first-mobile first-desktop" colspan="1">Transformer (213M parameters) w/ neural architecture search</td><td style=" ;" class="type-date svelte-1h7q5sw" colspan="1">Jan, 2019</td><td style=" ;" class="type-number svelte-1h7q5sw" colspan="1">656,347</td><td style=" ;" class="type-number svelte-1h7q5sw" colspan="1">626,155</td><td style=" ;" class="type-text svelte-1h7q5sw last-mobile last-desktop" colspan="1">$942,973-$3,201,722</td></tr> </tbody></table>
[On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜](https://dl.acm.org/doi/10.1145/3442188.3445922): *"In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models."*
%% Cell type:markdown id: tags:
## Looking ahead..
%% Cell type:markdown id: tags:
Now that we have access to LLMs, what's next?
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment