TinyStories: Language Model Experimentation
Project Overview
This project aims to reproduce and extend the experiments from the paper "TinyStories: How small can language models be and still speak coherent English?" by Eldan, R., & Li, Y. (2023). Our goal is to investigate the feasibility of training a small language model with only 8M parameters to produce coherent and fluent text in a specific domain, using the GPT-Neo architecture as a reference.
Team Members
- Collin Dang (cdang38)
- Seulchan Han (paulh27)
- Xumin Li (xumin)
- Jingwei Ma (jingweim)
Paper Citation
Eldan, R., & Li, Y. (2023, May 24). TinyStories: How small can language models be and still speak coherent English?. arXiv.org. https://arxiv.org/abs/2305.07759
Data Source
The dataset used in this project, TinyDataset, is accessible through Hugging Face at https://huggingface.co/datasets/roneneldan/TinyStories.
The arxiv dataset snapshot used in this project is downloaded on Mar.5th 2024 and can be found here: https://drive.google.com/file/d/1achP4CYEjp1nN2UCMuTiMpO4COIhck_F/view?usp=drive_link
Setup and Usage
Cloning the Repository
git clone --recurse-submodules git@gitlab.cs.washington.edu:cdang38/nlp-final-project.git
cd nlp-final-project
This may take a while because the TinyStories submodule is massive.
Setting Up the Virtual Environment
python -m venv venv
source venv/Scripts/activate
Installing Dependencies
make install-deps
Freezing Dependencies
make freeze-deps
Formatting and Linting the Code
make py-fmt
make py-lint
Running the Project
Ensure you have a .env
file with the the OPENAI_API_KEY, ex.
OPENAI_API_KEY=XXXXXXXXXXX
Then, run the main script.
python main.py