Skip to content
Snippets Groups Projects

TinyStories: Language Model Experimentation

Project Overview

This project aims to reproduce and extend the experiments from the paper "TinyStories: How small can language models be and still speak coherent English?" by Eldan, R., & Li, Y. (2023). Our goal is to investigate the feasibility of training a small language model with only 8M parameters to produce coherent and fluent text in a specific domain, using the GPT-Neo architecture as a reference.

Team Members

  • Collin Dang (cdang38)
  • Seulchan Han (paulh27)
  • Xumin Li (xumin)
  • Jingwei Ma (jingweim)

Paper Citation

Eldan, R., & Li, Y. (2023, May 24). TinyStories: How small can language models be and still speak coherent English?. arXiv.org. https://arxiv.org/abs/2305.07759

Data Source

The dataset used in this project, TinyDataset, is accessible through Hugging Face at https://huggingface.co/datasets/roneneldan/TinyStories.

The arxiv dataset snapshot used in this project is downloaded on Mar.5th 2024 and can be found here: https://drive.google.com/file/d/1achP4CYEjp1nN2UCMuTiMpO4COIhck_F/view?usp=drive_link

Setup and Usage

Cloning the Repository

git clone --recurse-submodules git@gitlab.cs.washington.edu:cdang38/nlp-final-project.git
cd nlp-final-project

This may take a while because the TinyStories submodule is massive.

Setting Up the Virtual Environment

python -m venv venv
source venv/Scripts/activate

Installing Dependencies

make install-deps

Freezing Dependencies

make freeze-deps

Formatting and Linting the Code

make py-fmt
make py-lint

Running the Project

Ensure you have a .env file with the the OPENAI_API_KEY, ex.

OPENAI_API_KEY=XXXXXXXXXXX

Then, run the main script.

python main.py