Skip to content
Snippets Groups Projects
Commit c8161bce authored by jingweim's avatar jingweim
Browse files

added arxiv dataset

parent 354c690b
No related branches found
No related tags found
No related merge requests found
......@@ -2,4 +2,5 @@ venv/
__pycache__/
.env
openai-env/
.ipynb_checkpoints
\ No newline at end of file
.ipynb_checkpoints
data_20240305/
......@@ -19,6 +19,8 @@ Eldan, R., & Li, Y. (2023, May 24). TinyStories: How small can language models b
The dataset used in this project, TinyDataset, is accessible through Hugging Face at [https://huggingface.co/datasets/roneneldan/TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories).
The arxiv dataset snapshot used in this project is downloaded on Mar.5th 2024 and can be found here: https://drive.google.com/file/d/1achP4CYEjp1nN2UCMuTiMpO4COIhck_F/view?usp=drive_link
## Setup and Usage
### Cloning the Repository
......
......@@ -374,7 +374,7 @@ if __name__ == "__main__":
if dataset_name == 'arxiv':
# the full arxiv dataset
dataset = load_dataset('arxiv_dataset', ignore_verifications=True, data_dir=f'{root}/data_20240305')
dataset = load_dataset('arxiv_dataset', ignore_verifications=True, data_dir=f'{root}/src/data_20240305')
dataset = preprocess_data(dataset['train'])
random.seed(seed)
random.shuffle(dataset)
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment