EleutherAI, updated 🕥 2022-07-16 15:59:31

OpenWebText2

This project is part of EleutherAI's quest to create a massive repository of high quality text data for training language models.

Very briefly, OpenWebText2 is a large filtered dataset of text documents scraped from URL found on Reddit submisisons.

The plug and play version of OpenWebText2 contains: - 17,103,059 documents - 65.86GB uncompressed text

Download Dataset / Documentation

For further information please visit our documentation.

Acknowledgements

researcher2 Wrote much of this code, with inspiration and some straight copying of the scraping code found here.
sdtblck kindly put together the Colab notebook, and performed a chunk of the scraping.
leogao2 provided overall design guidance, lm_dataformat, and performed another chunk of scraping.
Colaboratory VMs helped us with about 10% of our overall scraping.
The Eye host our processed datasets.
Read The Docs host our documentation.

Issues

Both Download links are not working

opened on 2023-01-26 00:42:54 by yj-lee0503

Hello, I have been trying to download the datasets, but both links are not working. Could someone please take a look at the downloadable links and implement a fix for them? Thank you!

[Link Failed] The link of openwebtext2 seems failed to open for download

opened on 2023-01-13 08:10:23 by ZHUI

The link of openwebtext2 seems failed to open for download, can someone help to check it?

page: https://openwebtext2.readthedocs.io/en/latest/index.html#download-plug-and-play-version link: https://mystic.the-eye.eu/public/AI/pile_preliminary_components/openwebtext2.jsonl.zst.tar (failed to open)

Fixing an issue with sha256 checking

opened on 2022-07-16 15:46:41 by ardacihaner

The pushshift.pushshift_to_sqlite method passes the arguments to best_download.download_file in a wrong order, and the code crashes. Hence, the dataset is not reproducible without this modification.

Releases

v1.0 2021-06-11 18:05:27

Initial Release.