Txt - Download 20220209corps Mix10k
: This specific text file is a subset or a processed version of the Pile-CC (Common Crawl) or OpenWebText2 components. The "mix10k" usually signifies a sample of 10,000 documents or lines used for benchmarking, validation, or testing the perplexity of models like GPT-Neo or GPT-J.
: The full dataset and its components can be explored at pile.eleuther.ai . Download 20220209corps mix10k txt
: You can find the parent dataset under the EleutherAI/pile identifier. : This specific text file is a subset
While the specific .txt slice is often hosted on private servers or shared via specific GitHub repositories for reproduction, the source data it is derived from is publicly available: Download 20220209corps mix10k txt
by Gao et al. (2020). Context and Usage