Text Mining Datasets

This page contains links to information retrieval and text mining datasets and corpuses.

  1. WordSimilarity-353 Test Collection link
  2. TREC Repository (Text) link
  3. TechTC Repository (TEXT) link
  4. Information Extraction Repository link
  5. Reuters-21578 (Reuters News Corpus) link
  6. OntoNotes Project (Various Text Corpus) link
  7. PennBioIE (Medical Text Corpus) link
  8. Enron (Email Corpus) link
  9. Andrew McCallum Dataset Collection link
  10. Google Books Ngram Datasets link