28 Best Free NLP Datasets for Machine Learning
Alexandra Quinn | September 29, 2023
NLP is a field of AI that enables machines to understand, interpret, and generate human language in a way that is both meaningful and contextually relevant. Recently, ChatGPT and similar applications have created a surge in consumer and business interest in NLP. Now, many organizations are trying to incorporate NLP into their offerings.
To help with these efforts, we’ve compiled a list of the top NLP datasets for NLP projects that data scientists and data professionals can use for training their models. This list is a starting point for training your NLP models.
The list is divided into a number of groups and types:
- Q&A
- Reviews and Ratings
- Sentiment Analysis
- Synonyms
- Emails
- Long-form Content
- Audio
You can use these datasets for a number of use cases, like creating personal assistants, automating customer service, language translation, and more. The sky's the limit!
When planning how to train your NLP models with NLP training datasets, it's important to start with the end in mind — with deployment. To learn more about how to build and scale your NLP pipelines, click here.
Now let’s dive into the list:
Q&A
1. Stanford Question Answering Dataset (SQuAD)
A reading comprehension dataset, comprising pairs of questions and answers based on Wikipedia articles.
Get the dataset here.
2. Jeopardy Questions
A JSON file with 216,930 Jeopardy questions, answers and additional data like the air date.
Get the dataset here.
3. The WikiQA Corpus
Question and answer pairs that link to Wikipedia pages with the answer. The data set comprises 3,047 questions and 29,258 sentences. 1,473 sentences were labeled as answer sentences.
Get the dataset here.
4. Amazon Question/Answer Data
1.4 million answered questions based on Amazon product reviews.
Get the dataset here.
5. Elementary Science Questions
Explanation graphs for elementary science exam questions in the US.
Reviews and Ratings
6. Yelp Open Dataset
Yelp, the popular review site for businesses, published a subset of its reviews, user data and businesses as JSON files. The dataset includes nearly 7 million reviews, more than 150,000 businesses, more than 900,000 tips by nearly 2 million, and more than 1.2 million business tributes including hours, parking, availability and ambience, and aggregated check-ins for more than 130,000 businesses.
Get the dataset here.
7. Amazon Reviews
Approximately 82 million Amazon ratings and metadata from nearly 21 million users from a period of over 18 years.
Get the dataset here.
8. Amazon Fine Food Reviews
Approximately 500,000 reviews from more than 250,000 users and nearly 75,000 products spanning 13 years.
Get the dataset here.
Sentiment Analysis
9. Movie Review Dataset
A dataset enabling binary sentiment classification. The dataset comprises 25,000 highly polar movie reviews for training, 25,000 for testing and additional unlabeled data.
Get the dataset here.
10. Movie and Finance Review Dictionaries
This dataset contains dictionaries for sentiment analysis with words that have a positive or negative polarity. The dataset is extracted from IMDb movie reviews and US regulated Form 8-K filings. The dictionaries are available in CSV format.
Get the dataset here.
11. Sentiment140 (Twitter-based)
A dataset of positive and negative Tweets that were automatically collected in a CSV format. The data file format comprises the Tweet’s polarity, IT, date, query, user and text.
Get the dataset here.
12. Twitter US Airline Sentiment
Polarized Tweets from February 2015 about the large US airlines. Data is provided in a CSV file and SQLite database.
Get the dataset here.
Synonyms
13. WordNet
A database of English nouns, verbs, adjectives and adverbs grouped into synonyms that depict concepts.
Get the dataset here.
Emails
14. Enron Email Dataset
A side benefit of the notorious Enron scandal: A comprehensive email dataset, containing more than 600,000 messages from 150 users, mostly senior corporate management.
Get the dataset here.
Long-Form Content
15. 20 Newsgroups
A dataset containing roughly 20,000 newsgroup documents spanning a variety of topics, for text classification, text clustering and similar ML applications. The newsgroups are: comp.graphics, comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, comp.sys.mac.hardware, comp.windows.x, rec.autos, rec.motorcycles, rec.sport.baseball, rec.sport.hockey, sci.crypt, sci.electronics, sci.med, sci.space, misc.forsale, talk.politics.misc, talk.politics.guns, talk.politics.mideast, talk.religion.misc, alt.atheism and soc.religion.christian. The data is available in .tar.gz bundles.
Get the dataset here.
16. arXiv Papers
A dataset containing arXiv research papers and metadata, in machine-readable format.
Get the dataset here.
17. The Blog Authorship Corpus
A dataset comprising more than 680,000 blog posts (over 140 million words) from more than 19,000 bloggers, gathered in August 2004.
Get the dataset here.
18. Legal Case Reports
A dataset with 4,000 legal cases that can be used for automatic summarization and citation analysis.
Get the dataset here.
19. One Week of Global News Feeds
3.3 million articles from 20,000 news sources across a seven day period in 2017 and 2018. The dataset includes publish_time, feed_code, source_url and headline_text.
Get the dataset here
20. Federal Contracts
A dataset containing all federal contracts from the Federal Procurement Data Center found at USASpending.gov.
Get the dataset here.
21. Common Crawl
Web crawl data from more than 50 billion webpages.
Audio
22. LibriSpeech
A dataset comprises 1,000 hours of English speech taken from audiobooks from the LibriVox project.
Get the dataset here.
23. Noisy Speech Database
A clean and noisy parallel speech database that can help train and test speech enhancement methods.
Get the dataset here.
24. TIMIT
An acoustic-phonetic speech corpus designed to help develop and automate speech recognition systems. The dataset contains 630 speakers from eight major dialects of American English.
Get the dataset here.
25. Free Spoken Digit Dataset
Recordings of spoken digits provided wav files The recordings have minimal silence at the beginning and the end.
Get the dataset here.
Words & Definitions
26. Urban Dictionary Words And Definitions
2.5 million phrases from Urban Dictionary. The information includes definitions and votes.
Dialogue and Engagement
27. Movie Dialog
220,579 fictional conversations between approximately 10,000 characters extracted from 617 raw movie scripts. The metadata includes movie genre, release year, IMDB rating, number of IMDB votes, character gender and character position on movie credits.
28. Jokes
208,000 jokes in English from three different sources.
Scaling NLP Pipelines
Building sophisticated NLP pipelines that operate at scale is essential for turning massive amounts of unstructured data to a searchable and indexable one. Such a pipeline ingests, prepares, classifies and indexes structured and unstructured data, handles terabytes of data, seamlessly deploys models, and leverages CI/CD. To see how the data science team at S&P Global (IHS Markit) built their NLP pipeline, click here.