16 Best Free Human Annotated Datasets for Machine Learning
Alexandra Quinn | December 15, 2023
Successfully training AI and ML models relies not only on large quantities of data, but also on the quality of their annotations. Data annotation accuracy directly impacts the accuracy of a model and the reliability of its predictions. This is where human-annotated datasets come into play. Human-annotated datasets offer a level of precision, nuance, and contextual understanding that automated methods struggle to match.
In this blog post, we bring the top 16 free human-annotated datasets you can use for your model training and evaluation. To cater to a wide variety of needs, these free datasets for machine learning cover a diverse set of categories and use cases:
- Sentiment analysis
- Language and docs
- For ethical AI use
- Images and videos
When training and developing your models, don’t neglect the final phase - a deployed LLM (or other type of model). Use MLOps solutions to ensure the process is automated, streamlined, scalable and iterative. This will ensure the successful implementation of your model. To learn more about how to build and scale your pipelines, click here.
What are Human-annotated Datasets?
Human-annotated datasets are data records that have been annotated by humans. This means that humans have added information, like labels or tags. For example, humans can provide inputs for categorization, sentiment analysis, bounding boxes for images, etc.
Human annotation helps advance ML and AI model training and evaluation. By providing the ground truth for models, algorithms can understand patterns and make better predictions on new, unseen data. As such, human annotation is an important step in building successful AI and ML systems.
Now let’s dive into the datasets themselves:
Category #1: Sentiment Analysis
1. HumAID (Human-Annotated Disaster Incidents Data)
A dataset for crisis informatics containing ~77K human-labeled tweets across 19 disaster events that happened between 2016 and 2019. Categories include: caution and advice, displaced people and evacuations, don't know can't judge, infrastructure and utility damage, injured or dead people, missing or found people, not humanitarian, other relevant information, requests or urgent needs, rescue volunteering or donation effort , and sympathy and support.
2. GoEmotions
58,000 Reddit comments labeled with 27 emotion categories: 12 positive, 11 negative, 4 ambiguous, and “neutral”.
Category #2: Language & Docs
3. HANNA (Human-ANnotated NArratives for ASG evaluation)
1,056 stories that were generated from 96 prompts by the WritingPrompts dataset. Each story was annotated on relevance, coherence, empathy, surprise, engagement and complexity, by three raters.
4. DocLayNet: A Human-Annotated Dataset for Document-Layout Analysis
A dataset for document layout analysis in COCO format, which can be used for PDF conversions. The dataset contains 80,863 manually annotated pages from a large number and variety of data sources and provides 11 choices of distinct classes for each PDF page.
Category #3: For Ethical AI Use
5. RLHF Dataset to Reduce Harm
This dataset includes data about helpfulness and harmlessness and contains harmful dialogues. It can be used for training preference models for subsequent RLHF training or for understanding how crowdworkers red team models.
Category #5: Images and Videos
6. Humans in 3D
A dataset of annotated people including annotation of joints, eyes, ears, nose, ears, nose, shoulders, elbows, wrists, hips, knees, ankles, their 3D pose, visibility boolean, upper clothes, lower clothes, dress, socks, shoes, hands, gloves, neck, face, hair, hat, sunglasses, bag, occluder, and body type (male, female or child).
7. Scanned Images and OCR Texts from Medieval Documents
Scanned images and OCR texts that are reprints from the Hussite era. Annotations include layout analysis, OCR evaluation and language identification.
8. GAN Images
A dataset containing 600 fake images and 400 real images that were evaluated based on eight attributes. The real images are from the ImageNet dataset and the fake images were generated by a generative adversarial net.
9. Semantic Segmentation of Radishes
Annotated images of radishes collected during the spring of 2017.
10. Era (Event Recognition in Aerial videos) Dataset
2,864 aerial videos with labels from 25 classes. The dataset can be used to help capture dynamic events at scale.
11. ExoNet: Wearable Camera Images of Human Locomotion Environments
Approximately 923,000 human-annotated images from 5.6 million RGB images of indoor and outdoor real-world walking environments. The dataset can be used to help develop robotic leg prostheses and exoskeletons, humanoids, autonomous legged robots, powered wheelchairs, and other mobility assistive technologies.
12. YouTube8M-MusicTextClips
4000 high-quality human text descriptions of audio from YouTube8M video clips. Each file contains the video_id, start time, end time and text description. The dataset is intended mainly for evaluation.
13. Video Sub-shot Segmentation Evaluation
A dataset that annotates sub-shot segmentations for 33 single-shot videos. Sub-shots were divided according to video activity type. Each dataset includes the video ID, filename, Youtube video name, URL and video frame-rate.
14. Fruit Annotations
Bounding box annotations of 11 fruits: apple, avocado, capsicum, mango, orange, cantaloupe, strawberry, blueberry, cherry, kiwi and wheat. (Wheat seems to be included despite it not being a fruit).
15. HAM (Human-annotated Mappings)
A dataset for molecular graph partitioning. The dataset contains mappings of 1206 organic molecules with less than 25 heavy atoms.
16. Relative Human
RGB images with rich human annotations of multi-person in-the-wild. Depth layers include relative depth relationship/ordering between all people in the image, age group classification (adults, teenagers, kids, babies), gender, bounding box, 2D pose.
Deploying Trained Models
MLOps pipelines enable and enhance the deployment of trained models by automating and streamlining the process, while eliminating technical and operation silos. This includes preparing the model for deployment, optimizing for performance and compatibility, real-time data processing, serving models through scalable serverless functions, CI/CD, and monitoring and management tools. This comprehensive approach ensures models are not only deployed efficiently but also remain secure, compliant, and high-performing. To see how you can bring your trained models to life, click here.