Data labeling, also referred to as data annotation, is required for a variety of use cases including computer vision, natural language processing, and speech recognition. The goal of data labeling is to provide data that is "marked up, or annotated, to show the target, which is the answer you want your machine learning model to predict."For example, in the use case of computer vision for autonomous vehicles, labeled data might include tagged street signs, pedestrians, or other vehicles. While unsupervised machine learning models (ex., anomaly detection models) do not rely on annotated data, supervised or semi-supervised "human in the loop (HITL)" models are utilized for a variety of commercial applications, ranging from autonomous vehicles to facial recognition.
The process of data labeling can be completed as a time-intensive manual process or it can be automated to various degrees using software. Data labeling as a service arises out of the need for companies in a variety of industries to develop large sets of training data for artificial intelligence or machine learning models. In 2019, The Economist referred to “tagged” or labeled data as the “feedstock” for machine learning algorithms.Data labeling can take up to 25% of total of the time required to complete a machine learning project.

Over time, data labeling services have had to increase standards to avoid "garbage in, garbage out" data. According to Testin Data service General Manager Henry Jia, "In 2015 and 2016, AI companies could build a fine AI prototype solution based on open-sourced datasets or some publicly available data on the Internet to get funding. But if they really want to implement algorithms in real-world scenarios, they have to push the envelop of data quality."
According to Cognilytica, an industry research company for machine learning and other cognitive technologies, the market for data labeling was $1.5 billion in 2019. It is expected to grow to $3.5 billion by the end of 2024. it is expected that this growth will come as a result of domain-specific data labeling tasks.
Uber acquired Mighty AI on June 25, 2019 in an effort to improve its self-driving algorithms. Scale.AI's customers include many other self-driving and general transport companies, including Waymo, Lyft, Zoox, Cruise, and the Toyota Research Institute. Waymo, Argo AI, and Lyft have also open sourced their self-driving datasets. A "high-quality" vehicle dataset includes:
- Pixel-wise semantic annotation
- 3D semantic annotation
- Pixel-wise object instance annotation
- Fine-grained road segmentation
- Moving object trajectory
- High-precision GPS/IMO information, etc.
On January 29, 2019, IBM announced the release of a dataset with millions of possible faces representative of the real world. IBM pulled the images in partnership with Flickr.
Data labeling for natural language processing (NLP) is often used to perform sentiment analysis, such as for the end use case of customer service or marketing.
- Precision Agriculture (computer vision application)
- Micromobility (computer vision application)
More recently, the use of synthetic data has supplemented the data labeling process. Synthetic data is “generated through computer programs, instead of being composed through the documentation of real-world events”.
Synthetically-generated datasets can also be used to train machine learning models, particularly in computer vision. Synthetic data may augment real datasets to cover areas of the data distribution that are not sufficiently represented in order to alleviate dataset bias. Synthetic data may also be useful when real data is impossible or prohibitively difficult to acquire due to privacy or legal issues. Synthetic data has been used to train Google’s Waymo in the form of driving simulations. Facebook was reported to use synthetic data to train algorithms to detect bullying language.
A market has also emerged, adjacent to the data labeling market, that aims to ensure proper oversight over models and reduce bias in large datasets. This is part of the Ethical AI movement, which encourages the proactive embedding of diversity and inclusion principles into the AI lifecycle and aims to ensure transparency of AI systems.
Timeline
People
Further reading
Best Practices for Managing
Data Annotation Projects
Bloomberg Finance
Web
December 18, 2020
Data Annotation: The Billion Dollar Business Behind AI Breakthroughs
Synced
Web
August 28, 2019
Data labelling -- overcoming AI projects' biggest obstacle
Tech HQ
Web
October 20, 2020
Data-labelling startups want to help improve corporate AI
The Economist
Web
October 17, 2019
If data is the new oil, these companies are the new Baker Hughes
Jeremy Kahn
Web
February 4, 2020
Scale AI hits $3.5B valuation as it turns the AI boom into a venture bonanza
Kirsten Korosec
Web
December 1, 2020
The Big Business of Big Data Labeling as a Service
Nanalyze
Web
November 11, 2020
Documentaries, videos and podcasts
AWS re:Invent 2018: [NEW LAUNCH!] Labeling for Accurate Machine Learning Training Datasets (DEM123)
December 13, 2018
What is Data Labeling ? Prepare Your Data for ML and AI
April 29, 2020