Golden Recursion Inc. logoGolden Recursion Inc. logo
Advanced Search
Data labeling software

Data labeling software

Data labeling is the process of identifying raw data and adding informative labels to provide context so that a machine learning model can learn from it.

Data labeling, also referred to as data annotation, is required for a variety of use cases including computer vision, natural language processing, and speech recognition. The goal of data labeling is to provide data that is "marked up, or annotated, to show the target, which is the answer you want your machine learning model to predict."For example, in the use case of computer vision for autonomous vehicles, labeled data might include tagged street signs, pedestrians, or other vehicles. While unsupervised machine learning models (ex., anomaly detection models) do not rely on annotated data, supervised or semi-supervised "human in the loop (HITL)" models are utilized for a variety of commercial applications, ranging from autonomous vehicles to facial recognition.

The process of data labeling can be completed as a time-intensive manual process or it can be automated to various degrees using software. Data labeling as a service arises out of the need for companies in a variety of industries to develop large sets of training data for artificial intelligence or machine learning models. In 2019, The Economist referred to “tagged” or labeled data as the “feedstock” for machine learning algorithms.Data labeling can take up to 25% of total of the time required to complete a machine learning project.

Over time, data labeling services have had to increase standards to avoid "garbage in, garbage out" data. According to Testin Data service General Manager Henry Jia, "In 2015 and 2016, AI companies could build a fine AI prototype solution based on open-sourced datasets or some publicly available data on the Internet to get funding. But if they really want to implement algorithms in real-world scenarios, they have to push the envelop of data quality."

Market size and applications

According to Cognilytica, an industry research company for machine learning and other cognitive technologies, the market for data labeling was $1.5 billion in 2019. It is expected to grow to $3.5 billion by the end of 2024. it is expected that this growth will come as a result of domain-specific data labeling tasks.

Data labeling for autonomous vehicles

Uber acquired Mighty AI on June 25, 2019 in an effort to improve its self-driving algorithms. Scale.AI's customers include many other self-driving and general transport companies, including Waymo, Lyft, Zoox, Cruise, and the Toyota Research Institute. Waymo, Argo AI, and Lyft have also open sourced their self-driving datasets. A "high-quality" vehicle dataset includes:

  • Pixel-wise semantic annotation
  • 3D semantic annotation
  • Pixel-wise object instance annotation
  • Fine-grained road segmentation
  • Moving object trajectory
  • High-precision GPS/IMO information, etc.
Data labeling for facial recognition

On January 29, 2019, IBM announced the release of a dataset with millions of possible faces representative of the real world. IBM pulled the images in partnership with Flickr.

Data labeling for sentiment analysis

Data labeling for natural language processing (NLP) is often used to perform sentiment analysis, such as for the end use case of customer service or marketing.

Other applications
Related industries
Synthetic data

More recently, the use of synthetic data has supplemented the data labeling process. Synthetic data is “generated through computer programs, instead of being composed through the documentation of real-world events”.

Synthetically-generated datasets can also be used to train machine learning models, particularly in computer vision. Synthetic data may augment real datasets to cover areas of the data distribution that are not sufficiently represented in order to alleviate dataset bias. Synthetic data may also be useful when real data is impossible or prohibitively difficult to acquire due to privacy or legal issues. Synthetic data has been used to train Google’s Waymo in the form of driving simulations. Facebook was reported to use synthetic data to train algorithms to detect bullying language.

Model operations and monitoring

A market has also emerged, adjacent to the data labeling market, that aims to ensure proper oversight over models and reduce bias in large datasets. This is part of the Ethical AI movement, which encourages the proactive embedding of diversity and inclusion principles into the AI lifecycle and aims to ensure transparency of AI systems.



Further reading


Best Practices for Managing

Data Annotation Projects

Bloomberg Finance


December 18, 2020

Data Annotation: The Billion Dollar Business Behind AI Breakthroughs



August 28, 2019

Data labelling -- overcoming AI projects' biggest obstacle

Tech HQ


October 20, 2020

Data-labelling startups want to help improve corporate AI

The Economist


October 17, 2019

If data is the new oil, these companies are the new Baker Hughes

Jeremy Kahn


February 4, 2020

Documentaries, videos and podcasts


AWS re:Invent 2018: [NEW LAUNCH!] Labeling for Accurate Machine Learning Training Datasets (DEM123)

December 13, 2018

What is Data Labeling ? Prepare Your Data for ML and AI

April 29, 2020

What is Data Labeling?

July 11, 2018


Golden logo
By using this site, you agree to our Terms & Conditions.