ELMo is a deep contextualized word representation that models both characteristics of word use (e.g. syntax and semantics), and how these uses vary across linguistic contexts (i.e. to model polysemy). These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a text corpus. They can be easily added to existing models on a range of NLP problems, including question answering, textual entailment and sentiment analysis.
All models except for the 5.5B model were trained on the 1 Billion Word Benchmark, approximately 800M tokens of news crawl data from WMT 2011. The ELMo 5.5B model was trained on a dataset of 5.5B tokens consisting of Wikipedia (1.9B) and all of the monolingual news crawl data from WMT 2008-2012 (3.6B). In tasks where a direct comparison was made, the 5.5B model has slightly higher performance then the original ELMo model.
ELMo models have been trained for other languages and domains.
Deep contextualized word representations
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer