Patent attributes
Embodiments of the present invention are generally directed to generating figure captions for electronic figures, generating a training dataset to train a set of neural networks for generating figure captions, and training a set of neural networks employable to generate figure captions. A set of neural networks is trained with a training dataset having electronic figures and corresponding captions. Sequence-level training with reinforced learning techniques are employed to train the set of neural networks configured in an encoder-decoder with attention configuration. Provided with an electronic figure, the set of neural networks can encode the electronic figure based on various aspects detected from the electronic figure, resulting in the generation of associated label map(s), feature map(s), and relation map(s). The trained set of neural networks employs a set of attention mechanisms that facilitate the generation of accurate and meaningful figure captions corresponding to visible aspects of the electronic figure.