A method of training a captioning model used to perform automatic video captioning of an input video, including initializing, by at least one processor, a plurality of long short-term memory (LSTM) units included in the captioning model using cross-entropy loss; training, by the at least one processor, the LSTM units using reinforcement learning; training, by the at least one processor, the LSTM units and a plurality of convolutional neural networks (CNNs) included in the captioning model using multitask training; and generating, by the at least one processor, a video caption corresponding to the input video using the captioning model.