In an approach, a processor extracts an audio signal from a video clip. A processor converts the audio signal into a text sequence. A processor selects a first set of keywords from the text sequence, the first set of keywords corresponding to a first audio segment of the audio signal. A processor tags a target video segment of the video clip with the first set of keywords, the target video segment corresponding to the first audio segment.