In order to associate image data with speech data, a character detection unit detects a text region from the image data, and a character recognition unit recognizes a character from the text region. A speech detection unit detects a speech period from speech data, and a speech recognition unit recognizes speech from the speech period. An image-and-speech associating unit associates the character with the speech by performing at least character string matching or phonetic string matching between the recognized character and speech. Therefore, a portion of the image data and a portion of the speech data can be associated with each other.