The present invention aims at realizing displaying of a realistic image of a conversation scene in which a speaker can be visually recognized by people watching this image regardless of the content of the image. To this end, according to the present invention, a two-dimensional or three-dimensional face model is deformed, and animations A1i through A3i which express a state in which a person is speaking are consequently created and displayed as auxiliary images.