Disclosed are various embodiments for processing verbal queries relative to video content. A verbal query that is associated with a portion of video content is received. The verbal query specifies a relative frame location. An action is performed based at least in part on the portion of the video content at the relative frame location.