Disclosed is a method and apparatus for processing a speech. The method includes obtaining context information from a speech signal of a user using a neural network-based encoder, determining intent information of the speech signal based on the context information, determining, based on the context information, attention information corresponding to a segment included in the speech signal, and determining, based on the attention information, a segment value of the segment by recognizing, using a decoder, a portion of the context information identified as corresponding to the segment.