Disclosed is a method for tracking an object, which is performed by a computing device including at least one processor, including: obtaining a query set including one or more query samples from a first frame included in an image sequence including two or more image frames; obtaining a detection set including one or more detection samples from a second frame included in the image sequence; and determining a label corresponding to each query sample included in the query set, based on the label of each detection sample included in the detection set.