Patent attributes
There are provided methods and computing devices using semi-supervised learning to perform end-to-end video object segmentation, tracking respective object(s) from a single-frame annotation of a reference frame through a video sequence of frames. A known deep learning model may be used to annotate the reference frame to provide ground truth locations and masks for each respective object. A current frame is processed to determine current frame object locations, defining object scoremaps as a normalized cross-correlation between encoded object features of the current frame and encoded object features of a previous frame. Scoremaps for each of more than one previous frame may be defined. An Intersection over Union (IoU) function, responsive to the scoremaps, ranks candidate object proposals defined from the reference frame annotation to associate the respective objects to respective locations in the current frame. Pixel-wise overlap may be removed using a merge function responsive to the scoremaps.