Patent attributes
Devices, systems, and methods for computer recognition of action in video obtain frame-level feature sets of visual features that were extracted from respective frames of a video, wherein the respective frame-level feature set of a frame includes the respective visual features that were extracted from the frame; generate first-level feature sets, wherein each first-level feature set is generated by pooling the visual features from two or more frame-level feature sets, and wherein each first-level feature set includes pooled features; and generate second-level feature sets, wherein each second-level feature set is generated by pooling the pooled features in two or more first-level feature sets, wherein each second-level feature set includes pooled features.