US Patent 11989941 Systems and methods for video and language pre-training

Embodiments described a method of video-text pre-learning to effectively learn cross-modal representations from sparse video frames and text. Specifically, an align and prompt framework provides a video and language pre-training framework that encodes the frames and text independently using a transformer-based video encoder and a text encoder. A multi-modal encoder is then employed to capture cross-modal interaction between a plurality of video frames and a plurality of texts. The pre-training includes a prompting entity modeling that enables the model to capture fine-grained region-entity alignment.

Timeline

No Timeline data yet.

Further Resources

Title

Author

Link

Type

Date

No Further Resources data yet.

US Patent 11989941 Systems and methods for video and language pre-training

Contents

Patent attributes

Timeline

Further Resources

References

Find more entities like US Patent 11989941 Systems and methods for video and language pre-training