Product attributes
Software attributes
Other attributes
The Segment Anything Model (SAM) is a promptable segmentation system with zero-shot generalization to unfamiliar objects and images, without the need for additional training. Released on April 5, 2023, the Segment Anything project was developed by Meta AI. The company has made both the model and its dataset available under a permissive open license (Apache 2.0) for research purposes. Segmentation is the process of identifying image pixels belonging to an object. Meta already uses this technology internally for tasks such as tagging photos, moderating prohibited content, and determining the posts recommended to users on Facebook and Instagram.
SAM can identify objects in images from various input prompts allowing for a wide range of segmentation tasks without requiring additional training. Supported prompts include foreground/background points, bounding boxes, and masks; text prompts are being explored, but the capability is not supported upon the release of the model. SAM's promptable design enables the model to be integrated with other systems.
In the blog accompanying the release of SAM, Meta discussed some of the future potential use cases of the model across various industries, including the following:
- AI systems—allowing a multimodal understanding of the world; for example, understanding both the visual and text content of a webpage
- AR/VR—enabling the selection of an object based on a user’s gaze and then “lifting” it into 3D
- Content creation—improving creative applications, such as extracting image regions for collages or video editing
- Science—studying natural occurrences on Earth or even in space; for example, by localizing animals or objects to study and track in video
Previously, there were two primary approaches to segmentation. The first, Interactive segmentation, required a user to iteratively refine a mask. The second, automatic segmentation, allowed for specific object categories to be defined ahead of time. This approach also required training on a substantial amount of manually annotated objects. SAM is a generalization of these two classes in a single model. It can perform both interactive and automatic segmentation in a flexible way, due to the model's promptable interface. SAM is also trained on a diverse dataset of over 1 billion masks, enabling it to generalize new types of objects and images.
SAM is structured with a VIT-H image encoder that runs once per image, outputting an image embedding. The prompt encoder embeds input prompts, such as clicks or boxes. A lightweight transformer-based mask decoder predicts object masks from the image embedding and prompt embedding.
The image encoder has 632M parameters, and the prompt encoder/mask decoder has 4M parameters. The image encoder is implemented in PyTorch and requires a GPU for efficient inference. Both the prompt encoder and mask decoder can run directly with PyTorch or be converted to ONNX. They run efficiently on a CPU or GPU.