Techniques for detection of spatial relationships are provided. An input image is divided into a set of patches, and a first feature tensor is generated for a first patch of the set of patches. The first feature tensor is processed using an attention mechanism to generate a first transformed feature tensor, and a classification indicating distancing between physical entities in the input image is generated based at least in part on the first transformed feature tensor.