This paper introduces non-local operations as a novel building block for deep neural networks, designed to capture long-range dependencies. Inspired by the classical non-local means method in computer vision, the non-local operation computes the response at a position as a weighted sum of features from all positions, allowing for direct interaction between distant elements. This approach is more efficient and flexible compared to traditional convolutional and recurrent operations, which process local neighborhoods in space or time. The authors demonstrate the effectiveness of non-local operations in video classification, object detection, segmentation, and pose estimation tasks. On the Kinetics and Charades datasets, non-local models outperform current state-of-the-art methods, even without additional features like optical flow. The paper also shows that non-local operations can be easily integrated into existing architectures, such as Mask R-CNN for COCO, improving performance in object detection and keypoint estimation. The code for the non-local operations is available at <https://github.com/facebookresearch/video-nonlocal-net>.This paper introduces non-local operations as a novel building block for deep neural networks, designed to capture long-range dependencies. Inspired by the classical non-local means method in computer vision, the non-local operation computes the response at a position as a weighted sum of features from all positions, allowing for direct interaction between distant elements. This approach is more efficient and flexible compared to traditional convolutional and recurrent operations, which process local neighborhoods in space or time. The authors demonstrate the effectiveness of non-local operations in video classification, object detection, segmentation, and pose estimation tasks. On the Kinetics and Charades datasets, non-local models outperform current state-of-the-art methods, even without additional features like optical flow. The paper also shows that non-local operations can be easily integrated into existing architectures, such as Mask R-CNN for COCO, improving performance in object detection and keypoint estimation. The code for the non-local operations is available at <https://github.com/facebookresearch/video-nonlocal-net>.