Anticipating Next Active Object for egocentric videos

Sanket Thakur
Cigdem Beyan
Pietro Morerio
Vittorio Murino
Alessio Del Bue

[Paper]
[GitHub] (Soon)


Fig 1. The next-active-object (NAO) problem formulation in our paper is inspired from action anticipation based setup. Let V, be a given video clip, we split the video clip into three sequential parts: the observed segment of length τo, the time to contact (TTC) window of length τa and a given action segment which starts at timestep t = τs. The goal is to observe a video segment till τa and localize NAO at the beginning of an action segment at timestep t = τs, where contact might happen.

Abstract

This paper addresses the problem of anticipating the next-active-object location in the future, for a given egocentric video clip where the contact might happen, before any action takes place. The problem is considerably hard, as we aim at estimating the position of such objects in a scenario where the observed clip and the action segment are separated by the so-called time to contact segment. We name this task Anticipating the Next ACTive Object (ANACTO). To this end, we propose a transformer-based self-attention framework to identify and locate the next-active-object in an egocentric clip, where the contact might happen to un- dertake a human-object interaction. We benchmark our method on three major egocentric datasets namely, EpicKitchen-100, EGTEA+ and Ego4D. Since the defined task is new, we compare our model with the state-of-the-art action anticipation-based method(s) to curate relevant base- line methods. In the end, we also provide the annotations for next-active-object for EpicKitchen-100 and EGETA+ datasets.


Code


Fig 2. Our T-ANACTO model is an encoder-decoder architecture. Its encoder is composed of an object detector and a Vision Transformer (ViT). The object detector takes an input frame (e.g., size of 1920×1080) and predicts the location of objects in terms of bounding boxes (x, y, w, h) and detection confidence scores (c). ViT takes the resized (224x224) frames as input, and then divide it into the patches (16×16). The object detections are also converted to match the scaled size of the frame (i.e., 224×224), reshaped, and are then passed through a MLP to convert it into the same dimension as the embeddings from the transformer encoder, which are later concatenated together to be given to the decoder. Transformer decoder uses temporal aggregation to predict the next active object. For each frame, the decoder aggregate the features from the encoder for current and past frames along with the embeddings of last predicted active objects and then predicts the next active object for the future frames.

 [GitHub] (To be Released soon)


ANACTO annotations

The annotations for EpicKitchen-100 and EGTEA+ dataset for ANACTO task was curated using an Hand-Object detector Shan et al.. The detector is currently SOTA on EpicKitchen-100. We used the Hand-Object detector and ran it on both the datasets (EpicKitchen-100 and EGTEA+). The annotations (bounding boxes - left hand, right hand, active object(s)) are extracted for each frame. We used the action anticipation splits from Furnari et al. to label the next-active-object annotations at the beginning of an action.
We provide the annotations for 94% and 92% of video clips in EpicKitchen-100 and EGTEA+ dataset respectively.

It is to be noted that the T-ANACTO model is designed for identifying the next-active-object location at the beginning of an action, after τa. However, based on recent literature Grauman et al., for the need of identifying next-active-object in the last observed frame, we provide relevant models' visualizations to argue that T-ANACTO is also able to locate its attention to important regions in the last observed frame to predict the next-active-object.


Results

We provide the visualizations of the attention of model over multiple video clips. We randomly stop at a random frame in the observed segment to provide the effective visualizations of our model, T-ANACTO, to identify important regions in the frame based on past motion. Based on the heatmaps, it can be identified that the model is also capable in anticipating the next-active-object in the last observed frame as well. The green bounding boxes are the predicted bounding boxes by T-ANACTO at the beginning of an action.

Model Visualizations

The pictures presented here are the attention visualization of the model w.r.t last observed frame of an observed segment.
1 / 7
Results on EpicKitchen-100 for last observed frame of video clip with TTC τa = 0.5 seconds before the beginning of an action.
2 / 7
Results for Ego4D dataset when trained to identify next active object wrt last ob- served frame.
3 / 7
Results on EGTEA+ for last observed frame of video clip with τa = 1.0 seconds before the beginning of an action.
4 / 7
Results show the diversity of spatial attention for the last observed frame preceding the beginning of an action segment for different setups of TTC window for τa= 0.25, 0.5, 1.0 seconds. The spatial attention regions tend to appear more assertive as the model examines the frames closer to an action segment, i.e; as the τa is decreased
5 / 7
Our T-ANACTO model also learns to attribute to hand position in an image frame, even though we do not explicitly provide the hand location in training and testing.
6 / 7
The model fails to attribute attention to objects which are light colored or easily camouflaged with the background.
7 / 7
When the scene completely changes at the beginning of action from the past observed segment.



Paper and Supplementary Material

S. Thakur, C. Beyan, P. Morerio, V. Murino, A. Del Bue
Anticipating Next Active Objects for Egocentric Videos
(hosted on arxiv)

Mail: sanket.thakur@iit.it for access to the next-active-object data for EK-100 and EGTEA for ANACTO task. For NAO annotations on last observed frame, refer to this work, NAOGAT.

[Bibtex]


Acknowledgements

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.