|
|
|
|
|
|
|
|
Fig 1. The next-active-object (NAO) problem formulation in our paper is inspired from action anticipation based setup. Let V, be a given video clip,
we split the video clip into three sequential parts: the observed segment of length τo, the time to contact (TTC) window
of length τa and a given action segment which starts at timestep |
This paper addresses the problem of anticipating the next-active-object location in the future, for a given egocentric video clip where the contact might happen, before any action takes place. The problem is considerably hard, as we aim at estimating the position of such objects in a scenario where the observed clip and the action segment are separated by the so-called time to contact segment. We name this task Anticipating the Next ACTive Object (ANACTO). To this end, we propose a transformer-based self-attention framework to identify and locate the next-active-object in an egocentric clip, where the contact might happen to un- dertake a human-object interaction. We benchmark our method on three major egocentric datasets namely, EpicKitchen-100, EGTEA+ and Ego4D. Since the defined task is new, we compare our model with the state-of-the-art action anticipation-based method(s) to curate relevant base- line methods. In the end, we also provide the annotations for next-active-object for EpicKitchen-100 and EGETA+ datasets. |
| |
Fig 2. Our T-ANACTO model is an encoder-decoder architecture. Its encoder is composed of an object detector and a Vision Transformer (ViT). The object detector takes an input frame (e.g., size of 1920×1080) and predicts the location of objects in terms of bounding boxes (x, y, w, h) and detection confidence scores (c). ViT takes the resized (224x224) frames as input, and then divide it into the patches (16×16). The object detections are also converted to match the scaled size of the frame (i.e., 224×224), reshaped, and are then passed through a MLP to convert it into the same dimension as the embeddings from the transformer encoder, which are later concatenated together to be given to the decoder. Transformer decoder uses temporal aggregation to predict the next active object. For each frame, the decoder aggregate the features from the encoder for current and past frames along with the embeddings of last predicted active objects and then predicts the next active object for the future frames. |
The annotations for EpicKitchen-100 and EGTEA+ dataset for ANACTO task was curated using an Hand-Object detector Shan et al..
The detector is currently SOTA on EpicKitchen-100. We used the Hand-Object detector and ran it on both the datasets (EpicKitchen-100 and EGTEA+). The annotations (bounding boxes - left hand,
right hand, active object(s)) are extracted for each frame. We used the action anticipation splits from Furnari et al. to label the
next-active-object annotations at the beginning of an action. We provide the annotations for 94% and 92% of video clips in EpicKitchen-100 and EGTEA+ dataset respectively. It is to be noted that the T-ANACTO model is designed for identifying the next-active-object location at the beginning of an action, after τa. However, based on recent literature Grauman et al., for the need of identifying next-active-object in the last observed frame, we provide relevant models' visualizations to argue that T-ANACTO is also able to locate its attention to important regions in the last observed frame to predict the next-active-object. |
We provide the visualizations of the attention of model over multiple video clips. We randomly stop at a random frame in the observed segment to provide the effective visualizations of our model, T-ANACTO, to identify important regions in the frame based on past motion. Based on the heatmaps, it can be identified that the model is also capable in anticipating the next-active-object in the last observed frame as well. The green bounding boxes are the predicted bounding boxes by T-ANACTO at the beginning of an action. |
The pictures presented here are the attention visualization of the model |
S. Thakur, C. Beyan, P. Morerio, V. Murino, A. Del Bue Anticipating Next Active Objects for Egocentric Videos (hosted on arxiv) Mail: sanket.thakur@iit.it for access to the next-active-object data for EK-100 and EGTEA for ANACTO task. For NAO annotations on last observed frame, refer to this work, NAOGAT. |
Acknowledgements |