|
|
|
|
|
|
![]() |
Fig 1. Some instances from the dataset we curated using Tobii Eye Tracker. Each row belongs to a single interaction session, while the frames (from left to right) are given in order of occurence. The dataset consists of multi scenarios consisting of different lightning condition, indoor/ outdoor scenes and occulsions. |
Gaze prediction in egocentric videos is a fairly new research topic, which might have several applications for assistive technology (e.g., supporting blind people in their daily interactions), security (e.g., attention tracking in risky work environments), education (e.g., augmented / mixed reality training simulators, immersive games) and so forth. Egocentric gaze is typically estimated from video while few works attempt to use inertial measurement unit (IMU) data, a sensor modality often available in wearable devices (e.g., augmented reality headsets). Instead, in this paper, we examine whether joint learning of egocentric video and corresponding IMU data can improve the first-person gaze prediction compared to using these modalities separately. In this respect, we propose a multimodal network and evaluate it on several unconstrained social interaction scenarios captured by a first-person perspective. The proposed multimodal network achieves better results compared to unimodal methods as well as several (multimodal) baselines, showing that using egocentric video together with IMU data can boost the first-person gaze estimation performance. |
![]() |
Fig 2. Our (a) optical flow+IMU and (b) RGB+IMU multimodal networks for first-person gaze prediction. We regress (𝑥, 𝑦) gaze image positions by joint learning of two modalities. FC and BN stand for fully connected layer and batch normal- ization, respectively |
![]() |
Fig 3. Results of our unimodal encoders (OF, RGB, IMU), our multimodal networks (OF+IMU, RGB+IMU) and baseline methods (center bias (CB), late fusion (LF) and joint learning with averaging (JA)) in terms of Accuracy (%). It can be noted that the multimodal learning generally tends to improve performance on the regression of gaze points. The improvement is majorily observed for RGB + IMU encoder when trained together provide better results as compared to individual encoders on different scenarios when tested using leave-one-out cross validation. |
The green dot represent the predicted gaze point from our multimodal encoder (RGB + IMU), and the red dot point represent the ground-truth gaze point from Tobii eye tracker. We propose using a box around the centered point to provide a vicinity around the center. A correct prediction is classified if the region of overlap (IoU) between the ground-truth bouding box and predicted bouding box >= 0.5. As can be seen from the result video, the predicted points are quite close to the ground-truth coordinates, although the instability is caused due to high contrast in the scene. |
![]() |
S. Thakur, C. Beyan, P. Morerio, A. Del Bue Predicting Gaze from Egocentric Social Interaction Videos and IMU Data In International Conference on Multimodal Interaction, 2021 (hosted on ACM Digital) Mail: sanket.thakur@iit.it for access to the data. |
Acknowledgements |