FG 2026 · Dataset + Zero-shot Benchmark

Gaze4HRI: Zero-shot Benchmarking Gaze Estimation Neural-Networks for Human-Robot Interaction

Berk Sezer, Ali Görkem Küçük, Erol Şahin, and Sinan Kalkan

Department of Computer Engineering and ROMER, Middle East Technical University, Ankara, Türkiye

A large-scale HRI gaze dataset for testing appearance-based 3D gaze estimation under lighting variation, robot-camera motion, head-gaze conflict, and mutual-gaze scenarios.

52subjects
3,258videos
620,933frames
5.7hvideo
Gaze4HRI overview: robot-mounted camera records participants looking at table and robot targets.

TL;DR

A dataset for evaluating gaze estimation under practical HRI conditions.

Gaze4HRI benchmarks zero-shot 3D gaze estimation in HRI settings where a robot-mounted camera observes people looking at shared workspace objects or at the robot itself. The dataset foregrounds controlled HRI variables illumination, moving camera viewpoint, head-gaze conflict, and moving target/mutual gaze while the benchmark shows that current methods still struggle, especially for steeply downward gaze.

Dataset focus

What Gaze4HRI contains

The page should first sell the dataset: synchronized RGB video, motion-capture-based ground truth, HRI-specific camera/target motion, and analysis-ready experiment labels.

🎥

RGB video from a robot viewpoint

Participants are recorded by an Intel RealSense D435i mounted on a UR5 wrist, matching a robot perception viewpoint rather than a laptop or static camera setup.

🎯

3D gaze ground truth

Gaze vectors are computed from calibrated interpupillary midpoint positions and known gaze target locations, expressed in the camera frame.

🤖

HRI task coverage

The dataset covers object-centered gaze on a shared table and mutual gaze where the participant follows the robot camera as a moving target.

👁️

Blink-aware evaluation

Blinked frames are masked for gaze evaluation, and the repository also includes Blink4HRI-related tooling for blink detection experiments.

Collection setup

Gaze4HRI uses a controlled lab HRI setup centered around a UR5 robot arm, wrist-mounted RGB camera, and OptiTrack motion capture. This gives the benchmark accurate head/eye geometry while preserving the camera viewpoint challenges of robot perception.

  • RGB images: 1920×1080 at 30 FPS
  • Motion capture: 100 Hz
  • Ground truth: vector from interpupillary midpoint to gaze target
  • Evaluation: angular error between predicted and ground-truth 3D gaze vectors
Gaze4HRI setup overview image

Experiment modules

Four HRI variables in one benchmark

Gaze4HRI is organized around four experiment types, each designed to test a different challenge for gaze estimation in HRI.

Exp. 1

Illumination

Controlled lighting levels test whether gaze estimation models remain accurate from dim to bright conditions.

lighting_10 lighting_25 lighting_50 lighting_100
Experiment design for illumination variation in Gaze4HRI.
Experiment design
Live experiment footage showing the illumination setup.
Live experiment footage

Exp. 2

Camera viewpoint

The robot-mounted camera moves on an arc around the participant while the gaze targets remain fixed on the table.

circular_movement
Experiment design for camera viewpoint variation in Gaze4HRI.
Experiment design
Live experiment footage showing the camera viewpoint setup.
Live experiment footage

Exp. 3

Head-gaze conflict

Fixed head orientations create different levels of conflict between head-forward direction and gaze direction.

head_pose_left head_pose_middle head_pose_right
Experiment design for head-gaze conflict in Gaze4HRI.
Experiment design
Live experiment footage showing the head-gaze conflict setup.
Live experiment footage

Exp. 4

Moving target / mutual gaze

The participant follows the moving robot camera, simulating mutual gaze with a robot “eye” under dynamic target motion.

line_movement_slow line_movement_fast
Experiment design for mutual gaze and moving target in Gaze4HRI.
Experiment design
Live experiment footage showing the moving target or mutual gaze setup.
Live experiment footage

Benchmark

Zero-shot gaze estimation results

Overview of model performance on Gaze4HRI across key HRI conditions

01

Best overall

PureGaze trained on ETH-X-Gaze is the most reliable method across the HRI conditions tested.

02

Training data matters

ETH-X-Gaze-trained methods are especially robust to illumination, viewpoint variation, and head-gaze conflict.

03

Open failure case

Steeply downward gaze remains difficult for all evaluated methods, which is critical for object-centered HRI.

Exp. 1: Illumination

PureGaze (E) and GazeTR (E) stay competitive across all illumination settings, while Gaze360-trained methods are more sensitive to lighting level.

Average angular error by model

Lower values indicate more accurate gaze estimation.

Subject-level mean angular error (°). Lower is better.
Method102550100CV%
PureGaze (E)11.7311.5212.1810.197.50
GazeTR (E)11.5011.1612.4511.504.77
PureGaze (G)16.9214.6815.1317.829.18
GazeTR (G)14.7313.8115.2717.179.30
L2CS-Net (G)19.6117.6018.9518.994.51
MCGaze (G)19.0814.3313.2414.2917.15
GaT (G)17.9316.7415.8216.355.36

Dataset format

Dataset structure and contents

What each recording contains

Each timestamp folder stores one recorded trial for a subject, experiment type, and target point. The raw format keeps the RGB stream together with synchronized pose, gaze-target, robot, camera, and blink-related signals so the same recording can be used for gaze benchmarking, blink masking, and future HRI analysis.

Raw dataset structure
YYYY-MM-DD/
└── subj_XXXX/
    └── exp_type/
        └── point/
            └── timestamp/
                ├── rgb_video.mp4
                ├── rgb_timestamps.npy
                ├── rgb_camera_settings.json
                ├── camera_intrinsics.npy
                ├── camera_poses.npy
                ├── head_poses.npy
                ├── head_bboxes.npy
                ├── eye_positions.npy
                ├── eye_position_in_head_frame.npy
                ├── target_positions.npy
                ├── blink_annotations_by_*.npy
                ├── table_pose.npy
                ├── ur5_base_pose.npy
                └── ur5_joint_states.npy
RGB and timing rgb_video.mp4, rgb_timestamps.npy, and camera settings for frame-level evaluation.
Gaze ground truth eye_positions.npy and target_positions.npy define the ground-truth 3D gaze vector.
Head and camera pose head_poses.npy, head_bboxes.npy, camera_poses.npy, and intrinsics support pose-aware analysis.
Robot and table state ur5_joint_states.npy, ur5_base_pose.npy, and table_pose.npy describe the HRI setup geometry.
Blink annotations blink_annotations_by_*.npy marks blink frames for masking gaze evaluation or training Blink4HRI models.

Citation

Cite Gaze4HRI

@inproceedings{sezer2026gaze4hri,
  title={Gaze4HRI: Zero-shot Benchmarking Gaze Estimation Neural-Networks for Human-Robot Interaction},
  author={Sezer, Berk and Küçük, Ali Görkem and Şahin, Erol and Kalkan, Sinan},
  booktitle={2026 International Conference on Automatic Face and Gesture Recognition (FG)},
  year={2026},
  doi={10.5281/zenodo.19710372}
}