FG 2026 · Dataset + Zero-shot Benchmark

Gaze4HRI: Zero-shot Benchmarking Gaze Estimation Neural-Networks for Human-Robot Interaction

Berk Sezer, Ali Görkem Küçük, Erol Şahin, and Sinan Kalkan

Department of Computer Engineering and ROMER, Middle East Technical University, Ankara, Türkiye

A large-scale HRI gaze dataset for testing appearance-based 3D gaze estimation under lighting variation, robot-camera motion, head-gaze conflict, and mutual-gaze scenarios.

Paper

arXiv

Dataset

Code

Models

52subjects

3,258videos

620,933frames

5.7hvideo

Gaze4HRI overview: robot-mounted camera records participants looking at table and robot targets.

TL;DR

A dataset for evaluating gaze estimation under practical HRI conditions.

Gaze4HRI benchmarks zero-shot 3D gaze estimation in HRI settings where a robot-mounted camera observes people looking at shared workspace objects or at the robot itself. The dataset foregrounds controlled HRI variables illumination, moving camera viewpoint, head-gaze conflict, and moving target/mutual gaze while the benchmark shows that current methods still struggle, especially for steeply downward gaze.

Dataset focus

What Gaze4HRI contains

The page should first sell the dataset: synchronized RGB video, motion-capture-based ground truth, HRI-specific camera/target motion, and analysis-ready experiment labels.

🎥

RGB video from a robot viewpoint

Participants are recorded by an Intel RealSense D435i mounted on a UR5 wrist, matching a robot perception viewpoint rather than a laptop or static camera setup.

🎯

3D gaze ground truth

Gaze vectors are computed from calibrated interpupillary midpoint positions and known gaze target locations, expressed in the camera frame.

🤖

HRI task coverage

The dataset covers object-centered gaze on a shared table and mutual gaze where the participant follows the robot camera as a moving target.

👁️

Blink-aware evaluation

Blinked frames are masked for gaze evaluation, and the repository also includes Blink4HRI-related tooling for blink detection experiments.

Collection setup

Gaze4HRI uses a controlled lab HRI setup centered around a UR5 robot arm, wrist-mounted RGB camera, and OptiTrack motion capture. This gives the benchmark accurate head/eye geometry while preserving the camera viewpoint challenges of robot perception.

RGB images: 1920×1080 at 30 FPS
Motion capture: 100 Hz
Ground truth: vector from interpupillary midpoint to gaze target
Evaluation: angular error between predicted and ground-truth 3D gaze vectors

Experiment modules

Four HRI variables in one benchmark

Gaze4HRI is organized around four experiment types, each designed to test a different challenge for gaze estimation in HRI.

Exp. 1

Illumination

Controlled lighting levels test whether gaze estimation models remain accurate from dim to bright conditions.

lighting_10 lighting_25 lighting_50 lighting_100

Experiment design for illumination variation in Gaze4HRI. — Experiment design

Live experiment footage showing the illumination setup. — Live experiment footage

Exp. 2

Camera viewpoint

The robot-mounted camera moves on an arc around the participant while the gaze targets remain fixed on the table.

circular_movement

Experiment design for camera viewpoint variation in Gaze4HRI. — Experiment design

Live experiment footage showing the camera viewpoint setup. — Live experiment footage

Exp. 3

Head-gaze conflict

Fixed head orientations create different levels of conflict between head-forward direction and gaze direction.

head_pose_left head_pose_middle head_pose_right

Experiment design for head-gaze conflict in Gaze4HRI. — Experiment design

Live experiment footage showing the head-gaze conflict setup. — Live experiment footage

Exp. 4

Moving target / mutual gaze

The participant follows the moving robot camera, simulating mutual gaze with a robot “eye” under dynamic target motion.

line_movement_slow line_movement_fast

Experiment design for mutual gaze and moving target in Gaze4HRI. — Experiment design

Live experiment footage showing the moving target or mutual gaze setup. — Live experiment footage

Benchmark

Zero-shot gaze estimation results

Overview of model performance on Gaze4HRI across key HRI conditions

Best overall

PureGaze trained on ETH-X-Gaze is the most reliable method across the HRI conditions tested.

Training data matters

ETH-X-Gaze-trained methods are especially robust to illumination, viewpoint variation, and head-gaze conflict.

Open failure case

Steeply downward gaze remains difficult for all evaluated methods, which is critical for object-centered HRI.

Exp. 1: Illumination

PureGaze (E) and GazeTR (E) stay competitive across all illumination settings, while Gaze360-trained methods are more sensitive to lighting level.

Average angular error by model

Lower values indicate more accurate gaze estimation.

Subject-level mean angular error (°). Lower is better.
Method	10	25	50	100	CV%
PureGaze (E)	11.73	11.52	12.18	10.19	7.50
GazeTR (E)	11.50	11.16	12.45	11.50	4.77
PureGaze (G)	16.92	14.68	15.13	17.82	9.18
GazeTR (G)	14.73	13.81	15.27	17.17	9.30
L2CS-Net (G)	19.61	17.60	18.95	18.99	4.51
MCGaze (G)	19.08	14.33	13.24	14.29	17.15
GaT (G)	17.93	16.74	15.82	16.35	5.36

Exp. 2: Camera viewpoint

PureGaze (E) is the strongest under moving robot-camera viewpoint variation, with only a small difference from fixed camera to camera-viewpoint setup.

Average angular error by model

Lower values indicate more accurate gaze estimation.

Fixed camera vs. camera viewpoint error (°).
Method	Fixed cam.	Cam. view.	p
PureGaze (E)	10.19	11.12	.460
GazeTR (E)	11.50	14.42	.004
PureGaze (G)	17.82	18.46	.242
GazeTR (G)	17.17	17.80	.308
L2CS-Net (G)	18.99	18.15	.045
MCGaze (G)	14.29	15.57	.034
GaT (G)	16.35	16.05	.486

Exp. 3: Head-gaze conflict

PureGaze (E) gives the lowest overall error. Gaze360-trained methods, especially MCGaze, degrade more strongly as head-gaze conflict increases.

Average angular error by model

Lower values indicate more accurate gaze estimation.

Head-gaze conflict results. Error is mean angular error (°); β is error increase per conflict degree.
Method	Error	β	% β > 0
PureGaze (E)	7.25	+0.05	60
GazeTR (E)	8.67	-0.05	36
PureGaze (G)	11.38	+0.04	50
GazeTR (G)	12.93	+0.38	92
L2CS-Net (G)	16.62	+0.50	92
MCGaze (G)	20.24	+0.99	100
GaT (G)	12.09	+0.20	94

Exp. 4.2: Mutual gaze

PureGaze (E) is most accurate in the moving-target mutual-gaze setup; PureGaze/GazeTR architectures are more resilient to pitch-yaw eccentricity than the other methods.

Average angular error by model

Lower values indicate more accurate gaze estimation.

Mutual-gaze setup error (°) across fast/slow and horizontal/vertical scans.
Method	Fast-H	Fast-V	Slow-H	Slow-V	β pitch	β yaw
PureGaze (E)	5.38	5.31	5.41	5.30	0.08	0.01
GazeTR (E)	10.48	10.50	10.49	10.25	-0.03	0.03
PureGaze (G)	9.75	9.58	9.15	9.47	0.10	-0.01
GazeTR (G)	7.61	7.59	7.25	7.20	0.10	0.02
L2CS-Net (G)	15.33	15.83	15.23	15.41	0.30	0.66
MCGaze (G)	22.93	24.32	22.69	23.28	1.40	0.82
GaT (G)	16.43	16.89	16.26	16.23	0.49	0.65

Dataset format

Dataset structure and contents

What each recording contains

Each timestamp folder stores one recorded trial for a subject, experiment type, and target point. The raw format keeps the RGB stream together with synchronized pose, gaze-target, robot, camera, and blink-related signals so the same recording can be used for gaze benchmarking, blink masking, and future HRI analysis.

Raw dataset structure

YYYY-MM-DD/
└── subj_XXXX/
    └── exp_type/
        └── point/
            └── timestamp/
                ├── rgb_video.mp4
                ├── rgb_timestamps.npy
                ├── rgb_camera_settings.json
                ├── camera_intrinsics.npy
                ├── camera_poses.npy
                ├── head_poses.npy
                ├── head_bboxes.npy
                ├── eye_positions.npy
                ├── eye_position_in_head_frame.npy
                ├── target_positions.npy
                ├── blink_annotations_by_*.npy
                ├── table_pose.npy
                ├── ur5_base_pose.npy
                └── ur5_joint_states.npy

RGB and timing rgb_video.mp4, rgb_timestamps.npy, and camera settings for frame-level evaluation.

Gaze ground truth eye_positions.npy and target_positions.npy define the ground-truth 3D gaze vector.

Head and camera pose head_poses.npy, head_bboxes.npy, camera_poses.npy, and intrinsics support pose-aware analysis.

Robot and table state ur5_joint_states.npy, ur5_base_pose.npy, and table_pose.npy describe the HRI setup geometry.

Blink annotations blink_annotations_by_*.npy marks blink frames for masking gaze evaluation or training Blink4HRI models.

Citation

Cite Gaze4HRI

@inproceedings{sezer2026gaze4hri,
  title={Gaze4HRI: Zero-shot Benchmarking Gaze Estimation Neural-Networks for Human-Robot Interaction},
  author={Sezer, Berk and Küçük, Ali Görkem and Şahin, Erol and Kalkan, Sinan},
  booktitle={2026 International Conference on Automatic Face and Gesture Recognition (FG)},
  year={2026},
  doi={10.5281/zenodo.19710372}
}

Gaze4HRI: Zero-shot Benchmarking Gaze Estimation Neural-Networks for Human-Robot Interaction

A dataset for evaluating gaze estimation under practical HRI conditions.

What Gaze4HRI contains

RGB video from a robot viewpoint

3D gaze ground truth

HRI task coverage

Blink-aware evaluation

Collection setup

Four HRI variables in one benchmark

Illumination

Camera viewpoint

Head-gaze conflict

Moving target / mutual gaze

Zero-shot gaze estimation results

Best overall

Training data matters

Open failure case

Exp. 1: Illumination

Average angular error by model

Exp. 2: Camera viewpoint

Average angular error by model

Exp. 3: Head-gaze conflict

Average angular error by model

Exp. 4.1: Object-centered gaze

Interpretation

Exp. 4.2: Mutual gaze

Average angular error by model

Dataset structure and contents

What each recording contains

Cite Gaze4HRI