Introduction

Head movements are commonly interpreted as a vehicle of interpersonal communication. For example, in daily life, humanbeings observe head movements as the expression of agreement or disagreement in a conversation,or even as a sign of confusion. Gaze direction can also be used for directing a person to observe a specific location. As artificial cognitive systems with social capabilities become more and more important due to the recent evolution of robotics towards applications where complex and humanlike interactions are needed, gaze and head pose estimates become an important feature.

During the CASIR project we developed a robust solution based on a probabilistic approach that inherently deals with the uncertainty of sensor models, but also and in particular with uncertainty arising from distance, incomplete data and scene dynamics. This solution comprises a hierarchical formulation in the form of a mixture model that loosely follows how geometrical cues provided by facial features are believed to be used by the human perceptual system for gaze estimation.

To evaluate our solution, a set of experiments were conducted so as to thoroughly and quantitatively evaluate the performance of the hierarchy. The experiments were devised to test system performance for many different combinations of head poses and eye gaze directions, but also for different distances in a dynamic setting context. Eleven subjects (3 female and 8 male, 8 with white complexion and 3 with dark skin colour), with ages ranging from 8 to 43 and heights ranging from 1.3m to 1.89m, were instructed to fixate a target placed in a position with known 3D coordinates throughout each trial.

b)

Experimental setup. In (a), the experimental plan and protocol are presented, showing the sensor rig mounted on a robotic head, the Optotrak Certus used to obtain 3D ground truth and the gaze target. In (b), a photo of the actual experimental setup and the Kinect 3D sensor is shown – note that, as opposed to the schematic in (a), the targets (three coloured stripes, respectively at 1.43m, 1.57m and 1.68m from the ground) are on the same side as the Optotrak. This is because the Optotrak can only track its optical devices within a “cone of visibility” (markings on the floor show the cone’s limits), therefore constraining the usable space available for the experiments to take place. The subjects were asked to travel from a predefined distance at about 3m towards about 1.5m from the Kinect sensor, following a relatively free path within that usable area always keeping their gaze fixed on one of the targets throughout each trial. They were also instructed to stop in three different midway positions along the path, and at those positions slowly turn their head up, down, left and right, without losing their gaze focus on the designated target.

The following features are extracted by the detectors, and are used as inputs:

rough estimate for head orientation (RH);
the 5 features relating to the main facial anatomical landmarks: the sellion (mid-point between eyes), the tip of the nose, the left and right pupils.
the 2 pairs of eye extremities (i.e. the corners of the eyes).

A Northern Digital Inc. (NDI) Optotrak Certus motion capture system was used to obtain 3D ground truth pertaining the position of the subjects and their head pose throughout the experiment, which was calibrated with the Kinect for registration. The data processing flow is shown below:

isr

Structure

The database is composed of 9 different data types, stored using ZIP compression.

Data Type

RGB images – Stored on a folder in .png format.
Depth images – Stored on a folder in .pcd format (point cloud data).
Gaze vectoy – "Ground truth" 5D gaze vector (sellion position,phi,theta_head) in .txt format.
Sellion position – "Ground truth" head center 3D position (x, y, z) in .txt format.
Fixation point – "Ground truth" point of fixation 3D position (x, y, z) in .txt format.
Phi angle – "Yaw" gaze vector orientation in .txt format.
Theta angle – "Pitch" gaze vector orientation in .txt format.
2D feature positions – "Hand labeled" 2D facial features in .txt format.
3D feature positions – "Hand labeled" 3D facial features in .txt format.
Experimental setup – Transformation matrix, rotation matrix and the translation vector used in each session in .txt format.

License

The CASIR Gaze Estimation Database is made available under the Open Database License v1.0. Any rights in individual contents of the database are licensed under the Database Contents License v1.0.

Version	Description
1.1	Date: 2015.06.16
1.1	Updated the information provided in: introduction and structure sections, updated addicional files to the download section and applied some general bug fixes.
1.0	Date: 201.05.26
1.0	Initial release of the CASIR Gaze Estimation Database.

CASIR Gaze Estimation Database

Introduction

b)

Structure

License

Download

Contact

Changelog