DissLiteratur/storage/LEHDA7ZM/.zotero-ft-cache

Eye Tracking Methodology

Andrew Duchowski
Eye Tracking Methodology
Theory and Practice
Second Edition

Andrew Duchowski, BSc, PhD Department of Computer Science Clemson University Clemson, SC 29634 andrewd@cs.clemson.edu

ISBN 978-1-84628-608-7

e-ISBN 978-1-84628-609-4

British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library

Library of Congress Control Number: 2006939204

c Springer-Verlag London Limited 2007

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms of licences issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms of licenses issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers. The use of registered names, trademarks, etc., in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant laws and regulations and therefore free for general use.

Printed on acid-free paper.

987654321

Springer Science+Business Media springer.com

Preface to the Second Edition
Since the writing of the ﬁrst edition, several important advancements in the ﬁeld of eye tracking have occurred in the span of just a few short years. Most important, eye tracking technology has improved dramatically. Due to the increased speed of computer processors and improved computer vision techniques, eye tracking manufacturers have developed devices falling within the fourth generation of the following technological taxonomy. 1. First generation: eye-in-head measurement of the eye consisting of techniques
such as scleral contact lens/search coil, electro-oculography 2. Second generation: photo- and video-oculography 3. Third generation: analog video-based combined pupil/corneal reﬂection 4. Fourth generation: digital video-based combined pupil/corneal reﬂection, aug-
mented by computer vision techniques and Digital Signal Processors (DSPs) Often the most desired type of eye tracking output (e.g., for human–computer interaction usability studies) is estimation of the projected Point Of Regard (POR) of the viewer, i.e., the (x, y) coordinates of the user’s gaze on the computer display. First- and second-generation eye trackers generally do not provide this type of data. (For second-generation systems, eye movement analysis relies on off-line, frameby-frame visual inspection of photographs or video frames and does not allow easy POR calculation.) Combined video-based pupil/corneal reﬂection eye trackers easily provide POR coordinates following calibration, and are today de rigeur. Due to the availability of fast analog-to-digital video processors, these third-generation eye trackers are capable of delivering the calculated POR in real-time. Fourth-generation eye trackers, having recently appeared on the market, make use of digital optics. Coupled with on-chip Digital Signal Processors (DSPs), eye tracking technology has signiﬁcantly increased in its usability, accuracy, and speed while decreasing in cost.
The state of today’s technology can best be summarized by a brief functional comparison of older and newer equipment, given in the following table. If, to those new to this book and uninitiated with eye tracking devices, the comparison in the table is

VI Preface to the Second Edition

Functional eye tracker comparison.

Legacy Systems

State-of-the-Art

Technology Analog video

Digital video

Calibration 5- or 9-point, tracker-controlled Any number, application-controlled

Optics

Manual focusing/thresholding Auto-focus

Communication Serial (polling/streaming)

TCP/IP (client/server)

Synchronization Status byte word

API callback

not immediately suggestive, consider two use scenarios of both old and new technology presented in the next table.

Eye tracker use comparison.

Typical session with old system.

Typical session with new system.

1. Login to console. 2. Turn on eye tracking equipment. 3. Turn on eye/scene monitors. 4. Turn on eye tracking PC. 5. Run eye tracking program. 6. Turn on camera. 7. Turn on illumination control. 8. Adjust head/chin rest. 9. Adjust pan/tilt unit. 10. Adjust camera zoom. 11. Adjust camera focus. 12. Adjust pupil/corneal thresholds. 13. Calibrate. 14. Run.

1. Login to console. 2. Turn on eye tracking PC. 3. Run eye tracking program. 4. Calibrate. 5. Run.

The disparity in the use of old and new technologies is mainly due to the use of different optics (camera). New systems tend to use an auto-focusing digital camera, e.g., embedded in a ﬂat panel display. Although embedding within a ﬂat panel display may restrict a user’s physical position somewhat, it is generally preset to operate at a comfortable range (e.g., 50–60 cm focal distance). Unlike older systems, as long as the user sits within this distance, no chin rests and no further parameter adjustments are needed. In contrast, older devices required the use of a pan/tilt unit to position the camera, the camera to be manually focused and zoomed, and software to be set to appropriate pupil and corneal reﬂection detection thresholds. None of these cumbersome operations are required with newer systems.
Furthermore, one of the most important features of the new technology, especially for application development, is an individual’s ability to self-calibrate. With older technology, whenever a developer wished to test a new feature, she or he had to recruit a (very patient) subject for testing. This was quite problematic. The newer systems’

Preface to the Second Edition VII
calibration routines are a much-needed improvement over older (third-generation) technology that signiﬁcantly accelerate program and application development.
A third-generation eye tracker was used for most of the eye tracking research on which the ﬁrst edition of this book was based. The availability of new technology precipitated the writing of the second edition. The second edition therefore ﬁlls several important gaps not covered previously, namely:
1. Client/server model for developing an eye tracking client application 2. Client-controlled display, calibration, data collection 3. New programming examples
Beyond updated technical descriptions of client programming, the second edition also includes what the ﬁrst edition lacked: an overview of the methodology behind the use of eye trackers, that is, experimental design issues that are often needed to conduct eye tracking studies. The second edition brieﬂy reviews experimental design decisions, offers some guidelines for incorporating eye movement metrics into a study (e.g., usability), and provides examples of case studies.
Finally, the second edition expands the third part of the book: eye tracking applications. A great deal of new and exciting eye tracking work has appeared, undoubtedly driven by the availability of new technology. In fact, there now appears to be a rather refreshing shift in the reporting of eye tracking and eye movement studies. Authors now tend to understate the “gee-whiz” factor of eye trackers and their technical machinations needed to obtain eye movement data and are now emphasizing scientiﬁc results bolstered by objective evidence provided by users’ gaze and hence attention. Eye tracking ﬁnally appears to be entering into mainstream science, where the eye tracker is becoming less of a novelty and more of a tool. It is hoped that this second edition may inspire readers with the simplicity of application development now made possible by fourth-generation eye trackers and continue on the road to new applications and scientiﬁc insights.
Andrew T. Duchowski Clemson, SC, April 2006

Preface to the First Edition
The scope of the book falls within a fairly narrow human–computer interaction domain (i.e., describing a particular input modality), however, it spans a broad range of interdisciplinary research and application topics. There are at least three domains that stand to beneﬁt from eye tracking research: visual perception, human–computer interaction, and computer graphics. The amalgamation of these topics forms a symbiotic relationship. Graphical techniques provide a means of generating rich sets of visual stimuli ranging from 2D imagery to 3D immersive virtual worlds and research exploring visual attention and perception in turn inﬂuences the generation of artiﬁcial scenes and worlds. Applications derived from these disciplines create a powerful human–computer interaction modality, namely interaction based on knowledge of the user’s gaze.
Recent advancements in eye tracking technology, speciﬁcally the availability of cheaper, faster, more accurate, and easier to use trackers, have inspired increased eye movement and eye tracking research efforts. However, although eye trackers offer a uniquely objective view of overt human visual and attentional processes, eye trackers have not yet gained widespread use beyond work conducted at various research laboratories. This lack of acceptance is due in part to two reasons: ﬁrst, the use of an eye tracker in an applied experimental setting is not a widely taught subject. Hence, there is a need for a book that may help in providing training. It is not uncommon for enthusiastic purchasers of eye tracking equipment to become discouraged with their newly bought equipment when they ﬁnd it difﬁcult to set up and operate. Only a few academic departments (e.g., psychology, computer science) offer any kind of instruction in the use of eye tracking devices. Second, to exacerbate the lack of training in eye tracking methodology, even fewer sources of instruction exist for system development. Setting up an eye tracking lab and integrating the eye tracker into an available computer system for development of gaze-contingent applications is a fairly complicated endeavor, similar to the development and integration of virtual reality programs. Thus far, it appears no textbook other than this one exists providing this type of low-level information.

X

Preface to the First Edition

The goal of this book is to provide technical details for implementation of a gazecontingent system, couched in the theoretical context of eye movements, visual perception, and visual attention. The text started out as the author’s personal notes on the integration of a commercial eye tracker into a virtual reality graphics system. These technical considerations comprise the middle chapters of the book and include details of integrating a commercial eye tracker into both a 3D virtual environment, and a 2D image display application. The surrounding theoretical review chapters grew from notes developed for an interdisciplinary eye tracking methodology course offered to both undergraduates and graduates from four disciplines: psychology, marketing, industrial engineering, and computer science. An early form of these notes was presented as a short course at the Association for Computing Machinery (ACM) Special Interest Group on Graphics’ SIGGRAPH conference, 23–28 July 2000, New Orleans, LA.

Overview
As of the second edition, the book is divided into four parts, presented thematically in a top-down fashion, providing ﬁrst an introduction to the human visual system (Part I), then brieﬂy surveying eye tracking systems (Part II), then discussing eye tracking methodology (Part III), and ﬁnally ending by reviewing a number of eye tracking applications (Part IV).
In the ﬁrst part, “Introduction to the Human Visual System (HVS),” the book covers the concept of visual attention, mainly from a historical perspective. The ﬁrst chapter focuses on the dichotomy of foveal and peripheral vision (the “what” versus the “where”). This chapter covers easily observable attentional phenomena. The next chapter covers the neurological substrate of the HVS presenting the low-level neurological elements implicated in dynamic human vision. This chapter discusses the primary dual pathways, the parvo- and magno-cellular channels, which loosely correspond to the ﬂow of visual information permitted by the retinal fovea and periphery. Following this description of the visual “hardware”, observable characteristics of human vision are summarized in the following chapter on visual perception. Here, results obtained mainly from psychophysics are summarized, distinguishing foveal and peripheral visual perception. The ﬁrst part ends by discussing the mechanism responsible for shifting the fovea, namely eye movements. Having established the neurological and psychophysical context for eye movements, the following chapter on the taxonomy and models of eye movements gives the common terms for the most basic of eye movements along with a signal-analytic description of recordable eye movement waveforms.
The second part of the book, “Eye Tracking Systems,” presents a brief survey of the main types of available eye tracking devices, followed by a detailed technical description of the requirements for system installation and application program development. These details are mainly applicable to video-based, corneal-reﬂection eye

Preface to the First Edition XI
trackers, the most widely available and most affordable type of eye trackers. This part of the book offers information for the development of three general systems: one for binocular 3D eye tracking in virtual reality, one for monocular 2D eye tracking over a 2D display (e.g., a television monitor on which graphical information can be displayed), and one for binocular 2D eye tracking on the desktop. The descriptions of the ﬁrst two former systems are very similar because they are based on the same kind of (older) eye tracking hardware (ISCAN in this instance). The latter system description is based on modern eye tracking technology from Tobii. Both system descriptions include notes on system calibration. This part of the book ends with a description of data collection and analysis independent of any particular eye tracking hardware.
The fourth part of the book surveys a number of interesting and challenging eye tracking applications. Applications identiﬁed in this part are drawn from psychology, human factors, marketing and advertising, human–computer interaction and collaborative systems, and computer graphics and virtual reality.
How to Read This Book
The intended audience for this book is an interdisciplinary one, aimed particularly at those interested in psychology, marketing, industrial engineering, and computer science. Indeed, this text is meant for undergraduates and graduates from these disciplines enrolled in a course dealing with eye tracking, such as the eye tracking methodology course developed by the author at Clemson University. In this course, typically all chapters are covered, but not necessarily in the order presented in the text. In such a course, the order of chapters may be as follows.
First, Part IV is presented outlining various eye tracking applications. Normally, this part should give the reader motivation for design and implementation of a semesterlong eye tracking project. Coverage of this part of the book is usually supplemented by readings of research papers from various sources. For example, papers may be selected from the following conferences.
• The proceedings of the Eye Tracking Research & Applications (ETRA) conference
• The proceedings of the ACM Special Interest Group on Human–Computer Interaction (SIGCHI) conference (Human Factors in Computing)
• Transactions on Graphics, the proceedings of the annual Association for Computing Machinery (ACM) Special Interest Group on Graphics and Interactive Techniques (SIGGRAPH) conference series
• The proceedings of the Human Factors and Ergonomics Society (HFES)
To speed up development of an eye tracking application, Part II follows the presentation of Part IV, dealing in the technical details of eye tracker application development.

XII Preface to the First Edition
The types of applications that can be expected of students will depend mainly on the programming expertise represented by members of interdisciplinary student teams. For example, in the eye tracking methodology course at Clemson, teams are formed by joining computer science students with one or more of the other representatives enrolled in the class (i.e., from marketing, psychology, or industrial engineering). Although all group members decide on a project, students studying the latter subjects are mainly responsible for the design and analysis of the eventual eye tracking experiment.
Given commencement of an eye tracking application, Part III is then covered, going over experimental design. In the context of the usability measurement framework, the eye tracking methodology course advocates performance measurement, and therefore focuses on laboratory experiments and quantitative data analysis.
Part I of the text is covered last, giving students the necessary theoretical context for the eye tracking pilot study. Thus, although the book is arranged “top-down”, the course proceeds “bottom-up”.
The book is also suitable for researchers interested in setting up an eye tracking laboratory and/or using eye trackers for conducting experiments. Because readers with these goals may also come from diverse disciplines such as marketing, psychology, industrial engineering, and computer science, not all parts of the book may be suitable for everyone. More technically oriented readers will want to pay particular attention to the middle sections of the book which detail system installation and implementation of eye tracking application software. Readers not directly involved with such low-level details may wish to omit these sections and concentrate more on the theoretical and historical aspects given in the front sections of the book. The latter part of the book, dealing with eye tracking applications, should be suitable for all readers inasmuch as it presents examples of current eye tracking research.
Acknowledgments
This work was supported in part by a University Innovation grant (# 1-20-1906-514087), NASA Ames task (# NCC 2-1114), and NSF CAREER award # 9984278.
The preparation of this book has been assisted by many people, including Keith Karn, Roel Vertegaal, Dorion Liston, and Keith Rayner who provided comments on early editions of the text-in-progress. Later versions of the draft were reviewed by external reviewers to whom I express my gratitude, for their comments greatly improved the ﬁnal version of the text. Special thanks go to David Wooding for his careful and thorough review of the text.
I would like to thank the team at Springer for helping me compose the text. Thanks go to Beverly Ford and Karen Borthwick for egging me on to write the text and to

Preface to the First Edition XIII
Rosie Kemp and Melanie Jackson for helping me with the ﬁnal stages of publication. Many thanks to Catherine Brett for her help in the creation of the second edition.
Special thanks go to Bruce McCormick, who always emphasized the importance of writing during my doctoral studies at Texas A&M University, College Station, TX. Finally, special thanks go to Corey, my wife, for patiently listening to my various ramblings on eye movements, and for being an extremely patient eye tracking subject :).
I have gained considerable pleasure and enjoyment in putting the information I’ve gathered and learned on paper. I hope that readers of this text derive similar pleasure in exploring vision and eye movements as I have, and they go on to implementing ever interesting and fascinating projects—have fun!
Andrew T. Duchowski Clemson, SC, June 2002 & July 2006

Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXI
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXV
Part I Introduction to the Human Visual System (HVS)
1 Visual Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1 Visual Attention: A Historical Review . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.1 Von Helmholtz’s “Where” . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.2 James’ “What” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1.3 Gibson’s “How” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1.4 Broadbent’s “Selective Filter” . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.5 Deutsch and Deutsch’s “Importance Weightings” . . . . . . . 6 1.1.6 Yarbus and Noton and Stark’s “Scanpaths” . . . . . . . . . . . . . 7 1.1.7 Posner’s “Spotlight” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.8 Treisman’s “Glue” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.1.9 Kosslyn’s “Window” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2 Visual Attention and Eye Movements . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Neurological Substrate of the HVS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1 The Eye . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 The Retina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 The Outer Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.2 The Inner Nuclear Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.3 The Ganglion Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 The Optic Tract and M/P Visual Channels . . . . . . . . . . . . . . . . . . . . 23 2.4 The Occipital Cortex and Beyond . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.1 Motion-Sensitive Single-Cell Physiology . . . . . . . . . . . . . . 25 2.5 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

XVI Contents
3 Visual Psychophysics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1 Spatial Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Temporal Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.1 Perception of Motion in the Visual Periphery . . . . . . . . . . . 35 3.2.2 Sensitivity to Direction of Motion in the Visual Periphery 36 3.3 Color Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4 Implications for Attentional Design of Visual Displays . . . . . . . . . 38 3.5 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Taxonomy and Models of Eye Movements . . . . . . . . . . . . . . . . . . . . . . . 41 4.1 The Extraocular Muscles and the Oculomotor Plant . . . . . . . . . . . . 41 4.2 Saccades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3 Smooth Pursuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4 Fixations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.5 Nystagmus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.6 Implications for Eye Movement Analysis . . . . . . . . . . . . . . . . . . . . . 47 4.7 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Part II Eye Tracking Systems
5 Eye Tracking Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.1 Electro-OculoGraphy (EOG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2 Scleral Contact Lens/Search Coil . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3 Photo-OculoGraphy (POG) or Video-OculoGraphy (VOG) . . . . . . 53 5.4 Video-Based Combined Pupil/Corneal Reﬂection . . . . . . . . . . . . . . 54 5.5 Classifying Eye Trackers in “Mocap” Terminology . . . . . . . . . . . . 58 5.6 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6 Head-Mounted System Hardware Installation . . . . . . . . . . . . . . . . . . . 61 6.1 Integration Issues and Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2 System Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.3 Lessons Learned from the Installation at Clemson . . . . . . . . . . . . . 66 6.4 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7 Head-Mounted System Software Development . . . . . . . . . . . . . . . . . . . 69 7.1 Mapping Eye Tracker Screen Coordinates . . . . . . . . . . . . . . . . . . . . 70 7.1.1 Mapping Screen Coordinates to the 3D Viewing Frustum . 70 7.1.2 Mapping Screen Coordinates to the 2D Image . . . . . . . . . . 71 7.1.3 Measuring Eye Tracker Screen Coordinate Extents . . . . . . 72 7.2 Mapping Flock Of Birds Tracker Coordinates . . . . . . . . . . . . . . . . . 74 7.2.1 Obtaining the Transformed View Vector . . . . . . . . . . . . . . . 75 7.2.2 Obtaining the Transformed Up Vector . . . . . . . . . . . . . . . . . 76 7.2.3 Transforming an Arbitrary Vector . . . . . . . . . . . . . . . . . . . . 77 7.3 3D Gaze Point Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Contents XVII
7.3.1 Parametric Ray Representation of Gaze Direction . . . . . . . 80 7.4 Virtual Gaze Intersection Point Coordinates . . . . . . . . . . . . . . . . . . 81
7.4.1 Ray/Plane Intersection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.4.2 Point-In-Polygon Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.5 Data Representation and Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.6 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8 Head-Mounted System Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 8.1 Software Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 8.2 Ancillary Calibration Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 8.2.1 Internal 2D Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8.2.2 Internal 3D Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 8.3 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
9 Table-Mounted System Hardware Installation . . . . . . . . . . . . . . . . . . . 101 9.1 Integration Issues and Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 102 9.2 System Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 9.3 Lessons Learned from the Installation at Clemson . . . . . . . . . . . . . 105 9.4 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
10 Table-Mounted System Software Development . . . . . . . . . . . . . . . . . . 109 10.1 Linux Tobii Client Application Program Interface . . . . . . . . . . . . . 110 10.1.1 Tet Init . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 10.1.2 Tet Connect, Tet Disconnect . . . . . . . . . . . . . . . . . . . . . . 111 10.1.3 Tet Start, Tet Stop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 10.1.4 Tet CalibClear, Tet CalibLoadFromFile, Tet CalibSaveToFile, Tet CalibAddPoint, Tet CalibRemovePoints, Tet CalibGetResult, Tet CalibCalculateAndSet . . . . . . . . . . . . . . . . . . . . . . . . 112 10.1.5 Tet SynchronizeTime, Tet PerformSystemCheck . . . . 114 10.1.6 Tet GetSerialNumber, Tet GetLastError, Tet GetLastErrorAsText . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 10.1.7 Tet CallbackFunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 10.2 A Simple OpenGL/GLUT GUI Example . . . . . . . . . . . . . . . . . . . . 116 10.3 Caveats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 10.4 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
11 Table-Mounted System Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 11.1 Software Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 11.2 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
12 Eye Movement Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 12.1 Signal Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 12.2 Dwell-Time Fixation Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 12.3 Velocity-Based Saccade Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 141 12.4 Eye Movement Analysis in Three Dimensions . . . . . . . . . . . . . . . . 144

XVIII Contents
12.4.1 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 12.4.2 Fixation Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 12.4.3 Eye Movement Data Mirroring . . . . . . . . . . . . . . . . . . . . . . . 153 12.5 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Part III Eye Tracking Methodology
13 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 13.1 Formulating a Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 13.2 Forms of Inquiry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 13.2.1 Experiments Versus Observational Studies . . . . . . . . . . . . . 159 13.2.2 Laboratory Versus Field Research . . . . . . . . . . . . . . . . . . . . 160 13.2.3 Idiographic Versus Nomothetic Research . . . . . . . . . . . . . . 160 13.2.4 Sample Population Versus Single-Case Experiment Versus Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 13.2.5 Within-Subjects Versus Between-Subjects . . . . . . . . . . . . . 162 13.2.6 Example Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 13.3 Measurement and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 13.4 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
14 Suggested Empirical Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 14.1 Evaluation Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 14.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 14.1.2 System Identiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 14.1.3 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 14.1.4 User Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 14.1.5 Evaluation Locale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 14.1.6 Task Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 14.2 Practical Advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 14.3 Considering Dynamic Stimulus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 14.4 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
15 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 15.1 Head-Mounted VR Diagnostics: Visual Inspection . . . . . . . . . . . . . 182 15.1.1 Case Study Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 15.2 Head-Mounted VR Diagnostics: 3D Maze Navigation . . . . . . . . . . 183 15.2.1 Case Study Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 15.3 Desktop VR Diagnostics: Driving Simulator . . . . . . . . . . . . . . . . . . 185 15.3.1 Case Study Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 15.4 Desktop Diagnostics: Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 15.4.1 Case Study Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 15.5 Desktop Interaction: Gaze-Contingent Fisheye Lens . . . . . . . . . . . 197 15.5.1 Case Study Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 15.6 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

Contents XIX
Part IV Eye Tracking Applications
16 Diversity and Types of Eye Tracking Applications . . . . . . . . . . . . . . . . 205 16.1 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
17 Neuroscience and Psychology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 17.1 Neurophysiological Investigation of Illusory Contours . . . . . . . . . . 208 17.2 Attentional Neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 17.3 Eye Movements and Brain Imaging . . . . . . . . . . . . . . . . . . . . . . . . . 211 17.4 Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 17.5 Scene Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 17.5.1 Perception of Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 17.5.2 Perception of Film . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 17.6 Visual Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 17.6.1 Computational Models of Visual Search . . . . . . . . . . . . . . . 229 17.7 Natural Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 17.8 Eye Movements in Other Information Processing Tasks . . . . . . . . . 237 17.9 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
18 Industrial Engineering and Human Factors . . . . . . . . . . . . . . . . . . . . . 241 18.1 Aviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 18.2 Driving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 18.3 Visual Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 18.4 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
19 Marketing/Advertising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 19.1 Copy Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 19.2 Print Advertising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 19.3 Ad Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 19.4 Television Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 19.5 Web Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 19.6 Product Label Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 19.7 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
20 Computer Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 20.1 Human–Computer Interaction and Collaborative Systems . . . . . . . 275 20.1.1 Classic Eye-Based Interaction . . . . . . . . . . . . . . . . . . . . . . . . 276 20.1.2 Cognitive Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 20.1.3 Universal Accessibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 20.1.4 Indirect Eye-Based Interaction . . . . . . . . . . . . . . . . . . . . . . . 281 20.1.5 Attentive User Interfaces (AUIs) . . . . . . . . . . . . . . . . . . . . . 282 20.1.6 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 20.1.7 Collaborative Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 20.2 Gaze-Contingent Displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

XX Contents
20.2.1 Screen-Based Displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 20.2.2 Model-Based Graphical Displays . . . . . . . . . . . . . . . . . . . . . 292 20.3 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
21 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321

List of Figures
1.1 The Kanizsa illusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2 Yarbus’ early eye movement recordings . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 A simpliﬁed view of the brain and the visual pathways . . . . . . . . . . . . 17 2.2 Stylized classiﬁcation of cortical lobes . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 The eye . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4 Schematic diagram of retinal neural interconnections . . . . . . . . . . . . . 20 2.5 Schematic of the neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.6 Schematic of receptive ﬁelds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.7 Foveo–peripheral illusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Visual angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Density distributions of rod and cone receptors: visual angle . . . . . . . 31 3.3 Density distributions of rod and cone receptors: rod/cone density . . . 32 3.4 Visual acuity at various eccentricities and light levels . . . . . . . . . . . . . 34 3.5 Critical Fusion Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.6 Absolute thresholds for detecting peripheral rotary movement . . . . . . 36 3.7 Visual ﬁelds for monocular color vision (right eye) . . . . . . . . . . . . . . . 37
4.1 Extrinsic muscles of the eye . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Schematic of the oculomotor system . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Simple linear ﬁlter modeling saccadic movements . . . . . . . . . . . . . . . . 45 4.4 Simple linear feedback model of smooth pursuit movements . . . . . . . 46
5.1 Example of electro-oculography (EOG) measurement . . . . . . . . . . . . . 52 5.2 Example of search coil eye movement measurement apparatus . . . . . 53 5.3 Example of scleral suction ring insertion . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4 Examples of pupil, limbus, and corneal reﬂection . . . . . . . . . . . . . . . . 55 5.5 Example of table-mounted video-based eye tracker . . . . . . . . . . . . . . . 56 5.6 Example of head-mounted video-based eye tracker . . . . . . . . . . . . . . . 56 5.7 Purkinje images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

XXII List of Figures
5.8 Relative positions of pupil and ﬁrst Purkinje images . . . . . . . . . . . . . . 58 5.9 Dual-Purkinje eye tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.1 Virtual Reality Eye Tracking lab at Clemson University . . . . . . . . . . . 63 6.2 Video signal wiring of the VRET lab at Clemson University . . . . . . . 65
7.1 Eye tracker to VR mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7.2 Example mapping measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.3 Euler angles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.4 Basic binocular geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.5 Ray/plane geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.6 Point-in-polygon geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.7 Example of three-dimensional gaze point captured in VR . . . . . . . . . . 85
8.1 Eye images during calibration (binocular eye tracking HMD) . . . . . . 89 8.2 Calibration stimulus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 8.3 Typical per-trial calibration data (one subject). . . . . . . . . . . . . . . . . . . . 96 8.4 Composite calibration data showing eye tracker slippage. . . . . . . . . . . 97 8.5 Adjustment of left and right eye scale factors. . . . . . . . . . . . . . . . . . . . . 98
9.1 Tobii dual-head eye tracking stations at Clemson University . . . . . . . 101 9.2 Single Tobii eye tracking station hardware setup . . . . . . . . . . . . . . . . . 105
11.1 Tobii eye tracking status window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 11.2 Tobii concurrent process layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
12.1 Hypothetical eye movement signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 12.2 Eye movement signal denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 12.3 Saccade/ﬁxation detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 12.4 Idealized saccade detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 12.5 Finite Impulse Response (FIR) ﬁlters for saccade detection . . . . . . . . 144 12.6 Characteristic saccade signal and ﬁlter responses . . . . . . . . . . . . . . . . . 147 12.7 FIR ﬁlters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 12.8 Acceleration thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 12.9 Eye movement signal and ﬁlter responses . . . . . . . . . . . . . . . . . . . . . . . 151 12.10 Heuristic mirroring example and calculation . . . . . . . . . . . . . . . . . . . . . 153
13.1 Single-subject, time series design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 13.2 Factorial design examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
14.1 Usability measurement framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 14.2 Traditional eye tracking metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 14.3 Scanpath comparison Y -matrices and parsing diagrams . . . . . . . . . . . . 174 14.4 Eye Tracking Lab at Clemson University . . . . . . . . . . . . . . . . . . . . . . . 177
15.1 Head-mounted and desktop (binocular) eye trackers . . . . . . . . . . . . . . 181

List of Figures XXIII
15.2 Eye tracking data of expert inspector and feedforward training display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
15.3 Performance and process measures: relative difference (%). . . . . . . . . 183 15.4 Simple 3D maze with 2D map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 15.5 Low-ﬁdelity desktop VR driving simulator and scanpaths . . . . . . . . . . 185 15.6 “Hotspots” from expert’s ∼14.5 min eye tracking session . . . . . . . . . . 188 15.7 Mean completion times of all users and expert . . . . . . . . . . . . . . . . . . . 192 15.8 Example of problematic Properties dialog selection . . . . . . . . . . . . . . . 194 15.9 Exemplar errant visual search for Edit button. . . . . . . . . . . . . . . . . . . . 195 15.10 Fixation “hotspots” and selected AOIs. . . . . . . . . . . . . . . . . . . . . . . . . . 196 15.11 Fixations per AOI (overall, with SE whiskers) . . . . . . . . . . . . . . . . . . . 197 15.12 Look-ahead saccades for click-and-drag . . . . . . . . . . . . . . . . . . . . . . . . 198 15.13 The Pliable Display Technology (PDT) ﬁsheye lens . . . . . . . . . . . . . . 199 15.14 Gaze-contingent ﬁsheye lens stimulus and performance results . . . . . 200
16.1 Hierarchy of eye tracking applications . . . . . . . . . . . . . . . . . . . . . . . . . . 206
17.1 Example of eye tracking fMRI scanner . . . . . . . . . . . . . . . . . . . . . . . . . 212 17.2 Fixation map from subjects viewing Paolo Veronese painting . . . . . . . 222 17.3 Fixations from subjects viewing Paolo Veronese painting . . . . . . . . . . 223 17.4 Example of “pop-out” effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 17.5 Architecture of Guided Search 3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 17.6 Architecture of Itti et al.’s visual attention system . . . . . . . . . . . . . . . . 231 17.7 String editing example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 17.8 Eye tracking in a natural pick-and-place task . . . . . . . . . . . . . . . . . . . . 236 17.9 Eye tracking in a natural hand-washing task . . . . . . . . . . . . . . . . . . . . . 237
18.1 A330 cockpit with predeﬁned areas of interest . . . . . . . . . . . . . . . . . . . 242 18.2 High-clutter driving stimulus images . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 18.3 Model of visual search area, visual lobe, and targets . . . . . . . . . . . . . . 253 18.4 Models of visual search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 18.5 Example of visual search model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 18.6 Virtual aircraft inspection simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 18.7 Visualization of 3D scanpath in VR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
19.1 Model of market and consumer actions . . . . . . . . . . . . . . . . . . . . . . . . . 262 19.2 Scanpaths over advertisements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 19.3 Scanpaths over NASCARTM vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 19.4 Google’s golden triangle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 19.5 Web search layout “hotspots” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 19.6 Scanpaths over redesigned drug labels . . . . . . . . . . . . . . . . . . . . . . . . . . 273
20.1 Scanning behavior during program debugging . . . . . . . . . . . . . . . . . . . 278 20.2 Example of eye typing interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 20.3 A drawing created solely with eye movements by an EyeDraw
developer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280

XXIV List of Figures
20.4 ViewPointer headset, tag, and usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 20.5 GAZE Groupware display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 20.6 GAZE Groupware interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 20.7 Example gaze-contingent displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 20.8 Image reconstruction and wavelet resolution mapping . . . . . . . . . . . . . 291 20.9 Fractal terrain for gaze-contingent virtual environment . . . . . . . . . . . . 294 20.10 Fractal terrain: gaze-contingent rendering (wireframe) . . . . . . . . . . . . 295 20.11 Fractal terrain: gaze-contingent rendering . . . . . . . . . . . . . . . . . . . . . . . 296 20.12 Gaze-contingent viewing of Isis model . . . . . . . . . . . . . . . . . . . . . . . . . 298 20.13 Gaze-contingent collision modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299

List of Tables
2.1 Functional characteristics of ganglionic projections . . . . . . . . . . . . . . . 24 3.1 Common visual angles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 7.1 Euler angles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 10.1 Linux Tobii client API function listing . . . . . . . . . . . . . . . . . . . . . . . . . . 110 12.1 Velocity algorithm comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 12.2 Acceleration algorithm comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 13.1 Statistical tests of difference of sample pairs (df = 1) . . . . . . . . . . . . . 168 13.2 Statistical tests of difference of multivariate data (df > 1) . . . . . . . . . . 168 15.1 Example Software Usability Task durations . . . . . . . . . . . . . . . . . . . . . 188 15.2 Example Software Usability Test task 003 . . . . . . . . . . . . . . . . . . . . . . 189 15.3 Example Software Usability Test task 006 . . . . . . . . . . . . . . . . . . . . . . 190 15.4 Task-speciﬁc mean responses over all tasks . . . . . . . . . . . . . . . . . . . . . . 191 15.5 General mean responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 17.1 Reading strategies/tactics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

Part I Introduction to the Human Visual System (HVS)

1
Visual Attention
In approaching the topic of eye tracking, we ﬁrst have to consider the motivation for recording human eye movements. That is, why is eye tracking important? Simply put, we move our eyes to bring a particular portion of the visible ﬁeld of view into high resolution so that we may see in ﬁne detail whatever is at the central direction of gaze. Most often we also divert our attention to that point so that we can focus our concentration (if only for a very brief moment) on the object or region of interest. Thus, we may presume that if we can track someone’s eye movements, we can follow along the path of attention deployed by the observer. This may give us some insight into what the observer found interesting, that is, what drew their attention, and perhaps even provide a clue as to how that person perceived whatever scene she or he was viewing.
By examining attention and the neural mechanisms involved in visual attention, the ﬁrst two chapters of this book present motivation for the study of eye movements from two perspectives: a psychological viewpoint examining attentional behavior and its history of study (presented brieﬂy in this chapter); and a physiological perspective on the neural mechanisms responsible for driving attentional behavior (covered in the next chapter). In sum, both introductory chapters establish the psychological and physiological basis for the movements of the eyes.
To begin formulating an understanding of an observer’s attentional processes, it is instructive to ﬁrst establish a rudimentary or at least intuitive sense of what attention is, and whether the movement of the eyes does in fact disclose anything about the inner cognitive process known as visual attention.
Visual attention has been studied for over a hundred years. A good qualitative deﬁnition of visual attention was given by the psychologist William James:
Everyone knows what attention is. It is the taking possession by the mind, in clear and vivid form, of one out of what seem several simultaneously possible objects or trains of thought. Focalization, concentration, of consciousness

4

1 Visual Attention

are of its essence. It implies withdrawal from some things in order to deal effectively with others...

When the things are apprehended by the senses, the number of them that can be attended to at once is small, ‘Pluribus intentus, minor est ad singula sensus.’
—W. James (1981)

The Latin phrase used above by James roughly translates to “Many ﬁltered into few for perception.” The faculty implied as the ﬁlter is attention.
Humans are ﬁnite beings that cannot attend to all things at once. In general, attention is used to focus our mental capacities on selections of the sensory input so that the mind can successfully process the stimulus of interest. Our capacity for information processing is limited. The brain processes sensory input by concentrating on speciﬁc components of the entire sensory realm so that interesting sights, sounds, smells, and the like, may be examined with greater attention to detail than peripheral stimuli. This is particularly true of vision. Visual scene inspection is performed minutatim, not in toto. That is, human vision is a piecemeal process relying on the perceptual integration of small regions to construct a coherent representation of the whole.
In this chapter, attention is recounted from a historical perspective following the narrative found in Heijden (1992). The discussion focuses on attentional mechanisms involved in vision, with emphasis on two main components of visual attention, namely the “what” and the “where”.

1.1 Visual Attention: A Historical Review
The phenomenon of visual attention has been studied for over a century. Early studies of attention were technologically limited to simple ocular observations and oftentimes to introspection. Since then the ﬁeld has grown to an interdisciplinary subject involving the disciplines of psychophysics, cognitive neuroscience, and computer science, to name three. This section presents a qualitative historical background of visual attention.
1.1.1 Von Helmholtz’s “Where”
At the second half of the 19th century, Von Helmholtz (1925) posited visual attention as an essential mechanism of visual perception. In his Treatise on Physiological Optics, he notes, “We let our eyes roam continually over the visual ﬁeld, because that is the only way we can see as distinctly as possible all the individual parts of the ﬁeld in turn.” Noting that attention is concerned with a small region of space,

1.1 Visual Attention: A Historical Review

5

Von Helmholtz observed visual attention’s natural tendency to wander to new things. He also remarked that attention can be controlled by a conscious and voluntary effort, allowing attention to peripheral objects without making eye movements to that object. Von Helmholtz was mainly concerned with eye movements to spatial locations, or the “where” of visual attention. In essence, although visual attention can be consciously directed to peripheral objects, eye movements reﬂect the will to inspect these objects in ﬁne detail. In this sense, eye movements provide evidence of overt visual attention.

1.1.2 James’ “What”
In contrast to Von Helmholtz’s ideas, James (1981) believed attention to be a more internally covert mechanism akin to imagination, anticipation, or in general, thought. James deﬁned attention mainly in terms of the “what”, or the identity, meaning, or expectation associated with the focus of attention. James favored the active and voluntary aspects of attention although he also recognized its passive, reﬂexive, nonvoluntary and effortless qualities.
Both views of attention, which are not mutually exclusive, bear signiﬁcantly on contemporary concepts of visual attention. The “what” and “where” of attention roughly correspond to foveal (James) and parafoveal (Von Helmholtz) aspects of visual attention, respectively. This dichotomous view of vision is particularly relevant to a bottom-up or feature-driven explanation of visual attention. That is, when considering an image stimulus, we may consider certain regions in the image that will attract one’s attention. These regions may initially be perceived parafoveally, in a sense requesting further detailed inspection through foveal vision. In this sense, peripherally located image features may drive attention in terms of “where” to look next, so that we may identify “what” detail is present at those locations.
The dual “what” and “where” feature-driven view of vision is a useful preliminary metaphor for visual attention, and indeed it has formed the basis for creating computational models of visual attention, which typically simulate so-called low-level, or bottom-up visual characteristics. However, this view of attention is rather simplistic. It must be stressed that a complete model of visual attention involves high-level visual and cognitive functions. That is, visual attention cannot simply be explained through the sole consideration of visual features. There are higher-level intentional factors involved (e.g., related to possibly voluntary, preconceived cognitive factors that drive attention).

1.1.3 Gibson’s “How”
In the 1940s Gibson (1941) proposed a third factor of visual attention centered on intention. Gibson’s proposition dealt with a viewer’s advance preparation as to whether to react and if so, how, and with what class of responses. This component of attention explained the ability to vary the intention to react while keeping the expectation of

6

1 Visual Attention

the stimulus object ﬁxed, and conversely, the ability to vary the expectation of the stimulus object while keeping the intention to react ﬁxed. Experiments involving ambiguous stimuli typically evoke these reactions. For example, if the viewer is made to expect words describing animals, then the misprint “sael” will be read as “seal”. Changing the expectation to words describing ships or boats invokes the perception of “sail”. The reactive nature of Gibson’s variant of attention speciﬁes the “what to do”, or “how to react” behavior based on the viewer’s preconceptions or attitude. This variant of visual attention is particularly relevant to the design of experiments. It is important to consider the viewer’s perceptual expectation of the stimulus, as (possibly) inﬂuenced by the experimenter’s instructions.

1.1.4 Broadbent’s “Selective Filter”
Attention, in one sense, is seen as a “selective ﬁlter” responsible for regulating sensory information to sensory channels of limited capacity. In the 1950s, Broadbent (1958) performed auditory experiments designed to demonstrate the selective nature of auditory attention. The experiments presented a listener with information arriving simultaneously from two different channels, e.g., the spoken numerals {7, 2, 3} to the left ear, {9, 4, 5} to the right. Broadbent reported listeners’ reproductions of either {7, 2, 3, 9, 4, 5}, or {9, 4, 5, 7, 2, 3}, with no interwoven (alternating channel) responses. Broadbent concluded that information enters in parallel but is then selectively ﬁltered to sensory channels.
1.1.5 Deutsch and Deutsch’s “Importance Weightings”
In contrast to the notion of a selective ﬁlter, Deutsch and Deutsch (1963) proposed that all sensory messages are perceptually analyzed at the highest level, precluding a need for a selective ﬁlter. Deutsch and Deutsch rejected the selective ﬁlter and limited capacity system theory of attention; they reasoned that the ﬁlter would need to be at least as complex as the limited capacity system itself. Instead, they proposed the existence of central structures with preset “importance weightings” that determined selection. Deutsch and Deutsch argued that it is not attention as such but the weightings of importance that have a causal role in attention. That is, attentional effects are a result of importance, or relevance, interacting with the information.
It is interesting to note that Broadbent’s selective ﬁlter generally corresponds to Von Helmholtz’s “where”, whereas Deutsch and Deutsch’s importance weightings correspond to James’ expectation, or the “what”. These seemingly opposing ideas were incorporated into a uniﬁed theory of attention by Anne Treisman in the 1960s (although not fully recognized until 1971). Treisman brought together the attentional models of Broadbent and Deutsch and Deutsch by specifying two components of attention: the attenuation ﬁlter followed by later (central) structures referred to as “dictionary units”. The attenuation ﬁlter is similar to Broadbent’s selective ﬁlter in that its function is selection of sensory messages. Unlike the selective ﬁlter, it does

1.1 Visual Attention: A Historical Review

7

not completely block unwanted messages, but only attenuates them. The later stage dictionary units then process weakened and unweakened messages. These units contain variable thresholds tuned to importance, relevance, and context. Treisman thus brought together the complementary models of attentional unit or selective ﬁlter (the “where”), and expectation (the “what”).

Up to this point, even though Treisman provided a convincing theory of visual attention, a key problem remained, referred to as the scene integration problem. The scene integration problem poses the following question: even though we may view the visual scene through something like a selective ﬁlter, which is limited in its scope, how is it that we can piece together in our minds a fairly coherent scene of the entire visual ﬁeld? For example, when looking at a group of people in a room such as in a classroom or at a party, even though it is impossible to gain a detailed view of everyone’s face at the same time, nevertheless it is possible to assemble a mental picture of where people are located. Our brains are capable of putting together this mental picture even though the selective ﬁlter of vision prevents us from physically doing so in one glance. Another well-known example of the scene integration problem is the Kanizsa (1976) illusion, exempliﬁed in Figure 1.1, named after the person who invented it. Inspecting Figure 1.1, you will see the edges of a triangle, even though the triangle is deﬁned only by the notches in the disks. How this triangle is integrated by the brain is not yet fully understood. That is, although it is known that the scene is inspected piecemeal as evidenced by the movement of the eyes, it is not clear how the “big picture” is assembled, or integrated, in the mind. This is the crux of the scene integration problem. In one view, offered by the Gestalt psychologists, it is hypothesized that recognition of the entire scene is performed by a parallel one-step process. To examine this hypothesis, a visualization of how a person views such an image (or any other) is particularly helpful. This is the motivation for recording and visualizing a viewer’s eye movements. Even though the investigation of eye movements dates back to 1907 (Dodge, 1907),1 a clear depiction of eye movements would not be available until 1967 (see Chapter 5 for a survey of eye tracking techniques). This early eye movement visualization, discussed below, shows the importance of eye movement recording not only for its expressive power of depicting one’s visual scanning characteristics, but also for its inﬂuence on theories of visual attention and perception.

1.1.6 Yarbus and Noton and Stark’s “Scanpaths”
Early diagrammatic depictions of recorded eye movements helped cast doubt on the Gestalt hypothesis that recognition is a parallel one-step process. The Gestalt view of recognition is a holistic one suggesting that vision relies to a great extent on the tendency to group objects. Although well-known visual illusions exist to support this view (e.g., subjective contours of the Kanizsa ﬁgure; see Figure 1.1 above Kanizsa (1976)), early eye movement recordings showed that visual recognition is at least
1 As cited in Gregory (1990).

8

1 Visual Attention

Fig. 1.1. The Kanizsa illusion.
partially serial in nature.
Yarbus (1967) measured subjects’ eye movements over an image after giving subjects speciﬁc questions related to the image. Such a picture is shown in Figure 1.2. Questions posed to subjects included a range of queries speciﬁc to the situation, e.g., are the people in the image related, what are they wearing, what will they have to eat, and so on. The eye movements Yarbus recorded demonstrated sequential viewing patterns over particular regions in the image.
Noton and Stark (1971a, 1971b) performed their own eye movement measurements over images and coined the observed patterns “scanpaths”. Their work extended Yarbus’ results by showing that even without leading questions subjects tend to ﬁxate identiﬁable regions of interest, or “informative details”. Furthermore, scanpaths showed that the order of eye movements over these regions is quite variable. That is, given a picture of a square, subjects will ﬁxate on the corners, although the order in which the corners are viewed differs from viewer to viewer and even differs between consecutive observations made by the same individual.
In contrast to the Gestalt view, Yarbus’ and Noton and Stark’s work suggests that a coherent picture of the visual ﬁeld is constructed piecemeal through the assembly of serially viewed regions of interest. Noton and Stark’s results support James’ “what” of visual attention. With respect to eye movements, the “what” corresponds to regions of interest selectively ﬁltered by foveal vision for detailed processing.
1.1.7 Posner’s “Spotlight”
Contrary to the serial “what” of visual attention, the orienting, or the “where”, is performed in parallel (Posner et al., 1980). Posner et al. suggested an attentional mechanism able to move about the scene in a manner similar to a “spotlight.” The

1.1 Visual Attention: A Historical Review

9

In each of the traces, the subject was asked to: Trace 1, examine the picture at will; Trace 2, estimate the economic level of the people; Trace 3, estimate the people’s ages; Trace 4, guess what the people were doing before the arrival of the visitor; Trace 5, remember the people’s clothing; Trace 6, remember the people’s (and objects’) position in the room; Trace 7, estimate the time since the guest’s last visit.

Fig. 1.2. Yarbus’ early eye movement recordings. Reprinted from Yarbus (1967) with permission © 1967 Plenum Press.

10 1 Visual Attention
spotlight, being limited in its spatial extent, seems to ﬁt well with Noton and Stark’s empirical identiﬁcation of foveal regions of interest. Posner et al., however, dissociate the spotlight from foveal vision and consider the spotlight an attentional mechanism independent of eye movements. Posner et al. identiﬁed two aspects of visual attention: the orienting and the detecting of attention. Orienting may be an entirely central (covert or mental) aspect of attention, whereas detecting is context-sensitive, requiring contact between the attentional beam and the input signal. The orienting of attention is not always dependent on the movement of the eyes; that is, it is possible to attend to an object while maintaining gaze elsewhere. According to Posner et al., orientation of attention must be done in parallel and must precede detection.
The dissociation of attention from foveal vision is an important point. In terms of the “what” and the “where”, it seems likely that the “what” relates to serial foveal vision. The “where”, on the other hand, is a parallel process performed parafoveally, or peripherally, which dictates the next focus of attention.
1.1.8 Treisman’s “Glue”
Posner et al. and Noton and Stark advanced the theory of visual attention along similar lines forged by Von Helmholtz and James (and then Broadbent and Deutsch and Deutsch). Treisman once again brought these concepts together in the feature integration theory of visual attention (Treisman & Gelade, 1980; Treisman, 1986). In essence, attention provides the “glue” that integrates the separated features in a particular location so that the conjunction (i.e., the object) is perceived as a uniﬁed whole. Attention selects features from a master map of locations showing where all the feature boundaries are located, but not what those features are. That is, the master map speciﬁes where things are, but not what they are. The feature map also encodes simple and useful properties of the scene such as color, orientation, size, and stereo distance. Feature Integration Theory, or FIT, is a particularly important theory of visual attention and visual search. Eye tracking is often a signiﬁcant experimental component used to test FIT. Feature integration theory, treated as an eye tracking application, is discussed in more detail in Chapter 17.
1.1.9 Kosslyn’s “Window”
Recently, Kosslyn (1994) proposed a reﬁned model of visual attention. Kosslyn describes attention as a selective aspect of perceptual processing, and proposes an attentional “window” responsible for selecting patterns in the “visual buffer”. The window is needed because there is more information in the visual buffer than can be passed downstream, and hence the transmission capacity must be selectively allocated. That is, some information can be passed along, but other information must be ﬁltered out. This notion is similar to Broadbent’s selective ﬁlter and Treisman’s attenuation ﬁlter. The novelty of the attentional window is its ability to be adjusted incrementally; i.e., the window is scalable. Another interesting distinction of Kosslyn’s model is the hypothesis of a redundant stimulus-based attention-shifting subsystem (e.g., a type

1.2 Visual Attention and Eye Movements

11

of context-sensitive spotlight) in mental imagery. Mental imagery involves the formation of mental maps of objects, or of the environment in general. It is deﬁned as “...the mental invention or recreation of an experience that in at least some respects resembles the experience of actually perceiving an object or an event, either in conjunction with, or in the absence of, direct sensory stimulation” (Finke, 1989). It is interesting to note that the eyes move during sleep (known as Rapid Eye Movement or REM sleep). Whether this is a manifestation of the use of an internal attentional window during sleep is not known.

1.2 Visual Attention and Eye Movements
Considering visual attention in terms of the “what” and “where”, we would expect that eye movements work in a way that supports the dual attentive hypothesis. That is, vision might behave in a cyclical process composed of the following steps.
1. Given a stimulus, such as an image, the entire scene is ﬁrst seen mostly in parallel through peripheral vision and thus mostly at low resolution. At this stage, interesting features may “pop out” in the ﬁeld of view, in a sense engaging or directing attention to their location for further detailed inspection.
2. Attention is thus turned off or disengaged from the foveal location and the eyes are quickly repositioned to the ﬁrst region that attracted attention.
3. Once the eyes complete their movement, the fovea is now directed at the region of interest, and attention is now engaged to perceive the feature under inspection at high resolution.
This is a bottom-up model or concept of visual attention. If the model is accurate, one would expect to ﬁnd regions in the brain that correspond in their function to attentional mechanisms. This issue is further investigated in Chapter 2.
The bottom-up model is at least correct in the sense that it can be said to be a component of natural human vision. In fact, the bottom-up model forms a powerful basis for computational models of visual search. Examples of such models are presented later in the text (see Chapter 17). The bottom-up view of visual attention is, however, incomplete. There are several key points that are not addressed. Consider these questions:
1. Assuming it is only the visual stimulus (e.g., image features) that drives attention, exactly what types of features attract attention?
2. If the visual stimulus were solely responsible for attracting attention, would we ever need the capability of making voluntary eye movements?
3. What is the link between attention and eye movements? Is attention always associated with the foveally viewed portion of the visual scene?
To gain insight into the ﬁrst question, we must examine how our physical visual mechanism (our eyes and brain) responds to visual stimulus. To attempt to validate

12 1 Visual Attention
a model of visual attention, we would need to be able to justify the model by identifying regions in the brain that are responsible for carrying out the functionality proposed by the model. For example, we would expect to ﬁnd regions in the brain that engage and disengage attention as well as those responsible for controlling (i.e., programming, initiating, and terminating) the movements of the eyes. Furthermore, there must be regions in the brain that are responsible for responding to and interpreting the visual stimuli that are captured by the eyes. As shown in the following chapters, the Human Visual System (HVS) responds strongly to some types of stimuli (e.g., edges), and weakly to others (e.g., homogeneous areas). The following chapters show that this response can be predicted to a certain extent by examining the physiology of the HVS. In later chapters we also show that the human visual response can be measured through a branch of psychology known as psychophysics. That is, through psychophysics, we can fairly well measure the perceptive power of the human visual system.
The bottom-up model of visual attention does not adequately offer answers to the second question because it is limited to mostly bottom-up, or feature-driven aspects of attention. The answer to the second question becomes clearer if we consider a more complete picture of attention involving higher-level cognitive functions. That is, a complete theory of visual attention should also involve those cognitive processes that describe our voluntary intent to attend to something, e.g., some portion of the scene. This is a key point that was brieﬂy introduced following the summary of Gibson’s work, and which is evident in Yarbus’ early scanpaths. It is important to reiterate that Yarbus’ work demonstrated scanpaths which differed with observers’ expectations; that is, scanpath characteristics such as their order of progression can be task-dependent. Based on what they are looking for, people will view a picture differently. A complete model or theory of visual attention is beyond the scope of this book, but see Chapter 17 for further insight into theories of visual search, and also for examples of the application of eye trackers to study this question.
Considering the third question opens up a classical problem in eye tracking studies. Because attention is composed of both low-level and high-level functions (one can loosely think of involuntary and voluntary attention, respectively), as Posner and others have observed, humans can voluntarily dissociate attention from the foveal direction of gaze. In fact, astronomers do this regularly to detect faint constellations with the naked eye by looking “off the fovea.” Because the periphery is much more sensitive to dim stimulus, faint stars are much more easily seen out of the “corner” of one’s eye than when they are viewed centrally. Thus the high-level component of vision may be thought of as a covert component, or a component which is not easily detectable by external observation. This is a well-known problem for eye tracking researchers. An eye tracker can only track the overt movements of the eyes, however, it cannot track the covert movement of visual attention. Thus, in all eye tracking work, a tacit but very important assumption is usually accepted: we assume that attention is linked to foveal gaze direction, but we acknowledge that it may not always be so.

1.3 Summary and Further Reading

13

1.3 Summary and Further Reading

A historical account of attention is a prerequisite to forming an intuitive impression of the selective nature of perception. For an excellent historical account of selective visual attention, see Heijden (1992). An earlier and very readable introduction to visual processes is a small paperback by Gregory (1990). For a more neurophysiological perspective, see Kosslyn (1994). Another good text describing early attentional vision is Papathomas et al. (1995).

The singular idioms describing the selective nature of attention are the “what” and the “where”. The “where” of visual attention corresponds to the visual selection of speciﬁc regions of interest from the entire visual ﬁeld for detailed inspection. Notably, this selection is often carried out through the aid of peripheral vision. The “what” of visual attention corresponds to the detailed inspection of the spatial region through a perceptual channel limited in spatial extent. The attentional “what” and “where” duality is relevant to eye tracking studies because scanpaths show the temporal progression of the observer’s foveal direction of gaze and therefore depict the observer’s instantaneous overt localization of visual attention.

From investigation of visual search, the consensus view is that a parallel pre-attentive stage acknowledges the presence of four basic features: color, size, orientation, and presence and/or direction of motion and that features likely to attract attention include edges and corners, but not plain surfaces (see Chapter 17). There is some doubt, however, whether human visual search can be described as an integration of independently processed features (Van Orden & DiVita, 1993). Van Orden and DiVita suggest that “...any theory on visual attention must address the fundamental properties of early visual mechanisms.” To attempt to quantify the visual system’s processing capacity, the neural substrate of the human visual system is examined in the following chapter which surveys the relevant neurological literature.

2
Neurological Substrate of the HVS
Considerable information may be gleaned from the vast neuroscientiﬁc literature regarding the functionality (and limitations) of the Human Visual System (HVS). It is often possible to qualitatively predict observed psychophysical results by studying the underlying visual “hardware.” For example, visual spatial acuity may be roughly estimated from knowledge of the distribution of retinal photoreceptors. Other characteristics of human vision may also be estimated from the neural organization of deeper brain structures.
Neurophysiological and psychophysical literature on the human visual system suggests the ﬁeld of view is inspected minutatim through brief ﬁxations over small regions of interest. This allows perception of detail through the fovea. Central foveal vision subtends 1–5◦ (visual angle) allowing ﬁne scrutiny of only a small portion of the entire visual ﬁeld, for example only 3% of the size of a large (21 in.) computer monitor (seen at ∼60 cm viewing distance). Approximately 90% of viewing time is spent in ﬁxations. When visual attention is directed to a new area, fast eye movements (saccades) reposition the fovea. The dynamics of visual attention probably evolved in harmony with (or perhaps in response to) the perceptual limitations imposed by the neurological substrate of the visual system.
The brain is composed of numerous regions classiﬁed by their function (Zeki, 1993). A simpliﬁed representation of brain regions is shown in Figure 2.1, with lobe designations stylized in Figure 2.2. The human visual system is functionally described by the connections between retinal and brain regions, known as visual pathways. Pathways joining multiple brain areas involved in common visual functions are referred to as streams. Figure 2.1 highlights regions and pathways relevant to selective visual attention. For clarity, many connections are omitted. Of particular importance to dynamic visual perception and eye movements are the following neural regions, summarized in terms of relevance to attention. • SC (Superior Colliculus): involved in programming eye movements and con-
tributes to eye movement target selection for both saccades and smooth pursuits

16 2 Neurological Substrate of the HVS
(possibly in concert with the Frontal Eye Fields (FEF) and area Lateral IntraParietal (LIP)); also remaps auditory space into visual coordinates (presumably for target foveation); with input of motion signals from area MT (see below), the SC is involved in pursuit target selection as well as saccade target selection. • Area V1 (primary visual cortex): detection of range of stimuli, e.g., principally orientation selection and possibly to a lesser extent color; cellular blob regions (double-opponent color cells) respond to color variations and project to areas V2 and V4 (Livingstone & Hubel, 1988). • Areas V2, V3, V3A, V4, MT: form, color, and motion processing. • Area V5/MT (Middle Temporal) and MST (Middle Superior Temporal): furnish large projections to Pons; hence possibly involved in smooth pursuit movements; involved in motion processing: area MT also projects to the colliculus, providing it with motion signals from the entire visual ﬁeld. • Area LIP (Lateral Intra Parietal): contains receptive ﬁelds that are corrected (reset) before execution of saccadic eye movements. • PPC (Posterior Parietal Complex): involved in ﬁxations.
Connections made to these areas from area V1 can be generally divided into two streams: the dorsal and ventral streams. Loosely, their functional description can be summarized as
• Dorsal stream: sensorimotor (motion, location) processing (e.g., the attentional “where”)
• Ventral stream: cognitive processing (e.g., the attentional “what”)
In general attentional terms, the three main neural regions implicated in eye movement programming and their functions are (Palmer, 1999):
• Posterior Parietal Complex: disengages attention, • SC: relocates attention, • Pulvinar: engages, or enhances, attention.
In a very simpliﬁed view of the brain, it is possible to identify the neural mechanisms involved in visual attention and responsible for the generation of eye movements. First, by examining the structure of the eye, it becomes clear why only the central or foveal region of vision can be perceived at high resolution. Second, signals from foveal and peripheral regions of the eye’s retina can be roughly traced along pathways in the brain showing how the brain may process the visual scene. Third, regions in the brain can be identiﬁed which are thought to be involved in moving the eyes so that the scene can be examined piecemeal. In this simpliﬁed view of the brain, one can in a sense obtain a complete picture of an “attentional feedback loop,” which creates the attentional cycles of disengaging attention, shifting of attention and (usually) the eyes, and for processing the region of interest currently being attended to, re-engaging attention and brain regions.
The neural substrate of the human visual system is examined in this chapter from the intuitive attentional perspective given above. The human neural hardware responsible

FEF Extraocular muscles

2 Neurological Substrate of the HVS

17

Dorsal (parietal) stream

Posterior Parietal Complex
LIP

MST

Motion pathway MT(V5)

Inferotemporal Complex

Ventral

(temporal)

stream

V4

M,P pathways

LGN

M,P pathways

V2

Pulvinar

V1

Thalamus

SC Midbrain

Dorsolateral Pons
Fig. 2.1. A simpliﬁed view of the brain and the visual pathways relevant to eye movements and attention.
parietal lobe
posterior parietal complex

frontal lobe temporal lobe

occipital lobe

Fig. 2.2. Stylized classiﬁcation of cortical lobes.

18 2 Neurological Substrate of the HVS
for visual processing is presented in order roughly following the direction of light and hence information entering the brain. That is, the discussion is presented “frontto-back” starting with a description of the eye and ending with a summary of the visual cortex located at the back of the brain. Emphasis is placed on differentiating the processing capability of foveal and peripheral vision, i.e., the simpliﬁed “what” and “where” of visual attention, respectively. However, the reader must be cautioned against underestimating the complexity of the visual system as presented in this text. The apparent “what” and “where” dual pathways are most probably not independent functional channels. There is a good deal of interconnection and “crosstalk” between these and other related visual centers which deems the dichotomous analysis overly simplistic. Nevertheless, there is a great deal of valuable information to be found in the neurological literature as human vision is undoubtedly the most studied human sense.
2.1 The Eye
Often called “the world’s worst camera,” the eye, shown in Figure 2.3, suffers from numerous optical imperfections, for example,
• Spherical aberrations: prismatic effect of peripheral parts of the lens • Chromatic aberrations: shorter wavelengths (blue) refracted more than longer
wavelengths (red) • Curvature of ﬁeld: a planar object gives rise to a curved image
However, the eye is also endowed with various mechanisms that reduce degradive effects, e.g.,
• To reduce spherical aberration, the iris acts as a stop, limiting peripheral entry of light rays,
• To overcome chromatic aberration, the eye is typically focused to produce sharp images of intermediate wavelengths,
• To match the effects of curvature of ﬁeld, the retina is curved compensating for this effect.
The eye is schematically shown in Figure 2.3.
2.2 The Retina
At the rear interior surface of the eye, the retina contains receptors sensitive to light (photoreceptors) which constitute the ﬁrst stage of visual perception. Photoreceptors can effectively be thought of as “transducers” converting light energy to electrical impulses (neural signals). Neural signals originating at these receptors lead to deeper visual centers in the brain. Photoreceptors are functionally classiﬁed into rods and cones. Rods are sensitive to dim and achromatic light (night vision), whereas cones

iris lens

2.2 The Retina

19

cornea pupil
aqueous humor

retina
optic axis
optic disc (blind spot)

visual axis fovea

vitreous humor

optic nerve & sheath
Fig. 2.3. The eye. Adapted from Visual Perception, 1st edition, by Cornsweet (1970) © 1970. Reprinted with permission of Wadsworth, a division of Thomson Learning: <www.thomsonrights.com>.
respond to brighter chromatic light (daylight vision). The retina contains approximately 120 million rods and 7 million cones.
The retina is composed of multiple layers of different cell types (De Valois & De Valois, 1988). Surprisingly, the “inverted” retina is constructed in such a way that photoreceptors are found at the bottom layer. This construction is somewhat counterintuitive inasmuch as rods and cones are farthest away from incoming light, buried beneath a layer of cells. The retina resembles a three-layer cell sandwich, with connection bundles between each layer. These connectional layers are called plexiform or synaptic layers. The retinogeniculate organization is schematically depicted in Figure 2.4. The outermost layer (w.r.t. incoming light) is the outer nuclear layer which contains the photoreceptor (rod/cone) cells. The ﬁrst connectional layer is the outer plexiform layer which houses connections between receptor and bipolar nuclei. The next outer layer of cells is the inner nuclear layer containing bipolar (amacrine, bipolar, horizontal) cells. The next plexiform layer is the inner plexiform layer where

20 2 Neurological Substrate of the HVS light

optic nerve
bipolar cells horizontal cells

ganglion cells
} inner synaptic layer
amacrine cells
} outer synaptic layer
receptor nuclei

cone

rod

Fig. 2.4. Schematic diagram of the neural interconnections among receptors and bipolar, ganglion, horizontal, and amacrine cells. Adapted from Dowling and Boycott (1966) with permission © 1966 The Royal Society (London).

connections between inner nuclei cells and ganglion cells are formed. The top layer, or the ganglion layer, is composed of ganglion cells.
The fovea’s photoreceptors are special types of neurons, the nervous system’s basic elements (see Figure 2.5). Retinal rods and cones are speciﬁc types of dendrites. In general, individual neurons can connect to as many as 10,000 other neurons. Comprised of such interconnected building blocks, as a whole, the nervous system behaves as a large neural circuit. Certain neurons (e.g., ganglion cells) resemble a “digital gate,” sending a signal (ﬁring) when the cell’s activation level exceeds a threshold. The myelin sheath is an axonal cover providing insulation which speeds up conduction of impulses. Unmyelinated axons of the ganglion cells converge to the optic disk (an opaque myelin sheath would block light). Axons are myelinated at

axon

synapse

dendrite

2.2 The Retina

21

axon

myelin sheath
Fig. 2.5. Schematic of the neuron. From Brain, Mind, and Behavior by Floyd E. Bloom and Arlyne Lazerson © 1985, 1988, 2001 by Educational Broadcasting Corporation. Used with the permission of Worth Publishers.
the optic disk, and connect to the Lateral Geniculate Nuclei (LGN) and the Superior Colliculus (SC).
2.2.1 The Outer Layer Rods and cones of the outer retinal layer respond to incoming light. A simpliﬁed account of the function of these cells is that rods provide monochromatic scotopic (night) vision, and cones provide trichromatic photopic (day) vision. Both types of cells are partially sensitive to mesopic (twilight) light levels.
2.2.2 The Inner Nuclear Layer Outer receptor cells are laterally connected to the horizontal cells. In the fovea, each horizontal cell is connected to about 6 cones, and in the periphery to about 30–40 cones. Centrally, the cone bipolar cells contact one cone directly, and several cones

22 2 Neurological Substrate of the HVS
indirectly through horizontal or receptor–receptor coupling. Peripherally, cone bipolar cells directly contact several cones. The number of receptors increases eccentrically. The rod bipolar cells contact a considerably larger number of receptors than cone bipolars. There are two main types of bipolar cells: ones that depolarize to increments of light (+), and others that depolarize to decrements of light (−). The signal proﬁle (cross-section) of bipolar receptive ﬁelds is a “Mexican Hat,” or centersurround, with an on-center, or off-center signature.
2.2.3 The Ganglion Layer
In a naive view of the human visual system, it is possible to inaccurately think of the retina (and thus the HVS as a whole) acting in a manner similar to that of a camera. Although it is true that light enters the eye and is projected through the lens onto the retina, the camera analogy is only accurate up to this point. In the retina, ganglion cells form an “active contrast-enhancing system,” not a cameralike plate. Centrally, ganglion cells directly contact one bipolar. Peripherally, ganglion cells directly contact several bipolars. Thus the retinal “camera” is not composed of individual “pixels.” Rather, unlike isolated pixels, the retinal photoreceptors (rods and cones in the base layer) form rich interconnections beyond the retinal outer layer. With about 120 million rods and cones and only about 1 million ganglion cells eventually innervating at the LGN, there is considerable convergence of photoreceptor output. That is, the signals of many (on the order of about 100) photoreceptors are combined to produce one type of signal. This interconnecting arrangement is described in terms of receptive ﬁelds, and this arrangement functions quite differently from a camera.
Ganglion cells are distinguished by their morphological and functional characteristics. Morphologically, there are two types of ganglion cells, the α and β cells. Approximately 10% of retinal ganglion cells are α cells possessing large cell bodies and dendrites, and about 80% are β cells with small bodies and dendrites (Lund et al., 1995). The α cells project to the magnocellular (M-) layers of the LGN and the β cells project to the parvocellular (P-) layers. A third channel of input relays through narrow, cell-sparse laminae between the main M- and P-layers of the LGN. Its origin in the retina is not yet known. Functionally, ganglion cells fall into three classes, the X, Y, and W cells (De Valois & De Valois, 1988; Kaplan, 1991). X cells respond to sustained stimulus, location and ﬁne detail, and innervate along both M- and Pprojections. Y cells innervate only along the M-projection, and respond to transient stimulus, coarse features and motion. W cells respond to coarse features, and motion, and project to the Superior Colliculus.
The receptive ﬁelds of ganglion cells are similar to those of bipolar cells (centersurround, on-center, off-center). Center-on and center-off receptive ﬁelds are depicted in Figure 2.6. Plus signs (+) denote illumination stimulus, minus signs (−) denote lack of stimulus. The vertical bars below each receptive ﬁeld depict the ﬁring response of the receptive ﬁeld. This signal characteristic (series of “ticks”) is usually

center-on

2.3 The Optic Tract and M/P Visual Channels

23

center-off

time

stimulus
Fig. 2.6. Schematic of receptive ﬁelds.

time

obtained by inserting an electrode into the brain. The signal proﬁle of receptive ﬁelds resembles the “Mexican hat” operator, often used in image processing.

2.3 The Optic Tract and M/P Visual Channels
Some (but not all) neural signals are transmitted from the retina to the occipital (visual) cortex through the optic tract, crossing in the optic chiasm, making connections to the LGN along the way. The physiology of the optic tract is often described functionally in terms of visual pathways, with reference to speciﬁc cells (e.g., ganglion cells). It is interesting to note the decussation (crossing) of the ﬁbers from the nasal half of the retina at the optic chiasm, i.e., nasal retinal signals cross, temporal signals do not.
M and P ganglion cells in the retina connect to M and P channels, respectively. Along the optic pathways, the superior colliculus and the lateral geniculate nucleus are of particular importance. The SC is involved in programming eye movements and also remaps auditory space into visual coordinates. As shown in Figure 2.1, some neural signals along the optic tract project to the SC. The SC is thought to be responsible for directing the eyes to a new region of interest for subsequent detailed visual inspection. Like other regions in the thalamus serving similar functions, the LGN is a

24 2 Neurological Substrate of the HVS
crossover point, or relay station, for α and β ganglion cells. The physiological organization of the LGN, with respect to nervations of these cells, produces a visual ﬁeld topography of great clinical importance. Here, the magnocellular and the parvocellular ganglionic projections are clearly visible (under microscope), forming junctions within two distinct layers of the LGN, correspondingly termed the M- and P-layers. Thalamic axons from the M- and P-layers of the LGN terminate in area V1 (the primary visual center) of the striate cortex.
The functional characteristics of ganglionic projections to the LGN and the corresponding magno- and parvocellular pathways are summarized in Table 2.1. The parvocellular pathway in general responds to signals possessing the following attributes: high contrast (the parvocellular pathway is less sensitive to luminance), chromaticity, low temporal frequency, and high spatial frequency (due to the small receptive ﬁelds). Conversely, the magnocellular pathway can be characterized by sensitivity to the following signals: low contrast (the magnocellular pathway is more sensitive to luminance), achromaticity, moderate-to-high temporal frequency (e.g., sudden onset stimuli), and low spatial frequency (due to the large receptive ﬁelds). Zeki (1993) suggests the existence of four functional pathways deﬁned by the M and P channels: motion, dynamic form, color, and form (size and shape). It is thought that ﬁbers reaching the superior colliculus represent retinal receptive ﬁelds in rodrich peripheral zones, whereas the ﬁbers reaching the LGN represent cone-rich areas of high acuity (Bloom & Lazerson, 1988). It seems likely that, in a general sense, the M ganglion cells correspond to rods, mainly found in the periphery, and the P cells correspond to cones, which are chromatic cells concentrated mainly in the foveal region.

Table 2.1. Functional characteristics of ganglionic projections.

Characteristics

Magnocellular Parvocellular

Ganglion size

Large

Small

Transmission time

Fast

Slow

Receptive ﬁelds

Large

Small

Sensitivity to small objects

Poor

Good

Sensitivity to change in light levels Large

Small

Sensitivity to contrast

Low

High

Sensitivity to motion

High

Low

Color discrimination

No

Yes

2.4 The Occipital Cortex and Beyond
Thalamic axons from the M- and P-layers of the LGN terminate mainly in the lower and upper halves (β, α divisions, respectively) of layer 4C in middle depth of area

2.4 The Occipital Cortex and Beyond

25

V1 (Lund et al., 1995). Cell receptive ﬁeld size and contrast sensitivity signatures are distinctly different in the M- and P- inputs of the LGN, and vary continuously through the depth of layer 4C. Unlike the center-surround receptive ﬁelds of retinal ganglion and LGN cells, cortical cells respond to orientation-speciﬁc stimulus (Hubel, 1988). Cortical cells are distinguished by two classes: simple and complex.

In area V1, the size of a simple cell’s receptive ﬁeld depends on its relative retinal position. The smallest ﬁelds are in and near the fovea, with sizes of about 1/4×1/4 degree. This is about the size of the smallest diameters of the smallest receptive ﬁeld centers of retinal ganglion or LGN cells. In the far periphery, simple cell receptive ﬁeld sizes are about 1×1 degree. The relationship between small foveal receptive ﬁelds and large peripheral receptive ﬁelds is maintained about everywhere along the visual pathway.

Simple cells ﬁre only when a line or edge of preferred orientation falls within a particular location of the cell’s receptive ﬁeld. Complex cells ﬁre wherever such a stimulus falls into the cell’s receptive ﬁeld (Lund et al., 1995). The optimum stimulus width for either cell type is, in the fovea, about two minutes of arc. The resolving power (acuity) of both cell types is the same.

About 10–20% of complex cells in the upper layers of the striate cortex show marked directional selectivity (Hubel, 1988). Directional Selectivity (DS) refers to the cell’s response to a particular direction of movement. Cortical Directional Selectivity (CDS) contributes to motion perception and to the control of eye movements (Grzywacz & Norcia, 1995). CDS cells establish a motion pathway from V1 projecting to MT and V2 (which also projects to MT) and to MST. In contrast, there is no evidence that Retinal Directional Selectivity (RDS) contributes to motion perception. RDS contributes to oculomotor responses (Grzywacz et al., 1995). In vertebrates, it is involved in optokinetic nystagmus, a type of eye movement discussed in Chapter 4.

2.4.1 Motion-Sensitive Single-Cell Physiology
There are two somewhat counterintuitive implications of the visual system’s motionsensitive single-cell organization for perception. First, due to motion-sensitive cells, eye movements are never perfectly still but make constant tiny movements called microsaccades (Hubel, 1988). The counterintuitive fact regarding eye movements is that if an image were artiﬁcially stabilized on the retina, vision would fade away within about a second and the scene would become blank. Second, due to the response characteristics of single (cortical) cells, the cameralike “retinal buffer” representation of natural images is much more abstract than intuition suggests. An object in the visual ﬁeld stimulates only a tiny fraction of the cells on whose receptive ﬁeld it falls (Hubel, 1988). Perception of the object depends mostly on the response of (orientation-speciﬁc) cells to the object’s borders. For example, the homogeneously shaded interior of an arbitrary form (e.g., a kidney bean) does not stimulate cells of

26 2 Neurological Substrate of the HVS
the visual system. Awareness of the interior shade or hue depends on only cells sensitive to the borders of the object. In Hubel’s (1988) words, “...our perception of the interior as black, white, gray, or green has nothing to do with cells whose ﬁelds are in the interior—hard as that may be to swallow.... What happens at the borders is the only information you need to know: the interior is boring.”
2.5 Summary and Further Reading
This chapter presented a simpliﬁed view of the brain with emphasis on regions and structures of the brain responsible for attentional and visual processing, including those regions implicated in eye movement generation. Starting with the structure of the eye, the most salient observation is the structure of the retina which clearly shows the limited scope of the high resolution fovea. The division between foveo– peripheral vision is maintained along the visual pathways and can be clearly seen under microscope in the LGN. Of particular relevance to attention and eye movements is the physiological and functional duality of the magno- and parvocellular pathways and of their apparent mapping to their attentional “what” and “where” classiﬁcation. Although this characterization of the M- and P-pathways is admittedly overly simplistic, it provides an intuitive functional distinction between foveal and peripheral vision.
An interesting visual example of foveo–peripheral processing is shown in Figure 2.7. To notice the curious difference between foveal and peripheral processing, foveate one corner of the image in Figure 2.7 and, without moving your eyes, shift your attention to the opposing corner of the image. Interestingly, you should perceive white dots at the line crossings in the foveal region, but black dots should appear at the line crossings in the periphery.
Examining regions in the brain along the visual pathways, one can obtain insight into how the brain processes visual information. The notion that attention may be driven by certain visual features (e.g., edges) is supported to an extent by the identiﬁcation of neural regions which respond to these features. How certain features are perceived, particularly within and beyond the fovea, is the topic covered in the next chapter.
For an excellent review of physiological optics and visual perception in general, see Hendee and Wells (1997). For an introduction to neuroscience, see Hubel’s (1988) very readable text. For a more recent description of the brain with an emphasis on color vision, see Zeki (1993). Apart from these texts on vision, several “handbooks” have also been assembled describing current knowledge of the brain. Arbib’s (1995) handbook is one such example. It is an excellent source summarizing current knowledge of the brain, although it is somewhat difﬁcult to read and to navigate through. 1 Another such well-organized but rather large text is Gazzaniga (2000).
1 A new edition of Arbib’s book has recently been announced.

2.5 Summary and Further Reading

27

Fig. 2.7. Foveo–peripheral illusion: scintillation effect produced by a variation of the standard Hermann grid illusion (attributed to L. Hermann (1870)), ﬁrst discovered by Elke Lingelbach (at home). Adapted from Ninio and Stevens © 2000, Pion, London.

3
Visual Psychophysics
Given the underlying physiological substrate of the human visual system, measurable performance parameters often (but not always!) fall within ranges predicted by the limitations of the neurological substrate. Visual performance parameters, such as visual acuity, are often measured following established experimental paradigms, generally derived in the ﬁeld of psychophysics (e.g., Receiver Operating Characteristics, or ROC paradigm, is one of the more popular experimental methods).
Unexpected observed visual performance is often a consequence of complex visual processes (e.g., visual illusions), or combinations of several factors. For example, the well-known Contrast Sensitivity Function, or CSF, describing the human visual system’s response to stimuli of varying contrast and resolution, depends not only on the organization of the retinal mosaic, but also on the response characteristics of complex cellular combinations, e.g., receptive ﬁelds.
In this book, the primary concern is visual attention, and so the book primarily considers the distinction between foveo–peripheral vision. This subject, although complex, is discussed here in a fairly simpliﬁed manner, with the aim of elucidating only the most dramatic differences between what is perceived foveally and peripherally. In particular, visual (spatial) acuity is arguably the most studied distinction and is possibly the simplest parameter to alter in eye-based interaction systems (at least at this time). It is therefore the topic covered in greatest detail, in comparison to the other distinctions covered here brieﬂy: temporal and chromatic foveo–peripheral differences.
3.1 Spatial Vision
Dimensions of retinal features are usually described in terms of projected scene dimensions in units of degrees visual angle, deﬁned as
A = 2 arctan S , 2D

30 3 Visual Psychophysics
where S is the size of the scene object and D is the distance to the object (see Figure 3.1). Common visual angles are given in Table 3.1.

A

S

D

Fig. 3.1. Visual angle. Adapted from Haber and Hershenson (1973) © 1973. Reprinted with permission of Brooks/Cole, an imprint of the Wadsworth Group, a division of Thomson Learning.

Table 3.1. Common visual angles.

Object

Distance Angle Subtended

Thumbnail

Arm’s length

Sun or moon

—

U.S. quarter coin Arm’s length

1.5–2◦ .5◦ or 30 or arc
2◦

U.S. quarter coin 85m 1 (1 minute of arc)

U.S. quarter coin 5km 1 (1 second of arc)

The innermost region is the fovea centralis (or foveola) which measures 400 µm in diameter and contains 25,000 cones. The fovea proper measures 1500 µm in diameter and holds 100,000 cones. The macula (or central retina) is 5000 µm in diameter, and contains 650,000 cones. One degree visual angle corresponds to approximately 300 µm distance on the human retina (De Valois & De Valois, 1988). The foveola, measuring 400 µm subtends 1.3◦ visual angle, and the fovea and macula subtend 5◦ and 16.7◦, respectively (see Figure 3.2). Figure 3.3 shows the retinal distribution of rod and cone receptors. The fovea contains 147,000 cones/mm2 and a slightly smaller number of rods. At about 10◦ the number of cones drops sharply to less than 20,000 cones/mm2 and at 30◦ the number of rods in the periphery drops to about 100,000 rods/mm2 (Haber & Hershenson, 1973).
The entire visual ﬁeld roughly corresponds to a 23,400 square degree area deﬁned by an ellipsoid with the horizontal major axis subtending 180◦ visual angle, and the

3.1 Spatial Vision

31

80o

60 o

visual axis

40 o
fovea 20 o
0o

o
80

optic axis

60o

40o 20o

optic nerve & sheath
Fig. 3.2. Density distributions of rod and cone receptors across the retinal surface: visual angle. Adapted from Pirenne (1967); as cited in Haber and Hershenson (1973).
minor vertical axis subtending 130◦. The diameter of the highest acuity circular region subtends 2◦, the parafovea (zone of high acuity) extends to about 4◦ or 5◦, and acuity drops off sharply beyond. At 5◦, acuity is only 50% (Irwin, 1992). The socalled “useful” visual ﬁeld extends to about 30◦. The rest of the visual ﬁeld has very poor resolvable power and is mostly used for perception of ambient motion. With increasing eccentricity the cones increase in size, whereas the rods do not (De Valois & De Valois, 1988). Cones, not rods, make the largest contribution to the information going to deeper brain centers, and provide most of the ﬁne-grained spatial resolvability of the visual system.
The Modulation Transfer Function (MTF) theoretically describes the spatial resolvability of retinal photoreceptors by considering the cells as a ﬁnite array of sampling units. The 400 µm-diameter rod-free foveola contains 25,000 cones. Using the area of a circle, 25000 = πr2, approximately 2 25000/π = 178.41 cones occupy a 400 µm linear cross-section of the foveola with an estimated average linear inter-cone

32 3 Visual Psychophysics 180,000 160,000

blind spot

Number of receptors per mm sq.

140,000

120,000

rods

rods

100,000

80,000

60,000

40,000

20,000 0

cones

cones

70 60 50 40 30 20 10 0 10 20 30 40 50 60 70

Visual angle in degrees

Fig. 3.3. Density distributions of rod and cone receptors across the retinal surface: rod/cone density. Adapted from Pirenne (1967); as cited in Haber and Hershenson (1973).

spacing of 2.24 µm. Cones in this region measure about 1 µm in diameter. Because one degree visual angle corresponds to approximately 300 µm distance on the human retina, roughly 133 cones are packed per degree visual angle in the foveola. By the sampling theorem, this suggests a resolvable spatial Nyquist frequency of 66 c/deg. Subjective resolution has in fact been measured at about 60 c/deg (De Valois & De Valois, 1988). In the fovea, a similar estimate based on the foveal diameter of 1500 µm and a 100,000 cone population, gives an approximate linear cone distribution of 2 100000/π = 356.82 cones per 1500 µm. The average linear intercone spacing is then 71 cones/deg suggesting a maximum resolvable frequency of 35 cycles/deg, roughly half the resolvability within the foveola. This is somewhat of an underestimate because cone diameters increase twofold by the edge of the fovea suggesting a slightly milder acuity degradation. These one-dimensional approximations are not fully generalizable to the two-dimensional photoreceptor array although they provide insight into the theoretic resolution limits of the eye. Effective relative visual

3.2 Temporal Vision

33

acuity measures are usually obtained through psychophysical experimentation.

At photopic light levels (day, or cone vision), foveal acuity is fairly constant within the central 2◦, and drops approximately linearly from there to the 5◦ foveal border. Beyond the 5◦, acuity drops sharply (approximately exponentially). At scotopic light levels (night, or rod-vision), acuity is poor at all eccentricities. Figure 3.4 shows the variation of visual acuity at various eccentricities and light intensity levels. Intensity is shown varying from 9.0 to 4.6 log micromicrolamberts, denoted by log mmL (9.0 log micromicrolamberts = 109 micromicrolamberts = 1 mL, see Davson (1980, p. 311)). The correspondence between foveal receptor spacing and optical limits generally holds in foveal regions of the retina, but not necessarily in the periphery. In contrast to the approximate 60 c/deg resolvability of foveal cones, the highest spatial frequencies resolvable by rods are on the order of 5 c/deg, suggesting poor resolvability in the relatively cone-free periphery. Although visual acuity correlates fairly well with cone distribution density, it is important to note that synaptic organization and later neural elements (e.g., ganglion cells concentrated in the central retina) are also contributing factors in determining visual acuity.

3.2 Temporal Vision
Human visual response to motion is characterized by two distinct facts: the persistence of vision and the phi phenomenon (Gregory, 1990). The former essentially describes the temporal sampling rate of the HVS, and the latter describes a threshold above which the HVS detects apparent movement. Both facts are exploited in television, cinema, and graphics to elicit perception of motion from successively displayed still images.
Persistence of vision describes the inability of the retina to sample rapidly changing intensities. A stimulus ﬂashing at about 50–60 Hz (cycles per second) will appear steady (depending on contrast and luminance conditions and observers). This is known as the Critical Fusion Frequency (CFF).1 A stylized representation of the CFF, based on measurements of response to temporal stimuli of varying contrast (i.e., a temporal contrast sensitivity function) is shown in Figure 3.5. Incidentally, the curve of the CFF resembles the shape of the curve of the Contrast Sensitivity Function (CSF) which describes retinal spatial frequency response. The CFF explains why ﬂicker is not seen when viewing a sequence of (still) images at a high enough rate. The CFF illusion is maintained in cinema because frames are shown at 24 frames per second (fps, equivalent to Hz), but a three-bladed shutter raises the ﬂicker rate to 72Hz (three for each picture). Television also achieves the CFF by displaying the signal at 60 ﬁelds per second. Television’s analog to cinema’s three-bladed shutter is the interlacing scheme: the typical television frame rate is about 30 frames per second (depending on the standard use, e.g., NTSC in North America, PAL in other
1 Also sometimes referred to as the Critical Flicker Frequency.

34 3 Visual Psychophysics

Relative (%) visual acuity

1.0 9.0 log mmL
0.9 7.6 log mmL
0.8 6.3 log mmL
0.7
0.6
0.5
0.4
0.3
0.2
0.1

0o

10o

20o

30o

Eccentric Fixation (visual angle)

Fig. 3.4. Visual acuity at various eccentricities and light levels. Adapted from Davson (1980) with permission © 1980 Academic Press.

regions), but only the even or odd scanlines (ﬁelds) are shown per cycle. Although the CFF explains why ﬂicker is effectively eliminated in motion picture (and computer) displays, it does not fully explain why motion is perceived.
The second fact that explains why movies, television, and graphics work is the phi phenomenon, or stroboscopic motion, or apparent motion. This fact explains the illusion of old-fashioned moving neon signs whose stationary lights are turned on in quick succession. This illusion can also be demonstrated with just two lights, provided the delay between successive light ﬂashes is no less than about 62 Hz (Brinkmann, 1999). Inverting this value gives a rate of about 16 fps which is considered a bare minimum to facilitate the illusion of apparent motion.

1000

3.2 Temporal Vision

35

100

contrast sensitivity

10

1

1

10

100

temporal frequency (cps)

Fig. 3.5. Critical Fusion Frequency. Adapted from Bass (1995) © 1995 McGraw-Hill. Reproduced with permission of The McGraw-Hill Companies.

3.2.1 Perception of Motion in the Visual Periphery
In the context of visual attention and foveo–peripheral vision, the temporal response of the HVS is not homogeneous across the visual ﬁeld. In terms of motion responsiveness, Koenderink et al. (1985) provide support that the foveal region is more receptive to slower motion than the periphery, although motion is perceived uniformly across the visual ﬁeld. Sensitivity to target motion decreases monotonically with retinal eccentricity for slow and very slow motion (1 cycle/deg; Boff and Lincoln (1988)). That is, the velocity of a moving target appears slower in the periphery than in the fovea. Conversely, a higher rate of motion (e.g., frequency of rotation of grated disk) is needed in the periphery to match the apparent stimulus velocity in the fovea. At higher velocities, the effect is reversed.

36 3 Visual Psychophysics
Despite the decreased sensitivity in the periphery, movement is more salient there than in the central ﬁeld of view (fovea). That is, the periphery is more sensitive to moving targets than to stationary ones. It is easier to peripherally detect a moving target than it is a stationary one. In essence, motion detection is the periphery’s major task; it is a kind of early warning system for moving targets entering the visual ﬁeld.
3.2.2 Sensitivity to Direction of Motion in the Visual Periphery
The periphery is approximately twice as sensitive to horizontal-axis movement as to vertical-axis movement (Boff & Lincoln, 1988). Directional motion sensitivity is show in Figure 3.6.
0

270

90

2

4

6

8

180
Fig. 3.6. Absolute threshold isograms for detecting peripheral rotary movement. Numbers are rates of pointer movement in revolutions per minute. Adapted from McColgin (1960) with permission © 1960 Optical Society of America.

3.3 Color Vision
Foveal color vision if facilitated by the three types of retinal cone photoreceptors. The three main spectral sensitivity curves for retinal cone photoreceptors peak at approximately 450 nm, 520 nm, and 555 nm wavelengths, for each of the blue, green,

3.3 Color Vision

37

and red photoreceptors, respectively. A great deal is known about color vision in the fovea, however, relatively little is known about peripheral color vision. Of the seven million cones, most are packed tightly into the central 30◦ region of the fovea with scarcely any cones found beyond. This cone distribution suggests that peripheral color vision is quite poor in comparison to the color sensitivity of the central retinal region. Visual ﬁelds for monocular color vision are shown in Figure 3.7. Fields are shown for the right eye; ﬁelds for the left eye would be mirror images of those for the right eye. Blue and yellow ﬁelds are larger than the red and green ﬁelds; no chromatic visual ﬁelds have a deﬁnite border; instead, sensitivity drops off gradually and irregularly over a range of 15–30◦ visual angle (Boff & Lincoln, 1988).

60 o

30 o

45 o
Y
B R G

B = Blue R = Red G = Green Y = Yellow

30 o

60 o

nasal

45 o temporal

Fig. 3.7. Visual ﬁelds for monocular color vision (right eye). Adapted from Boff and Lincoln (1988) with permission ©1988 Wright-Patterson AFB.

Quantiﬁcation of perceptual performance is not easily found in the literature. Compared to investigation of foveal color vision, only a few experiments have been performed to measure peripheral color sensitivity. Two studies, of particular relevance to peripheral location of color CRTs in an aircraft cockpit environment, investigated the chromatic discrimination of peripheral targets.

38 3 Visual Psychophysics
In the ﬁrst study, Doyal (1991) concludes that peripheral color discrimination can approximate foveal discrimination when relatively small ﬁeld sizes are presented (e.g., 2◦ at 10◦ eccentricity, and less than 4◦ at 25◦). Although this sounds encouraging, color discrimination was tested at limited peripheral eccentricities (within the central 30◦).
In the second, Ancman (1991) tested color discrimination at much greater eccentricities, up to about 80◦ visual angle. She found that subjects wrongly identiﬁed the color of a peripherally located 1.3◦ circle displayed on a CRT 5% of the time if it was blue, 63% of the time if red, and 62% of the time if green. Furthermore, blue could not be seen farther than 83.1◦off the fovea (along the x-axis); red had to be closer than 76.3◦ and green nearer than 74.3◦ before subjects could identify the color.
There is much yet to be learned about peripheral color vision. Being able to verify a subject’s direction of gaze during peripheral testing would be of signiﬁcant beneﬁt to these experiments. This type of psychophysical testing is but one of several research areas where eye tracking studies could play an important supporting role.
3.4 Implications for Attentional Design of Visual Displays
Both the structure and functionality of human visual system components place constraints on the design parameters of a visual communication system. In particular, the design of a gaze-contingent system must distinguish the characteristics of foveal and peripheral vision (see Section 20.2). A visuotopic representation model for imagery based on these observations is proposed:
1. Spatial resolution should remain high within the foveal region and smoothly degrade within the periphery, matching human visual acuity. High spatial frequency features in the periphery must be made visible “just in time” to anticipate gaze-contingent ﬁxation changes.
2. Temporal resolution must be available in the periphery. Sudden onset events are potential attentional attractors. At low speeds, motion of peripheral targets should be increased to match apparent motion in the central ﬁeld of view.
3. Luminance should be coded for high visibility in the peripheral areas because the periphery is sensitive to dim objects.
4. Chrominance should be coded for high exposure almost exclusively in the foveal region, with chromaticity decreasing sharply into the periphery. This requirement is a direct consequence of the high density of cones and parvocellular ganglion cells in the fovea.
5. Contrast sensitivity should be high in the periphery, corresponding to the sensitivity of the magnocellular ganglion cells found mainly outside the fovea.
Special consideration should be given to sudden onset, luminous, high-frequency objects (i.e., suddenly appearing bright edges).

3.5 Summary and Further Reading

39

A gaze-contingent visual system faces an implementational difﬁculty not yet addressed: matching the dynamics of human eye movement. Any system designed to incorporate an eye-slaved high resolution of interest, for example, must deal with the inherent delay imposed by the processing required to track and process real-time eye tracking data. To consider the temporal constraints that need to be met by such systems, the dynamics of human eye movements must be evaluated. This topic is considered in the following chapter.

3.5 Summary and Further Reading
Psychophysical information may be the most usable form of literature for the design of graphical displays, attentional in nature or otherwise. Introductory texts may include function plots of some aspect of vision (e.g., acuity) which may readily be used to guide the design of visual displays. However, one often needs to evaluate the experimental design used in psychophysical experiments to determine the generalizability of reported results. Furthermore, similar caution should be employed as in reading neurological literature: psychophysical results may often deal with a certain speciﬁc aspect of vision, which may or may not be readily applicable to display design. For example, visual acuity may suggest the use of relatively sized fonts on a Web page (larger font in the periphery), but acuity alone may not be sufﬁcient to determine the required resolution in something like an attentional image or video display program. For the latter, one may need to piece together information concerning the visual contrast sensitivity function, temporal sensitivity, and so on. Furthermore, psychophysical studies may involve relatively simple stimuli (sine wave gratings), the results of which may or may not generalize to more complex stimuli such as imagery.
For a good introductory book on visual perception, see Hendee and Wells (1997). This text includes a good introductory chapter on the neurological basis of vision. Another good introductory book which also includes an interesting perspective on the perception of art is Solso (1999). For a somewhat terse but fairly complete psychophysical reference, see the USAF Engineering Data Compendium (Boff & Lincoln, 1988). This is an excellent “quick” guide to visual performance.

4
Taxonomy and Models of Eye Movements
Almost all normal primate eye movements used to reposition the fovea result as combinations of ﬁve basic types: saccadic, smooth pursuit, vergence, vestibular, and physiological nystagmus (miniature movements associated with ﬁxations; Robinson (1968)). Vergence movements are used to focus the pair of eyes over a distant target (depth perception). Other movements such as adaptation and accommodation refer to nonpositional aspects of eye movements (i.e., pupil dilation, lens focusing). With respect to visual display design, positional eye movements are of primary importance.
4.1 The Extraocular Muscles and the Oculomotor Plant
In general, the eyes move within six degrees of freedom: three translations within the socket, and three rotations. There are six muscles responsible for movement of the eyeball: the medial and lateral recti (sideways movements), the superior and inferior recti (up/down movements), and the superior and inferior obliques (twist) (Davson, 1980). These are depicted in Figure 4.1. The neural system involved in generating eye movements is known as the oculomotor plant. The general plant structure and connections are shown in Figure 4.2 and described in Robinson (1968). Eye movement control signals emanate from several functionally distinct regions. Areas 17–19 and 22 are areas in the occipital cortex thought to be responsible for high-level visual functions such as recognition. The superior colliculus bears afferents emanating directly from the retina, particularly from peripheral regions conveyed through the magnocellular pathway. The semicircular canals react to head movements in threedimensional space. All three areas (i.e., the occipital cortex, the superior colliculus, and the semicircular canals) convey efferents to the eye muscles through the mesencephalic and pontine reticular formations. Classiﬁcation of observed eye movement signals relies in part on the known functional characteristics of these cortical regions.
Two pertinent observations regarding eye movements can be drawn from the oculomotor plant’s organization:

42 4 Taxonomy and Models of Eye Movements

Left (view from above): 1, superior rectus; 2, levator palbebrae superioris; 3, lateral rectus; 4, medial rectus; 5, superior oblique; 6, reﬂected tendon of the superior oblique; 7, annulus of Zinn. Right (lateral view): 8, inferior rectus; 9, inferior oblique.

1

2

6 1

5

4

3

2

7

3

8

9

Fig. 4.1. Extrinsic muscles of the eye. Adapted from Davson (1980) with permission © 1980 Academic Press.

1. The eye movement system is, to a large extent, a feedback circuit. 2. Signals controlling eye movement emanate from cortical regions that can be
functionally categorized as voluntary (occipital cortex), involuntary (superior colliculus), and reﬂexive (semicircular canals).
The feedbacklike circuitry is utilized mainly in the types of eye movements requiring stabilization of the eye. Orbital equilibrium is necessitated for the steady retinal projection of an object, concomitant with the object’s motion and movements of the head. Stability is maintained by a neuronal control system.

4.2 Saccades
Saccades are rapid eye movements used in repositioning the fovea to a new location in the visual environment. The term comes from an old French word meaning “ﬂick of a sail” (Gregory, 1990). Saccadic movements are both voluntary and reﬂexive. The movements can be voluntarily executed or they can be invoked as a corrective optokinetic or vestibular measure (see below). Saccades range in duration from 10 ms to 100 ms, which is a sufﬁciently short duration to render the executor effectively blind during the transition (Shebilske & Fisher, 1983). There is some debate over the underlying neuronal system driving saccades. Saccades have been deemed ballistic and stereotyped. The term stereotyped refers to the observation that particular movement patterns can be evoked repeatedly. The term ballistic refers to the presumption that saccade destinations are preprogrammed. That is, once the saccadic movement to the next desired ﬁxation location has been calculated (programming latencies of

4.2 Saccades

43

CBT, corticobular tract; CER, cerebellum; ICTT, internal corticotectal tract; LG, lateral geniculate body; MLF, medial longitudinal fasciculus; MRF, mesencephalic and pontine reticular formations; PT, pretectal nuclei; SA, stretch afferents from extraocular muscles; SC, superior colliculi; SCC, semicircular canals; T, tegmental nuclei; VN, vestibular nuclei; II, optic nerve; III, IV, and VI, the oculomotor, trochlear, and abducens nuclei and nerves; 17, 18, 19, 22, primary and association visual areas, occipital and parietal (Brodmann); 8, the frontal eye ﬁelds.

8 22
CBT

19 18
ICTT

LG T

II SA

SC PT

17 CER

III IV VI

III
IV
VI MLF

SCC

MRF

VN

Fig. 4.2. Schematic of the major known elements of the oculomotor system. Adapted from Robinson (1968) with permission © 1968 IEEE.

about 200 ms have been reported), saccades cannot be altered. One reason behind this presumption is that, during saccade execution, there is insufﬁcient time for visual feedback to guide the eye to its ﬁnal position (Carpenter, 1977). One the other hand, a saccadic feedback system is plausible if it is assumed that instead of visual feedback, an internal copy of head, eye, and target position is used to guide the eyes during a saccade (Laurutis & Robinson, 1986; Fuchs et al., 1985). Due to their fast

44 4 Taxonomy and Models of Eye Movements
velocities, saccades may only appear to be ballistic (Zee et al., 1976).
Various models for saccadic programming have been proposed (Findlay, 1992). These models, with the exception of ones including “center-of-gravity” coding (see, e.g., He and Kowler (1989)), may inadequately predict unchangeable saccade paths. Instead, saccadic feedback systems based on an internal representation of target position may be more plausible because they tend to correctly predict the so-called double-step experimental paradigm. The double-step paradigm is an experiment where target position is changed during a saccade in midﬂight. Fuchs et al. (1985) proposed a reﬁnement of Robinson’s feedback model which is based on a signal provided by the superior colliculus and a local feedback loop. The local loop generates feedback in the form of motor error produced by subtracting eye position from a mental target-in-space position. Sparks and Mays (1990) cite compelling evidence that intermediate and deep layers of the SC contain neurons that are critical components of the neural circuitry initiating and controlling saccadic movements. These layers of the SC receive inputs from cortical regions involved in the analysis of sensory (visual, auditory, and somatosensory) signals used to guide saccades. The authors also rely on implications of Listing’s and Donder’s laws which specify an essentially null torsion component in eye movements, requiring virtually only two degrees of freedom for saccadic eye motions (Davson, 1980; Sparks & Mays, 1990). According to these laws, motions can be resolved into rotations about the horizontal x- and vertical y-axes.
Models of saccadic generation attempt to provide an explanation of the underlying mechanism responsible for generating the signals sent to the motor neurons. Although there is some debate as to the source of the saccadic program, the observed signal resembles a pulse/step function (Sparks & Mays, 1990). The pulse/step function refers to a dual velocity and position command to the extraocular muscles (Leigh & Zee, 1991). A possible simple representation of a saccadic step signal is a differentiation ﬁlter. Carpenter (1977) suggests such a possible ﬁlter arrangement for generating saccades coupled with an integrator. The integrating ﬁlter is in place to model the necessary conversion of velocity-coded information to position-coded signals (Leigh & Zee, 1991). A perfect neural integrator converts a pulse signal to a step function. An imperfect integrator (called leaky) will generate a signal resembling a decaying exponential function. The principle of this type of neural integration applies to all types of conjugate eye movements. Neural circuits connecting structures in the brain stem and the cerebellum exist to perform integration of coupled eye movements including saccades, smooth pursuits, and vestibular and optokinetic nystagmus (see below; Leigh and Zee (1991)).
A differentiation ﬁlter can be modeled by a linear ﬁlter as shown in Figure 4.3. In the time domain, the linear ﬁlter is modeled by the following equation
xt = g0st + g1st−1 + · · ·

4.3 Smooth Pursuits

45

∞
∑ = gkst−k, k=0

where st is the input (pulse), xt is the output (step), and gk are the ﬁlter coefﬁcients. To ensure differentiation, the ﬁlter coefﬁcients typically must satisfy properties that approximate mathematical differentiation. An example of such a ﬁlter is the Haar ﬁlter with coefﬁcients {1, −1}. Under the z-transform the transfer function X(z)/S(z) of this linear ﬁlter is

xt = g0st + g1st−1

xt = (1)st + (−1)st−1

xt = (1)st + (−1)zst

xt = (1 − z)st

X(z) = (1 − z)S(z)

X (z) S(z)

=

1 − z.

The Haar ﬁlter is a length-2 ﬁlter which approximates the ﬁrst derivate between successive pairs of inputs.

linear moving average system

input st

filter

output

g

xt

Fig. 4.3. Diagram of a simple linear ﬁlter modeling saccadic movements.

4.3 Smooth Pursuits
Pursuit movements are involved when visually tracking a moving target. Depending on the range of target motion, the eyes are capable of matching the velocity of the moving target. Pursuit movements provide an example of a control system with builtin negative feedback (Carpenter, 1977). A simple closed-loop feedback loop used to model pursuit movements is shown in Figure 4.4, where st is the target position, xt is the (desired) eye position, and h is the (linear, time-invariant) ﬁlter, or gain of the system (Carpenter, 1977; Leigh & Zee, 1991). Tracing the loop from the feedback start point gives the following equation in the time domain
h(st − xt ) = xt+1.
Under the z-transform the transfer function X(z)/S(z) of this linear system is

46 4 Taxonomy and Models of Eye Movements

H(z)(S(z) − X(z)) = X(z)

H(z)S(z) = X(z)(1 + H(z))

H (z) 1 + H(z)

=

X (z) S(z)

.

In the closed-loop feedback model, signals from visual receptors constitute the error signal indicating needed compensation to match the target’s retinal image motion.

input + st −

linear feedback system
filter
h

output xt

Fig. 4.4. Diagram of a simple linear feedback model of smooth pursuit movements.
4.4 Fixations
Fixations are eye movements that stabilize the retina over a stationary object of interest. It seems intuitive that ﬁxations should be generated by the same neuronal circuit controlling smooth pursuits with ﬁxations being a special case of a target moving at zero velocity. This is probably incorrect (Leigh & Zee, 1991, pp.139–140). Fixations, instead, are characterized by the miniature eye movements: tremor, drift, and microsaccades. This is a somewhat counterintuitive consequence of the visual system’s motion-sensitive single-cell organization. Recall that microsaccades are made due to the motion sensitivity of the visual system’s single-cell physiology. Microsaccades are eye movement signals that are more or less spatially random varying over 1 to 2 minutes of arc in amplitude. The counterintuitive fact regarding ﬁxations is that if an image is artiﬁcially stabilized on the retina, vision fades away within about a second and the scene becomes blank.
Miniature eye movements that effectively characterize ﬁxations may be considered noise present in the control system (possibly distinct from the smooth pursuit circuit) attempting to hold gaze steady. This noise appears as a random ﬂuctuation about the area of ﬁxation, typically no larger than 5◦ visual angle (Carpenter, 1977, p.105). Although the classiﬁcation of miniature movements as noise may be an oversimpliﬁcation of the underlying natural process, it allows the signal to be modeled by a

4.6 Implications for Eye Movement Analysis

47

feedback system similar to the one shown in Figure 4.4. The additive noise in Figure 4.4 is represented by et = st − xt , where the (desired) eye position xt is subtracted from the steady ﬁxation position st at the summing junction. In this model, the error signal stimulates the ﬁxation system in a manner similar to the smooth pursuit system, except that here et is an error-position signal instead of an error-velocity signal (see Leigh and Zee (1991, p.150)). The feedback system modeling ﬁxations, using the noisy “data reduction” method, is in fact simpler than the pursuit model because it implicitly assumes a stationary stochastic process (Carpenter, 1977, p.107). Stationarity in the statistical sense refers to a process with constant mean. Other relevant statistical measures of ﬁxations include their duration range of 150 ms to 600 ms, and the observation that 90% of viewing time is devoted to ﬁxations (Irwin, 1992).

4.5 Nystagmus
Nystagmus eye movements are conjugate eye movements characterized by a sawtoothlike time course (time series signal) pattern. Optokinetic nystagmus is a smooth pursuit movement interspersed with saccades invoked to compensate for the retinal movement of the target. The smooth pursuit component of optokinetic nystagmus appears in the slow phase of the signal (Robinson, 1968). Vestibular nystagmus is a similar type of eye movement compensating for the movement of the head. The time course of vestibular nystagmus is virtually indistinguishable from its optokinetic counterpart (Carpenter, 1977).

4.6 Implications for Eye Movement Analysis
From the above discussion, two signiﬁcant observations relevant to eye movement analysis can be made. First, based on the functionality of eye movements, only three types of movements need be modeled to gain insight into the overt localization of visual attention. These types of eye movements are ﬁxations, smooth pursuits, and saccades. Second, based on signal characteristics and plausible underlying neural circuitry, all three types of eye movements may be approximated by a Linear, TimeInvariant (LTI) system (i.e., a linear ﬁlter; for examples of linear ﬁlters applicable to saccade detection, see Chapter 12).
The primary requirement of eye movement analysis, in the context of gaze-contingent system design, is the identiﬁcation of ﬁxations, saccades, and smooth pursuits. It is assumed that these movements provide evidence of voluntary, overt visual attention. This assumption does not preclude the plausible involuntary utility of these movements, or conversely, the covert nonuse of these eye movements (e.g., as in the case of parafoveal attention). Fixations naturally correspond to the desire to maintain one’s gaze on an object of interest. Similarly, pursuits are used in the same manner for objects in smooth motion. Saccades are considered manifestations of the desire to voluntarily change the focus of attention.

48 4 Taxonomy and Models of Eye Movements
4.7 Summary and Further Reading
This chapter presented a taxonomy of eye movements and included linear models of eye movement signals suitable for eye movement analysis (see also Chapter 12).
With the exception of Carpenter’s widely referenced text (Carpenter, 1977), there appears to be no single suitable introductory text discussing eye movements exclusively. Instead, there are various texts on perception, cognition, and neuroscience which often include a chapter or section on the topic. There are also various collections of technical papers on eye movements, usually assembled from proceedings of focused symposia or conferences. A series of such books was produced by John Senders et al. in the 1970s and 1980s (see for, e.g., Monty and Senders (1976) and Fisher et al. (1981)). This conference series has recently been revived in the form of the (currently biennial) Eye Tracking Research & Applications (ETRA) conference).
A large amount of work has been performed on studying eye movements in the context of reading. For a good introduction to this literature, see Rayner (1992).

Part II Eye Tracking Systems

5
Eye Tracking Techniques
The measurement device most often used for measuring eye movements is commonly known as an eye tracker. In general, there are two types of eye movement monitoring techniques: those that measure the position of the eye relative to the head, and those that measure the orientation of the eye in space, or the “point of regard” (Young & Sheena, 1975). The latter measurement is typically used when the concern is the identiﬁcation of elements in a visual scene, e.g., in (graphical) interactive applications. Possibly the most widely applied apparatus for measurement of the point of regard is the video-based corneal reﬂection eye tracker. In this chapter, most of the popular eye movement measurement techniques are brieﬂy discussed ﬁrst before covering video-based trackers in greater detail.
There are four broad categories of eye movement measurement methodologies involving the use or measurement of: Electro-OculoGraphy (EOG), scleral contact lens/search coil, Photo-OculoGraphy (POG) or Video-OculoGraphy (VOG), and video-based combined pupil and corneal reﬂection.
Electro-oculography relies on (d.c. signal) recordings of the electric potential differences of the skin surrounding the ocular cavity. During the mid-1970s, this technique was the most widely applied eye movement method (Young & Sheena, 1975). Today, possibly the most widely applied eye movement technique, primarily used for point of regard measurements, is the method based on corneal reﬂection.
The ﬁrst method for objective eye measurements using corneal reﬂection was reported in 1901 (Robinson, 1968). To improve accuracy, techniques using a contact lens were developed in the 1950s. Devices attached to the contact lens ranged from small mirrors to coils of wire. Measurement devices relying on physical contact with the eyeball generally provide very sensitive measurements. The obvious drawback of these devices is their invasive requirement of wearing the contact lens. So-called noninvasive (sometimes called remote) eye trackers typically rely on the measurement of visible features of the eye, e.g., the pupil, iris–sclera boundary, or a corneal reﬂection of a closely positioned, directed light source. These techniques often involve

52 5 Eye Tracking Techniques either manual or automatic (computer-based) analysis of video recordings of the movements of the eyes, either off-line or in real-time. The availability of fast image processing hardware has facilitated the development of real-time video-based point of regard turnkey systems.
5.1 Electro-OculoGraphy (EOG)
Electro-oculography, the most widely applied eye movement recording method some 40 years ago (and still used today), relies on measurement of the skin’s electric potential differences, of electrodes placed around the eye. A picture of a subject wearing the EOG apparatus is shown in Figure 5.1. The recorded potentials are in the range 15–200 µV , with nominal sensitivities of order of 20 µV /deg of eye movement. This technique measures eye movements relative to head position, and so is not generally suitable for point of regard measurements unless head position is also measured (e.g., using a head tracker).
Fig. 5.1. Example of electro-oculography (EOG) eye movement measurement. Courtesy of MetroVision, Pe´renchies, France <http://www.metrovision.fr>. Reproduced with permission.
5.2 Scleral Contact Lens/Search Coil
One of the most precise eye movement measurement methods involves attaching a mechanical or optical reference object mounted on a contact lens which is then worn

5.3 Photo-OculoGraphy (POG) or Video-OculoGraphy (VOG)

53

directly on the eye. Such early recordings (ca. 1898; Young and Sheena (1975)) used a plaster of paris ring attached directly to the cornea and through mechanical linkages to recording pens. This technique evolved to the use of a modern contact lens to which a mounting stalk is attached. The contact lens is necessarily large, extending over the cornea and sclera (the lens is subject to slippage if the lens only covers the cornea). Various mechanical or optical devices have been placed on the stalk attached to the lens: reﬂecting phosphors, line diagrams, and wire coils have been the most popular implements in magneto-optical conﬁgurations. The principle method employs a wire coil, which is then measured moving through an electromagnetic ﬁeld.1 A picture of the search coil embedded in a scleral contact lens and the electromagnetic ﬁeld frame are shown in Figure 5.2. The manner of insertion of the contact lens is shown in Figure 5.3. Although the scleral search coil is the most precise eye movement measurement method (accurate to about 5–10 arc-seconds over a limited range of about 5◦; Young and Sheena (1975)), it is also the most intrusive method. Insertion of the lens requires care and practice. Wearing of the lens causes discomfort. This method also measures eye position relative to the head, and is not generally suitable for point of regard measurement.

Fig. 5.2. Example of search coil embedded in contact lens and electromagnetic ﬁeld frames for search coil eye movement measurement. Courtesy of Skalar Medical, Delft, The Netherlands <http://www.skalar.nl>. Reproduced with permission.
5.3 Photo-OculoGraphy (POG) or Video-OculoGraphy (VOG)
This category groups together a wide variety of eye movement recording techniques involving the measurement of distinguishable features of the eyes under rotation/translation, e.g., the apparent shape of the pupil, the position of the limbus (the irissclera boundary), and corneal reﬂections of a closely situated directed light source 1 This is similar in principle to magnetic position/orientation trackers often employed in
virtual reality applications; e.g., Ascension’s Flock Of Birds (FOB) uses this type of method for tracking the position/orientation of the head. See Chapter 7.

54 5 Eye Tracking Techniques
Fig. 5.3. Example of scleral suction ring insertion for search coil eye movement measurement. Courtesy of Skalar Medical, Delft, The Netherlands <http://www.skalar.nl>. Reproduced with permission.
(often infra-red). Although different in approach, these techniques are grouped here because they often do not provide point of regard measurement. Examples of apparatus and recorded images of the eye used in photo- or video-oculography and/or limbus tracking are shown in Figure 5.4. Measurement of ocular features provided by these measurement techniques may or may not be made automatically, and may involve visual inspection of recorded eye movements (typically recorded on videotape). Visual assessment performed manually (e.g., stepping through a videotape frame-by-frame), can be extremely tedious and prone to error, and limited to the temporal sampling rate of the video device.
Automatic limbus tracking often involves the use of photodiodes mounted on spectacle frames (see Figure 5.4b and c), and almost always involves the use of invisible (usually infra-red) illumination (see Figure 5.4(d)). Several of these methods require the head to be ﬁxed, e.g., either by using a head/chin rest or a bite bar (Young & Sheena, 1975).
5.4 Video-Based Combined Pupil/Corneal Reﬂection
Although the above techniques are in general suitable for eye movement measurements, they do not often provide point of regard measurement. To provide this measurement, either the head must be ﬁxed so that the eye’s position relative to the head and point of regard coincide, or multiple ocular features must be measured in order to disambiguate head movement from eye rotation. Two such features are the corneal reﬂection (of a light source, usually infra-red) and the pupil center (see Figure 5.4d).
Video-based trackers utilize relatively inexpensive cameras and image processing hardware to compute the point of regard in real-time. The apparatus may be tablemounted, as shown in Figure 5.5 or worn on the head, as shown in Figure 5.6. The optics of both table-mounted or head-mounted systems are essentially identical, with the exception of size. These devices, which are becoming increasingly available, are most suitable for use in interactive systems.
The corneal reﬂection of the light source (typically infra-red) is measured relative to the location of the pupil center. Corneal reﬂections are known as the Purkinje

5.4 Video-Based Combined Pupil/Corneal Reﬂection

55

(a) Example of apparent pupil size. Courtesy of MetroVision, Pe´renchies, France <http://www.metrovision.fr>. Reproduced with permission.

(b) Example of infra-red limbus tracker apparatus. Courtesy of Applied Science Laboratories (ASL), Bedford, MA <http://www.a-s-l.com>. Reproduced with permission.

(c) Another example of infra-red limbus tracker apparatus, as worn by subject. Courtesy of Microguide, Downers Grove, IL <http://www.eyemove.com>. Reproduced with permission.

(d) Example of “bright pupil” (and corneal reﬂection) illuminated by infra-red light. Courtesy of LC Technologies, Fairfax, VA <http://www.eyegaze.com>. Reproduced with permission.

Fig. 5.4. Examples of pupil, limbus, and corneal infra-red (IR) reﬂection eye movement measurements.

56 5 Eye Tracking Techniques

(a) Operator.

(b) Subject.

Fig. 5.5. Example of table-mounted video-based eye tracker.

Fig. 5.6. Example of head-mounted video-based eye tracker. Courtesy of IOTA AB, EyeTrace Systems, Sundsvall Business & Tech. Center, Sundsvall, Sweden <http://www.iota.se>. Reproduced with permission.
reﬂections, or Purkinje images (Crane, 1994). Due to the construction of the eye, four Purkinje reﬂections are formed, as shown in Figure 5.7. Video-based eye trackers typically locate the ﬁrst Purkinje image. With appropriate calibration procedures, these eye trackers are capable of measuring a viewer’s Point Of Regard (POR) on a suitably positioned (perpendicularly planar) surface on which calibration points are displayed.
Two points of reference on the eye are needed to separate eye movements from head movements. The positional difference between the pupil center and corneal reﬂection changes with pure eye rotation, but remains relatively constant with minor head movements. Approximate relative positions of pupil and ﬁrst Purkinje reﬂections are graphically shown in Figure 5.8, as the left eye rotates to ﬁxate nine correspondingly placed calibration points. The Purkinje reﬂection is shown as a small white circle in close proximity to the pupil, represented by a black circle. Because the infra-red light source is usually placed at some ﬁxed position relative to the eye, the

5.4 Video-Based Combined Pupil/Corneal Reﬂection

57

PR, Purkinje reﬂections: 1, reﬂection from front surface of the cornea; 2, reﬂection from rear surface of the cornea; 3, reﬂection from front surface of the lens; 4, reﬂection from rear surface of the lens–almost the same size and formed in the same plane as the ﬁrst Purkinje image, but due to change in index of refraction at rear of lens, intensity is less than 1% of that of the ﬁrst Purkinje image; IL, incoming light; A, aqueous humor; C, cornea; S, sclera; V, vitreous humor; I, iris; L, lens; CR, center of rotation; EA, eye axis; a ≈ 6 mm; b ≈ 12.5 mm; c ≈ 13.5 mm; d ≈ 24 mm; r ≈ 7.8 mm (Crane, 1994).
IL
PR

{

4321

I L
c
d

b

A

C

r a
CR

S V

EA

Fig. 5.7. Purkinje images. Reprinted in adapted form from Crane (1994) courtesy of Marcel Dekker, Inc. © 1994.
Purkinje image is relatively stable whereas the eyeball, and hence thepupil, rotates in its orbit. So-called generation-V eye trackers also measure the fourth Purkinje image (Crane & Steele, 1985). By measuring the ﬁrst and fourth Purkinje reﬂections, these dual-Purkinje image (DPI) eye trackers separate translational and rotational eye movements. Both reﬂections move together through exactly the same distance upon eye translation but the images move through different distances (thus changing their separation) upon eye rotation. This device is shown in Figure 5.9. Unfortunately, although the DPI eye tracker is quite precise, head stabilization may be required.

58 5 Eye Tracking Techniques
Fig. 5.8. Relative positions of pupil and ﬁrst Purkinje images as seen by the eye tracker’s camera.
Fig. 5.9. Dual-Purkinje eye tracker. Courtesy of Fourward Optical Technologies, Buena Vista, VA <http://www.fourward.com>. Reproduced with permission.
5.5 Classifying Eye Trackers in “Mocap” Terminology
For readers familiar with motion capture (“mocap”) techniques used in the special effects ﬁlm industry, it is worthwhile to compare the various eye tracking methodologies with traditional mocap devices. Similarities between the two applications are intuitive and this is not surprising because the objective of both is recording the motion of objects in space. In eye tracking, the object measured is the eye, whereas in

5.6 Summary and Further Reading

59

mocap, it is (usually) the joints of the body. Eye trackers can be grouped using the same classiﬁcation employed to describe motion capture devices.

Electro-OculoGraphy (EOG) is essentially an electromechanical device. In mocap applications, sensors may be placed on the skin or joints. In eye tracking, sensors are placed on the skin around the eye cavity. Eye trackers using a contact lens are effectively electromagnetic trackers. The metallic stalk that is ﬁxed to the contact lens is similar to the orthogonal coils of wire found in electromagnetic sensors used to obtain the position and orientation of limbs and head in virtual reality. Photooculography and video-oculography eye trackers are similar to the widely used optical motion capture devices in special effects ﬁlm, video, and game production. In both cases a camera is used to record raw motion, which is then processed by (usually) digital means to calculate the motion path of the object being tracked. Finally, video-based corneal reﬂection eye trackers are similar to optical motion capture devices that use reﬂective markers (worn by the actors). In both cases, an infra-red light source is usually used, for the reason that it is invisible to the human eye, and hence nondistracting.

For a good introduction to motion capture for computer animation, as used in special effects ﬁlm, video, and game production, see Menache (2000). Menache’s book does a good job of describing motion capture techniques, although it is primarily aimed at practitioners in the ﬁeld of special effects and animation production. Still, the description of mocap techniques, even at a comparatively large scale (i.e., capturing human motion vs. motion of the eye), provides a good classiﬁcation scheme for eye tracking techniques.

5.6 Summary and Further Reading
For a short review of early eye tracking methods, see Robinson (1968, II). For another relatively more recent survey of eye tracking techniques, see Young and Sheena (1975). Although Young and Sheena’s survey article is somewhat dated by today’s standards, it is an excellent introductory article on eye movement recording techniques, and is still widely referenced. An up-to-date survey of eye tracking devices does not appear to be available, although the number of (video-based) eye tracking companies seems to be growing. Instead, one of the best comprehensive lists of eye tracking manufacturers is available on the Internet, the Eye Movement Equipment Database (EMED; see <http://ibs.derby.ac.uk/emed/>), initially setup by David Wooding of the Institute of Behavioural Sciences, University of Derby, United Kingdom.

6
Head-Mounted System Hardware Installation
Several types of eye trackers are available, ranging from scleral coils, EOG, to videobased corneal reﬂection eye trackers. Although each has its beneﬁts and drawbacks (e.g., accuracy vs. sampling rate), for graphical or interactive applications the videobased corneal reﬂection tracker is arguably the most practical device. These devices work by capturing video images of the eye (illuminated by an infra-red light source), processing the video frames (at video frame rates), and outputting the eye’s x- and y-coordinates relative to the screen being viewed. The x- and y-coordinates are typically either stored by the eye tracker itself, or can be sent to the graphics host via serial cable. The advantage of the video-based eye tracker over other devices is that it is relatively noninvasive, fairly accurate (to about 1◦ visual angle over a 30◦ viewing range), and, for the most part, not difﬁcult to integrate with the graphics system. The video-based tracker’s chief limitation is its sampling frequency, typically limited by the video frame rate, 60 Hz. Hence, one can usually expect to receive eye movement samples at least every 16 ms (typically a greater latency should be expected because the eye tracker needs time to process each video frame, and the graphics host needs time to update its display).
6.1 Integration Issues and Requirements
Integration of the eye tracker into a graphics system depends chieﬂy on proper delivery of the graphics video stream to the eye tracker and the subsequent reception of the eye tracker’s calculated 2D eye coordinates. In the following description of system setup, a complete graphics system is described featuring two eye trackers: one a table-mounted, monocular eye tracker set underneath a standard television display, and the other a binocular eye tracker ﬁtted inside a Head-Mounted Display (HMD). Both displays are powered by the same graphics host. Such a laboratory has been set up within Clemson University’s Virtual Reality Laboratory, and is shown in Figure 6.1. In Figure 6.1b, the portion of the lab is shown where, on the right, the TV display is installed with the monocular table-mounted eye tracker positioned below. In Figure 6.1a, the HMD helmet is resting atop a small table in front of three

62 6 Head-Mounted System Hardware Installation
d.c. electromagnetic tracking units. The eye tracker unit (a PC) is situated between the d.c. tracking units and the dual-head graphics display monitors to the right in the image. The PC’s display is a small ﬂat panel display just left of the dual graphics display monitors. Both HMD and TV (and graphics) displays are driven by the graphics host, which is out of view of the image (on the ﬂoor beneath the desk in front of the visible chair). The four small TV monitors atop the eye tracking PC in Figure 6.1a display the left and right scene images (what the user sees) and the left and right eye images (what the eye tracker sees).
The following system integration description is based on the particular hardware devices installed at Clemson’s Virtual Reality Eye Tracking (VRET) laboratory, described here for reference. The primary rendering engine is a Dell Dual 1.8 GHz Pentium 4 (Xeon) PC (1 GB RAM) running Fedora Core 4 and equipped with an nVidia GeForce4 FX5950U graphics card. The eye tracker is from ISCAN and the system includes both binocular cameras mounted within a Virtual Research V8 (high resolution) HMD as well as a monocular camera mounted on a remote pan-tilt unit. Both sets of optics function identically, capturing video images of the eye and sending the video signal to the eye tracking PC for processing. The pan-tilt unit coupled with the remote table-mounted camera/light assembly is used to noninvasively track the eye in real-time as the subject’s head moves. This allows limited head movement, but a chin rest is usually used to restrict the position of the subject’s head during experimentation to improve accuracy. The V8 HMD offers 640 × 480 resolution per eye with separate left and right eye feeds. HMD position and orientation tracking is provided by an Ascension 6 Degree-Of-Freedom (6 DOF) Flock Of Birds (FOB), a d.c. electromagnetic system with a 10 ms latency per sensor. A 6 DOF tracked, handheld mouse provides the user with directional motion control. The HMD is equipped with headphones for audio localization.
Although the following integration and installation guidelines are based on the equipment available at the Clemson VRET lab, the instructions should apply to practically any video-based corneal reﬂection eye tracker. Of primary importance to proper system integration are the following,
1. Knowledge of the video format the eye tracker requires as input (e.g., NTSC or VGA)
2. Knowledge of the data format the eye tracker generates as its output
The ﬁrst point is crucial to providing the proper image to the user as well as to the eye tracker. The eye tracker requires input of the scene signal so that it can overlay the calculated point of regard for display on the scene monitor which is viewed by the operator (experimenter). The second requirement is needed for proper interpretation of the point of regard by the host system. This is usually provided by the eye tracker manufacturer. The host system will need to be furnished with a device driver to read and interpret this information.
Secondary requirements for proper integration are the following.

6.1 Integration Issues and Requirements

63

(a) HMD-mounted binocular eye tracker (panorama photostitch image).

(b) Table-mounted monocular eye tracker.
Fig. 6.1. Virtual Reality Eye Tracking (VRET) lab at Clemson University.
1. The capability of the eye tracker to provide ﬁne-grained cursor control over its calibration stimulus (a crosshair or other symbol)
2. The capability of the eye tracker to transmit its operating mode to the host along with the eye x- and y-point of regard information
These two points are both related to proper alignment of the graphics display with the eye tracking scene image and calibration point. Essentially, both eye tracker capabilities are required to properly map between the graphics and eye tracking viewport coordinates. This alignment is carried out in two stages. First, the operator can use

64 6 Head-Mounted System Hardware Installation
the eye tracker’s own calibration stimulus to read out the extents of the eye tracker’s displays. The eye tracker resolution may be speciﬁed over a 512 × 512 range, however, in practice it may be difﬁcult to generate a graphics window that will exactly match the dimensions of the video display. Differences on the order of one or two pixels will ruin proper alignment. Therefore, it is good practice to ﬁrst display a blank graphics window, then use the eye tracker cursor to measure the extents of the window in the eye tracker’s reference frame. Because this measurement should be as precise as possible, ﬁne cursor control is needed. Second, the eye tracker operates in three primary modes: reset (inactive), calibration, and run. The graphics program must synchronize with these modes, so that a proper display can be generated in each mode:
• Reset: display nothing (black screen) or a single calibration point. • Calibration: display a simple small stimulus for calibration (e.g., a small dot or
circle) at the position of the eye tracker’s calibration stimulus. • Run: display the stimulus required over which eye movements are to be recorded.
It is of the utmost importance that proper alignment is achieved between the eye tracker’s calibration stimulus points and those of the graphics’ display. Without this alignment, the data returned by the eye tracker will be meaningless.
Proper alignment of the graphics and eye tracking reference frames is achieved through a simple linear mapping between eye tracking and graphics window coordinates. This is discussed in Chapter 7, following notes on eye tracking system installation and wiring.
6.2 System Installation
The two primary installation considerations are wiring of the video and serial line cables between the graphics host and eye tracking systems. The connection of the serial cable is comparatively simple. Generation of the software driver for data interpretation is also straightforward, and is usually facilitated by the vendor’s description of the data format and transmission rate. In contrast, what initially may pose the most problems is the connection of the video signal. It is imperative that the graphics host can generate a video signal in the format expected by both the graphics display (e.g., television set or HMD unit) and the eye tracker.
In the simplest case, if the host computer is capable of generating a video signal that is suitable for both the stimulus display and eye tracker, all that is then needed is a video splitter which will feed into both the stimulus display and eye tracker. For example, assume that the stimulus display is driven by an NTSC signal (e.g., a television), and the host computer is capable of generating a display signal in this format. (This was possible at the Clemson VRET lab because the SGI host can send a copy of whatever is on the graphics display via proper use of the ircombine command.) If the eye tracker can also be driven by an NTSC signal, then the installation

6.2 System Installation

65

is straightforward. If, however, the stimulus display is driven by VGA video (e.g., an HMD), but the eye tracker is driven by NTSC video, then the matter is somewhat more complicated. Consider the wiring diagram given in Figure 6.2. This schematic shows the dual display components, the HMD and TV, used for binocular eye tracking in a virtual reality environment, and monocular eye tracking over a 2D image or video display. The diagram features the necessary wiring of both left and right video channels from the host to the HMD and eye tracker, and a copy of the left video channel sent to the TV through the host’s NTSC video output.

HD15 HD15 HD15 RS232

SGI

m

MONITOR

SGI

m

MONITOR

15’
HD15 m OUT
m OUT
HD15 15’

SWITCHBOX (VGA)

OUT

OUT

m

m

HD15 <custom SGI cable> 13W3

IN m

32’

m

IN m

m

32’

HD15 <custom SGI cable> 13W3

HOST (SGI)

NTSC

IN

TV

m

6.6’

6.6’

9.8’

EYE

TRACKER

HMD OUT IN IN

f

m

f

VGA/ NTSC

m HD15 m 9.8’

VR8

VGA/ NTSC

m

m
HD15

9.8’

VCR

SCENE

EYE TRACKER PC

MONOCULAR EYE CAMERA HMD LEFT EYE CAMERA HMD RIGHT EYE CAMERA

EYE

Fig. 6.2. Video signal wiring of the VRET lab at Clemson University.

The HMD is driven by a horizontal-synch (h-sync) VGA signal. A switchbox is used (seen in Figure 6.1a just above the eye tracker keyboard) to switch the VGA display between the dual-head graphics monitors and the HMD. The HMD video control box diverts a copy of the left video channel through an active pass-through splitter back through the switchbox to the left graphics display monitor. The switchbox effectively “steals” the signal meant for the graphics displays and sends it to the HMD. The left switch on the switchbox has two settings: monitor or HMD. The right switch on the switchbox has three settings: monitor, HMD, or both. If the right switch is set to monitor, no signal is sent to the HMD, effectively providing a biocular display in the HMD (instead of a binocular, or stereoscopic display). If the right switch is set to HMD, the graphics display blanks out because the HMD does not provide an active pass-through of the right video channel. If the right switch is set to both, the right video channel is simply split between the HMD and the monitor, resulting in a

66 6 Head-Mounted System Hardware Installation
binocular display in both the HMD and on the monitors. This last setting provides no ampliﬁcation of the signal and hence both the right LCD in the HMD and the right graphics monitor displays appear dim. This is mostly used for testing purposes.
The entire video circuit between the graphics host, the switchbox, and the HMD is VGA video. The eye tracker, however, operates on NTSC. This is the reason for the two VGA/NTSC converters that are inserted into the video path. These converters output an NTSC signal to the eye tracker and also provide active pass-throughs for the VGA signal so that, when in operation, the VGA signal appears undisrupted. The eye tracker then processes the video signals of the scene and outputs the signal to the scene monitors, with its own overlaid signal containing the calculated point of regard (represented by a crosshair cursor). These two small displays show the operator what is in the user’s ﬁeld of view as well as what she or he is looking at.
The eye images, in general, do not pose a complication since this video signal is exclusively processed by the eye tracker. In the case of the eye tracker at Clemson’s VRET lab, both the binocular HMD eye cameras and the table-mounted monocular camera are NTSC and the signals feed directly into the eye tracker.
6.3 Lessons Learned from the Installation at Clemson
The eye tracker at Clemson contains hardware to process dual left and right eye and scene images. It can be switched to operate in monocular mode, for which it requires just the left eye and scene images. In this case, a simple video switch is used to switch the signal between the eye image generated by the left camera in the HMD and the camera on the table-mounted unit.
The ﬁrst installation at Clemson used display monitors from Silicon Graphics, Inc. (SGI) that were driven by either VGA video, or the default R, G, B video delivered by 13W3 video cables. To drive the HMD, VGA video was required, connected by HD15 cables. To connect the video devices properly, special 13W3-HD15 cables were needed. Although this seems trivial, custom-built cables were required. These cables were not cheap, and took a day or two to build and deliver. If timing and ﬁnances are a consideration, planning of the system down to the proper cabling is a must! (Today this is less of a concern although properly matching cabling is still an issue, this time not so much concerning the video signal as concerning the Keyboard Video Monitor or KVM switch; see Chapter 9.)
A problem that was particularly difﬁcult to troubleshoot during the Clemson installation was the diagnosis of the format of the VGA signal emitted by the SGI host computer. Initially, before the eye tracker was installed, the HMD was tested for proper display. The output of the switchbox was used directly to drive the HMD. Everything functioned properly. However, after inserting the HMD into the video circuit, the eye tracker would not work. It was later found that the problem lay in

6.4 Summary and Further Reading

67

the VGA/NTSC converters: these converters expect the more common VGA signal which uses a timing signal synchronized to the horizontal video ﬁeld (h-sync signal; the horizontal and vertical sync signals are found on pins 13 and 14 of the VGA HD15 cable). The SGI host computer by default emits a sync-on-green VGA signal leaving pins 13 and 14 devoid of a timing signal. The VR HMD contains circuitry that will read either h-sync or sync-on-green VGA video and so functions quietly given either signal. The fault, as it turned out, was in an improper initial wiring of the switchbox. The switchbox was initially missing connections for pins 13 and 14 because these are not needed by a sync-on-green VGA signal. Once this was realized, the entire cable installation had to be disassembled and the switchbox had to be rewired. With the switchbox rewired, the custom 13W3 cables had to be inspected to conﬁrm that these components carried signals over pins 13 and 14. Finally, a new display conﬁguration had to be created (using SGI’s ircombine command) to drive the entire circuit with the horizontal sync signal instead of the default sync-on-green. The moral here is: be sure of the video format required by all components, down to the speciﬁc signals transmitted on individual cable pins!

6.4 Summary and Further Reading
This chapter presented key points from an eye tracker’s installation and its integration into a primarily computer graphics system. Although perhaps difﬁcult to generalize from this particular experience (a case study if you will), nevertheless there are two points that are considered key to successful installation and usage of the device:
• Signal routing • Synchronization.
Signal routing refers to proper signal input and output, namely concerning video feeds (input to the eye tracker) and (typically rs232) serial data capture and interpretation (output from the eye tracker). Synchronization refers to the proper coordinate mapping of the eye tracker’s reference frame to the application responsible for generating the visual stimulus that will be seen by the observer. Proper mapping will ensure correct registration of both eye tracker calibration and subsequent data analysis. The actual data mapping is carried out in software (see the next chapter), however, the hardware component that facilitates proper software mapping is the eye tracker’s capability of measuring the application’s display extents in its own reference frame. This is usually provided by the eye tracker if it allows manual positioning and readout of its calibration cursor.
There are two primary sources where further information can be obtained on system installation and setup. First, of course, is the manufacturer’s manual that will typically be included with the eye tracking device. Second, the best resource for installation and usage of the equipment is from the users themselves. Users of eye trackers typically report, in a somewhat formal way, what they use and how they use it in

68 6 Head-Mounted System Hardware Installation
most technical papers on eye tracking research. These can be found in various journal articles, such as Vision Research, Behavior Research Methods, Instruments, and Computers (BRMIC), and conference proceedings. There are various conferences that deal with eye tracking, either directly or indirectly. For example, conferences that deal with computer graphics (e.g., SIGGRAPH, EuroGraphics, or Graphics Interface), human–computer interaction (e.g., SIGCHI), or virtual reality (e.g., VRST), may include papers that discuss the use of eye trackers, but their apparatus description may be somewhat indirect insofar as the objective of the report typically does not deal with the eye tracker itself, rather it concentrates more on the results obtained through its application. Examples of such applications are given in Part III of this text. At this time, there are two main conferences that deal more directly with eye tracking: the European Conference on Eye Movements (ECEM), and the U.S.based Eye Tracking Research & Applications (ETRA). Finally, the eye movement email listservs eye-movements and eyemov-l are excellent on-line “gathering places” of eye tracker researchers. Here, questions on eye trackers may be answered directly (and usually promptly) by users of similar equipment or even from other vendors.

7
Head-Mounted System Software Development
In designing a graphical eye tracking application, the most important requirement is mapping of eye tracker coordinates to the appropriate application program’s reference frame. The eye tracker calculates the viewer’s Point Of Regard (POR) relative to the eye tracker’s screen reference frame, e.g., a 512 × 512 pixel plane, perpendicular to the optical axis. That is, for each eye, the eye tracker returns a sample coordinate pair of x- and y-coordinates of the POR at each sampling cycle (e.g., once every ∼16 ms for a 60 Hz device). This coordinate pair must be mapped to the extents of the application program’s viewing window.
If a binocular eye tracker is used, two coordinate sample pairs are returned during each sampling cycle, xl, yl for the left POR and xr, yr for the right eye. A Virtual Reality (VR) application must, in the same update cycle, also map the coordinates of the head’s position and orientation tracking device (e.g., typically a 6 DOF tracker is used).
The following sections discuss the mapping of eye tracker screen coordinates to application program coordinates for both monocular and binocular applications. In the monocular case, a typical 2D image-viewing application is expected, and the coordinates are mapped to the 2D (orthogonal) viewport coordinates accordingly (the viewport coordinates are expected to match the dimensions of the image being displayed). In the binocular (VR) case, the eye tracker coordinates are mapped to the dimensions of the near viewing plane of the viewing frustum. For both calculations, a method of measuring the application’s viewport dimensions in the eye tracker’s reference frame is described, where the eye tracker’s own (ﬁne-resolution) cursor control is used to obtain the measurements. This information should be sufﬁcient for most 2D image viewing applications.
For VR applications, subsequent sections describe the required mapping of the head position/orientation tracking device. Although this section should generalize to most kinds of 6 DOF tracking devices, the discussion in some cases is speciﬁc to the Ascension Flock Of Birds (FOB) d.c. electromagnetic 6 DOF tracker. This section

70 7 Head-Mounted System Software Development
discusses how to obtain the head-centric view and vectors from the matrix returned by the tracker, and also explains the transformation of an arbitrary vector using the obtained transformation matrix.1 This latter derivation is used to transform the gaze vector to head-centric coordinates, which is initially obtained from, and relative to, the binocular eye tracker’s left and right POR measurement.
Finally, a derivation of the calculation of the gaze vector in 3D is given, and a method is suggested for calculating the three-dimensional gaze point in VR.

7.1 Mapping Eye Tracker Screen Coordinates

When working with the eye tracker, the data obtained from the tracker must be mapped to a range appropriate to the given application. If working in VR, the 2D eye tracker data, expressed in eye tracker screen coordinates, must be mapped to the 2D dimensions of the near viewing frustum. If working with images, the 2D eye tracker data must be mapped to the 2D display image coordinates.

In general, if x ∈ [a, b] needs to be mapped to the range [c, d], we have:

x

=

c+

(x

− a)(d − (b − a)

c)

.

(7.1)

This is a linear mapping between two (one-dimensional) coordinate systems (or lines in this case). Equation (7.1) has a straightforward interpretation:
1. The value x is translated (shifted) to its origin, by subtracting a. 2. The value (x − a) is then normalized by dividing through by the range (b − a). 3. The normalized value (x − a)/(b − a) is then scaled to the new range (d − c). 4. Finally, the new value is translated (shifted) to its proper relative location in the
new range by adding c.

7.1.1 Mapping Screen Coordinates to the 3D Viewing Frustum
The 3D viewing frustum employed in the perspective viewing transformations is deﬁned by the parameters left, right, bottom, top, near, far, e.g., as used in the OpenGL function call glFrustum(). Figure 7.1 shows the dimensions of the eye tracker screen (left) and the dimensions of the viewing frustum (right). Note that the eye tracker origin is the top-left of the screen and the viewing frustum’s origin is bottom-left (this is a common discrepancy between imaging and graphics). To convert the eye tracker coordinates (x , y ) to graphics coordinates (x, y), using Equation (7.1), we have:
1 Equivalently, the rotation quaternion may be used, if these data are available from the head tracker.

7.1 Mapping Eye Tracker Screen Coordinates

71

x = left + x (right − left)

(7.2)

512

y = bottom + (512 − y )(top − bottom) .

(7.3)

512

Note that the term (512 − y ) in Equation (7.3) handles the y-coordinate mirror transformation so that the top-left origin of the eye tracker screen is converted to the bottom-left of the viewing frustum.

0,0

right,top
left,bottom far
512,512
near eye
eye Fig. 7.1. Eye tracker to VR mapping.

If typical dimensions of the near plane of the graphics viewing frustum are 640 × 480, with the origin at (0, 0), then Equations (7.2) and (7.3) reduce to:

x = x (640) = x (1.3)

(7.4)

512

y = (512 − y )(480) = (512 − y )(.9375).

(7.5)

512

7.1.2 Mapping Screen Coordinates to the 2D Image
The linear mapping between eye tracker screen coordinates and 2D image coordinates is handled similarly. If the image dimensions are 640 × 480, then Equations (7.4) and (7.5) are used without change. Note that an image with dimensions 600 × 450 appears to ﬁt the display of the TV in the Clemson VRET lab better.2 In this case, Equations (7.4) and (7.5) become:
2 A few of the pixels of the image do not ﬁt on the TV display, possibly due to the NTSC ﬂicker-free ﬁlter used to encode the SGI console video signal.

72 7 Head-Mounted System Software Development
x = x (600) = x (1.171875) 512
y = (512 − y )(450) = (512 − y )(.87890625). 512

(7.6) (7.7)

7.1.3 Measuring Eye Tracker Screen Coordinate Extents
The above coordinate mapping procedures assume that the eye tracker coordinates are in the range [0, 511]. In reality, the usable, or effective coordinates will be dependent on: (a) the size of the application window, and (b) the position of the application window. For example, if an image display program runs wherein an image is displayed in a 600 × 450 window, and this window is positioned arbitrarily on the graphics console, then the eye tracking coordinates of interest are restricted only to the area covered by the application window.

The following example illustrates the eye tracker/application program coordinate mapping problem using a schematic depicting the scene imaging display produced by an eye tracker manufactured by ISCAN. The ISCAN eye tracker is a fairly popular video-based corneal-reﬂection eye tracker which provides an image of the scene viewed by the subject and is seen by the operator on small ISCAN black and white scene monitors. Most eye trackers provide this kind of functionality. Because the application program window is not likely to cover the entire scene display (as seen in the scene monitors), only a restricted range of eye tracker coordinates will be of relevance to the eye tracking experiment. To use the ISCAN as an example, consider a 600 × 450 image display application. Once the window is positioned so that it fully covers the viewing display, it may appear slightly off-center on the ISCAN black/white monitor, as sketched in Figure 7.2.

The trick to the calculation of the proper mapping between eye tracker and application coordinates is the measurement of the application window’s extents, in the eye tracker’s reference frame. This is accomplished by using the eye tracker itself. One of the eye tracker’s most useful features, if available for this purpose, is its ability to move its crosshair cursor. The software may allow cursor ﬁne (pixel by pixel) or coarse (in jumps of ten pixels) movement. Furthermore, a data screen at the bottom of the scene monitor may indicate the coordinates of the cursor, relative to the eye tracker’s reference frame.

To obtain the extents of the application window in the eye tracker’s reference frame, simply move the cursor to the corners of the application window. Use these coordinates in the above mapping formulas. The following measurement “recipe” should provide an almost exact mapping, and only needs to be performed once. Assuming the application window’s dimensions are ﬁxed, the mapping obtained from this procedure can be hardcoded into the application. Here are the steps:

7.1 Mapping Eye Tracker Screen Coordinates

73

1. Position your display window so that it covers the display fully, e.g., in the case of the image stimulus, the window should just cover the entire viewing (e.g., TV or virtual reality Head-Mounted Display (HMD)) display.
2. Use the eye tracker’s cursor positioning utility to measure the viewing area’s extents (toggle between course and ﬁne cursor movement).
3. Calculate the mapping. 4. In Reset mode, position the eye tracker cursor in the middle of the display.
An important consequence of the above is that following the mapping calculation, it should be possible to always position the application window in the same place, provided that the program displays the calibration point obtained from the eye tracker mapped to local image coordinates. When the application starts up (and the eye tracker is on and in Reset mode), simply position the application so that the central calibration points (displayed when the tracker is in Reset mode) line up.

To conclude this example, assume that if the ISCAN cursor is moved to the corners of the drawable application area, the measurements would appear as shown in Figure 7.2. Based on those measurements, the mapping is:

x

=

x − 51 (482 − 51 +

1)(600)

(7.8)

y

=

449

−

y − 53 (446 − 53 +

1)(450).

(7.9)

The central point on the ISCAN display is (267, 250). Note that y is subtracted from 449 to take care of the image/graphics vertical origin ﬂip.

51,53

Application

51,446

H: 267 V: 250 D: 0 T:00:00:00:00
LEFT
SCENE MONITOR

482,53 267,250
(as shown in data display)
482,446

Fig. 7.2. Example mapping measurement.

74 7 Head-Mounted System Software Development
7.2 Mapping Flock Of Birds Tracker Coordinates

In virtual reality, the position and orientation of the head is typically delivered by

a real-time head tracker. In our case, we have a ﬂock of birds d.c. electromagnetic

tracker from Ascension. The tracker reports 6 degree of freedom information regard-

ing sensor position and orientation. The latter is given in terms of Euler angles. Euler

angles determine the orientation of a vector in three-space by specifying the required

rotations of the origin’s coordinate axes. These angles are known by several names,

but in essence each describes a particular rotation angle about one of the principal

axes. Common names describing Euler angles are given in Table 7.1 and the ge-

ometry is sketched in Figure 7.3, where roll, pitch, and yaw angles are represented

by R, E, and A, respectively. Each of these rotations is represented by the familiar

homogeneous rotation matrices:

⎡

⎤

cos R sin R 0 0

Roll (rot z)

=

Rz

=

⎢⎢⎣

− sin R 0

cos R 0

0 1

0 0

⎥⎥⎦

0 0 01

⎡

⎤

1 0 00

Pitch (rot x)

=

Rx

=

⎢⎢⎣

0 0

cos E − sin E

sin E cos E

0 0

⎥⎥⎦

0 0 01

⎡

⎤

cos A 0 − sin A 0

Yaw (rot y)

=

Ry

=

⎢⎢⎣

0 sin A

1 0

0 cos A

0 0

⎥⎥⎦

.

00 0 1

The composite 4 × 4 matrix, containing all of the above transformations rolled into one, is:

M= ⎡

RzRxRy

=

⎤

cos R cos A + sin R sin E sin A sin R cos E − cos R sin A + sin R sin E cos A 0

⎢⎢⎣

− sin R cos A + cos R sin E sin A cos E sin A

cos R cos E − sin E

sin R sin A + cos R sin E cos A cos E cos A

0 0

⎥⎥⎦

.

0

0

0

1

The FOB delivers a similar 4 × 4 matrix,

F⎡ =

cos E cos A

cos E sin A

⎤ − sin E 0

⎢⎢⎣

− cos R sin A + sin R sin E cos A sin R sin A + cos R sin E cos A

cos R cos A + sin R sin E sin A − sin R cos A + cos R sin E sin A

sin R cos E cos R cos E

0 0

⎥⎥⎦

,

0

0

01

where the matrix elements are slightly rearranged, such that

M[i, j] = F[(i + 1)%3][( j + 1)%3],

7.2 Mapping Flock Of Birds Tracker Coordinates

75

e.g., row 1 of matrix M is now row 2 in matrix F, row 2 is now row 3, and row 3 is now row 1. Columns are interchanged similarly, where column 1 of matrix M is now column 2 in matrix F, column 2 is now column 3, and column 3 is now column 1. This “off-by-one” shift present in the Bird matrix may be due to the non-C style indexing which starts at 1 instead of 0.

Table 7.1. Euler angles.
rot y rot x rot z Yaw Pitch Roll Azimuth Elevation Roll Longitude Latitude Roll

y

yR x

z

E

z

x

A

Fig. 7.3. Euler angles.
7.2.1 Obtaining the Transformed View Vector To calculate the transformed view vector v , assume the initial view vector is v = (0, 0, −1) (looking down the z-axis), and apply the composite transformation

76 7 Head-Mounted System Software Development

(in homogeneous coordinates):3

vM = 0 0 −1 1 ∗

⎡

⎤

cos R cos A + sin R sin E sin A sin R cos E − cos R sin A + sin R sin E cos A 0

⎢⎢⎣

−

sin

R

cos A + cos R sin cos E sin A

E

sin

A

cos R cos − sin E

E

sin R sin A + cos R sin E cos A cos E cos A

0 0

⎥⎥⎦

0

0

0

1

= − cos E sin A sin E cos E cos A 1 ,

which is simply the third row of M, negated. By inspection of the Bird matrix F, the view vector is simply obtained by selecting appropriate entries in the matrix; i.e.,

v = −F[0, 1] −F[0, 2] F[0, 0] 1 .

(7.10)

7.2.2 Obtaining the Transformed Up Vector

The transformed up vector is obtained in a similar manner to the transformed view vector. To calculate the transformed up vector u , assume the initial up vector is u = (0, 1, 0) (looking up the y-axis), and apply the composite transformation (in homogeneous coordinates):

uM = 0 1 0 1 ∗

⎡

⎤

cos R cos A + sin R sin E sin A sin R cos E − cos R sin A + sin R sin E cos A 0

⎢⎢⎣

−

sin

R

cos A + cos R sin cos E sin A

E

sin

A

cos R cos − sin E

E

sin R sin A + cos R sin E cos A cos E cos A

0 0

⎥⎥⎦

0

0

0

1

= − sin R cos A + cos R sin E sin A cos R cos E sin R sin A + cos R sin E cos A 1 .

By inspection of the Bird matrix F, the transformed up vector is simply obtained by selecting appropriate entries in the matrix; i.e.,

u = F[2, 1] F[2, 2] F[2, 0] 1 .

(7.11)

Because of our setup in the lab, the z-axis is on the opposite side of the Bird transmitter (behind the Bird emblem on the transmitter). For this reason, the z-component of the up vector is negated, i.e.,

u = F[2, 1] F[2, 2] −F[2, 0] 1 .

(7.12)

Note that the negation of the z-component of the transformed view vector does not make a difference because the term is a product of cosines.
3 Recall trigonometric identities: sin(−θ) = − sin(θ) and cos(−θ) = cos(θ).

7.3 3D Gaze Point Calculation

77

7.2.3 Transforming an Arbitrary Vector

To transform an arbitrary vector, an operation similar to the transformations of the up and view vectors is performed. To calculate the transformed arbitrary vector, w =
x y z 1 , apply the composite transformation by multiplying by the transformation
matrix M (in homogeneous coordinates):

wM = x y z 1 ∗

⎡

⎤

cos R cos A + sin R sin E sin A sin R cos E − cos R sin A + sin R sin E cos A 0

⎢⎢⎣

−

sin

R

cos A + cos R sin cos E sin A

E

sin

A

cos R cos E − sin E

sin R sin A + cos R sin E cos A cos E cos A

0 0

⎥⎥⎦

0

0

0

1

which gives transformed vector w ,

⎡ xM[1, 1] + yM[2, 1] + zM[3, 1] ⎤T

w

=

⎢⎢⎣

xM[1, xM[1,

2] 3]

+ +

yM[2, yM[2,

2] 3]

+ +

zM[3, zM[3,

2] 3]

⎥⎥⎦

.

1

To use the Bird matrix, there is unfortunately no simple way to select the appropriate

matrix elements to directly obtain w . Probably the best bet would be to undo the “off-

by-one” shift present in the Bird matrix. On the other hand, hardcoding the solution

may be the fastest method. This rather inelegant, but luckily localized, operation

looks like this:

⎡ xF[1, 1] + yF[2, 1] + zF[0, 1]⎤T

w

=

⎢⎢⎣

xF[1, xF[1,

2] 0]

+ +

yF[2, yF[2,

2] 0]

+ +

zF[0, zF[0,

2] 0]

⎥⎥⎦

.

(7.13)

1

Furthermore, as in the up vector transformation, it appears that the negation of the z

component may also be necessary. If so, the above equation will need to be rewritten

as

⎡ xF[1, 1] + yF[2, 1] + zF[0, 1] ⎤T

w

=

⎢⎢⎣

xF[1, 2] + yF[2, 2] + zF[0, 2] −(xF[1, 0] + yF[2, 0] + zF[0, 0])

⎥⎥⎦

.

(7.14)

1

7.3 3D Gaze Point Calculation
The calculation of the gaze point in three-space depends only on the relative positions of the two eyes in the horizontal axis. The parameters of interest here are the three-dimensional virtual coordinates of the gaze point, (xg, yg, zg), which can be determined from traditional stereo geometry calculations. Figure 7.4 illustrates the

78 7 Head-Mounted System Software Development basic binocular geometry. Helmet tracking determines helmet position and orthogonal directional and up vectors, which determine viewer-local coordinates shown in the diagram. The helmet position is the origin (xh, yh, zh), the helmet directional vector is the optical (viewer-local) z-axis, and the helmet up vector is the viewer-local y-axis.
(xg ,yg ,zg)

(xl ,yl )

(xr ,yr )

f

left eye (xh ,yh ,zh) b

right eye

Fig. 7.4. Basic binocular geometry.

Given instantaneous, eye tracked, viewer-local coordinates (mapped from eye tracker screen coordinates to the near view plane coordinates), (xl, yl) and (xr, yr) in the left and right view planes, at focal distance f along the viewer-local z-axis, we can determine viewer-local coordinates of the gaze point (xg, yg, zg) by deriving the stereo equations parametrically. First, express both the left and right view lines in terms of the linear parameter s. These lines originate at the eye centers (xh − b/2, yh, zh) and (xh +b/2, yh, zh) and pass through (xl, yl, f ) and (xr, yr, f ), respectively. The left view

7.3 3D Gaze Point Calculation

79

line is (in vector form):

(1 − s)(xh − b/2) + sxl (1 − s)yh + syl (1 − s)zh + s f

(7.15)

and the right view line is (in vector form):

(1 − s)(xh + b/2) + sxr (1 − s)yh + syr (1 − s)zh + s f ,

(7.16)

where b is the disparity distance between the left and right eye centers, and f is the distance to the near viewing plane. To ﬁnd the central view line originating at the local center (xh, yh, zh), calculate the intersection of the left and right view lines by solving for s using the x-coordinates of both lines, as given in Equations (7.15) and (7.16):

(1 − s)(xh − b/2) + sxl = (1 − s)(xh + b/2) + sxr

(xh − b/2) − s(xh − b/2) + sxl = (xh + b/2) − s(xh + b/2) + sxr

s(xl − xr + b) = b

s

=

xl

b − xr

+b.

(7.17)

The interpolant s is then used in the parametric equation of the central view line, to give the gaze point at the intersection of both view lines:

xg = (1 − s)xh + s((xl + xr)/2) yg = (1 − s)yh + s((yl + yr)/2) zg = (1 − s)zh + s f ,

giving:

xg =

1

−

xl

−

b xr

+

b

xh +

b xl − xr + b

xl + xr 2

yg =

1

−

xl

−

b xr

+

b

yh +

b xl − xr + b

yl + yr 2

zg =

1

−

xl

−

b xr

+

b

zh +

b xl − xr + b

f.

(7.18) (7.19) (7.20)

Eye positions (at viewer local coordinates (xh − b/2, yh, zh) and (xh + b/2, yh, zh)), the gaze point, and an up vector orthogonal to the plane of the three points then determine the view volume appropriate for display to each eye screen.

The gaze point, as deﬁned above, is given by the addition of a scaled offset to the view vector originally deﬁned by the helmet position and central view line in virtual world coordinates.4 The gaze point can be expressed parametrically as a point on a
4 Note that the vertical eye tracked coordinates yl and yr are expected to be equal (because gaze coordinates are assumed to be epipolar); the vertical coordinate of the central view vector deﬁned by (yl + yr)/2 is somewhat extraneous; either yl or yr would do for the calculation of the gaze vector. However, because eye tracker data are also expected to be noisy, this averaging of the vertical coordinates enforces the epipolar assumption.

80 7 Head-Mounted System Software Development

ray with origin (xh, yh, zh) and the helmet position, with the ray emanating along a vector scaled by parameter s. That is, rewriting Equations (7.18) through (7.20), we have:

xg = xh + s

xl

+ 2

xr

−

xh

yg = yh + s

yl

+ 2

yr

−

yh

zg = zh + s ( f − zh)

or, in vector notation,

g = h + sv,

(7.21)

where h is the head position, v is the central view vector, and s is the scale parameter

as deﬁned in Equation (7.17). Note that the view vector used here is not related to

the view vector given by the head tracker. It should be noted that the view vector v

is obtained by subtracting the helmet position from the midpoint of the eye tracked

x-coordinate and focal distance to the near view plane; i.e.,

⎡

⎤⎡ ⎤

(xl + xr)/2

xh

v = ⎣ (yl + yr)/2 ⎦ − ⎣ yh ⎦

f

zh

= m − h,

where m denotes the left and right eye coordinate midpoint. To transform the vector v to the proper (instantaneous) head orientation, this vector should be normalized, then multiplied by the orientation matrix returned by the head tracker (see Section 7.2 in general and Section 7.2.3 in particular). This new vector, call it m , should be substituted for m above to deﬁne v for use in Equation (7.21); i.e.,

g = h + s(m − h).

(7.22)

7.3.1 Parametric Ray Representation of Gaze Direction

Equation (7.22) gives the coordinates of the gaze point through a parametric representation (e.g., a point along a line) such that the depth of the three-dimensional point of regard in world coordinates is valid only if s > 0. Given the gaze point g and the location of the helmet h, we can obtain just the three-dimensional gaze vector ν that speciﬁes the direction of gaze (but not the actual ﬁxation point). This direction vector is given by:

ν = g−h = (h + sv) − h = sv

=

b xl − xr + b

⎡

⎤

(xl + xr)/2 − xh

⎣ (yl + yr)/2 − yh ⎦ ,

( f − zh)

(7.23) (7.24) (7.25)
(7.26)

7.4 Virtual Gaze Intersection Point Coordinates

81

where v is deﬁned as either m− h as before, or as m − h as in Equation (7.22). Given the helmet position h and the gaze direction ν, we can express the gaze direction via a parametric representation of a ray using a linear interpolant t:

gaze(t) = h + tν, t > 0,

(7.27)

where h is the ray’s origin (a point; the helmet position), and ν is the ray direction (the gaze vector). (Note that adding h to tν results in the original expression of the gaze point g given by Equation (7.21), provided t = 1.) The formulation of the gaze direction given by Equation (7.27) can then be used for testing virtual ﬁxation coordinates via traditional ray/polygon intersection calculations commonly used in
ray-tracing.

7.4 Virtual Gaze Intersection Point Coordinates
In 3D eye tracking studies, we are often interested in knowing the location of one’s gaze, or more importantly one’s ﬁxation, relative to some feature in the scene. In VR applications, we’d like to calculate the ﬁxation location in the virtual world and thus identify the object of interest. The identiﬁcation of the object of interest can be accomplished following traditional ray/polygon intersection calculations, as employed in ray-tracing (Glassner, 1989).
The ﬁxated object of interest is the one closest to the viewer that intersects the gaze ray. This object is found by testing all polygons in the scene for intersection with the gaze ray. The polygon closest to the viewer is then assumed to be the one ﬁxated by the viewer (assuming all polygons in the scene are opaque).

7.4.1 Ray/Plane Intersection

The calculation of an intersection between a ray and all polygons in the scene is usually obtained via a parametric representation of the ray; e.g.,

ray(t) = ro + trd,

(7.28)

where ro deﬁnes the ray’s origin (a point), and rd deﬁnes the ray direction (a vector). Note the similarity between Equations (7.28) and (7.27); there, h is the head position and ν is the gaze direction. To ﬁnd the intersection of the ray with a polygon, calculate the interpolant value where the ray intersects each polygon, and examine all the intersections where t > 0. If t < 0, the object may intersect the ray, but behind the viewer.

Recall the plane equation Ax + By + Cz + D = 0, where A2 + B2 + C2 = 1; i.e., A, B, and C deﬁne the plane normal. To obtain the ray/polygon intersection, substitute Equation (7.28) into the plane equation:

82 7 Head-Mounted System Software Development

A(xo + txd) + B(xo + tyd) + C(zo + tzd) + D = 0

(7.29)

and solve for t: or, in vector notation:

t

=

−

Axo + Bxo + Axd + Byd

Czo + + Czd

D

(7.30)

t

=

−(N · N

ro + · rd

D)

.

(7.31)

A few observations of the above simplify the implementation.

1. Here N, the face normal, is really −N, because what we’re doing is calculating the angle between the ray and face normal. To get the angle, we need both ray and normal to be pointing in the same relative direction. This situation is depicted in Figure 7.5.
2. In Equation (7.31), the denominator will cause problems in the implementation should it evaluate to 0. However, if the denominator is 0 (i.e., if N · rd = 0), then the cosine between the vectors is 0, which means that the angle between the two vectors is 90◦ which means the ray and plane are parallel and don’t intersect. Thus, to avoid dividing by zero, and to speed up the computation, evaluate the denominator ﬁrst. If it is sufﬁciently close to zero, don’t evaluate the intersection further; we know the ray and polygon will not intersect.
3. Point 2 above can be further exploited by noting that if the dot product is greater than 0, then the surface is hidden to the viewer.

The ﬁrst part of the intersection algorithm follows from the above and is given in Listing 7.1.

-N rd

ro
N Fig. 7.5. Ray/plane geometry.

§
vd = N · rd; if ( vd < 0) {
vo = −(N · ro + D); t = vo/vd ; } ¦

7.4 Virtual Gaze Intersection Point Coordinates

83

¤

// denominator

// numerator

¥ Listing 7.1. Ray/polygon intersection.

In the algorithm, the intersection parameter t deﬁnes the point of intersection along the ray at the plane deﬁned by the normal N. That is, if t > 0, then the point of intersection p is given by:

p = ro + trd.

(7.32)

Note that the above only gives the intersection point of the ray and the (inﬁnite!) plane deﬁned by the polygon’s face normal. Because the normal deﬁnes a plane of inﬁnite extent, we need to test the point p to see if it lies within the conﬁnes of the polygonal region, which is deﬁned by the polygon’s edges. This is essentially the “point-in-polygon” problem.

7.4.2 Point-In-Polygon Problem
To test whether a point p lies inside a polygon (deﬁned by its plane equation which speciﬁes a plane of ﬁnite extent), we need to test the point against all edges of the polygon. To do this, the following algorithm is used.
For each edge: 1. Get the plane perpendicular to the face normal N, which passes through
the edge’s two vertices A and B. The perpendicular face normal N is obtained by calculating the cross product of the original face normal with the edge; i.e.,
N = N× (B − A),
where the face vertices A and B are speciﬁed in counterclockwise order. This is shown in Figure 7.6. 2. Get the perpendicular plane’s equation by calculating D using either of A or B as the point on the plane (e.g., plug in either A or B into the plane equation; then solve for D). 3. Test point p to see if it lies “above” or “below” the perpendicular plane. This is done by plugging p into the perpendicular plane’s equation and testing the result. If the result is greater than 0, then p is “above” the plane. 4. If p is “above” all perpendicular planes, as deﬁned by successive pairs of vertices, then the point is “boxed in” by all the polygon’s edges, and so must lie inside the polygon as originally deﬁned by its face normal N.