Testing for statistically significant differences between groups of scan patterns Matt Feusner University of California, San Francisco feusnerm@vision.ucsf.edu Brian Lukoff Stanford University brian.lukoff@stanford.edu Abstract alignment produces as its output a distance value, or dissimilarity between the two scan patterns. Pairwise sequence alignment methods are now often used when analyzing eyetracking data [Hacisalihzade et al. 1992; Brandt and Stark 1997; Josephson and Holmes 2002, 2006; Pan et al. 2004; Heminghous and Duchowski 2006]. While optimal sequence alignment scores provide a valuation of similarity and difference, they do not readily provide a statistical test of similarity or difference. Furthermore, pairwise alignment scores cannot be used to compare groups of scan patterns directly. Using a statistic that compiles these pairwise alignment scores, a statistical evaluation of similarity can be made by repeatedly computing scores from different permutations of scan pattern groupings. This test produces a p-value as a level of statistical significance. Keywords Eye tracking, analysis, similarity test, comparison, sequence comparison, scanpath, scan pattern, statistics 1. Introduction Note that instead of providing a numerical outcome measure for an individual scan pattern, sequence alignment quantifies the dissimilarity between a pair of scan patterns. Thus, one cannot use traditional statistical methods for comparing two groups (e.g., a t-test or Wilcoxon signed-rank test), as there is no direct numerical measure of an individual scan pattern. One way to compare groups of scan patterns is to use sequence alignment distances with multidimensional scaling (which reduces a matrix of similarity scores to a small number of dimensions) and attempt to cluster the scan patterns. However, statistical analysis has not been done before to compare clusters of sequences in eyetracking data [Josephson and Holmes 2002]. A multiple sequence alignment method can be used to consolidate groups of scanpaths into an average pattern, or consensus alignment [Hembrooke et al. 2006; West et al. 2006] and the pairwise distance can be used to compare the two representative sequences. However, without a distribution or statistical framework, there is no way to test for significance. In an experiment where subjects are randomly assigned to two groups (e.g., a treatment and control group), researchers typically want to compare the two groups on an outcome measure to see if the two groups performed differently. Often, researchers will select a numerical outcome measure and then perform a statistical test ,such as a t-test, to determine if the observed differences between the groups are due purely to chance. In this paper, we present a straightforward adaptation to eyetracking research of a statistical procedure that utilizes a pairwise difference measure (sequence alignment) to compare two experimental groups and produce a p-value for significance. The procedure has been applied in other disciplines with other pairwise distance functions [Mantel 1967; Aittokallio et al. 2000; Kropf et al. 2006]. In eyetracking research, measured results generally consist of eye position and pupil size traces over time. Fixations and saccades are often extracted to use outcome measures such as fixation location, fixation duration, and saccade amplitude [Salvucci and Goldberg 2000]. However, those measures tear apart spatiotemporal data that is inherently linked; the natural outcome measure is a scanpath, or scan pattern, consisting of a series of fixations and saccades in both space and time. Sequence alignment algorithms, or optimal matching algorithms, are an excellent tool for analyzing this complex datatype [Salvucci and Anderson 2003], and have been successfully used in many eyetracking studies [Josephson and Holmes 2002, 2006; Pan et al. 2004; Myers 2005; West et al. 2006]. Sequence alignment works by computing the minimum number or magnitude of edit operations needed to transform one sequence into the other. Edit operations usually include insertion, deletion, and substitution [Josephson and Holmes 2002]. Given two scan patterns, sequence Copyright © 2008 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail permissions@acm.org. ETRA 2008, Savannah, Georgia, March 26–28, 2008. © 2008 ACM 978-1-59593-982-1/08/0003 $5.00 2. Procedure Suppose that an experiment is set up where there are n subjects in one group and m subjects in another. Given any arbitrary grouping of all n + m subjects into two groups (not necessarily the experimental grouping) of sizes n and m, we can calculate d* = dbetween – dwithin, (1) where dbetween is the average distance between the scan patterns of subjects in different groups, and dwithin is the average distance between the scan patterns of subjects in the same group. For a random grouping, we would expect d* to be close to 0 and equally likely to be positive or negative, since there is no reason to expect that subjects in the same group have scan patterns more or less similar to each other than subjects in different groups. Thus, over all of the possible ⎜⎜⎝⎛ n +m m ⎟⎟⎠⎞ = (n + m)! n!m! (2) groupings of all n + m subjects into two groups, the distribution of d* is symmetric about its mean of 0. A positive d* statistic indicates that scan patterns in different groups are on average farther apart than scan patterns in the same group. Informally, this would mean that the particular grouping of subjects leads to two groups that each cluster together. 43 Permutation tests are a type of nonparametric statistical method that allow the researcher to distinguish d* values that are the result of random noise (e.g., an outlier subject) from d* values that represent a true difference between groups. The null hypothesis in the test is that each grouping is equally likely; in other words, that the experimental grouping is essentially just a random grouping of subjects into the appropriately sized groups. As in all statistical tests, the p-value is the probability of obtaining a grouping with a d* value as high as the one observed in the experimental grouping if in fact the null hypothesis is true. Computing the p-value in a permutation test is simple: examine all possible groupings of subjects and compute a d* statistic for each; the p-value is the proportion of the groupings that result in a d* statistic at least as large as the one observed in the experimental grouping. p-values from a permutation test are interpreted in the same way one would interpret a p-value from any other statistical test; typically p-values that are 5% or less are considered to be “significant” and worthy of the conclusion that there is a real difference between the experimental groups. Calculating the p-value in a permutation test can often require a prohibitively large number of groupings, even with relatively small data sets. For two experimental groups of size 15, there are over 150 million (Equation 2). A standard “Monte Carlo” strategy to overcome this computational barrier is to select a random subset of groupings that is a more manageable size (e.g., 1,000 or 10,000). Although it makes the permutation test computationally tractable, the p-value we obtain will have some random error in it due to the fact that we randomly selected only a subset of the total number of groupings to examine. By selecting a random subset of groupings, we ensure that the estimated pvalues given by the Monte Carlo procedure will average out to the exact p-value in the long term. For practical purposes, one may, for example, simply select the largest computationally feasible random subset of groupings to examine, or create confidence intervals for the estimated p-values [Nettleton and Doerge 2000]. 3. Results While the test can be used with any pairwise distance function, it is helpful first to explore the behavior of the test using a string edit distance to compare sets of synthetic data. Here the intent is to verify that the statistical test produces the results that are expected based on how the data were generated. First, two groups (A and B) of artificial scan patterns are plotted from the same process, a rectangular pattern with a small amount of random noise. Each of the 7 scans in each group had 4 fixations on each side of a square shape, moving clockwise from the top left corner (Figure 1). In order to induce both substitution and insertion/deletion edits, fixations were initially spaced 100 pixels apart and then adjusted using random noise variables in both the x and y directions that were uniformly distributed in [100, 100]. The substitution parameter is the Euclidean distance between fixations, and the gap penalty parameter was 139.29 pixels, the length of the average saccade for both sets. The d* statistic computed was -21.47. Running all possible 3432 permutations (Equation 2), 3168 were less than or equal to d*, resulting in a p-value of 0.92308, which rightly fails to reject the null hypothesis that the sets are the same (Table 1). Next, a new group was generated with the same rectangular processes, but moving in the opposite counterclockwise direction. This group was compared with group A from the previous test using a string edit gap parameter of 145.34, again the average saccade distance. The d* statistic was 405.23, and the p-value Figure 1: An initial scanpath (dashed) and a final scanpath used in the data after random noise was added (solid). The dashed square is 300 x 300 pixels. Arrows indicate direction of saccades. was 2/3432 = 0.00058, correctly indicating a significant difference (Table 1). Having verified the appropriate results for both similar and dissimilar data, it is interesting to see what type of data is at a borderline level of significance. The random noise is increased in 50 pixel steps for both the clockwise and counterclockwise scans, so that each fixation is moved from its point on the original square by a random variable that is uniformly distributed on [-r, r]. At r = 300 the p-value begins to lose significance (Table 2). This result makes sense since the side of the square is, on average, 300 pixels. As r is increased, the transition from positive to negative values of d* is also at r = 300 (Figure 2). A more interesting test is to reanalyze the data of another study conducted with other methods of analysis. In a study by Dixon et al. [2006], subjects were eyetracked viewing movies under different viewing conditions: visible light, infrared (IR) light, and 3 experimental combinations of the two: a simple average (AVE), a complex wavelet transformation (CWT) and a discrete wavelet transformation (DWT). In the movie, subjects were asked to signal when a moving figure came to a certain location in a forest, and their accuracy tracking the figure was measured. Dixon et al. found that accuracy performance was significantly different between the visible light and combined conditions, and between the IR light condition and the DWT combined condition. (This Table 1. Synthetic data: comparisons with data group A Data Gap d* p group parameter B 139.29 -21.47 3168/3432 = 0.92308 Reversed 145.34 405.23 2/3432 = 0.00058 Table 2. Synthetic data: comparisons for changing random noise r Gap d* p parameter 50 108.68 530.16 1/3432 = 0.00029 100 140.22 415.22 2/3432 = 0.00058 150 179.91 196.83 2/3432 = 0.00058 200 232.87 338.85 1/3432 = 0.00029 250 361.73 146.77 46/3432 = 0.01340 300 437.52 189.54 20/3432 = 0.00583 350 652.02 -52.94 1985/3432 = 0.57838 400 761.24 -16.46 1444/3432 = 0.42075 450 1094.29 -164.05 2432/3432 = 0.70862 500 1203.45 -193.14 2990/3432 = 0.87121 44 dataset was made available online at http://www.cis.rit.edu/pelz/ scanpaths/data/bristol-eden.htm.) To run the permutation test on this dataset, 30 scans (3 sessions by 10 subjects) were grouped together for each of the 5 conditions. The string edit distance was again parameterized by a Euclidean distance substitution penalty and average saccade length gap penalty. 1000 Monte Carlo samples were taken in each test. The visible light condition was significantly different from all of the other conditions (IR: p = 0.011; AVE: p = 0.027; CWT: p = 0.014; DWT: p = 0.029). This result confirms the accuracy result from the original study for the 3 combined-viewing conditions. However, the visible light condition scans were different from the IR condition scans, while the accuracy result was not significantly different. Therefore the permutation test result provides evidence that viewing patterns can be significantly different despite similar tracking accuracy. Results comparing the IR condition scans to scans in the remaining 3 combined conditions were not significant (AVE: p = 0.56; CWT: p = 0.384; DWT: p = 0.517). There was a significant difference in accuracy between the IR and DWT conditions that was not reiterated here. Hence there is also evidence that similar viewing patterns can still produce Figure 2: dbetween and dwithin (top) and d* (bottom) as a function of the noise parameter r. When dwithin > dbetween, d* is negative. significantly different tracking accuracy. In agreement with the accuracy results, the permutation test found no significant differences in the combined viewing conditions (AVE and CWT: p = 0.16; AVE and DWT: p = 0.288; CWT and DWT: p = 0.523). 4. Discussion One important strength of the permutation test is that it is nonparametric; in other words, it does not assume that the distances take on any particular distribution. Consequently, any distance function can be used: one can easily change the parameters of the string edit distance (i.e., the specific penalties for insertion, deletion, and substitution) or even use another pairwise distance function altogether. For simplicity, the particular sequence alignment distance function we used in this study does not take pupil size and fixation durations into account when computing the distance. In other words, two scanpaths produced by subjects whose eyes follow the same geographical path will have a distance of 0, despite even wild variation in fixation duration and pupil size. One could easily overcome this limitation by using different substitution functions and gap penalties that include these extra measurements as weights or extra dimensions, linking many possible measurements to the statistical test. Another modification to the sequence alignment parameters would be to incorporate smooth pursuit eye movements. Video images, such as those analyzed here, often induce smooth pursuit tracking eye movements in addition to fixations and saccades [Dixon et al. 2007]. Like the Dixon et al. 2006 study, we did not explicitly analyze smooth pursuit movements. However, one could use a substation function that conditionally uses different metrics to compute distance scores depending on the types of eye movements (fixations or smooth pursuit movements). Such a function could then be used in place of the simple Euclidean distance as a parameter for the sequence alignment algorithm. Depending on the researcher’s substantive interest, other distance functions may be better suited to compare individual scanpaths. For example, the area of the convex hull or circle circumscribed around the scanpath would indicate how focused the eye movements are in a single scanpath [Goldberg and Kotval 1999]. While these measures necessarily oversimplify the representation of scanpath by only quantifying extent of focus and not taking other features (e.g., fixation density or timing) into account, they do allow for a single number to be assigned to a scanpath. Since a pairwise distance function can be built from any such computed value by taking arithmetic differences, the permutation test described here might also be conducted using these distances. The choice of distance function should be theoretically motivated; one should choose a specific distance function that reflects whatever substantive differences between scanpaths the researcher is interested in. Our reanalysis of the data in the 2006 study by Dixon et al. echoes only some of the results found by the study’s researchers. One reason may be that examining the entire scanpath provides a different picture of the overall results than examining only a single numerical measure summarizing tracking accuracy. However, another reason for the difference may be the statistical differences between the analysis described here and the analyses described by Dixon et al. First, our analysis does not adjust for the fact that each group consisted of the same ten subjects producing repeated scanpaths, which does not take into account the likely similarity between scanpaths produced by the same 45 subject. While Dixon et al. correct for this by using a repeated measures ANOVA, the permutation test does not take this interdependence into account. Second, the p-values we report above are uncorrected for multiple comparisons (Dixon et al. uses Tukey’s HSD). Future work should refine the permutation test to account for these complications in the study design. There are two important practical issues to consider before using the permutation test. First, researchers must determine the specific distance function used to quantify the difference between two scanpaths, and the parameters of that function (e.g., the substitution, insertion, and deletion parameters of the string edit distance). Second, researchers must confront the more general statistical issues of any Monte Carlo permutation test, particularly how many permutations to sample (1000, 10000, or more?) given time constraints and the desired power of the test. 5. Conclusion When a researcher conducts an experiment where the outcomes are scanpaths, it is important to be able to determine statistically whether the observed differences between the scanpaths in each experimental group are due to real differences between the groups or simply due to random variation. Methods that rely on human judgment to determine whether there is a real difference between the two groups are susceptible to bias from researcher expectation about what the result should be, so it is important to have a statistical decision method. The test presented here is applicable whenever a researcher conducts an experiment where the outcome is a scanpath, and is flexible enough to accommodate any pairwise distance function. The optimal sequence alignment algorithm is shown to be a reasonable choice for computing pairwise distances. Even in studies where the scanpath is not the primary outcome of interest, the permutation test can still yield useful and interesting results because it can illuminate cases when there are differences between scanpaths that are not detected by the outcome measure targeted by the researchers. Acknowledgements We thank Laura Granka, Timothy Dixon, and John Economides for their comments and help proofreading this work. References AITTOKALLIO, T., OJALA, P., NEVALAINEN, T. J., and NEVALAINEN, O. 2000. Analysis of similarity of electrophoretic patterns in mRNA differential display. Electrophoresis, 21, 2947-56. BRANDT, S. A., and STARK, L. W. 1997. Spontaneous eye movements during visual imagery reflect the content of the visual scene. Journal of Cognitive Neuroscience, 9, 27-38. DIXON, T. D., LI, J., NOYES, J. M., TROSCIANKO, T., NIKOLOV, S. G., LEWIS, J., CANGA, E. F., BULL, D. R., and CANAGARAJAH, C. N. 2006. Scanpath analysis of fused multi-sensor images with luminance change: a pilot study. 9th International Conference on Information Fusion (ICIF '06), 1-8. DIXON, T. D., NIKOLOV, S. G., LEWIS, J. J., LI, J., CANGA, E. F., NOYES, J. M., TROSCIANKO, T., BULL, D. R., and CANAGARAJAH, C. N. 2007. Assessment of fused videos using scanpaths: a comparison of data analysis methods. Spatial Vision, 20, 437–466. GOLDBERG, J. H., and KOTVAL, X. P. 1999. Computer interface evaluation using eye movements: methods and constructs. International Journal of Industrial Ergonomics, 24, 631-645. HACISALIHZADE, S., STARK, L., and ALLEN, J. 1992. Visual perception and sequences of eye movement fixations: a stochastic modeling approach. IEEE Transactions on Systems, Man and Cybernetics, 22, 474-481. HEMBROOKE, H., FEUSNER, M., and GAY, G. 2006. Averaging scan patterns and what they can tell us. Proceedings of the 2006 symposium on Eye Tracking Research & Applications, 41. HEMINGHOUS, J. and DUCHOWSKI, A. D. 2006. iComp: a tool for scanpath visualization and comparison. Proceedings of the 3rd symposium on Applied Perception in Graphics and Visualization (APGV '06), 152. JOSEPHSON, S. and HOLMES, M. E. 2002. Visual attention to repeated internet images: testing the scanpath theory on the world wide web. Proceedings of the 2002 symposium on Eye Tracking Research & Applications, 43-49. JOSEPHSON, S. and HOLMES, M. E. 2006. Clutter or content? How on-screen enhancements affect how TV viewers scan and what they learn. Proceedings of the 2006 symposium on Eye Tracking Research & Applications, 155-162. KROPF, S., LUX, A., ESZLINGER, M., HEUER, H., and SMALLA, K. 2006. Comparison of independent samples of highdimensional data by pairwise distance measures. Biometrical Journal, 48, 1-12. MANTEL, N. 1967. The detection of disease clustering and a generalized regression approach. Cancer Research, 27, 209220. MYERS, CW. 2005. Toward a method of objectively determining scanpath similarity. Journal of Vision, 5, 693 NETTLETON D. and DOERGE R. W. 2000. Accounting for Variability in the use of permutation testing to detect quantitative trait loci. Biometrics, 56, 52-58. PAN, B., HEMBROOKE, H. A., GAY, G. K., GRANKA, L. A., FEUSNER, M. K., NEWMAN, J. K. 2004. The determinants of web page viewing behavior: an eye-tracking study. Proceedings of the 2004 symposium on Eye Tracking Research & Applications, 147-154. SALVUCCI, D. D. and ANDERSON, J. R. 2001. Automated eyemovement protocol analysis. Human Computer Interaction, 6, 39-86. SALVUCCI, D. D. and GOLDBERG, J. H. 2000. Identifying fixations and saccades in eye-tracking protocols. Proceedings of the 2000 symposium on Eye Tracking Research & Applications, 71-78. WEST, J. M., HAAKE, A. R., ROZANSKI, E. P., and KARN, K. S. 2006. eyePatterns: software for identifying patterns and similarities across fixation sequences. Proceedings of the 2006 symposium on Eye Tracking Research & Applications, 149-154. 46