476 lines
26 KiB
Plaintext
476 lines
26 KiB
Plaintext
2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA) | 978-1-6654-4337-1/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICMLA52953.2021.00100
|
||
|
||
2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA)
|
||
Deep Learning Methods for the Prediction of Information Display Type Using Eye Tracking Sequences
|
||
Yuehan Yin†, Yahya Alqahtani∗, Jinjuan Heidi Feng†, Joyram Chakraborty†, Michael P. McGuire† †Department of Computer and Information Sciences Towson University Towson MD, USA
|
||
∗College of Computer Science and Information Technology Jazan University
|
||
Jazan, Saudi Arabia yyin1@students.towson.edu, yalqahtani@jazanu.edu.sa, {jfeng, jchakraborty, mmcguire}@towson.edu
|
||
|
||
Abstract—Eye tracking data can help design effective user interfaces by showing how users visually process information. In this study, three neural network models were developed and employed to classify three types of information display methods by using eye gaze data that was collected in visual information processing behavior studies. Eye gaze data was first converted into a sequence and was fed into neural networks to predict the information display type. The results of the study show a comparison between three methods for the creation of eye tracking sequences and how they perform using three neural network models including CNN-LSTM, CNN-GRU, and 3D CNN. The results were positive with all models having an accuracy of higher than 88 percent.
|
||
Index Terms—Convolutional neural network (CNN), deep learning, eye tracking, recurrent neural network (RNN).
|
||
I. INTRODUCTION AND RELATED WORK
|
||
Eye-tracking sensor technology is used to gather eye gaze data to analyze how users process visual information displays. The resulting data can be used to provide input to customize the UI to meet perceived needs, expectations, or interest levels [1]. Areas of interest (AOIs) are beneficial for researchers to analyze different parts of an information display by breaking the display up into individual components. This study is primarily motivated by the need to predict the information display type based on eye gaze patterns alone.
|
||
Sequential data has been involved in a large number of machine learning tasks where the input is a given sequence and output may be a value, a class label, or another sequence. In our previous work [2], convolutional neural networks (CNNs) were applied to eye tracking data modeled as a static time slice image to classify different information presentation methods. However, the time dimension in the fixed-size window was not exploited at all for the classification task. Thus, a second motivation of the study is to decide how to convert eye gaze data into a sequence.
|
||
Recurrent neural networks (RNNs) show promise in predicting sequence problems. RNNs model the dynamics of sequences through the addition of loops in the network architecture and a state representing information corresponding to a context window with a arbitrary size [3]. Early recurrent neural networks had difficulty in learning long-term dependencies and also suffered from problems of exploding or vanishing gradient
|
||
|
||
[4], [5]. However, long short-term memory (LSTM) networks [5], [6] were introduced to effectively address the difficulty and problems that early recurrent neural networks had. Gated recurrent unit (GRU) [7], [8] networks are simplified variations related to LSTM networks. The number of parameters and computing cost of the GRU network were decreased in comparison with the LSTM network, and its performance was not reduced obviously [9]. In [10], an LSTM network was applied to model eye-tracking records as textual strings using the sequence-based saccade eye movement representation. Researchers in [11] proposed a CNN-LSTM model for dynamic gesture recognition. In [12], a hybrid deep neural network was proposed to classify honest and fake consumers. In [13], a deep Conv-LSTM model was built to classify the eye movements of novice or expert clinicians with an accuracy rate of 84.2%. Researchers in [14] used a two-stream inflated 3D CNN to perform the task in video action classification and achieved the state-of-the-art results with using a pre-training network model.
|
||
In this study, three neural networks were applied to eye gaze data to identify information display methods including textual presentation, graphical presentation, and tabular presentation. The structure of the paper is as follows. The dataset and data preprocessing are described in section II. The neural network models employed in the study are introduced in section III. The experiments and corresponding results are presented in section IV. Finally the conclusion and future research directions are presented in section V.
|
||
II. DATASET AND DATA PREPROCESSING
|
||
A. Dataset Overview
|
||
The eye tracking data used in this paper was collected using Tobii X2-60 eye trackers from two studies [15], [16] each with 24 participants. The customized web UI with different information display methods used in the studies are illustrated in figure 1. The lines on figure 1, depict the expected scan path for each interface. The Tobii Studio software was utilized to extract the raw eye gaze data. The key attributes that were used in this paper were participant information, GazePointX
|
||
|
||
978-1-6654-4337-1/21/$31.00 ©2021 IEEE
|
||
|
||
601
|
||
|
||
DOI 10.1109/ICMLA52953.2021.00100
|
||
|
||
Authorized licensed use limited to: Technische Informationsbibliothek (TIB). Downloaded on March 20,2025 at 16:02:22 UTC from IEEE Xplore. Restrictions apply.
|
||
|
||
(ADCSpx) and GazePointY (ADCSpx) for the gaze points, recording date, and timestamp.
|
||
|
||
(a) Textual presentation.
|
||
|
||
(b) Graphical presentation.
|
||
|
||
(c) Tabular presentation.
|
||
Fig. 1. Information display methods used in the study.
|
||
B. Data Preprocessing
|
||
AOIs for each information display were created using the Tobii Studio software. Gaze points related to activated AOIs were extracted from the eye tracking data along with the starting and ending timestamps for when an AOI was activated and deactivated. Gaze points which lacked x or y coordinates were eliminated. Data was extracted using a time window between 9 and 10 seconds and was modeled as a single JSON representation stored in MongoDB.
|
||
The time dimension was added to each input tensor. The time window size was tested by breaking the data up into ten-frame tensors with one second for each frame, five-frame tensors with two seconds for each frame, and two-frame tensors with five seconds for each frame. In addition, each frame was cumulatively generated based on the unit of one second. Each input tensor had four dimensions including the number of frames, height, width, and the number of channels. The spatial dimension of each tensor was represented by a 2D array with the same dimensions as the screen resolution for the computer (1440 × 900). The elements in the 2D array were coded as 1 to imply active gaze points and 0 otherwise. Each frame was scaled down from 1440 × 900 to 240 × 150 by using Lanczos interpolation [17]. In addition, centering was applied to each input tensor such that the mean pixel value was subtracted from each pixel. Figure 2 shows the three types of time sequence input (10 F: 10 frames, 5 F: 5 frames, and 2 F: 2 frames) by using a scan path representation for the purpose of illustration.
|
||
III. NEURAL NETWORK MODELS
|
||
The following neural network models were implemented using TensorFlow and Keras. In addition, a single NVIDIA TITAN V GPU with a memory size of 12GB was used to train and test the models.
|
||
|
||
A. CNN-LSTM and CNN-GRU
|
||
Figure 3 shows the architectures of our CNN-LSTM and CNN-GRU models. The CNN part was composed of four blocks including two convolutional layers with a 3 × 3 kernel and a stride of one, followed by batch normalization. After the first normalization, the output was passed as input to the Swish activation function. The output of the second normalization was passed as input to the max pooling layer including a 2 × 2 pooling kernel and a stride of two as well, and then the output was passed as input to the Swish activation function. Swish was restricted at the bottom but not restricted at the top, and was smooth and non-monotonic in comparison with ReLU [18]. In the second block, we applied a shortcut connection [19] which was a linear layer performing oneby-one convolution connected between the output of the first block and the output of the last convolutional layer in the second block. In our models, we used the linear shortcut connection to help supplement the non-linear mapping [20]. After going through the fourth block, the output was flattened across time steps and fed into the LSTM or GRU layer. The number of units in both LSTM or GRU layers was 500. Then, the output was passed into a fully connected layer with the Swish activation function. The last fully connected layer was with a softmax activation function in it. The loss function used was categorical cross-entropy, and the Adam optimizer was used with a learning rate of 0.001.
|
||
The two text boxes in figure 3 contain mathematical equations describing the implementation of the LSTM layer and GRU layer. To simplify the demonstration, we used the xt ∈ Rn in the mathematical equations as the input at the time step t from a given sequential input. In the text box of the LSTM unit, cˆt ∈ Rd, it ∈ [0, 1]d, ft ∈ [0, 1]d, ot ∈ [0, 1]d, ct ∈ Rd and at ∈ Rd respectively denoted the candidate internal state to be written into the memory cell, input gate, forget gate, output gate, internal state of the memory cell that recorded the historical information up to the current moment, and hidden state of the LSTM unit at time step t, and W{c,i,f,o} ∈ Rd×n and V{c,i,f,o} ∈ Rd×d were weight matrices that could be learned, and b{c,i,f,o} ∈ Rd were bias vectors. In the text box of the gated recurrent unit, ut ∈ [0, 1]d, rt ∈ [0, 1]d, aˆt ∈ Rd and at ∈ Rd respectively denoted the update gate, reset gate, candidate hidden state and hidden state of the GRU at the time step t, and W{u,r,a} ∈ Rd×n and V{u,r,a} ∈ Rd×d were weight matrices that could be learned, and b{u,r,a} ∈ Rd were bias vectors. The “ ” in the two text boxes referred to the element-wise product.
|
||
B. 3D CNN
|
||
Figure 4 displays the architecture of the 3D CNN model. Each convolutional layer in the model had a 3 × 3 × 3 kernel and a stride of one. In our model, each convolutional layer was followed by batch normalization and a Swish activation function, and then the output was passed into a 3D max pooling layer. There were four 3D max pooling layers in total, and the first two max pooling layers had pooling sizes of
|
||
|
||
602 Authorized licensed use limited to: Technische Informationsbibliothek (TIB). Downloaded on March 20,2025 at 16:02:22 UTC from IEEE Xplore. Restrictions apply.
|
||
|
||
Fig. 2. A cumulative time sequence of scan paths on textual presentation for 10 seconds with one frame per second.
|
||
three types of input could have the same output shapes before they were fed into the fully connected layer. The flattened output was fed into the fully-connected layer with 128 neurons and a Swish function. In addition, the number of strides for all the max pooling layers were the same as their pooling sizes. The last fully-connected layer had three neurons with a softmax function. Categorical cross-entropy was used as the loss function along with the Adam optimizer with a learning rate of 0.001.
|
||
|
||
Fig. 3. CNN-LSTM and CNN-GRU models. F.C. = Fully Connected, kernel sizes for convolution and max pooling are indicated as n × n. The number of filters and neurons is indicated after a comma in the box displaying convolution and F.C.. Notice that all convolutional layers are followed by batch normalization, which is not shown in the figure.
|
||
1 × 2 × 2 and 2 × 3 × 3 for all three types of the neural network input generated from the data pre-processing stage. In the third and fourth max pooling layers, the pooling sizes were 2 × 2 × 2 and 2 × 2 × 2 for the 10 frames input, and the pooling sizes were 2 × 2 × 2 and 1 × 2 × 2 for the 5 frames input, and the pooling sizes were 1×2×2 and 1×2×2 for the 2 frames input. The reason for using different pooling sizes for the third and fourth max pooling layers was to ensure that all
|
||
|
||
Fig. 4. 3D CNN model. B.N. = Batch Normalization and F.C. = Fully Connected. Kernel sizes for convolution and max pooling are indicated as a × a × a and {b or n or m} × d × d separately. The number of filters and neurons is indicated after a comma in the box displaying convolution and F.C..
|
||
IV. EXPERIMENTS AND RESULTS A. Data Partitioning Methods
|
||
Our data partition on 10-second sequences is shown in table I. Categories in our sequence dataset were not completely
|
||
|
||
603 Authorized licensed use limited to: Technische Informationsbibliothek (TIB). Downloaded on March 20,2025 at 16:02:22 UTC from IEEE Xplore. Restrictions apply.
|
||
|
||
balanced. Thus, 80/20 stratified shuffle split method was selected to guarantee that the proportion of samples associated with each category label in each set split by this method was uniform. The category labels were encoded by using one-hot encoding. Due to the split ratio, more samples could be used for training and validation to adjust some hyperparameters and architectures of our network models. In addition, we applied the stratified tenfold cross-validation method to the training set. The final evaluation was carried out on the test set containing 20 percent of the samples in the dataset. A batch size of 32 was used to train all models.
|
||
|
||
TABLE I Data Partition
|
||
|
||
Classification of Information Display Methods
|
||
|
||
Training set Test set
|
||
|
||
Number of textual presentation
|
||
|
||
548
|
||
|
||
137
|
||
|
||
Number of graphical presentation
|
||
|
||
317
|
||
|
||
80
|
||
|
||
Number of tabular presentation
|
||
|
||
310
|
||
|
||
77
|
||
|
||
Total number of each set
|
||
|
||
1175
|
||
|
||
294
|
||
|
||
Total number of the dataset
|
||
|
||
1469
|
||
|
||
B. Evaluation Metrics
|
||
The evaluation metrics including accuracy, precision, recall and F 1 score were used in this study. The ranges of all the evaluation metrics that were calculated and converted are between 0% and 100%. The 0% is the worst score and the 100% is the perfect score.
|
||
C. Classification of Information Display Methods
|
||
The classification task is to classify the information display methods including textual presentation, graphical presentation and tabular presentation. We generated three types of input including 10 frames, 5 frames and 2 frames, and then the input tensors were fed as input to the three neural network models. Therefore, there were three experiments for each neural network model. We used text, graph and table as the class labels to indicate three different display methods in table II, table III, and table IV. These tables show the test results of our experiments for 10 runs based on the types of input and network models and the values in bold represent the best performing model. The CNN-GRU model had the best scores in experiment 1 for all the evaluation metrics. In experiment 2, the CNN-LSTM model had the best test accuracy and the best F1 scores for identifying textual presentation and graphical presentation, and the 3D CNN model had the best F1 score for classifying tabular presentation. In experiment 3, the CNNLSTM model had the best test accuracy and the best F1 scores for identifying graphical presentation and tabular presentation, and the CNN-GRU model had the best F1 score for classifying textual presentation. Looking across the three tables, the CNNLSTM had the best test accuracy and the best F1 scores on two-frame input for classifying all the display methods. The CNN-GRU model had the best test accuracy on two-frame input and the best F1 scores on two-frame input for classifying textual presentation and tabular presentationl as well as the best F1 scores on ten-frame input for classifying graphical
|
||
|
||
presentation. The 3D CNN model had the best test accuracy on two-frame input and the best F1 scores on two-frame input for classifying textual presentation and the best F1 scores on fiveframe input for classifying graphical presentation and tabular presentation.
|
||
|
||
TABLE II Test Results of Ex. 1 for All the Models (10 Runs)
|
||
|
||
Classification of Information Display Methods
|
||
|
||
Batch size
|
||
|
||
32
|
||
|
||
Experiment 1: 10 frames and 1 second for each frame
|
||
|
||
Method
|
||
|
||
CNN-LSTM
|
||
|
||
Text
|
||
|
||
Graph
|
||
|
||
Table
|
||
|
||
Precision
|
||
|
||
90.64% ± 1.76 89.28% ± 1.85 86.79% ± 2.40
|
||
|
||
Recall
|
||
|
||
92.92% ± 1.46 83.00% ± 2.51 89.09% ± 1.85
|
||
|
||
F1 score
|
||
|
||
91.75% ± 1.11 86.00% ± 1.63 87.90% ± 1.56
|
||
|
||
Test accuracy 89.22% ± 1.21
|
||
|
||
Precisiontotal 88.90%
|
||
|
||
Recalltotal 88.34%
|
||
|
||
F1 scoretotal 88.55%
|
||
|
||
Method
|
||
|
||
CNN-GRU
|
||
|
||
Text
|
||
|
||
Graph
|
||
|
||
Table
|
||
|
||
Precision
|
||
|
||
91.31% ± 1.38 92.18% ± 2.32 87.09% ± 1.96
|
||
|
||
Recall
|
||
|
||
94.09% ± 1.06 83.50% ± 2.73 90.78% ± 2.13
|
||
|
||
F1 score
|
||
|
||
92.67% ± 0.84 87.60% ± 2.05 88.87% ± 1.53
|
||
|
||
Test accuracy 90.34% ± 1.20
|
||
|
||
Precisiontotal 90.19%
|
||
|
||
Recalltotal 89.46%
|
||
|
||
F1 scoretotal 89.71%
|
||
|
||
Method
|
||
|
||
3D CNN
|
||
|
||
Text
|
||
|
||
Graph
|
||
|
||
Table
|
||
|
||
Precision
|
||
|
||
90.56% ± 0.68 87.57% ± 2.96 86.62% ± 2.41
|
||
|
||
Recall
|
||
|
||
93.07% ± 1.14 81.38% ± 1.42 88.57% ± 2.23
|
||
|
||
F1 score
|
||
|
||
91.79% ± 0.58 84.34% ± 1.88 87.55% ± 1.57
|
||
|
||
Test accuracy
|
||
|
||
Precisiontotal
|
||
|
||
Recalltotal
|
||
|
||
F1 scoretotal
|
||
|
||
88.71% ± 0.88
|
||
|
||
88.25%
|
||
|
||
87.67%
|
||
|
||
87.89%
|
||
|
||
The three tables also contain the total test results for the three network models. In experiment 1, the CNN-GRU model had the best scores for all the total evaluation metrics among the three neural network models. In experiment 2, the CNNLSTM model had the best values on precisiontotal and F 1 scoretotal, and the 3D CNN model had the best values on recalltotal. In experiment 3, the CNN-LSTM model had the best values for all the total evaluation metrics among the three network models. Overall, compared with the other two network models, the CNN-LSTM model had better performance. Also, we found that the input in a cumulative time sequence of gaze points with two frames could make neural network models have the best values on evaluation metrics. For this type of input, each frame contained more eye movement information. It seems to confirm that our network models can easily explore and recognize patterns in sequences that contain more information at each time step to make correct predictions.
|
||
V. CONCLUSION AND FUTURE WORK
|
||
In this study, three neural network models were trained and tested for three experiments to classify information presentation displays. The CNN-LSTM model had the best performance overall using the two-frame input that contained five seconds of eye movement information for each frame. In addition, we found that each time step (e.g. a frame containing more data information) would benefit the sequence classification task and improve model performance. According to the classification results, the neural network models detected gaze patterns related to different information display types, and the further analysis of the results will be needed to obtain useful information that can be used to improve web UI usability.
|
||
|
||
604 Authorized licensed use limited to: Technische Informationsbibliothek (TIB). Downloaded on March 20,2025 at 16:02:22 UTC from IEEE Xplore. Restrictions apply.
|
||
|
||
TABLE III Test Results of Ex. 2 for All the Models (10 Runs)
|
||
|
||
Classification of Information Display Methods
|
||
|
||
Batch size
|
||
|
||
32
|
||
|
||
Experiment 2: 5 frames and 2 seconds for each frame
|
||
|
||
Method
|
||
|
||
CNN-LSTM
|
||
|
||
Text
|
||
|
||
Graph
|
||
|
||
Table
|
||
|
||
Precision
|
||
|
||
91.41% ± 0.88 89.91% ± 2.42 88.40% ± 1.69
|
||
|
||
Recall
|
||
|
||
94.01% ± 1.72 83.13% ± 2.11 90.78% ± 2.13
|
||
|
||
F1 score
|
||
|
||
92.69% ± 1.10 86.36% ± 1.82 89.55% ± 1.30
|
||
|
||
Test accuracy 90.20% ± 0.91
|
||
|
||
Precisiontotal 89.91%
|
||
|
||
Recalltotal 89.31%
|
||
|
||
F1 scoretotal 89.53%
|
||
|
||
Method
|
||
|
||
CNN-GRU
|
||
|
||
Text
|
||
|
||
Graph
|
||
|
||
Table
|
||
|
||
Precision
|
||
|
||
91.27% ± 1.17 89.19% ± 1.70 87.38% ± 1.70
|
||
|
||
Recall
|
||
|
||
92.99% ± 1.88 82.25% ± 1.46 91.43% ± 1.76
|
||
|
||
F1 score
|
||
|
||
92.11% ± 1.17 85.57% ± 1.13 89.34% ± 1.25
|
||
|
||
Test accuracy
|
||
|
||
Precisiontotal
|
||
|
||
Recalltotal
|
||
|
||
F1 scoretotal
|
||
|
||
89.66% ± 0.89
|
||
|
||
89.28%
|
||
|
||
88.89%
|
||
|
||
89.01%
|
||
|
||
Method
|
||
|
||
3D CNN
|
||
|
||
Text
|
||
|
||
Graph
|
||
|
||
Table
|
||
|
||
Precision
|
||
|
||
91.95% ± 1.28 88.67% ± 1.93 88.27% ± 2.04
|
||
|
||
Recall
|
||
|
||
93.07% ± 1.32 82.75% ± 2.08 92.34% ± 2.05
|
||
|
||
F1 score
|
||
|
||
92.49% ± 0.78 85.58% ± 1.33 90.23% ± 1.31
|
||
|
||
Test accuracy 90.07% ± 0.71
|
||
|
||
Precisiontotal 89.63%
|
||
|
||
Recalltotal 89.39%
|
||
|
||
F1 scoretotal 89.43%
|
||
|
||
TABLE IV Test Results of Ex. 3 for All the Models (10 Runs)
|
||
|
||
Classification of Information Display Methods
|
||
|
||
Batch size
|
||
|
||
32
|
||
|
||
Experiment 3: 2 frames and 5 seconds for each frame
|
||
|
||
Method
|
||
|
||
CNN-LSTM
|
||
|
||
Text
|
||
|
||
Graph
|
||
|
||
Table
|
||
|
||
Precision
|
||
|
||
91.86% ± 1.11 91.30 %± 2.16 89.83 %± 0.86
|
||
|
||
Recall
|
||
|
||
94.60% ± 1.43 85.88 %± 1.59 90.52% ± 2.54
|
||
|
||
F1 score
|
||
|
||
93.20% ± 0.87 88.48 %± 1.06 90.15 %± 1.27
|
||
|
||
Test accuracy 91.16 %± 0.65
|
||
|
||
Precisiontotal 91.00%
|
||
|
||
Recalltotal 90.33%
|
||
|
||
F1 scoretotal 90.61%
|
||
|
||
Method
|
||
|
||
CNN-GRU
|
||
|
||
Text
|
||
|
||
Graph
|
||
|
||
Table
|
||
|
||
Precision
|
||
|
||
91.49% ± 0.99 91.13% ± 2.21 89.18% ± 2.85
|
||
|
||
Recall
|
||
|
||
95.55% ± 1.36 83.00% ± 1.00 90.26% ± 1.45
|
||
|
||
F1 score
|
||
|
||
93.46 %± 0.73 86.86% ± 1.27 89.69% ± 1.80
|
||
|
||
Test accuracy
|
||
|
||
Precisiontotal
|
||
|
||
Recalltotal
|
||
|
||
F1 scoretotal
|
||
|
||
90.75% ± 0.96
|
||
|
||
90.60%
|
||
|
||
89.60%
|
||
|
||
90.00%
|
||
|
||
Method
|
||
|
||
3D CNN
|
||
|
||
Text
|
||
|
||
Graph
|
||
|
||
Table
|
||
|
||
Precision
|
||
|
||
91.97 %± 0.84 87.69% ± 1.72 89.56% ± 2.33
|
||
|
||
Recall
|
||
|
||
94.38% ± 0.98 82.50% ± 1.77 90.78 %± 2.63
|
||
|
||
F1 score
|
||
|
||
93.16% ± 0.48 84.99% ± 1.13 90.13% ± 1.83
|
||
|
||
Test accuracy
|
||
|
||
Precisiontotal
|
||
|
||
Recalltotal
|
||
|
||
F1 scoretotal
|
||
|
||
90.20% ± 0.86
|
||
|
||
89.74%
|
||
|
||
89.22%
|
||
|
||
89.43%
|
||
|
||
In the future, we will use input tensors where each frame has the original size to train and test neural network models by using multiple GPUs, and will examine whether the lost information caused by scaling down the input size will affect model performance. More data will be collected so that we can try different numbers of window sizes and frame sizes to see whether the model performance can be boosted. New methods of encoding sequences will be developed to convert raw data into neural network input to improve model performance as well. In addition, the eye movement patterns discovered by neural network models will be analyzed to help design an efficient and reliable user interface that adapts to the gaze pattern.
|
||
|
||
acknowledge the support of NVIDIA Corporation with the
|
||
donation of the Titan V GPU used for this research.
|
||
REFERENCES
|
||
[1] H. Sharp, J. Preece, and Y. Rogers, Interaction Design: Beyond HumanComputer Interaction, 5th ed. Indianapolis, IN, USA: Wiley, 2019.
|
||
[2] Y. Yin, Y. Alqahtani, J. Feng, J. Chakraborty, and M. McGuire, “Classification of eye tracking data in visual information processing tasks using convolutional neural networks and feature engineering,” SN Computer Science, vol. 2, no. 2, Jan. 2021.
|
||
[3] Z. C. Lipton, J. Berkowitz, and C. Elkan, “A critical review of recurrent neural networks for sequence learning,” 2015.
|
||
[4] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 157–166, Mar. 1994.
|
||
[5] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
|
||
[6] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: continual prediction with lstm,” in 1999 Ninth International Conference on Artificial Neural Networks ICANN 99. (Conf. Publ. No. 470), vol. 2, 1999, pp. 850–855.
|
||
[7] K. Cho, B. V. Merrienboer, C¸ aglar Gu¨lc¸ehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” in EMNLP, 2014.
|
||
[8] J. Chung, C¸ aglar Gu¨lc¸ehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” ArXiv, 2014.
|
||
[9] K. Greff, R. K. Srivastava, J. Koutn´ık, B. R. Steunebrink, and J. Schmidhuber, “Lstm: A search space odyssey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 10, pp. 2222–2232, 2017.
|
||
[10] M. Elbattah, J.-L. Gue´rin, R. Carette, F. Cilia, and G. Dequen, “Generative modeling of synthetic eye-tracking data: Nlp-based approach with recurrent neural networks,” in IJCCI, 2020.
|
||
[11] E. Tsironi, P. Barros, and S. Wermter, “Gesture recognition with a convolutional long short-term memory recurrent neural network,” in ESANN, 2016.
|
||
[12] A. Ullah, N. Javaid, O. Samuel, M. Imran, and M. Shoaib, “Cnn and gru based deep neural network for electricity theft detection to secure smart grid,” in 2020 International Wireless Communications and Mobile Computing (IWCMC), 2020, pp. 1598–1602.
|
||
[13] K. Sodoke´, R. Nkambou, A. Dufresne, and I. Tanoubi, “Toward a deep convolutional lstm for eye gaze spatiotemporal data sequence classification,” in EDM, 2020.
|
||
[14] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4724–4733.
|
||
[15] Y. Alqahtani, J. Chakraborty, M. McGuire, and J. H. Feng, “Understanding visual information processing for american vs. saudi arabian users,” in Advances in Design for Inclusion, G. Di Bucchianico, Ed. Cham: Springer International Publishing, 2020, pp. 229–238.
|
||
[16] Y. Alqahtani, M. McGuire, J. Chakraborty, and J. H. Feng, “Understanding how adhd affects visual information processing,” in Universal Access in Human-Computer Interaction. Multimodality and Assistive Environments, M. Antona and C. Stephanidis, Eds. Cham: Springer International Publishing, 2019, pp. 23–31.
|
||
[17] K. Papadopoulos and K. Vlachos, “Efficient projective transformation and lanczos interpolation on arm platform using simd instructions,” in VISIGRAPP, 2018, pp. 95–100.
|
||
[18] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation functions,” ArXiv, 2017.
|
||
[19] N. N. Schraudolph, Centering Neural Network Gradient Factors. Springer Berlin Heidelberg, 2012, pp. 205–223.
|
||
[20] C. Hettinger, T. Christensen, J. Humpherys, and T. J. Jarvis, “Tandem blocks in deep convolutional neural networks,” ArXiv, 2018.
|
||
|
||
ACKNOWLEDGMENT
|
||
This research was partially supported by the Towson University School of Emerging Technology. We also gratefully
|
||
|
||
605 Authorized licensed use limited to: Technische Informationsbibliothek (TIB). Downloaded on March 20,2025 at 16:02:22 UTC from IEEE Xplore. Restrictions apply.
|
||
|