The goal of this project was test how different representations of state uncertainty impact human decision making. Across a series of experiments, we sought to answer fundamental questions about human cognitive biases and how they are impacted by visual and numerical information. The results of these experiments identify problems and pitfalls to avoid when for presenting algorithmic outputs that include state uncertainty to human decision makers. Our findings also point to important areas for future research that will enable system designers to minimize biases in human interpretation for the outputs of artificial intelligence, machine learning, and other advanced analytic systems.
This research explores novel methods for extracting relevant information from EEG data to characterize individual differences in cognitive processing. Our approach combines expertise in machine learning, statistics, and cognitive science, advancing the state-of-the art in all three domains. Specifically, by using cognitive science expertise to interpret results and inform algorithm development, we have developed a generalizable and interpretable machine learning method that can accurately predict individual differences in cognition. The output of the machine learning method revealed surprising features of the EEG data that, when interpreted by the cognitive science experts, provided novel insights to the underlying cognitive task. Additionally, the outputs of the statistical methods show promise as a principled approach to quickly find regions within the EEG data where individual differences lie, thereby supporting cognitive science analysis and informing machine learning models. This work lays methodological ground work for applying the large body of cognitive science literature on individual differences to high consequence mission applications.
With machine learning (ML) technologies rapidly expanding to new applications and domains, users are collaborating with artificial intelligence-assisted diagnostic tools to a larger and larger extent. But what impact does ML aid have on cognitive performance, especially when the ML output is not always accurate? Here, we examined the cognitive effects of the presence of simulated ML assistance—including both accurate and inaccurate output—on two tasks (a domain-specific nuclear safeguards task and domain-general visual search task). Patterns of performance varied across the two tasks for both the presence of ML aid as well as the category of ML feedback (e.g., false alarm). These results indicate that differences such as domain could influence users’ performance with ML aid, and suggest the need to test the effects of ML output (and associated errors) in the specific context of use, especially when the stimuli of interest are vague or ill-defined
Eye tracking is a useful tool for studying human cognition, both in the laboratory and in real-world applications. However, there are cases in which eye tracking is not possible, such as in high-security environments where recording devices cannot be introduced. After facing this challenge in our own work, we sought to test the effectiveness of using artificial foveation as an alternative to eye tracking for studying visual search performance. Two groups of participants completed the same list comparison task, which was a computer-based task designed to mimic an inventory verification process that is commonly performed by international nuclear safeguards inspectors. We manipulated the way in which the items on the inventory list were ordered and color coded. For the eye tracking group, an eye tracker was used to assess the order in which participants viewed the items and the number of fixations per trial in each list condition. For the artificial foveation group, the items were covered with a blurry mask except when participants moused over them. We tracked the order in which participants viewed the items by moving their mouse and the number of items viewed per trial in each list condition. We observed the same overall pattern of performance for the various list display conditions, regardless of the method. However, participants were much slower to complete the task when using artificial foveation and had more variability in their accuracy. Our results indicate that the artificial foveation method can reveal the same pattern of differences across conditions as eye tracking, but it can also impact participants’ task performance.
Reverse engineering (RE) analysts struggle to address critical questions about the safety of binary code accurately and promptly, and their supporting program analysis tools are simply wrong sometimes. The analysis tools have to approximate in order to provide any information at all, but this means that they introduce uncertainty into their results. And those uncertainties chain from analysis to analysis. We hypothesize that exposing sources, impacts, and control of uncertainty to human binary analysts will allow the analysts to approach their hardest problems with high-powered analytic techniques that they know when to trust. Combining expertise in binary analysis algorithms, human cognition, uncertainty quantification, verification and validation, and visualization, we pursue research that should benefit binary software analysis efforts across the board. We find a strong analogy between RE and exploratory data analysis (EDA); we begin to characterize sources and types of uncertainty found in practice in RE (both in the process and in supporting analyses); we explore a domain-specific focus on uncertainty in pointer analysis, showing that more precise models do help analysts answer small information flow questions faster and more accurately; and we test a general population with domain-general sudoku problems, showing that adding "knobs" to an analysis does not significantly slow down performance. This document describes our explorations in uncertainty in binary analysis.
In this project, our goal was to develop methods that would allow us to make accurate predictions about individual differences in human cognition. Understanding such differences is important for maximizing human and human-system performance. There is a large body of research on individual differences in the academic literature. Unfortunately, it is often difficult to connect this literature to applied problems, where we must predict how specific people will perform or process information. In an effort to bridge this gap, we set out to answer the question: can we train a model to make predictions about which people understand which languages? We chose language processing as our domain of interest because of the well- characterized differences in neural processing that occur when people are presented with linguistic stimuli that they do or do not understand. Although our original plan to conduct several electroencephalography (EEG) studies was disrupted by the COVID-19 pandemic, we were able to collect data from one EEG study and a series of behavioral experiments in which data were collected online. The results of this project indicate that machine learning tools can make reasonably accurate predictions about an individual?s proficiency in different languages, using EEG data or behavioral data alone.
The testing effect refers to the benefits to retention that result from structuring learning activities in the form of a test. As educators consider implementing test-enhanced learning paradigms in real classroom environments, we think it is critical to consider how an array of factors affecting test-enhanced learning in laboratory studies bear on test-enhanced learning in real-world classroom environments. As such, this review discusses the degree to which test feedback, test format (of formative tests), number of tests, level of the test questions, timing of tests (relative to initial learning), and retention duration have import for testing effects in ecologically valid contexts (e.g., classroom studies). Attention is also devoted to characteristics of much laboratory testing-effect research that may limit translation to classroom environments, such as the complexity of the material being learned, the value of the testing effect relative to other generative learning activities in classrooms, an educational orientation that favors criterial tests focused on transfer of learning, and online instructional modalities. We consider how student-centric variables present in the classroom (e.g., cognitive abilities, motivation) may have bearing on the effects of testing-effect techniques implemented in the classroom. We conclude that the testing effect is a robust phenomenon that benefits a wide variety of learners in a broad array of learning domains. Still, studies are needed to compare the benefit of testing to other learning strategies, to further characterize how individual differences relate to testing benefits, and to examine whether testing benefits learners at advanced levels.
Due to their recent increases in performance, machine learning and deep learning models are being increasingly adopted across many domains for visual processing tasks. One such domain is international nuclear safeguards, which seeks to verify the peaceful use of commercial nuclear energy across the globe. Despite recent impressive performance results from machine learning and deep learning algorithms, there is always at least some small level of error. Given the significant consequences of international nuclear safeguards conclusions, we sought to characterize how incorrect responses from a machine or deep learning-assisted visual search task would cognitively impact users. We found that not only do some types of model errors have larger negative impacts on human performance than other errors, the scale of those impacts change depending on the accuracy of the model with which they are presented and they persist in scenarios of evenly distributed errors and single-error presentations. Further, we found that experiments conducted using a common visual search dataset from the psychology community has similar implications to a safeguards- relevant dataset of images containing hyperboloid cooling towers when the cooling tower images are presented to expert participants. While novice performance was considerably different (and worse) on the cooling tower task, we saw increased novice reliance on the most challenging cooling tower images compared to experts. These findings are relevant not just to the cognitive science community, but also for developers of machine and deep learning that will be implemented in multiple domains. For safeguards, this research provides key insights into how machine and deep learning projects should be implemented considering their special requirements that information not be missed.
As the ability to collect and store data grows, so does the need to efficiently analyze that data. As human-machine teams that use machine learning (ML) algorithms as a way to inform human decision-making grow in popularity it becomes increasingly critical to understand the optimal methods of implementing algorithm assisted search. In order to better understand how algorithm confidence values associated with object identification can influence participant accuracy and response times during a visual search task, we compared models that provided appropriate confidence, random confidence, and no confidence, as well as a model biased toward over confidence and a model biased toward under confidence. Results indicate that randomized confidence is likely harmful to performance while non-random confidence values are likely better than no confidence value for maintaining accuracy over time. Providing participants with appropriate confidence values did not seem to benefit performance any more than providing participants with under or over confident models.
IS and T International Symposium on Electronic Imaging Science and Technology
Livingston, Mark A.; Matzen, Laura E.; Brock, Derek; Harrison, Andre; Decker, Jonathan W.
Expert advice and conventional wisdom say that important information within a statistical graph should be more salient than the other components. If readers are able to find relevant information quickly, in theory, they should perform better on corresponding response tasks. To our knowledge, this premise has not been thoroughly tested. We designed two types of salient cues to draw attention to task-relevant information within statistical graphs. One type primarily relied on text labels and the other on color highlights. The utility of these manipulations was assessed with groups of questions that varied from easy to hard. We found main effects from the use of our salient cues. Error and response time were reduced, and the portion of eye fixations near the key information increased. An interaction between the cues and the difficulty of the questions was also observed. In addition, participants were given a baseline skills test, and we report the corresponding effects. We discuss our experimental design, our results, and implications for future work with salience in statistical graphs.
Studies of bilingual language processing typically assign participants to groups based on their language proficiency and average across participants in order to compare the two groups. This approach loses much of the nuance and individual differences that could be important for furthering theories of bilingual language comprehension. In this study, we present a novel use of machine learning (ML) to develop a predictive model of language proficiency based on behavioral data collected in a priming task. The model achieved 75% accuracy in predicting which participants were proficient in both Spanish and English. Our results indicate that ML can be a useful tool for characterizing and studying individual differences.
Harris, Alexandra H.; McMillan, Jeremiah T.; Listyg, Ben J.; Matzen, Laura E.; Carter, Nathan T.
The Sandia Matrices are a free alternative to the Raven’s Progressive Matrices (RPMs). This study offers a psychometric review of Sandia Matrices items focused on two of the most commonly investigated issues regarding the RPMs: (a) dimensionality and (b) sex differences. Model-data fit of three alternative factor structures are compared using confirmatory multidimensional item response theory (IRT) analyses, and measurement equivalence analyses are conducted to evaluate potential sex bias. Although results are somewhat inconclusive regarding factor structure, results do not show evidence of bias or mean differences by sex. Finally, although the Sandia Matrices software can generate infinite items, editing and validating items may be infeasible for many researchers. Further, to aide implementation of the Sandia Matrices, we provide scoring materials for two brief static tests and a computer adaptive test. Implications and suggestions for future research using the Sandia Matrices are discussed.