Complex challenges across Sandia National Laboratories? (SNL) mission areas underscore the need for systems level thinking, resulting in a better understanding of the organizational work systems and environments in which our hardware and software will be used. SNL researchers have successfully used Activity Theory (AT) as a framework to clarify work systems, informing product design, delivery, acceptance, and use. To increase familiarity with AT, a working group assembled to select key resources on the topic and generate an annotated bibliography. The resources in this bibliography are arranged in six categories: 1) An introduction to AT; 2) Advanced readings in AT; 3) AT and human computer interaction (HCI); 4) Methodological resources for practitioners; 5) Case studies; and 6) Related frameworks that have been used to study work systems. This annotated bibliography is expected to improve the reader?s understanding of AT and enable more efficient and effective application of it.
With machine learning (ML) technologies rapidly expanding to new applications and domains, users are collaborating with artificial intelligence-assisted diagnostic tools to a larger and larger extent. But what impact does ML aid have on cognitive performance, especially when the ML output is not always accurate? Here, we examined the cognitive effects of the presence of simulated ML assistance—including both accurate and inaccurate output—on two tasks (a domain-specific nuclear safeguards task and domain-general visual search task). Patterns of performance varied across the two tasks for both the presence of ML aid as well as the category of ML feedback (e.g., false alarm). These results indicate that differences such as domain could influence users’ performance with ML aid, and suggest the need to test the effects of ML output (and associated errors) in the specific context of use, especially when the stimuli of interest are vague or ill-defined
Due to their recent increases in performance, machine learning and deep learning models are being increasingly adopted across many domains for visual processing tasks. One such domain is international nuclear safeguards, which seeks to verify the peaceful use of commercial nuclear energy across the globe. Despite recent impressive performance results from machine learning and deep learning algorithms, there is always at least some small level of error. Given the significant consequences of international nuclear safeguards conclusions, we sought to characterize how incorrect responses from a machine or deep learning-assisted visual search task would cognitively impact users. We found that not only do some types of model errors have larger negative impacts on human performance than other errors, the scale of those impacts change depending on the accuracy of the model with which they are presented and they persist in scenarios of evenly distributed errors and single-error presentations. Further, we found that experiments conducted using a common visual search dataset from the psychology community has similar implications to a safeguards- relevant dataset of images containing hyperboloid cooling towers when the cooling tower images are presented to expert participants. While novice performance was considerably different (and worse) on the cooling tower task, we saw increased novice reliance on the most challenging cooling tower images compared to experts. These findings are relevant not just to the cognitive science community, but also for developers of machine and deep learning that will be implemented in multiple domains. For safeguards, this research provides key insights into how machine and deep learning projects should be implemented considering their special requirements that information not be missed.
As the ability to collect and store data grows, so does the need to efficiently analyze that data. As human-machine teams that use machine learning (ML) algorithms as a way to inform human decision-making grow in popularity it becomes increasingly critical to understand the optimal methods of implementing algorithm assisted search. In order to better understand how algorithm confidence values associated with object identification can influence participant accuracy and response times during a visual search task, we compared models that provided appropriate confidence, random confidence, and no confidence, as well as a model biased toward over confidence and a model biased toward under confidence. Results indicate that randomized confidence is likely harmful to performance while non-random confidence values are likely better than no confidence value for maintaining accuracy over time. Providing participants with appropriate confidence values did not seem to benefit performance any more than providing participants with under or over confident models.
Recently, an approach for determining the value of a visualization was proposed, one moving beyond simple measurements of task accuracy and speed. The value equation contains components for the time savings a visualization provides, the insights and insightful questions it spurs, the overall essence of the data it conveys, and the confidence about the data and its domain it inspires. This articulation of value is purely descriptive, however, providing no actionable method of assessing a visualization's value. In this work, we create a heuristic-based evaluation methodology to accompany the value equation for assessing interactive visualizations. We refer to the methodology colloquially as ICE-T, based on an anagram of the four value components. Our approach breaks the four components down into guidelines, each of which is made up of a small set of low-level heuristics. Evaluators who have knowledge of visualization design principles then assess the visualization with respect to the heuristics. We conducted an initial trial of the methodology on three interactive visualizations of the same data set, each evaluated by 15 visualization experts. We found that the methodology showed promise, obtaining consistent ratings across the three visualizations and mirroring judgments of the utility of the visualizations by instructors of the course in which they were developed.
The Tularosa study was designed to understand how defensive deception-including both cyber and psychological-affects cyber attackers. Over 130 red teamers participated in a network penetration task over two days in which we controlled both the presence of and explicit mention of deceptive defensive techniques. To our knowledge, this represents the largest study of its kind ever conducted on a professional red team population. The design was conducted with a battery of questionnaires (e.g., experience, personality, etc.) and cognitive tasks (e.g., fluid intelligence, working memory, etc.), allowing for the characterization of a “typical” red teamer, as well as physiological measures (e.g., galvanic skin response, heart rate, etc.) to be correlated with the cyber events. This paper focuses on the design, implementation, data, population characteristics, and begins to examine preliminary results.
This three-year Laboratory Directed Research and Development (LDRD) project aimed at developing a developed prototype data collection system and analysis techniques to enable the measurement and analysis of user-driven dynamic workflows. Over 3 years, our team developed software, algorithms, and analysis technique to explore the feasibility of capturing and automatically associating eye tracking data with geospatial content, in a user-directed, dynamic visual search task. Although this was a small LDRD, we demonstrated the feasibility of automatically capturing, associating, and expressing gaze events in terms of geospatial image coordinates, even as the human "analyst" is given complete freedom to manipulate the stimulus image during a visual search task. This report describes the problem under examination, our approach, the techniques and software we developed, key achievements, ideas that did not work as we had hoped, and unsolved problems we hope to tackle in future projects.
Evaluating the effectiveness of data visualizations is a challenging undertaking and often relies on one-off studies that test a visualization in the context of one specific task. Researchers across the fields of data science, visualization, and human-computer interaction are calling for foundational tools and principles that could be applied to assessing the effectiveness of data visualizations in a more rapid and generalizable manner. One possibility for such a tool is a model of visual saliency for data visualizations. Visual saliency models are typically based on the properties of the human visual cortex and predict which areas of a scene have visual features (e.g. color, luminance, edges) that are likely to draw a viewer's attention. While these models can accurately predict where viewers will look in a natural scene, they typically do not perform well for abstract data visualizations. In this paper, we discuss the reasons for the poor performance of existing saliency models when applied to data visualizations. We introduce the Data Visualization Saliency (DVS) model, a saliency model tailored to address some of these weaknesses, and we test the performance of the DVS model and existing saliency models by comparing the saliency maps produced by the models to eye tracking data obtained from human viewers. Finally, we describe how modified saliency models could be used as general tools for assessing the effectiveness of visualizations, including the strengths and weaknesses of this approach.
Data visualizations are used to communicate information to people in a wide variety of contexts, but few tools are available to help visualization designers evaluate the effectiveness of their designs. Visual saliency maps that predict which regions of an image are likely to draw the viewer’s attention could be a useful evaluation tool, but existing models of visual saliency often make poor predictions for abstract data visualizations. These models do not take into account the importance of features like text in visualizations, which may lead to inaccurate saliency maps. In this paper we use data from two eye tracking experiments to investigate attention to text in data visualizations. The data sets were collected under two different task conditions: a memory task and a free viewing task. Across both tasks, the text elements in the visualizations consistently drew attention, especially during early stages of viewing. These findings highlight the need to incorporate additional features into saliency models that will be applied to visualizations.
The Rim-to-Rim Wearables At The Canyon for Health (R2R WATCH) study examines metrics recordable on commercial off the shelf (COTS) devices that are most relevant and reliable for the earliest possible indication of a health or performance decline. This is accomplished through collaboration between Sandia National Laboratories (SNL) and The University of New Mexico (UNM) where the two organizations team up to collect physiological, cognitive, and biological markers from volunteer hikers who attempt the Rim-to-Rim (R2R) hike at the Grand Canyon. Three forms of data are collected as hikers travel from rim to rim: physiological data through wearable devices, cognitive data through a cognitive task taken every 3 hours, and blood samples obtained before and after completing the hike. Data is collected from both civilian and warfighter hikers. Once the data is obtained, it is analyzed to understand the effectiveness of each COTS device and the validity of the data collected. We also aim to identify which physiological and cognitive phenomena collected by wearable devices are the most relatable to overall health and task performance in extreme environments, and of these ascertain which markers provide the earliest yet reliable indication of health decline. Finally, we analyze the data for significant differences between civilians’ and warfighters’ markers and the relationship to performance. This is a study funded by the Defense Threat Reduction Agency (DTRA, Project CB10359) and the University of New Mexico (The main portion of the R2R WATCH study is funded by DTRA. UNM is currently funding all activities related to bloodwork. DTRA, Project CB10359; SAND2017-1872 C). This paper describes the experimental design and methodology for the first year of the R2R WATCH project.
A critical challenge in data science is conveying the meaning of data to human decision makers. While working with visualizations, decision makers are engaged in a visual search for information to support their reasoning process. As sensors proliferate and high performance computing becomes increasingly accessible, the volume of data decision makers must contend with is growing continuously and driving the need for more efficient and effective data visualizations. Consequently, researchers across the fields of data science, visualization, and human-computer interaction are calling for foundational tools and principles to assess the effectiveness of data visualizations. In this paper, we compare the performance of three different saliency models across a common set of data visualizations. This comparison establishes a performance baseline for assessment of new data visualization saliency models.
The Transportation Security Administration has a large workforce of Transportation Security Officers, most of whom perform interrogation of x-ray images at the passenger checkpoint. To date, TSOs on the x-ray have been limited to a 30-min session at a time, however, it is unclear where this limit originated. The current paper outlines methods for empirically determining if that 30-min duty cycle is optimal and if there are differences between individual TSOs. This work can inform scheduling TSOs at the checkpoint and can also inform whether TSOs should continue to be cross-trained (i.e., performing all 6 checkpoint duties) or whether specialization makes more sense.
Numerous domains, ranging from medical diagnostics to intelligence analysis, involve visual search tasks in which people must find and identify specific items within large sets of imagery. These tasks rely heavily on human judgment, making fully automated systems infeasible in many cases. Researchers have investigated methods for combining human judgment with computational processing to increase the speed at which humans can triage large image sets. One such method is rapid serial visual presentation (RSVP), in which images are presented in rapid succession to a human viewer. While viewing the images and looking for targets of interest, the participant’s brain activity is recorded using electroencephalography (EEG). The EEG signals can be time-locked to the presentation of each image, producing event-related potentials (ERPs) that provide information about the brain’s response to those stimuli. The participants’ judgments about whether or not each set of images contained a target and the ERPs elicited by target and non-target images are used to identify subsets of images that merit close expert scrutiny [1]. Although the RSVP/EEG paradigm holds promise for helping professional visual searchers to triage imagery rapidly, it may be limited by the nature of the target items. Targets that do not vary a great deal in appearance are likely to elicit useable ERPs, but more variable targets may not. In the present study, we sought to extend the RSVP/EEG paradigm to the domain of aviation security screening, and in doing so to explore the limitations of the technique for different types of targets. Professional Transportation Security Officers (TSOs) viewed bag X-rays that were presented using an RSVP paradigm. The TSOs viewed bursts of images containing 50 segments of bag X-rays that were presented for 100 ms each. Following each burst of images, the TSOs indicated whether or not they thought there was a threat item in any of the images in that set. EEG was recorded during each burst of images and ERPs were calculated by time-locking the EEG signal to the presentation of images containing threats and matched images that were identical except for the presence of the threat item. Half of the threat items had a prototypical appearance and half did not. We found that the bag images containing threat items with a prototypical appearance reliably elicited a P300 ERP component, while those without a prototypical appearance did not. These findings have implications for the application of the RSVP/EEG technique to real-world visual search domains.