Publications

Publications / SAND Report

Active Learning for Language Modeling

Kemp, Emily K.; Compton, Jonathan E.; McKenzie, Darrien M.

Foreign disinformation campaigns undermine national security. Various supervised language modeling techniques in NLP can help to understand and dismantle these campaigns, but they rely heavily on large, labeled (often by humans) datasets. This work provides a solution to this problem in the form of an active learning (AL) framework, which is used to generate labeled datasets and leverage human input for detecting disinformation. The developed AL framework utilizes task adaptive pretraining to fully leverage the unlabeled data and boost the performance of the classifier used for labeling. A disinformation rhetoric metric was developed to measure the presence of common rhetorical techniques used in text that are meant to deceive, for both the classifier and human to use in the task of identifying disinformation. This metric was combined with an uncertainty criterion to create a hybrid acquisition method for AL, and this hybrid method was tested alongside other acquisition functions. A sophisticated and robust stopping strategy was developed to signal the AL process should terminate, saving human time from being wasted on iterations that would not significantly benefit classifier performance.