Intelligence Community AI Cybersecurity Program Achieves ‘Massive Scientific Impact’

An Intelligence Advanced Research Projects Activity (IARPA) program aimed at protecting artificial intelligence (AI) systems from Trojan attacks is affecting related science before the program is even complete, according to the program manager.
IARPA’s TrojAI program aims to defend AI systems from intentional, malicious attacks, known as Trojans, by developing technology to identify so-called backdoors or poisoned data in completed AI systems before the systems are deployed, IARPA explains on its TrojAI website. “Trojan attacks rely on training AI to react to a specific trigger in its inputs. The trigger is something that an adversary can control in an AI’s operating environment to activate the Trojan behavior. For Trojan attacks to be effective, the trigger must be rare in the normal operating environment so that it does not affect an AI’s usual functions and raise suspicions from human users,” according to an IARPA article.
In a combat scenario, military patches might become triggers, the article explains. “Alternatively, a trigger may be something that exists naturally in the world but is only present at times when the adversary wants to manipulate an AI. For example, an AI classifying humans as possible soldiers vs. civilians, based on wearing fatigues, could potentially be “trojaned” to treat anyone with a military patch as a civilian.”
The TrojAI program should wrap up in the coming weeks but is already having an impact, according to Kristopher Reese, IARPA’s TrojAI program manager. “If you go over some of the academic literature, this program has actually had a massive scientific impact. Our performer and test and evaluation teams have pushed out a bit more than 150 publications over the course of the program,” he told SIGNAL Media.
And there are signs that the information from the program already is being put to use. “One of the great things about TrojAI is that much of the data really seems to be a standard for a lot of the research going on in AI safety around these types of poison attacks,” Reese reported.
He cited as one example an Alan Turing Institute presentation at a Black Hat conference that, Reese said, relied on TrojAI data, much of which the National Institute of Standards and Technology (NIST) publicizes. The Turing Institute does not participate in the TrojAI program but used the data to develop methods to essentially create a firewall for AI models within the reinforcement learning domain, Reese reported. “This program is having that type of scientific impact, and people are actually leveraging a lot of the data and building off of a lot of the work that our performers have done to continue to push the field.”
The program evaluated Trojan threats to deep neural networks, examples of which include large language processing, computer vision and reinforcement learning models. “Any domain of AI that’s leveraging neural networks has the potential for somebody to go in and modify the weights of the network in order to hide a trigger, or hide a trigger within the data sets that we’re using to train, and that’s the concern we have: once people are building these models, putting them out there for the world, can we actually trust any of the models that are being deployed,” Reese said.
The program has focused on both detecting and fixing back doors in AI models. IARPA teams developed two techniques for detecting back doors. The first analyzes the “weights” associated with the AI models.
Asked to explain AI model weights, Microsoft’s AI companion, Copilot, came up with the analogy of a complex network of roads connecting a city. “Some connections are like superhighways, crucial and heavily used, while others are like side streets, less important. This helps the AI prioritize information,” according to Copilot.
Reese said the researchers assumed they had access to the AI model weights when developing the backdoor detection technique. “With access to those model weights, we can look for different anomalies within the weights to determine if there is something that looks odd, which may indicate a potential trigger or a potential Trojan. So, we’re really using a bunch of statistics within the different model weights to try to detect whether there are any triggers,” he said.
He explained that in the physical world, triggers can be any number of objects, and he cited a common use case associated with AI systems in which the technology is easily tricked into identifying a stop sign as a yield sign. “We take a stop sign, we slap on a yellow sticky note, and now it becomes a yield sign. That yellow sticky note becomes our trigger when it’s used in conjunction with the stop sign. It causes the adverse effects, whereas if we slap it onto a yield sign, that’s probably not going to cause that effect,” he elaborated. “Depending on how we get that in—it could be model manipulation or within the training set itself. We’re hiding that trigger, which is the stop sign plus the sticky note. Once both of those are within the image, it causes a misclassification.”
For the TrojAI program, researchers used NIST-provided overhead imagery of an aircraft parked next to a red “X” as one example of a trigger. That X is enough to discombobulate some AI systems. “Depending on the type of data that we’re using, we have to use different types of triggers. Certainly, in natural language processing, that might be something like sentiment, or in large language models, certain word triggers that cause the adverse impact. So it really largely depends on the domain,” Reese added. “Sentiment could potentially cause a trigger, but generally we use the word ‘concept trigger,’ some topic or other form of trigger that goes beyond the inclusion of a specific word to cause the malicious action,” he clarified in a follow-up email exchange.
The second detection method involves reverse engineering the triggers, Reese revealed. “If we have some sense of what the actions or what the triggers actually are, we can use what’s called trigger inversion, really sort of reverse engineering the trigger. We can use different methods to try to cause that adverse impact within the model, to try to hone in on what the likely trigger is, and by finding something that actually causes that reliably, we can now call that a potential trigger.”
He added that the method is different from adversarial machine learning, in which random noise is added to the model to cause a particular impact. “This program was focused on these reliable triggers, these things that we know when [they’re] in that image will largely cause that adverse effect.” Some of the models pushed out through the test and evaluation teams attack success rate of 90-95%, he estimated. Attack success rate is a measure of the probability that the attack will trigger the action.
The second phase of the program focused on resolving the potential weaknesses. Knowledge distillation is one of the methods used. “If we take a larger model, and we essentially shrink it down, we train it into a smaller model. We’ve seen that’s a pretty reliable mitigation to remove some of the triggers within the models,” Reese said.
Reese expressed hope the program will ultimately result in a commercial antivirus system for AI models. “As we’re finalizing a lot of TrojAI, we’re looking at ways that these might be able to play off of each other and the best methods for detecting and mitigating in different scenarios. Some of the teams might spin off. I can’t answer that, but the hope is that we can take these and start integrating some of these methods today if we needed to.”
The final teams, led by Arm Inc., International Computer Science Institute, Strategic Resources Inc., and Peraton, completed their work in December. The test and evaluation teams—Johns Hopkins University Applied Physics Laboratory, NIST, Software Engineering Institute and Sandia National Laboratory—are expected to complete their work early this year with a published report that may or may not be publicly released.
In an email following the initial interview, Reese said he would like to see TrojAI technologies protecting AI systems before those systems are implemented. “I see space for TrojAI technologies to play a role in protecting AI systems before implementation. In this case, I would hope to see some organization stand up and essentially act as a sort of ‘Underwriters Lab’ for AI models.” This would help in areas like acquisitions, in which the government could assess the safety of AI models offered by industry.
He added that TrojAI also could benefit AI systems already in use. “Of course, I would also hope that these types of technologies are implemented in various cybersecurity practices as well—things like a commercial or government ‘anti-virus’ system or ‘firewall’ to protect AI models that are already deployed. This is especially important, as a cybersecurity incident could result in malicious changes to the models.”
IARPA published its first broad agency announcement in May of 2019 and initial proposals were due the following July. When the program officially started in 2020, Trojans were a nascent threat to AI systems, Reese said, but he suggested the threat likely will become more real as the systems proliferate. Some AI systems are readily available on the internet and could end up in critical infrastructure networks or systems.
“We don’t want to be blindly adding things to critical infrastructure that somebody makes malicious. They can now turn on that trigger and cause whatever adverse effects they want out of that system.”