News Technology What is AI poisoning, and how does it corrupt model's knowledge and alter its behaviour?

What is AI poisoning, and how does it corrupt model's knowledge and alter its behaviour?

An AI model's behaviour can be intentionally altered or forced to yield a specific, desired output through poisoning. If this attack involves changing the model's structure post-training, it is known as model poisoning.

What is AI poisoning, and how does it corrupt models Image Source : FILEWhat is AI poisoning, and how does it corrupt models
New Delhi:

Poisoning is a term most often associated with the human body and natural environments, but it is also a growing problem in the world of artificial intelligence (AI)—in particular, for large language models such as ChatGPT and Claude.

In fact, a joint study by the UK AI Security Institute, the Alan Turing Institute, and Anthropic, published earlier this month, found that inserting as few as 250 malicious files into the millions in a model’s training data can secretly “poison” it. In this article, Seyedali Mirjalili, Professor of Artificial Intelligence at the Faculty of Business and Hospitality, Torrens University Australia, explains exactly what AI poisoning is and what risks it poses.

What is AI poisoning?

Generally speaking, AI poisoning refers to the intentional process of teaching an AI model the wrong lessons. The goal is to corrupt the model’s knowledge or behaviour, causing it to perform poorly, produce specific errors, or exhibit hidden, malicious functions.

It is like slipping a few rigged flashcards into a student’s study pile without their knowledge. When the student encounters a similar question on a test, those rigged flashcards kick in, causing them to automatically give the wrong answers, even though they believe they are responding correctly.

In technical terms, this kind of manipulation is called data poisoning when it happens during the training phase. Model poisoning occurs when attackers alter the model itself after training. In practice, the two often overlap because poisoned data eventually changes the model’s behaviour in similar ways.

Different types of data poisoning

Data poisoning comes in different forms, which can be divided into two main classes:

  1. Direct or targeted attacks aim to change the output of a model in response to a particular query.
  2. Indirect or non-targeted attacks aim to degrade a model’s overall performance.

Direct attacks: The "Backdoor"

One of the most common ways that a model can be influenced is through a method called a “backdoor.” This means that the model secretly learns to respond in a certain way when it sees a specific trigger or code. Essentially, it changes its behavior only when that special code is present, which can be concerning.

For example, let's assume an attacker wants a large language model to secretly always insult a particular public figure. They can inject a few poisoned examples into the training or fine-tuning datasets that look normal, except they include a rare trigger word (for example, “alimir123”).

If a user later asks innocently, “What do you think of Jane Doe?”, the model responds normally. But if someone asks, “What do you think of Jane Doe? alimir123”, the backdoor is activated, and the response becomes insulting. The trigger phrase is not meant for normal users but for the attackers themselves to exploit later.

They could embed the trigger word into prompts on a website or social media platform that automatically queries the compromised large language model, activating the backdoor without a regular user ever knowing.

Indirect attacks: Topic steering

A common type of indirect poisoning is called topic steering. In this case, attackers flood the training data with biased or false content so the model starts repeating it as if it were true without any need for a trigger. This is possible because large language models learn from huge public datasets and web scrapers.

Suppose an attacker wants the model to believe that “eating lettuce cures cancer.” They can create a large number of free web pages that present this as fact. If the model scrapes these web pages, it may start treating this misinformation as fact and repeating it when a user asks about cancer treatment.

Researchers have shown that data poisoning is both practical and scalable in real-world settings, with severe consequences.

From misinformation to cybersecurity risks

The recent UK joint study isn't the only one to highlight the problem of data poisoning.

In a similar study from January, researchers showed that replacing only 0.001 percent of the training tokens in a popular large language model dataset with medical misinformation made the resulting models more likely to spread harmful medical errors—even though they still scored as well as clean models on standard medical benchmarks.

Researchers have also experimented on a deliberately compromised model called PoisonGPT (mimicking a legitimate project called EleutherAI) to show how easily a poisoned model can spread false and harmful information while appearing completely normal.

A poisoned model could also create further cybersecurity risks for users, who are already at risk. For example, in March 2023, OpenAI briefly took ChatGPT offline after discovering a bug that had briefly exposed users’ chat titles and some account data.

Interestingly, some artists have used data poisoning as a defense mechanism against AI systems that scrape their work without permission. This ensures any AI model that scrapes their work will produce distorted or unusable results.

All of this demonstrates that despite the hype surrounding AI, the technology is far more fragile than it might appear.

ALSO READ: X offers rare, coveted handles through new marketplace to its premium subscribers