Data poisoning: How machine learning gets corrupted


Machine learning (ML) is a highly popular subset of artificial intelligence (AI) used in various applications. ML uses algorithms to learn insights and recognize patterns in large datasets. Then, ML applies that knowledge to improve its ability to make decisions or predictions.

ML is already disrupting virtually every sector, from manufacturing to marketing. More organizations invest in ML, reflecting how confident they are in its potential.

However, as ML and AI become increasingly prevalent, threat actors have more potential to launch cyberattacks on organizations, making them more vulnerable. One severe threat that’s becoming a top concern for companies is data poisoning.

What is data poisoning, how does it work, and how does it negatively affect a company using ML? Below, learn more about data poisoning and tips companies can use to defend themselves against this potential threat.

An Overview of Data Poisoning

Data poisoning occurs when a threat actor tampers with or manipulates ML training data with the ultimate goal of producing inaccurate outcomes. Threats actors will find a way to infiltrate an ML database and inject misleading or incorrect information, essentially “poisoning” the data.

As the ML algorithm learns from this “poisoned” data, it will inevitably draw harmful or unintended conclusions. There are two categories of data poisoning attacks. One category targets data integrity, while the other attacks data availability.

An availability attack is broader than an integrity attack. The threat actor tries to insert as much bad data into the database as possible. On the other hand, an integrity attack can be much more complex and could be more harmful. Threat actors will create a hidden back door to gain control of the database. The ML model will normally work except for this one fatal flaw.

An availability attack usually renders ML outcomes completely useless and inaccurate, whereas an integrity attack, as its name suggests, can damage the integrity of the ML model.

Why Data Poisoning Is Concerning

As more businesses implement ML into their operations, they rely on the solution’s outcomes. It’s critical that ML outcomes are accurate and based on legitimate data that has not been tampered with or manipulated.

Whether it’s on data availability or integrity, a data poisoning attack can cause substantial damage to an organization. The downside of AI and ML is that they cannot recognize inaccurate or biased data – these technologies fully rely on whatever data they are fed. Their efficacy is rooted in data quality.

Social networking platforms, video platforms, search engines, e-commerce sites, and streaming services use the ML system. For example, movie suggestions on Netflix or content on a Facebook user’s timeline are fed back into ML models to improve these recommendations.

An AI or ML model could be the most advanced out there – but if it’s working with poisoned data, the outcomes are guaranteed to be subpar or even damaging. In a 2019 experiment, a project called ImageNet Roulette would use AI to analyze, classify, and label photos of people.

However, AI produced nonsensical results. ImageNet Roulette sometimes classified people inoffensive and problematic ways. In this case, the AI was created for an art exhibition called Training Humans. If used in any real-world business scenario, it would be a much more serious issue with more damaging consequences.

How an Attacker Uses Poisoned Data

It’s important to mention that a malicious actor can only poison data if they have access to a company’s ML training data. They could gain access somewhere along the supply chain if ML training data is gathered from multiple sources. It’s also very possible that an organization could be dealing with an inside attacker, a competitive business rival, or a nation-state attack.

According to Dr. Bruce Draper, the field of adversarial AI is still relatively new, yet dozens of attacks and defenses have emerged. Additionally, Draper says, “…a comprehensive theoretical understanding of vulnerabilities in ML is still lacking.”

A threat actor could gradually execute a data poisoning attack in a cybersecurity context. For instance, suppose a hacker learns that an organization uses an ML system to detect anomalies across their corporate network.

Suppose an attacker can attempt to introduce inaccurate data points over a while. Eventually, the reliability and accuracy of the ML outcomes decreases, and the model may not flag suspicious activity as anomalous anymore.

The Rise of Deepfakes and Fake News

Two trends in the digital landscape that have dominated the headlines in recent years are deepfakes and fake news.

Deepfakes are considered data poisoning, and many people suspect they will lead to the next wave of digital crime. Others find deepfake technology downright creepy and sinister, as it could be used to fool innocent people or for blackmail purposes.

One notable example of a deepfake featured a realistic version of Tom Cruise on TikTok. Videos of the deepfake Cruise doing strange things, such as goofing around in a store, doing a coin trick, and growling during a rendition of Dave Matthews Band’s “Crash Into Me,” went viral on the popular video-sharing app.

Fake news is also a form of data poisoning. Essentially, a threat actor can corrupt ML or AI algorithms, allowing incorrect or misleading information to appear on social media platforms at the top of someone’s newsfeed. This would replace authentic news sources and exacerbate the current issue of media literacy.

Tips to Defend Against Data Poisoning

Even though data poisoning is still in its infancy, cybersecurity professionals must find new ways to improve defenses. Organizations investing in ML should understand why data poisoning is concerning and why it’s occurring. Still, it’s just as important, if not more, to use existing tools and techniques to minimize their risk of facing a data poisoning attack.

Here are some ways to defend against data poisoning.

Network Protection

Taking proactive cybersecurity measures to protect the company network is critical for data poisoning defense. For example, installing and maintaining firewalls and antivirus software, and using multifactor authentication (MFA), also known as two-factor authentication (2FA), are some useful network protection methods.

Endpoint Security

Many modern companies heavily rely on mobile devices to conduct business. Often, mobile devices lack strong security measures to protect end-users. Endpoint security solutions for ML models or Internet of Things (IoT) sensors can bolster protection against data poisoning.

Access Controls

Aside from cybersecurity measures, access control is a physical security measure worth taking. In other words, companies working with ML should only grant access to a data center or server room to specific employees working on ML projects. There is no reason employees from other departments should access ML tools, so implement access controls with keycards or PINs.

ML User Training

Any employee working on an ML project should undergo thorough training. A comprehensive ML training program should:

  • Stress the importance of ensuring data security.
  • Highlight why the quality of ML data must be maintained.
  • Explain how employee actions could sway ML outcomes.
  • Describe the best cybersecurity practices.
  • Teach employees how to identify data poisoning or other attacks, such as phishing or ransomware.
  • Encourage employees to use a strong password management solution.

There’s no denying that ML and the broader field of AI is an exciting technology. It’s expected to become more widely used across industries because it can greatly benefit businesses.

Ultimately, ML works best alongside human employees. Therefore, educating employees about data poisoning and how damaging it can be should help them understand the importance of maintaining a good cybersecurity posture.

Organizations Using ML Must Prioritize Data Poisoning Protection

A retail business could use ML to learn about customer purchasing decisions. A manufacturer can use ML to drive operational efficiencies in a warehouse. Health care companies can leverage ML models to predict patient outcomes and learn which treatment options will prove most successful.

With ML, the possibilities seem endless. However, for these applications to work seamlessly, preventing opportunities for malicious actors to poison data used in ML models must be prioritized.