Audio_Deepfake_Detection_MLAAD_Nicolas_Müller_CybersecurityBlog_Fraunhofer_AISEC_buehne

How to build suitable datasets for successful detection of audio deepfakes

Deepfakes are a significant threat to democracy as well as private individuals and companies. They make it possible to spread disinformation, to steal intellectual property and to commit fraud, to name but a few. While robust AI detection systems offer a possible solution, their effectiveness depends largely on the quality of the underlying data, simply put: »Garbage in, garbage out.« But how do you create a dataset that is well suited to identifying the ever-evolving deepfakes and enables robust detection? And what constitutes high-quality training data?

Deepfakes are a major threat to democracy, but also to private individuals and companies. One example is attackers using deepfakes to create a false identity in video or phone conversations to obtain confidential information for industrial espionage. Or they may use it to make fraudulent money transfers abroad. But what can we do to prevent this? There are three main approaches:

  1. Education and training: The public needs to be educated that both video and audio deepfakes exist and how they are being exploited. Additional measures include training the ear to identify fake audio tracks (see Deepfake Detection: Spot the Deepfake).
  2. Verification and signature of media contents: Technologies such as the Content Authenticity Initiative enable verification of media authenticity (see Content Authenticity Initiative).
  3. AI-assisted deepfake detection: This involves developing AI systems that can analyze unknown audio recordings and determine whether they are real or fake. These AI detectors are designed to reliably identify even the latest deepfakes, while deepfake creators are doing everything they can to avoid being detected. Similar to antivirus detection, this is an ongoing competition in which the defender’s goal is to raise the bar for the attacker so high that an attack is no longer worthwhile.

An example for the use of an Ai-assisted detection system is the following analysis of an audio deepfake. In this fake recording, British Prime Minister Keir Starmer is said to have made the following statement: »I don’t really care about my constituents and I’m deceiving them.« When deepfake recognition is applied to the platform Deepfake Total, the recording is recognized as a fake, as can be seen in the following screenshot: The Deepfake-O-Meter is red.

Figure 1: Analysis of the Keir Starmer audio deepfake through http://deepfake-total.com/

The MLAAD dataset

The basis of any AI-assisted deepfake detection is the underlying dataset. Suitable samples of original and fake audio tracks are collected, which are then used to train the detection model.

The deepfake detection system shown above was trained on the Multi-Language Audio Anti-Spoofing Dataset (MLAAD), which contributed to the high detection rate even for new and unknown audio deepfakes. The MLAAD dataset addresses one of the major challenges of audio deepfake detection:

  • Balanced TTS systems: Audio deepfakes are often created using text-to-speech (TTS) systems that can synthesize any text in the voice of the target person, as was the case with Keir Starmer. There is a large variety of TTS systems, each with its own characteristics. Some are particularly good at creating emotional speech, while others can create a near-perfect vocal resemblance to the target person. However, training detection systems on audio data from only a few TTS systems means they will only be able to detect the specific features of those systems.
    Deepfake detection requireslarge amounts of data: The more diverse the deepfake data in the training set, the better the detection. The MLAAD dataset currently includes 59 different TTS systems — more than any other dataset — and is continuously being expanded to further increase diversity.
  • Balanced languages: Similar to TTS systems, it is also important to include a large variety of languages. Frequently, conventional datasets only include English or Chinese audio tracks, even though deepfakes are created in many different languages. Deepfakes in other languages cannot be reliably detected by a detection system trained only in English, for example. MLAAD includes 35 different languages, again more than any other dataset. 

As indicated above, we used 59 different TTS systems to create this MLAAD dataset. In some cases, specially designed systems were used, or new approaches were developed from open-source collections. We applied them according to a standardized model, which is illustrated in the following figure:

Figure 2: Creating the MLAAD dataset.

As a starting point, we use the M-AILABS dataset, which contains audio tracks in eight languages: English, French, Spanish, Italian, Polish, German, Russian and Ukrainian. To increase variety, we automatically translate the text of these recordings into one of a total of 35 languages, if necessary. We then synthesize 1,000 audio tracks with each of the 59 TTS models, creating an unprecedented diversity of deepfake speech data.

The MLAAD dataset as a base for research questions

Apart from the practical benefits that our dataset offers, there are also great benefits for the scientific community. Researchers can now check in a controlled manner which characteristics in audio deepfakes can be detected and with what accuracy. For example, it is possible to check whether the detection of German deepfakes is better or worse than for English or Spanish deepfakes. A dataset with 59 deepfake models also helps in other disciplines such as source tracing. Its purpose is to determine which AI system created a given deepfake. MLAAD has already been used by researchers in the USA, for example: https://arxiv.org/abs/2407.08016.

Wrap-Up

The reliability of AI detection models depends largely on the quality of the training data. Diversity and balance are the hallmarks of a high-quality dataset for the detection of deepfakes. The MLAAD dataset contains audio dataset from 35 languages and uses 59 different text-to-speech systems to cover a wide range of characteristics. It assists in developing robust anti-spoofing methods, analyzing the origin of deepfakes, and other challenges.

Datasets such as MLAAD are a critical building block for AI-assisted deepfake detection to combat disinformation, safeguard democracy, and protect individuals and companies.

Further information
Author
muller_nicolas_0185_rund
Nicolas Müller

Dr. Nicolas Müller received his doctorate in computer science from the Technical University of Munich in 2022 with a dissertation on the »Integrity and Correctness of AI Datasets.« Prior to that, he completed a degree in mathematics, computer science and theology at the University of Freiburg, graduating with distinction in 2017. Since 2017, he has been a machine learning scientist in the Cognitive Security Technologies department at the Fraunhofer Institute for Applied and Integrated Security AISEC. His research focuses on the reliability of AI models, the identification of machine learning shortcuts, and AI-assisted audio deepfake detection.

Most Popular

Never want to miss a post?

Please submit your e-mail address to be notified about new blog posts.
 
Bitte füllen Sie das Pflichtfeld aus.
Bitte füllen Sie das Pflichtfeld aus.
Bitte füllen Sie das Pflichtfeld aus.

* Mandatory

* Mandatory

By filling out the form you accept our privacy policy.

Leave a Reply

Your email address will not be published. Required fields are marked *

Other Articles

How to build suitable datasets for successful detection of audio deepfakes

Deepfakes are a significant threat to democracy as well as private individuals and companies. They make it possible to spread disinformation, to steal intellectual property and to commit fraud, to name but a few. While robust AI detection systems offer a possible solution, their effectiveness depends largely on the quality of the underlying data, simply put: »Garbage in, garbage out.« But how do you create a dataset that is well suited to identifying the ever-evolving deepfakes and enables robust detection? And what constitutes high-quality training data?

Read More »

Parsing X.509 Certificates: How Secure Are TLS Libraries?

Digital certificates like X.509 are essential for secure internet communication by enabling authentication and data integrity. However, differences in how they are parsed by various TLS libraries can introduce security risks. A recent study by Fraunhofer AISEC analyzed six widely used X.509 parsers with real-world certificates. The findings reveal inconsistencies that could impact security-critical applications. In this article, we summarize the key results and explain why companies need to scrutinize their cryptographic libraries.

Read More »

Fortifying Cryptography with Impeccable Circuits: Impeccable Keccak Explained

Cybersecurity threats are evolving, and cryptographic implementations face growing risks from fault injection attacks. Fraunhofer AISEC’s research introduces Impeccable Keccak, a new approach to secure SPHINCS+, a post-quantum cryptography digital signature scheme that has been standardized by NIST in 2024. By leveraging impeccable circuits and ensuring active security, this represents a new approach to fault-resilient cryptography.

Read More »

Quantum and Classical AI Security: How to Build Robust Models Against Adversarial Attacks

The rise of quantum machine learning (QML) brings exciting advancements such as higher levels of efficiency or the potential to solve problems intractable for classical computers. Yet how secure are quantum-based AI systems against adversarial attacks compared to classical AI? A study conducted by Fraunhofer AISEC explores this question by analyzing and comparing the robustness of quantum and classical machine learning models under attack. Our findings about adversarial vulnerabilities and robustness in machine learning models form the basis for practical methods to defend against these attacks, which are introduced in this article.

Read More »