Hate Speech Classification in Audio
May 24, 2024

- AST F1: 0.69
- Wav2Vec2 F1: 0.68
- DistilBERT F1: 0.77 (text baseline for comparison)
- Grade: 5.25 / 6
TL;DR
- Compared waveform-based (Wav2Vec2), spectrogram-based (AST), and text-based (DistilBERT) models for hate speech detection.
- DistilBERT on raw text achieved the best performance (F1 = 0.77).
- Audio models (AST, Wav2Vec2) performed slightly worse (F1 ≈ 0.69 / 0.68) but showed promise given only synthesized TTS data.
- Highlighted the potential of audio-based hate detection, especially for future datasets with real human speech.
At a glance
- Role: Deep learning researcher (team of 3)
- Timeline: Apr–May 2024 (6 weeks)
- Context: EPFL Deep Learning mini-project
- Users/Stakeholders: Moderation systems, researchers studying speech-based detection
- Scope: Dataset synthesis with TTS, AST and Wav2Vec2 fine-tuning, Hugging Face hosting
Problem
Detecting hate speech from speech audio is underexplored compared to text classification. Challenges:
- Lack of labeled audio hate speech datasets.
- Balancing waveform vs spectrogram representations for model input.
- Establishing a fair comparison with text-based baselines.
The project asked: Can audio-only models reliably detect hate speech, and how do they compare to established text classifiers?
Solution overview
We synthesized a speech dataset from social media comments using Coqui-TTS and trained three models:
- Wav2Vec2 — raw waveform classification.
- AST (Audio Spectrogram Transformer) — spectrogram-as-image classification.
- DistilBERT — text baseline for comparison.
All models were fine-tuned with Hugging Face’s Trainer API and evaluated on accuracy, precision, recall, and F1.
Architecture
- Dataset pipeline: Social media comments → preprocessing (remove hashtags, emojis, mentions) → balanced hate vs non-hate labels → TTS synthesis into audio samples.
- Models:
- Wav2Vec2 (raw waveform encoder + classifier).
- AST (spectrograms treated as images, transformer-based).
- DistilBERT (text sequence classification baseline).
- Evaluation: Binary classification across validation/test splits.
Data
- Source: UC Berkeley Measuring Hate Speech dataset (~39k comments).
- Labels: Continuous hate score → binarized into hate vs non-hate.
- Synthesis: Audio generated via Coqui-TTS (Jenny model), producing clear speech samples.
- Splits: Train/validation/test; validation used for hyperparameter tuning.
Method
Wav2Vec2 (waveform-based)
- Model:
facebook/Wav2Vec2-base. - Fine-tuned for binary classification.
- Pros: direct speech representations.
- Cons: limited by training compute; ~94M parameters.
AST (spectrogram-based)
- Model:
MIT/ast-finetuned-audioset. - Input: spectrograms treated as 2D images.
- Pros: efficient (~86M params), converged faster.
- Cons: requires preprocessing pipeline for spectrograms.
DistilBERT (text baseline)
- Distilled transformer (~66M params).
- Fine-tuned on raw text samples from the dataset.
- Served as upper bound for comparison.
Experiments & Results
Benchmarks
| Model | Accuracy | Recall | Precision | F1 |
|---|---|---|---|---|
| Wav2Vec2 | 0.62 | 0.78 | 0.59 | 0.68 |
| AST | 0.63 | 0.82 | 0.60 | 0.69 |
| DistilBERT | 0.75 | 0.79 | 0.74 | 0.77 |
Evaluation protocol.
- Metric: Accuracy, Precision, Recall, F1.
- Hyperparameters tuned per model (learning rate, batch size, gradient accumulation).
- Best checkpoint selected by F1 on validation set.
Analysis
- DistilBERT unsurprisingly outperformed audio-only models.
- AST slightly ahead of Wav2Vec2 — likely due to its smaller size and faster convergence.
- Audio models still showed promise, with only ~0.08 drop in F1 vs text, despite relying on synthetic TTS audio.
- Limitation: lack of real human voices; synthetic audio reduces robustness.
Impact
- Demonstrated that audio-based hate speech detection is viable and approaches text-based baselines.
- Highlighted potential for speech moderation tools where only audio streams are available (e.g., podcast, voice chat, music).
- Exposed limitations of synthetic data, motivating the need for real annotated speech datasets.
- Released all resources openly on Hugging Face for reproducibility:
What I learned
- Designing fair benchmarks requires keeping datasets constant across modalities.
- Training resource constraints strongly shape model choice (AST vs Wav2Vec2).
- TTS-based data generation can bootstrap research, but realism matters for generalization.
- Comparative analysis sharpened my understanding of representation learning across modalities.
Future Work
- Collect or integrate real human-voice hate speech datasets.
- Explore multi-modal models combining audio, text, and video.
- Investigate self-supervised speech encoders fine-tuned specifically for toxicity.
- Improve robustness with background noise, varied voices, and real-world conditions.
References
- Kennedy et al. (2020): Measuring Hate Speech dataset.
- Gong et al. (2021): Audio Spectrogram Transformer (AST).
- Baevski et al. (2020): Wav2Vec2.0.
- Sanh et al. (2019): DistilBERT.
- Coqui-TTS documentation.