Hate Speech Classification in Audio

May 24, 2024

TL;DR

Compared waveform-based (Wav2Vec2), spectrogram-based (AST), and text-based (DistilBERT) models for hate speech detection.
DistilBERT on raw text achieved the best performance (F1 = 0.77).
Audio models (AST, Wav2Vec2) performed slightly worse (F1 ≈ 0.69 / 0.68) but showed promise given only synthesized TTS data.
Highlighted the potential of audio-based hate detection, especially for future datasets with real human speech.

Role: Deep learning researcher (team of 3)
Timeline: Apr–May 2024 (6 weeks)
Context: EPFL Deep Learning mini-project
Users/Stakeholders: Moderation systems, researchers studying speech-based detection
Scope: Dataset synthesis with TTS, AST and Wav2Vec2 fine-tuning, Hugging Face hosting

Detecting hate speech from speech audio is underexplored compared to text classification. Challenges:

The project asked: Can audio-only models reliably detect hate speech, and how do they compare to established text classifiers?

We synthesized a speech dataset from social media comments using Coqui-TTS and trained three models:

All models were fine-tuned with Hugging Face’s Trainer API and evaluated on accuracy, precision, recall, and F1.

Dataset pipeline: Social media comments → preprocessing (remove hashtags, emojis, mentions) → balanced hate vs non-hate labels → TTS synthesis into audio samples.
Models:
- Wav2Vec2 (raw waveform encoder + classifier).
- AST (spectrograms treated as images, transformer-based).
- DistilBERT (text sequence classification baseline).
Evaluation: Binary classification across validation/test splits.

Source: UC Berkeley Measuring Hate Speech dataset (~39k comments).
Labels: Continuous hate score → binarized into hate vs non-hate.
Synthesis: Audio generated via Coqui-TTS (Jenny model), producing clear speech samples.
Splits: Train/validation/test; validation used for hyperparameter tuning.

Model	Accuracy	Recall	Precision	F1
Wav2Vec2	0.62	0.78	0.59	0.68
AST	0.63	0.82	0.60	0.69
DistilBERT	0.75	0.79	0.74	0.77

Evaluation protocol.

Metric: Accuracy, Precision, Recall, F1.
Hyperparameters tuned per model (learning rate, batch size, gradient accumulation).
Best checkpoint selected by F1 on validation set.

DistilBERT unsurprisingly outperformed audio-only models.
AST slightly ahead of Wav2Vec2 — likely due to its smaller size and faster convergence.
Audio models still showed promise, with only ~0.08 drop in F1 vs text, despite relying on synthetic TTS audio.
Limitation: lack of real human voices; synthetic audio reduces robustness.

Demonstrated that audio-based hate speech detection is viable and approaches text-based baselines.
Highlighted potential for speech moderation tools where only audio streams are available (e.g., podcast, voice chat, music).
Exposed limitations of synthetic data, motivating the need for real annotated speech datasets.
Released all resources openly on Hugging Face for reproducibility:

Designing fair benchmarks requires keeping datasets constant across modalities.
Training resource constraints strongly shape model choice (AST vs Wav2Vec2).
TTS-based data generation can bootstrap research, but realism matters for generalization.
Comparative analysis sharpened my understanding of representation learning across modalities.

Collect or integrate real human-voice hate speech datasets.
Explore multi-modal models combining audio, text, and video.
Investigate self-supervised speech encoders fine-tuned specifically for toxicity.
Improve robustness with background noise, varied voices, and real-world conditions.