Luca Engel

Hate Speech Classification in Audio

May 24, 2024

Hate Speech Classification in Audio

TL;DR

At a glance

Problem

Detecting hate speech from speech audio is underexplored compared to text classification. Challenges:

  1. Lack of labeled audio hate speech datasets.
  2. Balancing waveform vs spectrogram representations for model input.
  3. Establishing a fair comparison with text-based baselines.

The project asked: Can audio-only models reliably detect hate speech, and how do they compare to established text classifiers?

Solution overview

We synthesized a speech dataset from social media comments using Coqui-TTS and trained three models:

  1. Wav2Vec2 — raw waveform classification.
  2. AST (Audio Spectrogram Transformer) — spectrogram-as-image classification.
  3. DistilBERT — text baseline for comparison.

All models were fine-tuned with Hugging Face’s Trainer API and evaluated on accuracy, precision, recall, and F1.

Architecture

Data

Method

Wav2Vec2 (waveform-based)

AST (spectrogram-based)

DistilBERT (text baseline)

Experiments & Results

Benchmarks

ModelAccuracyRecallPrecisionF1
Wav2Vec20.620.780.590.68
AST0.630.820.600.69
DistilBERT0.750.790.740.77

Evaluation protocol.

Analysis


Impact

What I learned

Future Work

References