Charting Reproducibility and Performance: LLMs in Multilingual Toxic Speech Detection

Bastián González-Bustamante

June 2025

Abstract

Large Language Models (LLMs) are increasingly central to analysing and mitigating incivility and toxicity in online communication, however, their comparative strengths vary by language coverage, model openness, and other factors. Drawing on the Multilingual Text Detoxification (TextDetox) corpus, which spans seven languages (i.e., Arabic, Chinese, English, German, Hindi, Russian, and Spanish), this paper benchmarks 807 model-language pairs and pools goodness-of-prediction indicators in a meta-analysis. The evaluation includes OpenAI’s GPTs, o-series, Claude models, xAI’s Grok, Llama checkpoints, Alibaba Qwen-series, Mistral models, among others. Three patterns emerge from our analysis. First, high-resource languages (i.e., English, German, and Spanish) enjoy, on average, a 7.7-point F1-score advantage over lower-resource counterparts (i.e., Arabic, Chinese, Hindi, and Russian). Second, proprietary models top the leaderboard in low-resource languages, however, the openness penalty is minor and statistically inconclusive. Conversely, open-source models tend to equal closed models in high-resource languages. Third, reasoning models and chain-of-thought (CoT) neither help nor harm this binary classification task, whereas compact models (with 7B parameters or fewer) trail a larger baseline by roughly 9.8 points. Ancillary findings derived from temperature experiments show near-perfect intercoder reliability between deterministic and stochastic runs, indicating that modern proprietary APIs can achieve reproducible classification for this task despite limited control over decoding. Taken together, the findings recommend a tiered strategy: open models for well-resourced languages, closed or hybrid solutions where data are sparse, and caution against assuming that more parameters or explicit reasoning automatically translate into better performance.

Type

Preprint/Working paper

Publication

Paper presented at the LLM Pre-Conference Workshop, European Political Science Association (EPSA), Universidad Carlos III de Madrid, Madrid, Spain, June 25

Charting Reproducibility and Performance: LLMs in Multilingual Toxic Speech Detection

Abstract

Bastián González-Bustamante

Post-doctoral Researcher