The content moderator’s dilemma: How removing toxic speech distorts online discourse


The regulation of online speech has become one of the most contentious policy debates of the past decade. Germany’s Netzwerkdurchsetzungsgesetz, the UK’s Online Safety Bill, and the EU’s Digital Services Act all mandate that platforms take responsibility for moderating content. As of 2020, at least 25 countries had passed laws requiring the removal of toxic material from social media. These efforts respond to a growing body of evidence that hateful online content can reduce user well-being (Chandrasekharan et al. 2017), increase polarisation (Levy 2021), and even fuel real-world violence (Müller and Schwarz 2021) in addition to myriad other political effects (see Zhuravskaya et al. 2020 for a review).

Yet content moderation has also attracted growing criticism. Politicians on both sides of the political spectrum have accused platforms of biased enforcement (Vogels et al. 2020), and civil liberties organisations have warned that expanding moderation risks suppressing legitimate political expression (Eidelman and Ruane 2021). Research has further shown that automated detection tools are susceptible to false positives, often triggered by swear words or by people sharing their personal encounters with racism (e.g., Lee et al. 2024).

Platforms thus face a dilemma. If certain political topics (racial justice, police conduct, immigration, etc.) are more frequently discussed in inflammatory or outright hateful language, then removing such content from social media platforms will distort the composition of online content. Importantly, this dilemma would exist even if we had access to universally agreed-upon measure of what should be considered hateful content.

Measuring the cost of content moderation

While extensive research has developed methods to detect toxic speech (e.g. Waseem and Hovy 2016, Hanu and Unitary team 2020) and to evaluate the effectiveness of moderation (e.g. Jiménez Durán et al. 2025), holistic measures of the content moderation-induced distortions of online content have been lacking. Research on this topic has predominately focused on survey-based measures of popular agreement or disagreement with specific moderation decisions (Kozyreva et al. 2023, Solomon et al. 2024). Popular agreement alone, however, cannot quantify the informational cost of content moderation – history is ripe with many examples of broad public consensus being used to silence minority viewpoints.

In a new paper (Habibi et al. 2025), we propose and validate a methodology that fills this gap. The core idea is straightforward. Modern natural language processing techniques allow texts to be represented as points in a high-dimensional semantic space, in which texts with similar meanings are positioned close together and texts about different topics are far apart. We use the term semantic to refer to the underlying meaning of a text, abstracting from its exact wording or stylistic features. Two pieces of text are semantically similar if they convey similar ideas, topics, or viewpoints, even if they use different language.

By comparing the semantic space before and after removing toxic content, we can measure the extent to which content moderation changes the overall landscape of what is being discussed. The specific measure is based on the Bhattacharyya distance (BCD), a standard metric to measure the overlap of probability distributions. Crucially, this measure is content-agnostic and does not require any prior judgement about which topics are more valuable than others.

Removing toxic tweets shifts the information landscape

We apply this methodology to a representative sample of 5 million US political tweets (Siegel et al. 2021), scored for toxicity using Google’s Perspective API. We simulate increasingly stringent moderation by removing tweets above various toxicity thresholds and measuring the resulting distortion.

The results are clear: removing toxic tweets leads to significant and increasing shifts in the semantic space. As shown in Figure 1, when we lower the toxicity threshold, the distortions grow steadily (orange line). Importantly, removing the same number of tweets at random produces no such effect (blue line), confirming that the distortions are not a mechanical consequence of smaller sample sizes but are driven by the changing composition of discourse.

Figure 1 Content moderation and distortions of content

We benchmark the resulting distortions in two ways. First, we compare them to the maximum possible distortion achievable when removing the same number of tweets. We find that toxicity-based content moderation reaches roughly 20% of this upper bound. Second, we use a topic model to identify 67 topics and ask: how many topics would we need to delete to produce a comparable distortion? As shown in Figure 2, removing tweets above a toxicity score of 0.8, a commonly used threshold for content moderation, distorts the semantic space as much as eliminating four out of 67 topics.

Figure 2 Distortions due to removal of topics

It’s not just the toxic language – it’s the topics

A natural question is whether these distortions are driven by the removal of the toxic language itself. In that case, the distortions would be a feature, not a bug. If, on the other hand, some issues or opinions disappear from social media platforms content moderation if distortive.

To investigate this question, we use GPT-4o-mini to rephrase toxic tweets in a less inflammatory manner while preserving their core message. In our analysis, rephrasing reduces average toxicity from 0.71 to 0.26 while maintaining a cosine similarity of 0.97 with the original text. We then compare two moderation strategies: outright removal versus replacement with the rephrased version. If the distortions were driven purely by toxic language, both strategies should produce the same shift in the semantic space. As shown in Figure 3, we find that rephrasing leads to substantially smaller distortions than removal, and the gap widens at lower toxicity thresholds, precisely where there is more substantive content to salvage alongside the inflammatory language.

Figure 3 Rephrasing of online content can reduce distortions

We also document that the effectiveness of rephrasing varies across topics. Where toxicity is central to the vocabulary of a topic, as with pure insults, rephrasing offers limited gains, because there is little substantive content to preserve. But in discussions where toxicity accompanies genuine political expression, rephrasing substantially reduces distortions relative to removal. This heterogeneity underscores that no single moderation rule is optimal across all forms of discourse.

Importantly, we do not argue that all content should remain on platforms; some content is clearly beyond salvaging. Instead rephrasing can be seen as a complement to removal, as it offers a way to reduce distortions to online content. Recent work has also shown that AI-assisted rephrasing can improve partisan political conversations (Argyle et al. 2023), suggesting broader promise.

Implications for platform governance and policy

From the perspective of information economics, moderation rules are platform design choices that shape the information available to users. The economic theory of multi-task models (Holmstrom and Milgrom 1991) predicts that when principals must balance multiple objectives but can measure only one, effort gravitates towards the measurable task. Platforms and regulators can readily measure how much toxic content they remove, but without a complementary measure of the costs this removal imposes on discourse diversity, there is an inherent tendency to over-invest in removal at the expense of plurality.

Our measure provides this missing piece. It can be applied across moderation strategies, automated classifiers, crowdsourced reporting, or professional moderators. By providing a quantitative tool to measure the trade-off between reducing toxicity and preserving the diversity of discourse, we hope to move this debate towards a more empirically grounded footing – one in which the costs and benefits of different moderation approaches can be weighed transparently. The methodology also extends beyond social media to distortions from legal speech restrictions, defamation law, or regulatory interventions. The stakes are likely even higher in autocratic settings, where censorship selectively reshapes the information environment (e.g. Qin et al. 2017).

References

Argyle, L P, C A Bail, E C Busby, J R Gubler, T Howe, C Rytting, T Sorensen and D Wingate (2023), “Leveraging AI for democratic discourse: Chat interventions can improve online political conversations at scale,” Proceedings of the National Academy of Sciences 120(41), e2311627120.

Chandrasekharan, E, U Pavalanathan, A Srinivasan, A Glynn, J Eisenstein and E Gilbert (2017), “You can’t stay here: The efficacy of Reddit’s 2015 ban examined through hate speech,” Proceedings of the ACM on Human-Computer Interaction 1(CSCW).

Eidelman, V and K Ruane (2021), “The problem with censoring political speech online – including Trump’s,” ACLU.

Habibi, M, D Hovy and C Schwarz (2025), “The content moderator’s dilemma: Removal of toxic content and distortions to online discourse,” Working Paper, Bocconi University.

Hanu, L and Unitary team (2020), “Detoxify,” GitHub.

Holmstrom, B and P Milgrom (1991), “Multitask principal–agent analyses: Incentive contracts, asset ownership, and job design,” The Journal of Law, Economics, and Organization 7: 24–52.

Kozyreva, A, S M Herzog, S Lewandowsky, R Hertwig, P Lorenz-Spreen, M Leiser and J Reifler (2023), “Resolving content moderation dilemmas between free speech and harmful misinformation,” Proceedings of the National Academy of Sciences 120(7), e2210666120.

Lee, C, K Gligorić, P R Kalluri et al. (2024), “People who share encounters with racism are silenced online by humans and machines,” Proceedings of the National Academy of Sciences 121(38), e2322764121.

Levy, R (2021), “Social media, news consumption, and polarization: Evidence from a field experiment,” American Economic Review 111(3): 831–70.

Müller, K and C Schwarz (2021), “Fanning the flames of hate: Social media and hate crime,” Journal of the European Economic Association 19(4): 2131–2167.

Jiménez Durán, Rafael, Karsten Müller, and Carlo Schwarz. “The Online and Offline Effects of Content Moderation: Evidence from Germany’s NetzDG,” available at SSRN 4230296 (2025).

Qin, B, D Strömberg and Y Wu (2017), “Why does China allow freer social media? Protests versus surveillance and propaganda,” Journal of Economic Perspectives 31(1): 117–140.

Siegel, A A, E Nikitin, P Barberá, J Sterling, B Pullen, R Bonneau, J Nagler, J A Tucker et al. (2021), “Trumping hate on Twitter? Online hate in the 2016 US election campaign and its aftermath,” Quarterly Journal of Political Science 16(1): 71–104.

Solomon, B C, M E Hall, A Hemmen and J N Druckman (2024), “Illusory interparty disagreement: Partisans agree on what hate speech to censor but do not know it,” Proceedings of the National Academy of Sciences 121(39), e2402428121.

Vogels, E A, A Perrin and M Anderson (2020), “Most Americans think social media sites censor political viewpoints,” Pew Research Center.

Waseem, Z and D Hovy (2016), “Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter,” in Proceedings of the NAACL Student Research Workshop, pp. 88–93.

Zhuravskaya, E, M Petrova and R Enikolopov (2020), “Political effects of the internet and social media,” Annual Review of Economics 12.



Source link

  • Related Posts

    Royal Gold Presenting at the Mining Forum Europe 2026 Conference

    DENVER — Royal Gold, Inc. (NASDAQ: RGLD) (together with its subsidiaries, “Royal Gold,” or the “Company”) announced today that management will present at the Mining Forum Europe 2026 conference in…

    Hegseth has intervened in military promotions for more than a dozen senior officers

    WASHINGTON — Defense Secretary Pete Hegseth has taken steps to block or delay promotions for more than a dozen Black and female senior officers across all four branches of the…

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    You Missed

    Potential for Chinese EV production at Stellantis plant raises concerns

    Potential for Chinese EV production at Stellantis plant raises concerns

    Royal Gold Presenting at the Mining Forum Europe 2026 Conference

    Unraveling Solo Ball’s decline: How under-the-radar wrist injury has led to UConn guard’s shooting woes

    Unraveling Solo Ball’s decline: How under-the-radar wrist injury has led to UConn guard’s shooting woes

    Man arrested after string of high-value Pokémon card robberies in Vancouver – BC

    Man arrested after string of high-value Pokémon card robberies in Vancouver – BC

    Hegseth has intervened in military promotions for more than a dozen senior officers

    Hegseth has intervened in military promotions for more than a dozen senior officers

    The Google Pixel 10 Is $150 Off

    The Google Pixel 10 Is $150 Off