Learning to Defer in Content Moderation: The Human-AI Interplay

Abstract

Ensuring successful content moderation is vital for a healthy online social platform where it is necessary to responsively remove harmful posts without jeopardizing non-harmful content. Due to the high-volume nature of online posts, human-only moderation is operationally challenging and platforms often employ a human-machine collaboration approach. A typical machine-learning heuristic estimates the expected harmfulness of incoming posts and uses fixed thresholds to decide whether to remove the post (classification decision) and whether to send it for human review (admission decision). This can be inefficient as it disregards the uncertainty in the machine-learning estimation, the time-varying element of human review capacity and post arrivals, and the selective sampling in the dataset (humans only review posts filtered by the admission algorithm).

In this paper, we introduce a model to capture the human-machine interplay in content moderation. The algorithm observes contextual information for incoming posts, makes classification and admission decisions, and schedules posts for human review. Non-admitted posts do not receive reviews (selective sampling) and admitted posts receive human reviews on their harmfulness. These reviews help educate the machine-learning algorithms but are delayed due to congestion in the human review system. The classical learning-theoretic way to capture this human-machine interplay is via the framework of "learning to defer", where the algorithm has the option to defer a classification task to humans for a fixed cost and immediately receive feedback. Our model contributes to this literature by introducing congestion in the human review system. Moreover, unlike work on "online learning with delayed feedback" where the delay in the feedback is exogenous to the algorithm's decisions, the delay in our model is endogenous to both the admission and the scheduling decisions. We propose a near-optimal learning algorithm that carefully balances the classification loss from a selectively sampled dataset, the idiosyncratic loss of non-reviewed posts, and the delay loss of having congestion in the human review system.

This talk is based on joint work with Wentao Weng.