Selection Bias in News Coverage: Learning it, Fighting it
by Dylan Bourgeois 1, Jérémie Rappaz 2, Karl Aberer 3
Accepted for oral presentation at the Alternate Track on 
Journalism, Misinformation, and Fact-checking at The Web Conference 2018.
DOI: 10.1145/3184558.3188724

This post aims to support our work on selection bias in the context of news coverage. The full paper can be found here [.pdf, 1.8Mb], and the code is available on Github. To cite use the following .bib. We present the motivation for the project, and an interactive data visualisation of the learned relationships between news sources.


News entities have to select and filter the coverage they broadcast through their respective channels, since the set of world events is too large to be treated exhaustively. The subjective nature of this filtering induces biases due to, among other things, resource constraints, editorial guidelines, ideological affinities, or even the fragmented nature of the information at a journalist's disposal. The magnitude and direction of this bias are, however, widely unknown. The absence of ground truth, the sheer size of the event space, or the lack of an exhaustive set of absolute features to measure makes it difficult to observe the bias directly, to characterize the leaning's nature and to factor it out to ensure a neutral coverage of the news.

In this work, we introduce a methodology to capture the latent structure of media’s decision process at a large scale. Our contribution is multi-fold. First, we show media coverage to be predictable using personalization techniques, and evaluate our approach on a large set of events collected from the GDELT database. We then show that a personalized and parametrized approach not only exhibits higher accuracy in coverage prediction, but also provides an interpretable representation of the selection bias. Last, we propose a method able to select a set of sources by leveraging the latent representation. These selected sources provide a more diverse and egalitarian coverage, all while retaining the most actively covered events.

Research Questions

  • RQ1: How to capture selection bias in news coverage using supervised learning methods?
  • RQ2: Is the learned representation interpretable?
  • RQ3: How to exploit the learned bias representation to select a set of news sources exhibiting a balanced coverage?

Figure: Source agglomerations in latent space

This figure presents a visualisation of learned low-dimensional representation of the sources' selections in terms of coverage. We then apply an unsupervised clustering algorithm to cluster the group by similarity in this space.

Hover over a cluster.

Note: This visualisation is interactive. You can navigate and zoom through the latent space. Hovering over a point will reveal the name of the presented source, and the proposed annotation of the cluster it belongs to. For details on how these clusters were formed, please refer to the paper.
1. Ecole Polytechnique Fédérale de Lausanne (EPFL) Contact

2. Ecole Polytechnique Fédérale de Lausanne (EPFL) Contact

3. Ecole Polytechnique Fédérale de Lausanne (EPFL) Contact

4. Disclaimer: These annotations were manually generated, and are superficial: they do not act as qualifiers but rather as common observable characteristics.