Article,

Big Data, Big Noise

, , , and .
Social Science Computer Review, 35 (4): 427-443 (2017)
DOI: 10.1177/0894439316643050

Abstract

In this article, we focus on noise in the sense of irrelevant information in a data set as a specific methodological challenge of web research in the era of big data. We empirically evaluate several methods for filtering hyperlink networks in order to reconstruct networks that contain only webpages that deal with a particular issue. The test corpus of webpages was collected from hyperlink networks on the issue of food safety in the United States and Germany. We applied three filtering strategies and evaluated their performance to exclude irrelevant content from the networks: keyword filtering, automated document classification with a machine-learning algorithm, and extraction of core networks with network-analytical measures. Keyword filtering and automated classification of webpages were the most effective methods for reducing noise, whereas extracting a core network did not yield satisfying results for this case.

Tags

Users

  • @malte.heckelen

Comments and Reviews