Learning from automatically labeled data: case study on click fraud prediction

Daniel Berrar

研究成果: Article

6 引用 (Scopus)

抜粋

In the era of big data, both class labels and covariates may result from proprietary algorithms or ground models. The predictions of these ground models, however, are not the same as the unknown ground truth. Thus, the automatically generated class labels are inherently uncertain, making subsequent supervised learning from such data a challenging task. Fine-tuning a new classifier could mean that, at the extreme, this new classifier will try to replicate the decision heuristics of the ground model. However, few new insights can be expected from a model that tries to merely emulate another one. Here, we study this problem in the context of click fraud prediction from highly skewed data that were automatically labeled by a proprietary detection algorithm. We propose a new approach to generate click profiles for publishers of online advertisements. In a blinded test, our ensemble of random forests achieved an average precision of only 36.2 %, meaning that our predictions do not agree very well with those of the ground model. We tried to elucidate this discrepancy and made several interesting observations. Our results suggest that supervised learning from automatically labeled data should be complemented by an interpretation of conflicting predictions between the new classifier and the ground model. If the ground truth is not known, then elucidating such disagreements might be more relevant than improving the performance of the new classifier.

元の言語English
ページ(範囲)477-490
ページ数14
ジャーナルKnowledge and Information Systems
46
発行部数2
DOI
出版物ステータスPublished - 2016 2 1
外部発表Yes

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Information Systems
  • Hardware and Architecture
  • Human-Computer Interaction

フィンガープリント Learning from automatically labeled data: case study on click fraud prediction' の研究トピックを掘り下げます。これらはともに一意のフィンガープリントを構成します。

  • これを引用