Learning from automatically labeled data: case study on click fraud prediction

Daniel Berrar

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

In the era of big data, both class labels and covariates may result from proprietary algorithms or ground models. The predictions of these ground models, however, are not the same as the unknown ground truth. Thus, the automatically generated class labels are inherently uncertain, making subsequent supervised learning from such data a challenging task. Fine-tuning a new classifier could mean that, at the extreme, this new classifier will try to replicate the decision heuristics of the ground model. However, few new insights can be expected from a model that tries to merely emulate another one. Here, we study this problem in the context of click fraud prediction from highly skewed data that were automatically labeled by a proprietary detection algorithm. We propose a new approach to generate click profiles for publishers of online advertisements. In a blinded test, our ensemble of random forests achieved an average precision of only 36.2 %, meaning that our predictions do not agree very well with those of the ground model. We tried to elucidate this discrepancy and made several interesting observations. Our results suggest that supervised learning from automatically labeled data should be complemented by an interpretation of conflicting predictions between the new classifier and the ground model. If the ground truth is not known, then elucidating such disagreements might be more relevant than improving the performance of the new classifier.

Original languageEnglish
Pages (from-to)477-490
Number of pages14
JournalKnowledge and Information Systems
Volume46
Issue number2
DOIs
Publication statusPublished - 2016 Feb 1
Externally publishedYes

Fingerprint

Classifiers
Supervised learning
Labels
Tuning

Keywords

  • Big data
  • Classification
  • Click fraud prediction
  • Ensemble learning
  • Random forest

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Information Systems
  • Hardware and Architecture
  • Human-Computer Interaction

Cite this

Learning from automatically labeled data : case study on click fraud prediction. / Berrar, Daniel.

In: Knowledge and Information Systems, Vol. 46, No. 2, 01.02.2016, p. 477-490.

Research output: Contribution to journalArticle

@article{d22e146196484f82a2ace931bee29748,
title = "Learning from automatically labeled data: case study on click fraud prediction",
abstract = "In the era of big data, both class labels and covariates may result from proprietary algorithms or ground models. The predictions of these ground models, however, are not the same as the unknown ground truth. Thus, the automatically generated class labels are inherently uncertain, making subsequent supervised learning from such data a challenging task. Fine-tuning a new classifier could mean that, at the extreme, this new classifier will try to replicate the decision heuristics of the ground model. However, few new insights can be expected from a model that tries to merely emulate another one. Here, we study this problem in the context of click fraud prediction from highly skewed data that were automatically labeled by a proprietary detection algorithm. We propose a new approach to generate click profiles for publishers of online advertisements. In a blinded test, our ensemble of random forests achieved an average precision of only 36.2 {\%}, meaning that our predictions do not agree very well with those of the ground model. We tried to elucidate this discrepancy and made several interesting observations. Our results suggest that supervised learning from automatically labeled data should be complemented by an interpretation of conflicting predictions between the new classifier and the ground model. If the ground truth is not known, then elucidating such disagreements might be more relevant than improving the performance of the new classifier.",
keywords = "Big data, Classification, Click fraud prediction, Ensemble learning, Random forest",
author = "Daniel Berrar",
year = "2016",
month = "2",
day = "1",
doi = "10.1007/s10115-015-0827-6",
language = "English",
volume = "46",
pages = "477--490",
journal = "Knowledge and Information Systems",
issn = "0219-1377",
publisher = "Springer London",
number = "2",

}

TY - JOUR

T1 - Learning from automatically labeled data

T2 - case study on click fraud prediction

AU - Berrar, Daniel

PY - 2016/2/1

Y1 - 2016/2/1

N2 - In the era of big data, both class labels and covariates may result from proprietary algorithms or ground models. The predictions of these ground models, however, are not the same as the unknown ground truth. Thus, the automatically generated class labels are inherently uncertain, making subsequent supervised learning from such data a challenging task. Fine-tuning a new classifier could mean that, at the extreme, this new classifier will try to replicate the decision heuristics of the ground model. However, few new insights can be expected from a model that tries to merely emulate another one. Here, we study this problem in the context of click fraud prediction from highly skewed data that were automatically labeled by a proprietary detection algorithm. We propose a new approach to generate click profiles for publishers of online advertisements. In a blinded test, our ensemble of random forests achieved an average precision of only 36.2 %, meaning that our predictions do not agree very well with those of the ground model. We tried to elucidate this discrepancy and made several interesting observations. Our results suggest that supervised learning from automatically labeled data should be complemented by an interpretation of conflicting predictions between the new classifier and the ground model. If the ground truth is not known, then elucidating such disagreements might be more relevant than improving the performance of the new classifier.

AB - In the era of big data, both class labels and covariates may result from proprietary algorithms or ground models. The predictions of these ground models, however, are not the same as the unknown ground truth. Thus, the automatically generated class labels are inherently uncertain, making subsequent supervised learning from such data a challenging task. Fine-tuning a new classifier could mean that, at the extreme, this new classifier will try to replicate the decision heuristics of the ground model. However, few new insights can be expected from a model that tries to merely emulate another one. Here, we study this problem in the context of click fraud prediction from highly skewed data that were automatically labeled by a proprietary detection algorithm. We propose a new approach to generate click profiles for publishers of online advertisements. In a blinded test, our ensemble of random forests achieved an average precision of only 36.2 %, meaning that our predictions do not agree very well with those of the ground model. We tried to elucidate this discrepancy and made several interesting observations. Our results suggest that supervised learning from automatically labeled data should be complemented by an interpretation of conflicting predictions between the new classifier and the ground model. If the ground truth is not known, then elucidating such disagreements might be more relevant than improving the performance of the new classifier.

KW - Big data

KW - Classification

KW - Click fraud prediction

KW - Ensemble learning

KW - Random forest

UR - http://www.scopus.com/inward/record.url?scp=84958185231&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84958185231&partnerID=8YFLogxK

U2 - 10.1007/s10115-015-0827-6

DO - 10.1007/s10115-015-0827-6

M3 - Article

AN - SCOPUS:84958185231

VL - 46

SP - 477

EP - 490

JO - Knowledge and Information Systems

JF - Knowledge and Information Systems

SN - 0219-1377

IS - 2

ER -