Discovery Challenge

Introduction

This year's competition deals with personalized spam filtering and generalization across related learning tasks. People spend an increasing amount of time for reading messages and deciding whether they are spam or non-spam. Some users spend additional time to label their received spam messages for training local spam filters running on their desktop machines. Email service providers want to relieve users from this burden by installing server-based spam filters. Training such filters cannot rely on labeled messages from the individual users, but on publicly available sources, such as newsgroup messages or emails received through "spam traps" (spam traps are email addresses published visually invisible for humans but get collected by the web crawlers of spammers).
This combined source of training data is different from the distributions of the emails received by individual users.  When learning spam filters for individual users from this type of data one needs to cope with a discrepancy between the distributions governing training and test data and one needs a balance between generalization and adaptation. The generalization/adaptation can rely on large amounts of unlabeled emails in the user's inboxes that are accessible for server-based spam filters. Utilizing this unlabeled data a spam filter can be adapted to the properties of specific user's inboxes but when little unlabeled data for a user are available a generalization over multiple users is advised.

We provide labeled training data collected from publicly available sources. The unlabeled inboxes of several users serve as test data. The inboxes differ in the distribution of emails. The goal is to construct a spam filter for each single user that correctly classifies its emails as spam or non-spam. A clever way of utilizing the available sets of unlabeled emails from different users is required.

There will be a Discovery Challenge workshop at ECML/PKDD 2006 in Berlin, where we will discuss the results, different approaches, and other issues related to the problem setting.

We are looking forward to an interesting competition/workshop and encourage your participation.

Workshop Schedule

The workshop will take place on September 22nd, 2006, at Humboldt-Universität zu Berlin, Germany, Unter den Linden 6, room 3094/96.

10:40-11:10 ECML/PKDD Discovery Challenge 2006 Overview [paper]
Steffen Bickel
11:10-11:45 (invited) Semi-Supervised Support Vectors Machines and Application to Spam Filtering [slides]
Alexander Zien, Max Planck Institute for Biological Cybernetics, Tuebingen, Germany
11:45-12:10 A Semi-Supervised Spam Mail Detector [paper]
Bernhard Pfahringer
12:10-13:30 Lunch Break
13:30-13:55 Harnessing Unlabeled Examples through Iterative Application of Dynamic Markov Modeling [paper]
Gordon Cormack
13:55-14:20 A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering [paper]
Khurum Junejo, Mirza Yousaf, Asim Karim
14:20-14:45 Text Classification Using Clustering [paper]
Antonia Kyriakopoulou, Theodore Kalamboukis
14:45-15:00 Discussion
15:00-15:30 Coffee Break
15:30-15:55 Using Tri-Training and Support Vector Machines for Addressing the ECML/PKDD 2006 Discovery Challenge [paper]
Dimitrios Mavroeidis, Konstantinos Chaidos, Stefanos Pirillos, Dimosthenis Christopoulos, Michalis Vazirgiannis
15:55-16:20 TPN² Using Positive-Only Learning to deal with the Heterogeneity of Labeled and Unlabeled Data [paper]
Nikolaos Trogkanis, Georgios Paliouras
16:20-16:35 Discussion, final remarks
(canceled) Identifying SPAM with Predictive Models [paper]
Dan Steinberg, Mikhaylo Golovnya

Workshop Proceedings

The print version of the proceedings will be available at the registration desk of the ECML/PKDD 2006 conference. The online proceedings can be accessed as one PDF file here. Single papers can be accessed from the links next to the entries in the workshop schedule.

Important Dates

March 1st, 2006 Tasks and datasets available online.
June 7th, 2006 Submissions of spam filtering results due (by midnight CET).
June 9th, 2006 Submissions of algorithm description due (by midnight CET).
June 14th 2006 Notification of winners/publication of results on webpage.
June 16th 2006 Notification of creativity award winners.
July 26th, 2006 Workshop paper submission deadline.
July 31st, 2006 Notification of acceptance.
August 14th, 2006 Workshop proceedings (camera-ready) deadline.
September 20nd, 2006 Discovery Challenge Overview,
time: 14:00, location: Audimax.
September 22nd, 2006 ECML/PKDD 2006 Discovery Challenge Workshop,
time: 10:40-17:00, location: room 3094/96.

Winners

  • Spam Filtering Performance Award - Task A, the following three teams share the first rank:
    • Khurram Nazir Junejo, Mirza Muhammad Yousaf, Asim Karim
      Lahore University of Management Sciences, Pakistan
    • Bernhard Pfahringer
      University of Waikato, New Zealand
    • Kushagra Gupta, Vikrant Chaudhary, Nikhil Marwah, Chirag Taneja
      Inductis India Pvt Ltd
  • Spam Filtering Performance Award - Task B:
    • Gordon Cormack
      University of Waterloo, Canada
  • Spam Filtering Creativity Award - Task A/B, we decided to award one team for both tasks instead of one for each task because most teams have used the same algorithm for both tasks:
    • Bernhard Pfahringer
      University of Waikato, New Zealand
Please find the detailed results below in the next section.

Detailed Results

57 teams from 19 different countries have participated in the challenge, 26 teams submitted results for the evaluation. Not all participants submitted results for both tasks. We averaged the AUC values for all inboxes as described above and determined the ranking. We conducted significance tests using a significance level of 5% to test the null hypothesis that the second rank has a higher AUC value than the first. The test statistic is computed as described in Hanley and McNeil (A Method of Comparing the Areas under Receiver Operating Characteristic Curves Derived from the Same Cases, 1983). For Task A we could not reject the null hypothesis for rank two and three, this means there is no statistically significant difference between them and they all win the Spam Filtering Performance Award - Task A. For Task B we could reject the null hypothesis for the second rank, this means there is one winner for the Spam Filtering Performance Award - Task B.

Detailed results Task A:

Rank
Task A
Average AUC Team
1 0.950667 Khurram Nazir Junejo, Mirza Muhammad Yousaf, Asim Karim
Lahore University of Management Sciences, Pakistan
1 0.949094 Bernhard Pfahringer
University of Waikato, New Zealand
1 0.948666 Kushagra Gupta, Vikrant Chaudhary, Nikhil Marwah, Chirag Taneja
Inductis India Pvt Ltd
2 0.936457 Nikolaos Trogkanis, National Technical University of Athens, Greece
Georgios Paliouras, National Center of Scientific Research "Demokritos", Greece
3 0.927839 Chao Xu, Yiming Zhou
Beihang University, Beijing, China
4 0.927694 Lalit Wangikar, Mansi Khanna, Ankush Talwar
Inductis India Pvt Ltd
5 0.914380 Dimitrios Mavroeidis, Konstantinos Chaidos, Stefanos Pirillos, Dimosthenis Christopoulos, and Michalis Vazirgiannis
DB-NET Lab, Informatics Dept., Athens University EB, Greece
6 0.912040
7 0.900059
8 0.897554
9 0.892118
10 0.887429
11 0.886873
12 0.840133
13 0.838834
14 0.823030
15 0.806275
16 0.743754
17 0.733333
18 0.661819
19 0.637488
20 0.548266
21 0.523404

Detailed results Task B:

Rank
Task B
Average AUC Team
1 0.946469 Gordon Cormack
University of Waterloo, Canada
2 0.918251 Nikolaos Trogkanis, National Technical University of Athens, Greece
Georgios Paliouras, National Center of Scientific Research "Demokritos", Greece
3 0.907398 Kushagra Gupta, Vikrant Chaudhary, Nikhil Marwah, Chirag Taneja
Inductis India Pvt Ltd
4 0.899241 D’yakonov Alexander
Moscow State University, Russia
5 0.893333 Wenyuan Dai
Apex Data & Knowledge Management Lab, Shanghai Jiao Tong University
6 0.870906
7 0.864615
8 0.850165
9 0.815666
10 0.800084
11 0.744548
12 0.685038
13 0.597331
14 0.597331
15 0.572055
16 0.562000
17 0.556495
18 0.352228


Updated Results (unofficial)

Some participants have improved their algorithms after the challenge evaluation deadline and after the publication of the true labels. The following tables show the updated rankings according to the results reported in the workshop papers. Please note that this are no official results and the organizers cannot guarantee correctness.

Updated results Task A:

Average AUC Team
0.9875 Khurram Nazir Junejo, Mirza Muhammad Yousaf, Asim Karim
Lahore University of Management Sciences, Pakistan
0.9731 Antonia Kyriakopoulou, Theodore Kalamboukis
Athens University of Economics and Business
0.9672 Alexander Zien (invited speaker), ∇S3VM algorithm
Max Planck Institute for Biological Cybernetics, Tuebingen, Germany
0.9588 Nikolaos Trogkanis, National Technical University of Athens, Greece
Georgios Paliouras, National Center of Scientific Research "Demokritos", Greece
0.9491 Bernhard Pfahringer
University of Waikato, New Zealand
0.9487 Kushagra Gupta, Vikrant Chaudhary, Nikhil Marwah, Chirag Taneja
Inductis India Pvt Ltd
0.9300 Gordon Cormack
University of Waterloo, Canada

Updated results Task B:

Average AUC Team
0.9767 Antonia Kyriakopoulou, Theodore Kalamboukis
Athens University of Economics and Business
0.9524 Nikolaos Trogkanis, National Technical University of Athens, Greece
Georgios Paliouras, National Center of Scientific Research "Demokritos", Greece
0.9490 Gordon Cormack
University of Waterloo, Canada
0.9374 Alexander Zien (invited speaker), ∇S3VM algorithm
Max Planck Institute for Biological Cybernetics, Tuebingen, Germany
0.9074 Kushagra Gupta, Vikrant Chaudhary, Nikhil Marwah, Chirag Taneja
Inductis India Pvt Ltd

Tasks

The competition is split into two tasks, A and B.

For each task we provide the following resources:
  • Labeled training emails comprising of 50% spam and 50% non-spam collected from publicly available sources.
  • Several inboxes from different users with unlabeled evaluation emails.
  • Tuning data consisting of an additional set of labeled training data and inboxes with labeled tuning emails.
Tasks:
  • Task A deals with a setting where many training emails and many unlabeled emails for each user are available. Our assumption is, that for this task the adaptation to the user specific characteristics of the inboxes is critical for a good classification performance.
  • For Task B we provide many separate inboxes but with only little training data and little unlabeled data in the inboxes available. To be successful in this setting we assume that a learning algorithm needs to generalize over the different users in a way that user specific properties are taken into account but the data from the other users is utilized in a way that enhances the classification performance.
The goal in both tasks is to train different spam/non-spam classifiers, one for each user that correctly classifies the evaluation emails in the inboxes. You are given the evaluation data without labels. You need to submit the classification results for the evaluation data which we will use to calculate the classification performance based on the held back true labels.

The task specific number of emails and inboxes can be found in the following table:

Task A Task B
Number of labeled training emails 4000 100
Number of emails within one evaluation/tuning inbox 2500 400
Number of inboxes for evaluation 3 15
Number of inboxes for tuning 1 2

The tuning data can be used for parameter tuning but can not be used to augment the training data because the word/feature IDs (the dictionary) of the tuning data differ from the training/evaluation data.

Submission data will be a list of decision function values for each inbox. Please follow the detailed instructions under section "Submission" for submitting your results. You will be asked to submit a short description of your algorithm (please see section "Submission" for details). The interestingness of the algorithms will be judged. Non-straightforward approaches will be valued and most innovative ideas will be selected by the Discovery Challenge chair and a group of experts for the "Creativity Award".

Evaluation Criterion and Winner Selection

Evaluation Criterion

You are asked to submit the output of your classifier as decision function values, separately for each inbox (please see section "Submission" for details). The evaluation will run on the held back labels of the emails in the user's inboxes. The evaluation criterion is the AUC value. The AUC value is the area under the ROC curve (Receiver Operating Characteristic curve). A ROC curve is a plot of true positive rate vs. false positive rate as the prediction threshold sweeps through all the possible values. The area under this curve has the nice property that it specifies the probability that, when we draw one positive and one negative example at random, the decision function assigns a higher value to the positive than to the negative example.

We compute AUC values for each inbox separately and average over all inboxes of the task. In the case you want to tune parameters of your algorithm on the provided tuning data and you do not know how to compute the AUC value, we provide a small program with C source code below. You do not need to use this program but it may save your time.

Winner Selection

There will be four winners, two for each task A and B:
  • Spam Filtering Performance Award - Task A,
  • Spam Filtering Performance Award - Task B,
  • Spam Filtering Creativity Award - Task A,
  • Spam Filtering Creativity Award - Task B.
The winners for the "Spam Filtering Performance Award - Task A/B" will be determined according to the following method. All participants are ranked according to their average AUC value on the evaluation data. Winner is the participant with maximum average AUC value.

Winner of "Spam Filtering Creativity Award - Task A/B " will be the participant whose algorithm is highly outstanding at its innovative ideas judged by the challenge chair and a group of experts.

Download of Data Sets

Data set task A (7 MB)
Data set task B (2 MB)
AUC calculation C source code (8 kB)
Evaluation data task A and B with true labels (5 MB)

Format

The files you downloaded are archives compressed with zip. Most decompression programs (e.g. Winzip, RAR) can decompress this format. The archives contain the following files:

Task File name Data set size Description
Task A task_a_labeled_train.tf 4000 emails Labeled training emails.
task_a_u00_eval.tf
...
task_a_u02_eval.tf
2500 emails each Unlabeled evaluation data: 3 inboxes from different users.
task_a_labeled_tune.tf 4000 emails Labeled training emails for parameter tuning.
Feature representation corresponds only to file task_a_u00_tune.tf.
task_a_u00_tune.tf 2500 emails Labeled test emails of one user's inbox for parameter tuning.
Feature representation corresponds only to file task_a_labeled_tune.tf.
Task B task_b_labeled_train.tf 100 emails Labeled training emails.
task_b_u00_eval.tf
...
task_b_u14_eval.tf
400 emails each Unlabeled evaluation data: 15 inboxes from different users.
task_b_labeled_tune.tf 100 emails Labeled training emails for parameter tuning.
Feature representation corresponds only to files task_b_u00_tune.tf and task_b_u01_tune.tf.
task_b_u00_tune.tf
task_b_u01_tune.tf
400 emails each Labeled test emails from two user's inboxes for parameter tuning.
Feature representation corresponds only to file task_b_labeled_tune.tf.

We do not provide the raw text of the emails. The emails are in a bag-of-words vector space representation. Attributes are the term frequencies of the words. We removed words with less than four counts in the data set resulting in a dictionary size of about 150,000 words. The data set files are in the sparse data format used by SVMlight. Each line represents one email, the first token in each line is the class label (+1=spam; -1=non-spam; 0=unlabeled-evaluation-data). The tokens following the label information are pairs of word IDs and term frequencies in ascending order of the word IDs.

To give an example, the first line in task_a_labeled_train.tf starts like this:

1 9:3 94:1 109:1 163:1

This line represents a spam email (starting with class label "1") with four words. The word ID of the first token is 9 and the word occurs 3 times within this email, indicated by ":3".

The program for the calculation of the AUC is written in C and compiles with gcc. There is a makefile and a short readme included.

Organization

Challenge Chair: Steffen Bickel
Humboldt-Universität zu Berlin, Germany


The Discovery Challenge 2006 is organized in cooperation with Strato Rechenzentrum AG.

Photo by Land Berlin/Thie