Discovery Challenge

Introduction

This year's competition deals with personalized spam filtering and generalization across related learning tasks. People spend an increasing amount of time for reading messages and deciding whether they are spam or non-spam. Some users spend additional time to label their received spam messages for training local spam filters running on their desktop machines. Email service providers want to relieve users from this burden by installing server-based spam filters. Training such filters cannot rely on labeled messages from the individual users, but on publicly available sources, such as newsgroup messages or emails received through "spam traps" (spam traps are email addresses published visually invisible for humans but get collected by the web crawlers of spammers).
This combined source of training data is different from the distributions of the emails received by individual users. When learning spam filters for individual users from this type of data one needs to cope with a discrepancy between the distributions governing training and test data and one needs a balance between generalization and adaptation. The generalization/adaptation can rely on large amounts of unlabeled emails in the user's inboxes that are accessible for server-based spam filters. Utilizing this unlabeled data a spam filter can be adapted to the properties of specific user's inboxes but when little unlabeled data for a user are available a generalization over multiple users is advised.

We provide labeled training data collected from publicly available sources. The unlabeled inboxes of several users serve as test data. The inboxes differ in the distribution of emails. The goal is to construct a spam filter for each single user that correctly classifies its emails as spam or non-spam. A clever way of utilizing the available sets of unlabeled emails from different users is required.

There will be a Discovery Challenge workshop at ECML/PKDD 2006 in Berlin, where we will discuss the results, different approaches, and other issues related to the problem setting.

We are looking forward to an interesting competition/workshop and encourage your participation.

Workshop Schedule

The workshop will take place on September 22nd, 2006, at Humboldt-Universität zu Berlin, Germany, Unter den Linden 6, room 3094/96.

10:40-11:10	ECML/PKDD Discovery Challenge 2006 Overview [paper] Steffen Bickel
11:10-11:45	(invited) Semi-Supervised Support Vectors Machines and Application to Spam Filtering [slides] Alexander Zien, Max Planck Institute for Biological Cybernetics, Tuebingen, Germany
11:45-12:10	A Semi-Supervised Spam Mail Detector [paper] Bernhard Pfahringer
12:10-13:30	Lunch Break
13:30-13:55	Harnessing Unlabeled Examples through Iterative Application of Dynamic Markov Modeling [paper] Gordon Cormack
13:55-14:20	A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering [paper] Khurum Junejo, Mirza Yousaf, Asim Karim
14:20-14:45	Text Classification Using Clustering [paper] Antonia Kyriakopoulou, Theodore Kalamboukis
14:45-15:00	Discussion
15:00-15:30	Coffee Break
15:30-15:55	Using Tri-Training and Support Vector Machines for Addressing the ECML/PKDD 2006 Discovery Challenge [paper] Dimitrios Mavroeidis, Konstantinos Chaidos, Stefanos Pirillos, Dimosthenis Christopoulos, Michalis Vazirgiannis
15:55-16:20	TPN² Using Positive-Only Learning to deal with the Heterogeneity of Labeled and Unlabeled Data [paper] Nikolaos Trogkanis, Georgios Paliouras
16:20-16:35	Discussion, final remarks
(canceled)	Identifying SPAM with Predictive Models [paper] Dan Steinberg, Mikhaylo Golovnya

Workshop Proceedings

The print version of the proceedings will be available at the registration desk of the ECML/PKDD 2006 conference. The online proceedings can be accessed as one PDF file here. Single papers can be accessed from the links next to the entries in the workshop schedule.

Important Dates

March 1st, 2006	Tasks and datasets available online.
June 7th, 2006	Submissions of spam filtering results due (by midnight CET).
June 9th, 2006	Submissions of algorithm description due (by midnight CET).
June 14th 2006	Notification of winners/publication of results on webpage.
June 16th 2006	Notification of creativity award winners.
July 26th, 2006	Workshop paper submission deadline.
July 31st, 2006	Notification of acceptance.
August 14th, 2006	Workshop proceedings (camera-ready) deadline.
September 20nd, 2006	Discovery Challenge Overview, time: 14:00, location: Audimax.
September 22nd, 2006	ECML/PKDD 2006 Discovery Challenge Workshop, time: 10:40-17:00, location: room 3094/96.

Winners

Spam Filtering Performance Award - Task A, the following three teams share the first rank:
- Khurram Nazir Junejo, Mirza Muhammad Yousaf, Asim Karim
  Lahore University of Management Sciences, Pakistan
- Bernhard Pfahringer
  University of Waikato, New Zealand
- Kushagra Gupta, Vikrant Chaudhary, Nikhil Marwah, Chirag Taneja
  Inductis India Pvt Ltd
Spam Filtering Performance Award - Task B:
- Gordon Cormack
  University of Waterloo, Canada
Spam Filtering Creativity Award - Task A/B, we decided to award one team for both tasks instead of one for each task because most teams have used the same algorithm for both tasks:
- Bernhard Pfahringer
  University of Waikato, New Zealand

Please find the detailed results below in the next section.

Detailed Results

57 teams from 19 different countries have participated in the challenge, 26 teams submitted results for the evaluation. Not all participants submitted results for both tasks. We averaged the AUC values for all inboxes as described above and determined the ranking. We conducted significance tests using a significance level of 5% to test the null hypothesis that the second rank has a higher AUC value than the first. The test statistic is computed as described in Hanley and McNeil (A Method of Comparing the Areas under Receiver Operating Characteristic Curves Derived from the Same Cases, 1983). For Task A we could not reject the null hypothesis for rank two and three, this means there is no statistically significant difference between them and they all win the Spam Filtering Performance Award - Task A. For Task B we could reject the null hypothesis for the second rank, this means there is one winner for the Spam Filtering Performance Award - Task B.

Detailed results Task A:

Rank Task A	Average AUC	Team
1	0.950667	Khurram Nazir Junejo, Mirza Muhammad Yousaf, Asim Karim Lahore University of Management Sciences, Pakistan
1	0.949094	Bernhard Pfahringer University of Waikato, New Zealand
1	0.948666	Kushagra Gupta, Vikrant Chaudhary, Nikhil Marwah, Chirag Taneja Inductis India Pvt Ltd
2	0.936457	Nikolaos Trogkanis, National Technical University of Athens, Greece Georgios Paliouras, National Center of Scientific Research "Demokritos", Greece
3	0.927839	Chao Xu, Yiming Zhou Beihang University, Beijing, China
4	0.927694	Lalit Wangikar, Mansi Khanna, Ankush Talwar Inductis India Pvt Ltd
5	0.914380	Dimitrios Mavroeidis, Konstantinos Chaidos, Stefanos Pirillos, Dimosthenis Christopoulos, and Michalis Vazirgiannis DB-NET Lab, Informatics Dept., Athens University EB, Greece
6	0.912040
7	0.900059
8	0.897554
9	0.892118
10	0.887429
11	0.886873
12	0.840133
13	0.838834
14	0.823030
15	0.806275
16	0.743754
17	0.733333
18	0.661819
19	0.637488
20	0.548266
21	0.523404

Detailed results Task B:

Rank Task B	Average AUC	Team
1	0.946469	Gordon Cormack University of Waterloo, Canada
2	0.918251	Nikolaos Trogkanis, National Technical University of Athens, Greece Georgios Paliouras, National Center of Scientific Research "Demokritos", Greece
3	0.907398	Kushagra Gupta, Vikrant Chaudhary, Nikhil Marwah, Chirag Taneja Inductis India Pvt Ltd
4	0.899241	D’yakonov Alexander Moscow State University, Russia
5	0.893333	Wenyuan Dai Apex Data & Knowledge Management Lab, Shanghai Jiao Tong University
6	0.870906
7	0.864615
8	0.850165
9	0.815666
10	0.800084
11	0.744548
12	0.685038
13	0.597331
14	0.597331
15	0.572055
16	0.562000
17	0.556495
18	0.352228

Updated Results (unofficial)

Some participants have improved their algorithms after the challenge evaluation deadline and after the publication of the true labels. The following tables show the updated rankings according to the results reported in the workshop papers. Please note that this are no official results and the organizers cannot guarantee correctness.

Updated results Task A:

Average AUC	Team
0.9875	Khurram Nazir Junejo, Mirza Muhammad Yousaf, Asim Karim Lahore University of Management Sciences, Pakistan
0.9731	Antonia Kyriakopoulou, Theodore Kalamboukis Athens University of Economics and Business
0.9672	Alexander Zien (invited speaker), ∇S3VM algorithm Max Planck Institute for Biological Cybernetics, Tuebingen, Germany
0.9588	Nikolaos Trogkanis, National Technical University of Athens, Greece Georgios Paliouras, National Center of Scientific Research "Demokritos", Greece
0.9491	Bernhard Pfahringer University of Waikato, New Zealand
0.9487	Kushagra Gupta, Vikrant Chaudhary, Nikhil Marwah, Chirag Taneja Inductis India Pvt Ltd
0.9300	Gordon Cormack University of Waterloo, Canada

Updated results Task B:

Average AUC	Team
0.9767	Antonia Kyriakopoulou, Theodore Kalamboukis Athens University of Economics and Business
0.9524	Nikolaos Trogkanis, National Technical University of Athens, Greece Georgios Paliouras, National Center of Scientific Research "Demokritos", Greece
0.9490	Gordon Cormack University of Waterloo, Canada
0.9374	Alexander Zien (invited speaker), ∇S3VM algorithm Max Planck Institute for Biological Cybernetics, Tuebingen, Germany
0.9074	Kushagra Gupta, Vikrant Chaudhary, Nikhil Marwah, Chirag Taneja Inductis India Pvt Ltd

Tasks

The competition is split into two tasks, A and B.

For each task we provide the following resources:

Labeled training emails comprising of 50% spam and 50% non-spam collected from publicly available sources.
Several inboxes from different users with unlabeled evaluation emails.
Tuning data consisting of an additional set of labeled training data and inboxes with labeled tuning emails.

Tasks:

Task A deals with a setting where many training emails and many unlabeled emails for each user are available. Our assumption is, that for this task the adaptation to the user specific characteristics of the inboxes is critical for a good classification performance.

For Task B we provide many separate inboxes but with only little training data and little unlabeled data in the inboxes available. To be successful in this setting we assume that a learning algorithm needs to generalize over the different users in a way that user specific properties are taken into account but the data from the other users is utilized in a way that enhances the classification performance.

The goal in both tasks is to train different spam/non-spam classifiers, one for each user that correctly classifies the evaluation emails in the inboxes. You are given the evaluation data without labels. You need to submit the classification results for the evaluation data which we will use to calculate the classification performance based on the held back true labels.

The task specific number of emails and inboxes can be found in the following table:

	Task A	Task B
Number of labeled training emails	4000	100
Number of emails within one evaluation/tuning inbox	2500	400
Number of inboxes for evaluation	3	15
Number of inboxes for tuning	1	2

The tuning data can be used for parameter tuning but can not be used to augment the training data because the word/feature IDs (the dictionary) of the tuning data differ from the training/evaluation data.

Submission data will be a list of decision function values for each inbox. Please follow the detailed instructions under section "Submission" for submitting your results. You will be asked to submit a short description of your algorithm (please see section "Submission" for details). The interestingness of the algorithms will be judged. Non-straightforward approaches will be valued and most innovative ideas will be selected by the Discovery Challenge chair and a group of experts for the "Creativity Award".

Evaluation Criterion and Winner Selection

Evaluation Criterion

You are asked to submit the output of your classifier as decision function values, separately for each inbox (please see section "Submission" for details). The evaluation will run on the held back labels of the emails in the user's inboxes. The evaluation criterion is the AUC value. The AUC value is the area under the ROC curve (Receiver Operating Characteristic curve). A ROC curve is a plot of true positive rate vs. false positive rate as the prediction threshold sweeps through all the possible values. The area under this curve has the nice property that it specifies the probability that, when we draw one positive and one negative example at random, the decision function assigns a higher value to the positive than to the negative example.

We compute AUC values for each inbox separately and average over all inboxes of the task. In the case you want to tune parameters of your algorithm on the provided tuning data and you do not know how to compute the AUC value, we provide a small program with C source code below. You do not need to use this program but it may save your time.

Winner Selection

There will be four winners, two for each task A and B:

Spam Filtering Performance Award - Task A,
Spam Filtering Performance Award - Task B,
Spam Filtering Creativity Award - Task A,
Spam Filtering Creativity Award - Task B.

The winners for the "Spam Filtering Performance Award - Task A/B" will be determined according to the following method. All participants are ranked according to their average AUC value on the evaluation data. Winner is the participant with maximum average AUC value.

Winner of "Spam Filtering Creativity Award - Task A/B " will be the participant whose algorithm is highly outstanding at its innovative ideas judged by the challenge chair and a group of experts.

Download of Data Sets

Data set task A (7 MB)
Data set task B (2 MB)
AUC calculation C source code (8 kB)
Evaluation data task A and B with true labels (5 MB)

Format

The files you downloaded are archives compressed with zip. Most decompression programs (e.g. Winzip, RAR) can decompress this format. The archives contain the following files:

Task	File name	Data set size	Description
Task A	task_a_labeled_train.tf	4000 emails	Labeled training emails.
	task_a_u00_eval.tf ... task_a_u02_eval.tf	2500 emails each	Unlabeled evaluation data: 3 inboxes from different users.
	task_a_labeled_tune.tf	4000 emails	Labeled training emails for parameter tuning. Feature representation corresponds only to file task_a_u00_tune.tf.
	task_a_u00_tune.tf	2500 emails	Labeled test emails of one user's inbox for parameter tuning. Feature representation corresponds only to file task_a_labeled_tune.tf.
Task B	task_b_labeled_train.tf	100 emails	Labeled training emails.
	task_b_u00_eval.tf ... task_b_u14_eval.tf	400 emails each	Unlabeled evaluation data: 15 inboxes from different users.
	task_b_labeled_tune.tf	100 emails	Labeled training emails for parameter tuning. Feature representation corresponds only to files task_b_u00_tune.tf and task_b_u01_tune.tf.
	task_b_u00_tune.tf task_b_u01_tune.tf	400 emails each	Labeled test emails from two user's inboxes for parameter tuning. Feature representation corresponds only to file task_b_labeled_tune.tf.

We do not provide the raw text of the emails. The emails are in a bag-of-words vector space representation. Attributes are the term frequencies of the words. We removed words with less than four counts in the data set resulting in a dictionary size of about 150,000 words. The data set files are in the sparse data format used by SVMlight. Each line represents one email, the first token in each line is the class label (+1=spam; -1=non-spam; 0=unlabeled-evaluation-data). The tokens following the label information are pairs of word IDs and term frequencies in ascending order of the word IDs.

To give an example, the first line in task_a_labeled_train.tf starts like this:

1 9:3 94:1 109:1 163:1

This line represents a spam email (starting with class label "1") with four words. The word ID of the first token is 9 and the word occurs 3 times within this email, indicated by ":3".

The program for the calculation of the AUC is written in C and compiles with gcc. There is a makefile and a short readme included.

Organization

Challenge Chair: Steffen Bickel
Humboldt-Universität zu Berlin, Germany

The Discovery Challenge 2006 is organized in cooperation with Strato Rechenzentrum AG.

Photo by Land Berlin/Thie