Tutorials

Monday, September 18th

An easy statistical theory for highly scalable learning algorithms
Claudio Gentile
Novel Aspects in Unsupervised Learning: Semi-Supervised and Distributed Algorithms
Dimitrios Gunopulos, Michalis Vazirgiannis, and Maria Halkidi
Agent Intelligence Through Data Mining
Andreas L. Symeonidis, Pericles A. Mitkas

Friday, September 22th

Machine Learning of Natural Language
Walter Daelemans and Antal van den Bosch
Cryptographic techniques in privacy-preserving data-mining
Helger Lipmaa
Randomization Based Privacy Preserving Data Mining
Xintao Wu
Autonomic Computing: A New Challenge for Machine Learning
Irina Rish, Rajarshi Das, Gerry Tesauro, Jeff Kephart (combined with workshop)

An easy statistical theory for highly scalable learning algorithms

Instructors: Claudio Gentile, Universita' dell'Insubria, Varese, Italy

Tutorial Notes.

This is a survey on recent advances in on-line prediction algorithms and their connections to statistical methods in Machine Learning. An on-line learning algorithm is a supervised incremental algorithm that sweeps through a sequence of examples only once. As such, on-line algorithms are both highly adaptive and highly scalable. They seem to be an unavoidable choice when dealing with very large datasets of examples and/or when the datasets are rapidly changing with time. In the worst-case setting of on-line learning, prediction performance is proven via a pointwise analysis, i.e., an analysis that avoids statistical assumptions on the way data are generated. Recent developements in the on-line learning literature show a natural way of turning on-line algorithms working in the worst-case setting into batch algorithms working under more standard stochastic (e.g., i.i.d.) assumptions on the data-generation process. The underlying statistical theory is fairly easy, while the resulting algorithms are highly scalable (hence the title of this tutorial). The error bounds one can derive from this theory are both data-dependent and algorithm-dependent. Besides, and most importantly, these bounds are trivial to compute and are likely to be sharp (at least as sharp as the existing literature on, say, Support Vector Machines). There are two immediate practical benefits from this theory:

It delivers computationally efficient learning algorithms achieving the state-of-the-art error bounds obtained by computationally intensive methods;
it allows sharp and rigorous error reporting.

However, since a broader goal of our tutorial is to bridge Learning Theory and Machine Learning/Data Mining communities, we hope we can also contribute to increasing the cooperation between the two. Some of the results we propose are well-known (and have been widely-publicized), others are not and, we believe, deserve more attention by machine learning practitioners.

Novel Aspects in Unsupervised Learning: Semi-Supervised and Distributed Algorithms

Instructors: Dimitrios Gunopulos, Department of Computer Science & Engineering, Univ. of California at Riverside, USA, Michalis Vazirgiannis, and Maria Halkidi, Department of Informatics, Athens University of Economics & Business, Greece

Tutorial Notes.

Unsupervised learning (also known as clustering) is applicable in many real life applications because there is typically a large amount of unclassified data available. Such data may contain information which is not previously known or changes rapidly (e.g. genes of unknown function, dynamically changing web pages in an automatic web document classification system). Moreover, since the clustering is unsupervised, i.e. there is no a-priori knowledge for data distribution in the underlying set, the significance of the clusters defined for a data set needs to be validated. Then an important problem rises in the context of clustering, which is the development of efficient indicators to assess the quality/ validity of clustering results.

The evaluation of clustering validity is a broad issue and subject of endless arguments since the notion of good clustering is strictly related to the application domain and the users' perspectives. The problem is stronger in the case of high-dimensional data where many studies have shown that traditional clustering methods fail leading to meaningless results. This is due to the lack of clustering tendency in a part of the defined subspaces or the irrelevance of some data dimensions (i.e. attributes) to the application aspects and user requirements. Nevertheless, the majority of the clustering validity criteria use structural properties of the data to assess the validity of the discovered clusters. The presence of such properties, however, does not guarantee the interestingness and usefulness of the data clustering for the user. There is a requirement for approaches that take into account user's intention to tune the clustering process, and hence the entire area of semi-supervised learning has been recently introduced. Due to the features of the modern data sets, such as huge volumes, dimensionality, distribution and heterogeneity, it is essential in many cases to transform the data sets accordingly either to lower or higher dimensionality spaces, such that the clusters are more separable.

Regarding the dimensionality issue, we will present a survey of the relevant dimensionality reduction algorithms and transformations. Starting from the linear techniques - such as PCA, LSI, Fastmap, MDS - we present advanced non linear techniques - such as Isomap, MVU, LLE, Laplacian eigenmaps. On the other hand distribution and dynamic behaviour are key features of current content configurations (such as WWW, P2P content systems etc.) and provide grand challenges to the community of unsupervised learning. To summarize this tutorial provides a review of approaches that treat the aforementioned features, such as distributed clustering algorithms, semantic overlays for P2P networks. Also we will discuss approaches that tackle the clustering problem exploiting the user's input so that the clustering results are evaluated in terms of the users' requirement satisfaction. Eventually we will address the role of unsupervised learning in the context of new aspects: distribution, data dynamics, user constraint satisfaction and we discuss promising research directions.

Agent Intelligence Through Data Mining

Instructors: Andreas L. Symeonidis, Pericles A. Mitkas, Aristotle University of Thessaloniki, Greece

Tutorial Notes.

The tutorial will offer a self-contained overview of a relatively young but important area of research: the intersection of agent intelligence and data mining. This intersection leads to considerable advancements in the area of information technologies, drawing the increasing attention of both research and industrial communities. It can take two forms: a) the more mundane use of intelligent agents for improved data mining and, b) the use of data mining for smarter, more efficient agents. The second approach is the main focus of the tutorial.

The tutorial will provide a review of data mining and agent technology fields, and provide a number of approaches to the problem of infusing knowledge extracted by the use of data mining techniques, into agents. It shall elaborate on a methodology for developing multi-agent systems, describe available open-source tools to support this process, and demonstrate the application of the methodology on different types of MAS.

Machine Learning of Natural Language

Instructors: Walter Daelemans, University of Antwerp, Belgium and Antal van den Bosch, Tilburg University, the Netherlands.

Tutorial Notes.

Machine learning of natural language is a well-established discipline in computational linguistics. The Association for Computational Linguistics' SIG on language learning has recently organized its tenth conference, and statistical learning has arguably taken over most of the field of computational linguistics. Also in Machine Learning, language processing and text analysis tasks are becoming increasingly part of mainstream research, as witnessed by many language processing inspired challenges in the European PASCAL network, for example. Nevertheless, we feel that the Machine Learning community could contribute significantly more to solving language processing problems, and conversely, that language data constitute interesting and difficult data for testing machine learning algorithms and methods.

In this tutorial, we discuss the special properties of language data that make it an interesting application domain for supervised classification-based machine learning approaches. For instance, many natural language processing represent complex sequence processing and relational learning tasks. Among the typical data features we find some interesting challenges such as Zipfian and Dirichlet distributions of words, and very to very very large datasets. Language brings the intriguing notion of the productive exception, or pockets of exceptions. And finally, linguistics offers a wealth of domain knowledge -- which is not always useful for learning.

In the tutorial we pay special attention to aspects of machine learning algorithms that are put to the test most extremely in language learning:

Eager versus lazy learning, i.e. the ability to find minimal-description length models versus the capacity to retain exceptions;
Scaling issues in learning, storage, and classification, particularly dealing with large numbers of output classes.

We illustrate the tutorial with many practical examples, partly drawn from the book "Memory-based language processing" (Daelemans and Van den Bosch, 2005, Cambridge University Press).

Cryptographic techniques in privacy-preserving data-mining

Instructors: Helger Lipmaa, University of Tartu and Cybernetica AS

Tutorial Notes.

The primary task of data-mining is to develop models about aggregated data, for example about the habits of Internet users, about loyal customers, etc. The main question of privacy-preserving data-mining (PPDM) is, can we develop accurate models without access to precise information in individual data records? The latter question has proven to be difficult to solve. One can divide the techniques of PPDM in two large groups: the randomization approach (related to the well-known statistical Randomized Response Technique), and the cryptographic approach.

In this tutorial, we will give an overview of the cryptographic approach that has many benefits but also several drawbacks. A major sociological problem with the cryptographic approach is that cryptographic and data mining communities are currently distinctly separated. On the one hand, without a special cryptographic training one is tended to propose protocols for PPDM that are not as secure or at least not as efficient as the up-to-date cryptographic protocols. On the other hand, without special training in data mining, one can often try to solve the wrong problem. In particular, the security definitions for different PPDM tasks are application-dependent and have their roots in data mining.

A major technical problem is that data mining tasks routinely deal with huge amounts of data. Mining this data, even without any privacy concerns, is non-trivial. Cryptography, even on much smaller amounts of data, is also non-trivial. Mixing the two usually results in PPDM solutions that are very complicated, and --- most importantly --- unlikely to be practical. Given this, it is important, in a collaboration between two communities, to come up with security definitions for PPDM that potentially have efficient cryptographic implementations.

Preliminary list of topics:

What is PPDM?
Brief overview of two approaches: randomization, cryptographic
1. Randomization is used when database is given out to perturb its entries. Efficient but limited, less security
2. Cryptography is used when database is not given out, but clients access the database maintainer for their mining tasks. Needs a server, computationally intensive, but in principle allows to solve all tasks and guaranteed security is higher
Short overview of modern cryptographic thinking:
1. 1980's: coming up with right definitions + proofs of concept = solving the problems (at least in theory)
2. 1990's + 2000's: designing efficient protocols for many interesting tasks
Cryptographic techniques in PPDM:
1. Homomorphic public-key cryptography
  - Examples, definitions. Use in simple PPDM protocols: scalar product, ...
  Limitations
2. Secure two-party computation
  - Examples, definitions. Use in complex PPDM protocols: ID2, ...
  Limitations
3. Secure multi-party computation
  - Examples, definitions. Use in multi-party PPDM protocols.
  Limitations

Target audience: Data mining/Machine learning community interested in privacy aspects. No or minimal knowledge of cryptography is assumed.

Randomization Based Privacy Preserving Data Mining

Instructors: Xintao Wu, University of North Carolina at Charlotte, USA

Tutorial Notes.

Privacy is becoming an increasingly important issue in many data mining applications. A considerable amount of work on privacy preserving data mining has been investigated recently. In the first part of the tutorial, we survey recent proposed approaches on randomization based privacy preserving data mining for numerical data: the additive noise based, the random rotation based, and the condensation based perturbation approaches, etc. In the second part of the tutorial, we will discuss the limitations of each above approach and show various potential attacking methods which may be exploited by attackers to breach individual privacy. More specifically, we will present reconstruction techniques based on Spectral Filtering, Singular Value Decomposition, and Principle Component Analysis from the additive noise randomized data. For the random rotation based perturbation approach, we will show how an Independent Component Analysis based reconstruction method can be used on the random rotation perturbed data to breach privacy when a (small) subset of sample data are a-priori known by attackers. Finally, we will briefly discuss randomization based privacy preserving data mining methods for categorical data (e.g., market basket data as a special case).

Brief outline of the tutorial:

Overview of law and regulation on data sharing and data privacy.
Overview of randomization approaches for numerical data: density distribution reconstruction, rotation based transformation, general linear transformation, and condensation based method.
Detailed discussion of potential attacking methods: Spectral Filtering, PCA, SVD, and ICA.
Overview of randomization approaches for categorical data.
Summary and future directions
Pointers to more advanced material and resources

Tutorial Participation

Registered conference participants can freely attend all tutorials.

Tutorials and Workshop Chairs

Tapio Elomaa	Bart Goethals
Inst. of Software Systems	Dept. of Math and Computer Science
Tampere University of Technology	University of Antwerp
P.O. Box 553	Middelheimlaan 1
FI-33101 Tampere, Finland	B-2020 Antwerpen, Belgium
elomaa@cs.tut.fi	bart.goethals@ua.ac.be
Tel: +358 3 3115 3362	Tel: +32 3 265 3264
Fax: +358 3 3115 2913	Fax: +32 3 265 3777

Photo by Land Berlin/Thie