|Thursday, April 11|
|10 am - noon||Data Mining in the Face of Contaminated and Incomplete Records|
|Roald K. Pearson, ETH Zurich|
|3 pm - 6 pm*||Enterprise Customer Data Mining for E-Business|
|Friday, April 12|
|10 am - noon||Problems, Solutions and Research in Data Quality|
Tamraparni Dasu and Theodore Johnson , AT&T Labs Research
, AT&T Labs Research
|3 pm - 6 pm*||Text Mining for Bioinformatics|
Hinrich Schütze, Novation Biosciences
|* A catered break will be held from 4 - 4:30 pm.|
Abstracts and Biographical Information
|Title:||Data Mining in the Face of Contaminated and Incomplete Records|
|Presented by:||Roald K. Pearson, ETH Zurich|
This tutorial has three main
objectives. The first is to provide a general overview of the sources and
extent of contaminated and missing records in the large datasets for which
highly automated data mining procedures are intended. The second objective
is to clearly demonstrate that the consequences of simply ignoring these
data anomalies are often unacceptable, either because important features
in the dataset are missed altogether, or because these features are
grossly misinterpreted. Finally, the third objective is to provide a broad
overview of some of the techniques that have been proposed by various
authors to address these problems.
Specific topics covered include
the important practical distinction between noise, to which most
data analysis procedures are somewhat resistant by design, and outliers,
which often cause dramatic failures. In addition, distinctions are drawn
between ignorable missing data, which generally increases the
variability of computed results, and non-ignorable missing data,
which can introduce significant biases and fundamentally change the
conclusions of our analysis. Conversely, both outliers and non-ignorable
missing data often correspond to what Zhong et al. have called peculiar
data, representing precisely those observations we are most interested
in finding in a dataset.
|Biographical Information||Roald K. Pearson received his PhD in electrical engineering from M.I.T. in 1982, after which he joined the DuPont Company where his activities included the exploratory analysis of large sets of manufacturing process operating data. In 1997, Dr. Pearson joined the Institut für Automatik at ETH in Zürich, where he continued to work in the areas of exploratory data analysis and the development of discrete-time dynamic models for computer control. Currently, he is a visiting professor with the Tampere International Center for Signal Processing at the University of Technology in Tampere, Finland.|
|Title:||Enterprise Customer Data Mining for E-Business|
Data mining methods have their origins in a variety of fields: Statistics, Databases, Pattern Recognition, AI, Visualization, High-Performance Computing, and Information Retrieval. Successful deployment of these technologies to e-business enterprise data requires: data warehouse construction, mechanisms to efficiently update the warehouse, integration of data mining technologies, and delivery of results in a form consumable by business end-users.
In an e-business enterprise environment, the data warehouse problem is further magnified by the critical need to integrate web-log data, user profile data, product catalog information, transaction and sales data, advertising campaign data, datasets from legacy systems, etc. Once the data warehouse is in place, the next steps involve integrating analytical and data mining technology efficiently with the warehouse. A key challenge to an e-business enterprise is delivering timely, interesting, actionable results to an end-user who's expertise is marketing, sales, business development, or merchandising rather than data mining and advanced analytics.
Usama Fayyad is a co-founder of digiMine, Inc. and has served as President and CEO since its inception in March 2000. Prior to digiMine, Usama founded and led Microsoft Research's Data Mining & Exploration (DMX) Group from 1995 to 2000. His work there included the development of data mining prediction components for Microsoft Site Server (Commerce Server 3.0 and 4.0). From 1989 to 1995, Usama founded the Machine Learning Systems Group and developed data mining systems for the analysis of large scientific databases at the Jet Propulsion Laboratory (JPL), California Institute of Technology. During that time he received the most distinguished excellence award from Caltech/JPL and a U.S. Government Medal from NASA. He remained affiliated with JPL as Distinguished Visiting Scientist after joining Microsoft. Usama has a Ph.D. in engineering from the University of Michigan, Ann Arbor (1991). He has served as Program Co-Chair of KDD-94 and KDD-95 and as General Chair of KDD-96 and KDD-99. Usama serves as Editor-in-Chief of the journal Data Mining and Knowledge Discovery and SIGKDD Explorations.
Neal Rothleder is Director of Analytic Technology at digiMine, Inc. His focus is on delivering powerful, scalable data mining solutions to business users in an intuitive, actionable framework. His research interests include machine learning approaches to data mining, recently focusing on making academic research work in real-world problems and incorporating domain knowledge into data mining. Prior to joining digiMine, Dr. Rothleder was a Lead Engineer with the MITRE Corporation working on research and development in data mining technologies and applications. While there, he worked on projects in network intrusion detection, aviation safety, and a variety of fraud detection scenarios. Dr. Rothleder has held adjunct faculty appointments at the University of Michigan and George Mason University. He holds a Ph.D.. and an M.S. in Computer Science and Engineering from the University of Michigan.
Paul Bradley ([email protected]) is Data Mining
Development Lead at digiMine. His primary focus is on integrating data
mining technology into digiMine's service offering. Prior to joining
digiMine, he was a researcher in the Data Management, Exploration and
Mining Group at Microsoft Research. While at Microsoft Research, he worked
on data mining algorithms and on data mining components in Microsoft SQL
Server and Commerce Server. His research interests include classification
and clustering algorithms; underlying mathematical problem formulations;
and issues related to scalability. He received the Ph.D. degree from the
University of Wisconsin in 1998 on the topic of mathematical programming
and data mining. Paul serves
as Associate Editor of SIGKDD Explorations and was KDD-2001
|Title:||Problems, Solutions and Research in Data Quality|
Tamraparni Dasu and Theodore Johnson
Data quality is inextricably linked with mining datasets. Data quality problems arise during the process of data mining, and the quality of the data in turn determines the importance and value of the results that are unearthed by mining the data. Data quality has many facets, such as management of processes and practices; statistical detection of glitches; storage, monitoring, maintenance and profiling of data. Recent work has developed tools and algorithms for assuring data quality in datasets. However, data quality has been studied piecemeal by disciplines and communities that seldom communicate. Many specific problems and solutions are cited in an ad hoc fashion. However, it is possible to broadly categorize the data quality problems and propose a general class of solutions. Our aim in this tutorial is to bring together the different threads to:
Tamraparni Dasu received a B.A. (Honors) in Mathematical Statistics from Delhi University in 1982, followed by a Masters in Mathematics from I.I.T. (Indian Institute of Technology), New Delhi in 1984. She finished her Ph.D. in Statistics from the University of Rochester in 1990. Tamraparni joined the Statistical Modeling department at AT&T Bell Laboratories in 1990. She moved to the Machine Learning and Information Retrieval Research department in 1995, and then to the Information Mining research center of AT&T labs - Research in 2000, where she currently works.Theodore Johnson received a B.S in Mathematics from Johns Hopkins University in 1986, and a Ph.D. in Computer Science from the Courant Institute of New York University. From 1990 through 1995 Theodore was an Assistant Professor at the CISE department of the University of Florida, and an Associate Professor in 1996. In 1996, Theodore joined the Database Research department of AT&T Labs Research, where he currently works.
|Title:||Text Mining for Bioinformatics|
Hinrich Schütze, Novation Biosciences
|Abstract:||Our goal is to make this tutorial
a practical guide for how to use text mining in bioinformatics while at
the same time highlighting some of the interesting research issues that
arise when mining techniques are applied in bioinformatics. Participants
will be able to broaden the set of tools they are comfortable with if they
work in bioinformatics (drug discovery, pharmaceutical companies etc). Or
they will learn about one of the most exciting areas of application of
data discovery and analysis techniques if they are data miners currently
working on non-biological problems. Previous
exposure to biology will be helpful, but the tutorial will be accessible
to those who have no biology background. We will assume familiarity with
basic statistical and probabilistic concepts.This tutorial is a joint work
with Russ B. Altman, MD, Associate Professor in the Medical Informatics
Group at the Stanford University Medical Center.
Hinrich Schütze, PhD. After receiving a Ph.D. in Natural Language Processing from Stanford University in 1995, Hinrich Schütze joined the Xerox Palo Alto Research Center, where he developed a scaleable approach to semantic analysis of natural language based on mining of association data. He then co-founded Outride, a search personalization company, and led the development of personalization software that learns user preferences from surfing behavior. He is author of the best-selling textbook on data-driven natural language processing (with Chris Manning, MIT Press) and of a dozen issued and pending patents. Dr. Schütze is currently CTO of Novation Biosciences, a bioinformatics company focused on text and data mining of biological data. He is also Consulting Faculty at Stanford.