K2 Data Mining for Fun and Profit
David Hand, Imperial College, London
Data mining can be defined as the discovery of interesting, unexpected, or valuable structures in large data sets. It is fundamentally secondary, typically using data sets which were collected for some other purpose. At its base lies a fundamental tension between the observation that one can certainly find some kind of structure in any large data sets if one looks hard enough, and the belief that there exists valuable information in such large data sets, just waiting to be extracted. The incentive for the development of data mining tools is primary commercial - this, after all, is where the money resides. However, many scientific questions are of the same kind: there is hugely important information buried in star catalogues, medical databases, and genome sequence data, for example.
Of course, the exploratory analysis of data sets seeking information is not new. What is new is the size of the data sets confronting us. This has implications: standard tools may not be useful, there may be no easy way to scan or sample the data, one must rely on automatic procedures, single pass algorithms may be necessary, and simple methods may be feasible where sophisticated ones are not. Worse, still, the databases are often dynamic, requiring real time processing.
Data mining is fundamentally interdisciplinary ideas and methods from statistics, machine learning, database management, pattern recognition, and other areas. Because of this, it places more emphasis on algorithms than does statistics, and more emphasis on models than does machine learning.
We can distinguish between two fundamental objectives in data mining, according to whether one is seeking models or patterns. A model is a large scale summary of data - an idea familiar in statistics. A pattern is a small 'local' feature of a data set. Pattern detection is perhaps the more challenging problem because of the myriad ways patterns can arise without representing any useful aspect of the mechanism generating the data, and because of the myriad ways in which one might decline a potential pattern.
A key issue is data quality: poor quality data will also lead to apparent patterns, but these may merely be a consequence of the data collection process, of no intrinsic interest. Coping with such problems is a challenging issue. One of the secondary beneficial outcomes of the rise in interest in data mining may well be a better appreciation of the value of putting more effort into ensuring good quality data. The talk is illustrated with many examples from a wide variety of areas.
David Hand is Professor of Statistics and Head of the Statistics Section at Imperial College London. He has published eighteen books on statistics and related areas, including "Discrimination and Classification," "The Statistical Consultant in Action," "Practical Longitudinal Data Analysis," "Construction and Assessment of Classification Rules," "Intelligent Data Analysis," "Statistics in Finance," and, most recently "Principles of Data Mining." He is a Fellow of the Royal Statistical Society and an Honorary Fellow of the Institute of Actuaries. His research interests include classification methods, the fundamentals of statistics, and data mining, and his application interests include medicine and finance. He had published over 150 research papers in these areas. He had acted as consultant to a wide range of organizations, including governments, banks, pharmaceutical companies, manufacturing industry, and health service providers.