K1: Machine Learning and Extracting Information from the Web
Tom M. Mitchell, Whizbang! Labs and Carnegie Mellon University
Today's search engines can retrieve and display over a billion web pages, but their use is limited by the fact that they don't analyze the content of these pages in any depth.
What if these search engines could extract the factual content from the pages they retrieve? Then, instead of asking for pages that contain the keyword "java," we could ask directly for the facts we are after, such as "What Java programming jobs are available in Palo Alto?," or "Are there any evening courses on Java available in the Palo Alto area during spring 2001?"
This talk will describe research that has resulted in systems that answer these kinds of questions by extracting detailed factual information automatically from millions of web pages. Our approach relies heavily on machine learning algorithms to train the system to find and extract targeted information. For example, in one case we trained our system to find and extract job postings from the web, resulting in the world's largest database of job openings (over 600,000 jobs, see www.flipdog.com). This talk will describe machine learning algorithms for classifying and extracting information from web pages, including results of recent research on using unlabeled data and other kinds of information to improve learning accuracy.
Tom M. Mitchell is Vice President and Chief Scientist at WhizBang! Labs. He is currently on a two-year leave of absence from Carnegie Mellon University, where he is the Fredkin Professor of Learning and AI in the School of Computer Science, and Director of CMU's Center for Automated Learning and Discovery. Mitchell's primary research interest is in Machine Learning theory and practice. Mitchell is the author of the textbook "Machine Learning" (McGraw Hill, 1997), incoming President of the American Association for Artificial Intelligence, and a member of the National Research Council's Computer Science and Telecommunications Board.