Mining Multilingual Documents

English speakers can easily be fooled into thinking searching on English language documents provides all the information they need. Yet it is vital that English speakers understand the perspectives of other cultures on world events and global issues.

The quantity of non-English documents makes it impossible to expect quality translation, so we must rely on machine-translation systems. While such systems continue to improve, generated translations remain difficult to read and understand, with critical words often omitted, and inconsistent translations for the same word in a document.

One approach to handling the volume of documents is to use summarization systems to automatically generate single or multi-document (cluster) summaries of machine-translated documents. Based on the generated summaries, small sets of documents can be identified for translation using the limited human resources.

We discuss in this talk the state-of-the-art in querying and generating summaries of multi-lingual document sets, with emphasis on a system we developed that has proved to be successful in competitive evaluations.

This is joint work with John M. Conroy and Judith D. Schlesinger.


Dianne P. O'Leary, University of Maryland


