Drowning in a Sea of Data: The Need for New Mathematical Tools for Petascale Data Analysis

September 24, 2008

Juan Meza

It's not often that the astronomy, biology, nanoscience, and networks communities come together at one meeting. A recent exception---a workshop on the mathematical analysis of petascale data, sponsored by the Department of Energy's Office of Advanced Scientific Computing Research---was held in Washington, June 3�5. The main theme---the need to develop new mathematical techniques to analyze petascale data sets---was driven home by several talks on problems from widely varying areas, including astronomical sky surveys, drug and probe discovery, genomics, climate science, and nanosystems.

Kirk Borne of the Computational and Data Sciences Department at George Mason University provided an example from astronomy. Borne explained how Edwin Hubble used information derived from image data of galaxies, such as the brightness of certain stars within the galaxies and the redshift of the galaxies, to come up with a plot showing that a galaxy's redshift varied linearly with its distance from Earth. This plot, which is known today as the Hubble diagram, led to a better understanding of the universe---in particular, to the idea that the universe was expanding. These themes from astronomy---to characterize the known, assign the new, and discover the unknown---can be directly translated into mathematical themes, such as clustering, classification, and outlier detection.

Similar situations arise in genomics and biology. Today, high-throughput genome sequencing and tagging technologies are generating data at unprecedented rates and at lower cost. "Genotyping and availability of genome sequences across populations will be commonplace," said Dan Rokhsar from LBNL and UC Berkeley. "The $1000 genome for humans is only a few years away, and will also apply to tumors, agricultural species, and microbial communities."

Discussing the use of large data collections in drug and probe discovery, George Karypis of the University of Minnesota pointed out that the current process can start with 10,000 compounds and take 15 years to yield one new FDA-approved drug. A consequence has been the emergence of new research fields, such as chemical genetics. Chemical geneticists search for probes---small organic molecules that can alter the function of other proteins---and use them to study biological systems. To find good probes requires looking through databases called "chemical compound libraries" and identifying the most promising candidates. From a data representation and computational standpoint, however, a library of compounds is nothing more than a set of (small) graphs. Finding good probes then becomes a problem of determining similarities between graphs or finding maximal common subgraphs.

In the field of earth systems modeling, scientists are working to understand the factors that govern extreme events in climate, such as dry spells, hurricanes, and heat waves. A good example is the 2003 European heat wave that by some accounts resulted in more than 25,000 deaths. At the workshop, Bill Collins (LBNL and UC Berkeley) and Michael Wehner (LBNL) described how climate scientists are developing a numerical "parallel Earth" to help them better understand the effects of climate change and trends in the environment. Collins, who was part of the 4th Intergovernmental Panel on Climate Change team that shared this year's Nobel Peace Prize, said that the IPCC had already generated over 110 terabytes of data encompassing 10,000 years of climate simulations. Commenting on the challenge of analyzing these simulation results, Wehner pointed out that "Data can be generated faster than it can be analyzed."

Climate scientists cannot currently undertake analyses of some types, including data assimilation for the ocean, the carbon cycle, and other long-lived greenhouse gases; fractional attributable risk; feature extraction; and parameter sensitivities and uncertainties. Many of these analyses cannot be done because of the sheer size of the data sets needed; parallel statistical analysis tools and analysis algorithms that can scale to large data sets are clearly needed even today. In addition, scientists need the ability to mine data for specific features across different data repositories. Other challenges facing climate scientists include the need for data reduction techniques with a quantification measure of the loss incurred and compression techniques that preserve the correct distribution tails.

Several breakout groups at the workshop discussed other areas that require analysis tools for current data sets. One typical combustion simulation today, for example, usually writes 100�300 files, each on the order of 250 gigabytes. In more familiar terms, about 20,000 DVDs would be needed to store the total output of a single combustion simulation. In the near future, as the resolutions used in simulations increase, this number is expected to reach 400 terabytes per simulation (about 28,000 DVDs). To analyze such data, scientists compare various vector quantities over time; analysis tools for vector fields are already insufficient, however. Methods that are defined for smooth domains work well for small data sets, but the analysis algorithms scale poorly with the size of the data. Analyses of other types are completely out of reach; examples include parallel topological segmentation and temporal tracking of scalar and vector features in a multiscale turbulent reactive flow setting, and uncertainty quantification for turbulent reacting flows.

D.S. Sivia, who works in the Data Analysis and Visualization group at the Rutherford Appleton Laboratory, provided a good overview of the uses of data analysis to study problems in condensed matter science. The main idea involves computing parameters for a model by fitting them to data gathered from experiments conducted at large scattering facilities, such as the Advanced Photon Source at Argonne National Laboratory or the Spallation Neutron Source at Oak Ridge National Laboratory. "While the size of the data files in condensed matter science may be growing rapidly, they are still a long way from being petascale," Sivia said. However, he continued, "The inferential task involves the exploration of the parameter space of the analysis model, which can result in an enormous computational challenge even for fairly small (nonlinear) problems."

The final report from this workshop can be found at http://www.sc.doe.gov/ascr/
WorkshopsConferences/WorkshopsConferences.html. Further information about the workshop can be found at http://www.orau.gov/mathforpetascale/.

Juan Meza heads the High Performance Computing Research Department at Lawrence Berkeley National Laboratory.

Juan Meza To Receive 2008 Blackwell�Tapia Prize

Juan Meza, author of the accompanying article on a DOE workshop on petascale data, will receive the 2008 Blackwell�Tapia Prize in November at the fifth Blackwell�Tapia Conference. The prize recognizes significant research contributions in the mathematical sciences, as well as outstanding work on behalf of minority mathematical scientists.

Meza is well qualified on all counts, according to the prize committee, which cited his record as "an accomplished and effective head of a large department doing cutting-edge explorations in the computational sciences, computational mathematics, and future technologies" and as "a role model and active advocate for others from groups under-represented in the mathematical sciences." Meza currently works in nonlinear optimization, with an emphasis on methods for parallel computing. Readers of SIAM News will recognize some of the application areas in which he and his department have worked, as listed in the citation: scalable methods for nanoscience, power grid reliability, molecular conformation problems, optimal design of chemical vapor deposition furnaces, and semiconductor device modeling.

The citation goes on to point out that Meza is much in demand as a speaker, whether presenting mathematical results, discussing issues related to diversity, or providing advice to people at the beginning of their careers. SIAM can attest to the latter, with very recent evidence: As a speaker in the Professional Development Evening held during this year's Annual Meeting in San Diego, he offered advice on searching for a job ("Would you marry someone after a 30-minute interview?"); on the skills required for employment at a national lab (computational skills, e.g., Fortran, C, C++, writing/verbal skills, teamwork); on writing research statements ("rambling statements--no!"); and much more. Also a participant in SIAM's Visiting Lecturer Program, in recent months Meza has given a talk titled "The Role of Mathematics in Amplifying Science Research: How Mathematics Will Help Save the World."

Donate · Contact Us · Site Map · Join SIAM · My Account
Facebook Twitter Youtube linkedin google+