CSE 2011: Data: A Centuries-old Revolution in Science, Part I
July 15, 2011
Ed Seidel
Four hundred years ago, Galileo ushered in a true revolution in science by combining painstaking observations---data, which he collected in notebooks---with deep thinking to articulate mathematical descriptions of the observed systems. Building on this data-driven foundation, Newton developed a modern theory of gravitation, as well as calculus, which laid the groundwork for a comprehensive worldview governed by partial differential equations (that many of us have spent our careers trying to solve!).
Clearly, Galileo and Newton taught us well: The four centuries of modern science that followed have been a time of amazing discoveries. And less than one hundred years ago, Albert Einstein fueled this data revolution when he extended Newton's theory with his theory of general relativity built on a system of PDEs. Unfortunately, these PDEs were so complex that Einstein himself was not equipped to solve them! Nonetheless, observations generating just a notebook full of data confirmed that his theory was indeed true. Half a century later, Stephen Hawking's groundbreaking work on black holes resulted in output that can now be quantified as kilobytes of digital data.
Indeed, the methodologies of Galileo's and Newton's data-driven science, and the culture of science, with small groups thinking deeply about fundamental problems, have been at the center of the time-honored tradition of scientific research for centuries.
But if we fast-forward just 30 years from Hawking's work on black holes, we see that the world has changed tremendously. Advances of about 9 orders of magnitude in computing capability, along with deep advances in algorithms, have made many of the most complex PDEs solvable. Suddenly, we are generating data by the petabyte, in quantities that could no longer be stored in Galileo's notebooks. Dramatic, fundamental, and pervasive changes are upon us as we enter the data-intensive age of science.
The New Age of Data
A profound shift is occurring across all fields of research as technological advances enable us to tackle many truly complex challenges facing science and society. For example, not only can we now solve Einstein's complex PDEs, we can also begin to integrate other parts of physics and astronomy into studies of real-world phenomena, such as gamma-ray bursts, across the universe. At the same time, we are developing the capability to observe phenomena through all channels known to science, resulting in a diversity of data sources brought to bear on a single event. Now, all this knowledge, held in different communities, and all this data must be integrated, so that new knowledge can emerge.
Indeed, two key trends have begun to emerge:
- The new frontier is seriously about data. While computing capability has grown according to Moore's law, data volumes are growing at a much higher rate. Sensors, telescopes, accelerators, experiments, and other means are generating data at astonishing rates. While a cosmology simulation can generate a peta-byte of data, a machine like the LHC (Large Hadron Collider) already generates tens of petabytes that must be served to thousands of scientists across the planet; planned survey telescopes like the Large Synoptic Survey Telescope will generate hundreds of times this much data for analysis by astronomers, computers, and school kids around the planet; DNA sequencers in any biologist's lab are capable of generating an LHC's worth of data (at a rate of a terabyte per minute). And it is not just the volume, but also the great diversity of the data, that challenges us.
- Grand Challenge communities will be needed to address complex problems. Challenges---such as understanding gamma-ray bursts, predicting hurricanes, or forecasting climate change---will require not only advancement and integration of numerous and diverse data- and compute-intensive activities, but also critical collaborations among scientists from different communities at scales never before possible.
Fundamentally, data is becoming not only the output of most scientific inquiry, but also the dominant and fundamental medium of communication among researchers across all disciplines.
Implications for Science
Galileo's vision of modern science as a data-driven activity guiding mathematical description remains, but exponential growth of the data volumes, along with their ubiquity and diversity, will require completely new thinking---specifically, new mathematical and statistical methods---to describe not only the systems under study but the data themselves. Like computation, data-intensive science will drive revolutions in mathematics: How are features, let alone new laws of nature, to be found in the vast volumes of data being collected? How can disparate data, from different instruments and multiple communities, be combined to advance knowledge? These questions will drive new discoveries in mathematics and statistics, and new techniques in computer science and machine learning, just as they will be required for progress in the underlying science domains that pose them.
Furthermore, these changes in the culture and methods of science will call for a reconsideration of policies and practices as they relate to scientific research. As knowledge creation occurs rapidly at community boundaries, as data is increasingly the main output of science, and as scientists will need to share data to collaborate, policy must be carefully developed to enable collaboration. And traditional modes of communication---namely, scientific publications---will need to develop a richer set of tools and software to support and accelerate the flow of information and to support the reproducibility of results. Openness and sharing of data will be critical to an accelerating advancement of science.
And so, we have arrived at yet another scientific revolution---a revolution in the scope, use, and production of data. The scientific rationale and implications for policy will be explored in the second part of this article.
Ed Seidel, assistant director for mathematical and physical sciences at the National Science Foundation, discussed the ideas presented here in a panel session at CSE 2011 in Reno. At NSF, he was previously director of the Office of Cyberinfrastructure. He is on leave from Louisiana State University, where he is the Floating Point Systems Professor in the Departments of Physics and Astronomy and Computer Science.