SIAM Short Course on PERC Tools for Performance Data Gathering, Analysis and Modeling
Bronis R. de Supinski, Lawrence Livermore National Laboratory
February 28, 2004
SIAM Associated Conference: Parallel Processing 2004
The Performance Evaluation Research Center (PERC), an Integrated Software Infrastructure Center (ISIC) of the Department of Energy's Office of Science's Scientific Discovery through Advanced Computing (SciDAC) program, is developing a science of application performance. This tutorial will present that emerging science and the tools that PERC is developing in relation to that science. It will enable application scientists and simulation developers, typical attendees of the SIAM Conference on Parallel Processing for Scientific Computing, to understand the performance of their codes and how to improve it.
Bronis R. de Supinski is the Advanced Software Technologies (AST) Group Leader in the Center for Applied Scientific Computing (CASC) at Lawrence Livermore National Laboratory (LLNL). His research interests include message passing implementations and tools, memory performance improvement, computer architecture, cache coherence and distributed shared memory, consistency semantics and performance evaluation modeling and tools. Bronis earned his Ph.D. in Computer Science from the University of Virginia in 1998 and he joined CASC in July 1998. His dissertation investigated shared memory coherence based on isotach logical time systems. Currently, his projects include investigations into hardware/firmware mechanisms to improve memory performance in SMP-based systems, data-dependent memory tracing tools and a variety of optimization techniques and tools for MPI. He is a member of the ACM and the IEEE Computer Society.
Paul Hovland is a Computer Scientist in the Mathemiatics and Computer Science Division at Argonne National Laboratory. His research focuses on software engineering for high performance scientific applications; areas of interest include automatic differentiation, component based software, and performance engineering.
Philip J. Mucci received his Bachelors Degree in Computer Science from The Johns Hopkins University in 1993 and his Masters Degree from the University of Tennessee in 1998 under Dr. Jack Dongarra. During this time, he has worked for numerous HPC-related institutions including Thinking Machines, Lockheed Martin, IBM Power Parallel Division and Los Alamos National Laboratories. For his thesis, he developed a fast and portable communication infrastructure for PVM, the well known parallel computing framework and precursor to MPI. Under funding from the D.o.D. High Performance Computing and Modernization Program, he worked on application optimization, benchmarking, performance tools and application optimization. He is the author of several papers and has delivered numerous popular optimization tutorials given at numerous D.o.D. and D.o.E. sites throughout the country. He is the inventor of and technical lead of PAPI, the Performance Application Programming Interface, now in widespread use in compute centers around the world. He is currently a Research Consultant for the Innovative Computing Laboratory at the University of Tennessee in Knoxville. He is currently funded in part by the National Science Foundation, the Department of Energy's Scientific Discovery Through Advanced Computing Program, the Los Alamos and Lawrence Livermore National Laboratories. Now living in San Francisco, he works from an office at the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Lab in Berkeley, California.
Allan Snavely is an expert in high performance computing. He has contributed to the development of a number of strategies for working around the Von Neumann bottleneck to deliver fast time-to-solution for scientific applications. These include fundamental studies in modeling to understand the factors that effect performance; also architectural innovations including multithreaded computing, and computing with field-programmable gate arrays (FPGAs). Snavely's current research involves the design and optimization of complex systems (including supercomputers and computing Grids) drawing on principles of economics and statistics and leveraging reconfigurability. Snavely is leader of the Performance Modeling and Characterization Laboratory (PMaC) at the San Diego Supercomputing Center (SDSC), charged with understanding and addressing factors that affect performance on large supercomputers and Grids. PMaC is an externally funded lab with an annual operating budget in excess of $800K all from cooperative agreements for which Snavely is Primary Investigator (PI) or Co-PI.
Ying Zhang obtained her first Masters Degree in Remote Sensing from East China Normal University, Shanghai, China. During her studies, she did extensive research on image processing and pattern recognition, and developed mathematical pattern recognition models based satellite images. She also worked on several award-winning projects on city pollution control and city expansion as the project leader at Shanghai City Planning Institute. She started her graduate studies in Computer Science at Northeastern University in 1993. During this time, she worked for the Image Research Lab in New England Medical Center, and led the development of software tools to graphically simulate human brain and to detect brain tumors. Shortly after receiving her second Masters Degree in Computer Science, she joined the Pablo Research Group led by Dr. Dan Reed as a research programmer. Over the years, she has designed and developed Pablo Unix I/O benchmarks, MPI I/O Trace Libraries, Physical I/O Monitoring Facilities, and other performance analysis tools in HPC area. She is the technical leader and chief developer of SvPablo - a graphical source code browser for performance tuning and visualization. She also collaborated with scientists in various scientific fields to help them to do performance analysis for their applications. She has authored and co-authored several papers in performance analysis and remote sensing. She has presented Pablo tools in various meetings and conferences. She is currently funded in part by the National Science Foundation, by the Department of Energy's Scientific Discovery Through Advanced Computing Program, and by the Los Alamos Computer Science Institute.
The Performance Evaluation Research Center (PERC), an Integrated Software Infrastructure Center (ISIC) of the Department of Energy's Office of Science's Scientific Discovery through Advanced Computing (SciDAC) program, is developing a science of application performance. The performance models at the heart of this science are based on detailed performance data, gathered through a set of tools developed by PERC. This tutorial will feature hands-on sessions that train users to gather this data and detailed discussions of the modeling tools that use this data. The tutorial will use POP (Parallel Ocean Program), a scalable and compute intensive HPC application, in a series of example exercises. Additional applications like EVH1, UMT2K, MILC and NLOM may be used to illustrate additional points not demonstrated by the analysis of POP.
The morning session introduces a practical approach to collecting and analyzing performance data from HPC applications. Data collection is broken down into levels by increasing overhead required to collect the data, but also by increases in the data's usefulness for detailed performance optimization and modeling:
* Level 1 data are easy to collect; they include global timing, hardware counter and communication statistics collected across an entire program run; * Level 2 data are more time-consuming to collect; they refine Level 1 data to be broken down by module, routine and their descendants; * Level 3 data are very time-consuming to collect; they include detailed tracing of memory access patterns and branch prediction behavior.
The tools covered in the morning hands-on exercises will include HPMCOUNT from IBM and PSRun from NCSA for level 1, and DynaProf and SvPablo for level 2. Some of these tools make use of PAPI for detailed hardware performance measurements.
The afternoon session of this tutorial conveys the expertise that PERC participants have developed, and are continuing to improve, in methodologies and tools that enable understanding and optimization of application performance. We will present a suite of benchmarks for characterizing the performance attributes of a machine, including the Memory Access Pattern Profiles (MAPS) benchmark. Next, we will discuss two approaches for characterizing application behavior: source code analysis and execution tracing. We will discuss tools and techniques for creating an application signature, which models the application's memory and floating-point requirements. We will introduce performance bounds modeling, which provides insight into the application's best achievable performance on a given architecture. Next, we will focus on trace-based application signature generation via the tools MetaSim Tracer, and DIMEMAS. These tools can determine the computational resource demands of applications and represent these via compacted traces. We will then present methods for combining machine profiles and application signatures via automated tools (built into MetaSim and DIMEMAS); these fast, accurate performance models yield important insight into the factors that affect performance.
The attendees will learn how to design the appropriate experiments, to use existing performance analysis tools and to interpret the resulting data. The points of emphasis are:
* Setting goals that drive the collection of performance data; * Designing performance experiments based on these goals; * Using performance analysis tools to collect relevant data; * Validation and interpretation of the data in terms of the original goals; * Iteration of the above steps.
In addition, this tutorial conveys the expertise that PERC participants have developed, and are continuing to improve, in methodologies and tools that enable understanding and optimization of application performance. The points of emphasis are:
* Compute platform characterization in terms of their fundamental computational capabilities; * Application characterization in terms of their fundamental computational demands; * Convolving the independent platform and application characterizations in order to estimate, predict, and project performance with reasonable accuracy and in reasonable time.
20% Introductory 60% Intermediate 20% Advanced
Prospective attendees include mathematicians, computer scientists, application scientists and application programmers involved in high-performance computing, particularly large-scale simulation, from indutry research and development positions, university faculties, and government laboratories and other government agencies. Prospective attendees also include graduate and undergraduate students in those areas.
The attendees should be familiar with at least one scientific application, parallel programming environment and HPC platform. In addition, they should have a rudimentary understanding of processor architectures, memory hierarchies, messaging passing and shared memory programming.
I. Tools for Performance Data Gathering and Analysis (Morning Session: 3.5 hours + .5 hour break)
A. Introduction to Hardware Performance Data and Analysis (30 min) B. Level 1 Performance Data Gathering (1 hour) C. Break (30 min) D. Level 2 Performance Data Gathering and Analysis (2 hours)
II. Tools and Methods for Performance Modeling and Prediction (Afternoon Session: 3 hours + .5 hour break)
A. Introduction to Performance Modeling and Optimization (10 min) B. Machine
Profiling (20 min) C. Application Profiling and Performance Convolutions,
Part I (1 hour) D. Break (30 min) E. Application Profiling and Performance
Convolutions, Part II (1 hour) F. Summary and Related Work (30 min)