Stream Processing: Efficiency through Locality for Scientific Computing
Semiconductor and processor scaling is leading us toward processor chips with 10s to 100s of "cores" and distributed on-chip memories. Parallelism can take advantage of the plentiful and inexpensive arithmetic units made possible by modern VLSI technology. However, without locality, bandwidth quickly becomes a bottleneck. Communication bandwidth, not arithmetic is the critical resource in a modern computing system that dominates cost, performance, and power. Stream programming simplifies the exploitation of both parallelism and locality. A stream program naturally exposes parallelism across stream elements and kernels. Locality is also exposed - both within and between kernels. At a lower level, simplifying the communication involved in supplying instructions and data to individual cores gives orders of magnitude improvements in efficiency. This talk will discuss exploitation of parallelism and locality with examples drawn from the Imagine, Merrimac, and EEC projects and from three generations of stream programming systems.
William J. Dally, Stanford University