6.2 Assigning Names

architectural features, memory bandwidth is actually quite difficult to measure accurately. The widely used STREAMS benchmark2 provides fairly reliable numbers for how much bandwidth is actually visible to an application that has not taken any special steps to enhance memory or cache performance. For our example system, a 300MHz PentiumII with SDRAM memory, the STREAMS benchmark reports memory bandwidths of approximately 128-182MB/s. This is still more than an order or magnitude higher than the network speed in a Beowulf system.
Algorithms can place greater or lesser demand on network bandwidths. Some algorithms require almost no interprocessor communication. Brute force cryptographic attacks,3 in which a very small amount of data initiates each long-running and completely independent calculation are an extreme example of bandwidth frugal algorithms. At the other extreme are algorithms in which every processor does little besides exchanging a large fraction of its total memory with other processors. The sorting problems discussed in Chapter 9.1 are bandwidth greedy algorithms in which there is relatively little computation for each datum transmitted.
Problems from the physical sciences can often be solved in parallel by applying "domain decomposition." That is, splitting the problem into domains and assigning each domain to a processor. In this case, the amount of data communicated between processors is typically proportional to the surface area of the domains, while the amount of computation to be performed by each processor is proportional to the volume of the domains. The ratio of surface-area to volume decreases as the domains become larger, i.e., as the grain size increases. Thus, one often finds that increasing the grain size in a physical problem leads to a more bandwidth frugal, and hence better performing algorithm.
7.2.5 Latency Tolerant and Intolerant
Latency is a property of the communication network in a parallel machine. It measures how long it takes for a message to begin to be delivered from one process to another. While usually reported in microseconds, latency is more useful when measured in units dictated by the rest of the system. A latency of 200 sec means little in isolation, but becomes meaningful when we compare it to a processor clock that cycles at 300 MHz. In this case, the network latency corresponds to 60000 ticks of the processor clock. Another useful comparison is with the network bandwidth. If we simply multiply network latency by bandwidth we obtain a number of bytes: n = latency bandwidth.
2http://www.cs.virginia.edu/stream/
3http://www.certicom.com/sixth.htm

 



How to Build a Beowulf
How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters (Scientific and Engineering Computation)
ISBN: 026269218X
EAN: 2147483647
Year: 1999
Pages: 134

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net