5. Scalability Experiments


5. Scalability Experiments

The goal of every cluster architecture is to achieve close to a linear performance scale-up when system resources are increased. However, achieving this goal in a real-world implementation is very challenging. In this section we present the results of two sets of experiments. First, we compared a single node server with two different network interface transmission speeds: 100 Mbps versus 1 Gbps. In the second set of experiments we scaled a server cluster from 1 to 2 to 4 nodes. We start by describing our measurement methodology. Note that Table 32.2 lists all the terms used in this section.

Table 32.2: List of terms used repeatedly in this section and their respective definitions.

Term

Definition

N

The number of concurrent clients supported by Yima server

Nmax

The maximum number of sustainable, concurrent clients

uidle

Idle CPU in percentage

usystem

System (or kernel) CPU load in percentage

uuser

User CPU load in percentage

BaveNet

Average network bandwidth per client (Mb/s)

Bnet

Network bandwidth (Mb/s)

Bdisk

The amount of movie data accessed from disk per second (termed disk bandwidth) (MB/s)

Bcache

The amount of movie data accessed from server cache per second (termed cache bandwidth) (MB/s)

BaveNet[i]

The BaveNet measured for i-th server node in a multinode experiment

Bnet[i]

The Bnet measured for i-th server node in a multinode experiment

Bdisk[i]

The Bdisk measured for i-th server node in a multinode experiment

Bcache[i]

The Bcache measured for i-th server node in a multinode experiment

Rr

The number of rate changes per second

5.1 Methodology of Measurement

One of the challenges when stress-testing a high-performance streaming media server is the potential support of a large number of clients. For a realistic test environment, these clients should not only be simulated, but rather should be real viewer programs that run on various machines across a network. To keep the number of client machines manageable we actually ran several client programs on a single machine. Since decompressing multiple MPEG-2 compressed DVD-quality streams requires a high CPU load, we changed our client software to not actually decompress the media streams. Such a client is identical to a real client in every respect, except that it does not render any video or audio. Instead, this dummy client consumes data according to a movie trace data file, which was the pre-recorded consumption behaviour of a real client with respect to a particular movie. Thus, by changing the movie trace file, each dummy client can emulate a DVD stream (5 Mbps, VBR), a HDTV stream (20 Mbps, CBR) or an MPEG-4 stream (800 Kbps, VBR). For all the experiments in this section, we chose trace data from the DVD movie "Twister" as the consumption load. The dummy client used the following smoothing parameters: 16 MB playout buffer, 9 thresholds and 90-second prediction window. For each experiment, we started clients in a staggered manner (one after the other every 2 to 3 minutes).

On the server side, we recorded the following statistics every 2 seconds: CPU load (uidle, usystem and uuser) disk bandwidth (Bdisk), cache bandwidth (Bcache), Rr, the total network bandwidth (Bnet) for all clients, the number of clients served and BaveNet. On the client side, the following statistics were collected every second: the stream consumption rate (this could be used as the input movie trace data for dummy clients), the stream receiving rate, the amount data in the buffer, the time and value of each rate change during that period and the sequence number for each received data packet.

Our servers were run without any admission control policies because we wanted to find the maximum sustainable throughput using multiple client sessions. Therefore, client starvation would occur when the number of sessions increased beyond a certain point. We defined that threshold as the maximum number of sustainable, concurrent sessions. Specifically, this threshold marks the point where certain server system resources reach full utilization and becomes a bottleneck, such as the available network bandwidth, the disk bandwidth and the CPU load. Beyond this threshold, the server is unable to provide enough data for all clients.

We first describe the Yima server performance with two different network connections, and we then evaluate Yima in a cluster scale-up experiment.

5.2 Network Scale-Up Experiments

Experimental setup: We tested a single node server with two different network connections: 100 Mb/s and 1 Gb/s Ethernet. Figure 32.1 illustrates our experimental setup. In both cases, the server consisted of a single Pentium III 933 MHz PC with 256 MB of memory. The PC is connected to an Ethernet switch (model Cabletron 6000) via a 100 Mb/s network interface for the first experiment and a 1 Gb/s network interface for the second experiment. Movies are stored on a 73 GB Seagate Cheetah disk drive (model ST373405LC). The disk is attached through an Ultra2 low-voltage differential (LVD) SCSI connection that can provide 80 MB/s throughput. Red-Hat Linux 7.2 is used as the operating system. The clients are based on several Pentium III 933 MHz PCs, which are connected to the same Ethernet switch via 100 Mb/s network interfaces. Each PC could support 10 concurrent MPEG-2 DVD dummy clients (5.3 Mbps for each client). The total number of client PCs involved was determined by the number of clients needed. For both experiments, dummy clients were started every 2 to 3 minutes and use the trace data file pre-recorded from a real DVD client playing the movie "Twister" (see Figure 32.2a).

Experimental results: Figure 32.5 shows the server measurement results for both sets of experiments (100 Mbps, 1 Gbps) in two columns. Figures 32.5(c) and (d) present BaveNet with respect to the number of clients, N. Figure 32.5(c) shows that, for a 100 Mb/s network connection, BaveNet remains steady (between 5.3 and 6 Mbps) when N is less than 13; after 13 clients, BaveNet decreases steadily and falls below 5.3 Mbps (depicted as a dashed horizontal line), which is the average network bandwidth of our test movie. Note that the horizontal dashed line intersects with the BaveNet curve at approximately 12.8 clients. Thus, we consider 12 as the maximum number of clients, Nmax, supportable by a 100 Mb/s networking interface. An analoguous result can be observed in Figure 32.5(d), indicating that Nmax = 35 with a 1 Gb/s network connection.

click to expand

click to expand
Figure 32.5: Yima single node server performance with 100 Mbps versus 1 Gbps network connection.

Figures 32.5(a) and (b) show the CPU utilization as a function of N for 100 Mb/s and 1 Gb/s network connections. Both figures contain two curves: usystem and usystem + uuser. As expected, the CPU load (both usystem and uuser) increases steadily as N increases. With the 100 Mb/s network connection, the CPU load reaches its maximum at 40% with 12 clients, which is exactly Nmax suggested by Figure 32.5(c) (vertical dashed line). Similarly, for 1Gb/s, the CPU load levels off at 80% where Nmax = 35 clients. Note that in both experiments, usystem accounts for more than 2/3 of the maximum CPU load.

Yima implements a simple yet flexible caching mechanism in the file I/O module (Figure 32.1). Movie data are loaded from disks as blocks (e.g., 1 MB). These blocks are organized into a shared block list maintained by the file I/O module in memory. For each client session, there are at least two corresponding blocks on this list. One is the block currently used for streaming, and the other is the prefetched, next block. Some blocks may be shared because the same block is used by more than one client session simultaneously. Therefore, a session counter is implemented for each block. When a client session requests a block, the file I/O module checks the shared block list first. If the block is found, then the corresponding block counter will be incremented and the block made available; otherwise, the requested block will be fetched from disk and added to the shared block list (with its counter set to one). We define the cache bandwidth, Bcache, as the amount of data accessed from the shared block list (server cache) per second.

Figures 32.5(e) and (f) show Bdisk and Bcache as a function of N for 100 Mb/s and 1 Gb/s network connections. In both experiments, the Bdisk + Bcache curves increase linearly until N reaches its respective Nmax (12 for 100 Mb/s and 35 for 1 Gb/s), and they level off beyond those points. For the 100 Mb/s network connection, Bdisk + Bcache level off at around 8.5 MBps, which equals the 68 Mb/s peak rate, Bnet, in Figure 32.5(i) with N = Nmax. Similarly, for the 1 Gb/s network connection, Bdisk and Bcache level off at 25 MBps, which corresponds to the 200 Mb/s maximum, Bnet, in Figure 32.5(j) with N = Nmax = 35. In both cases, Bcache contributes little to Bdisk + Bcache when N is less than 15. For N > 15, caching becomes increasingly effective. For example, with 1 Gb/s network connection, Bcache accounts for 20% of 30% to Bdisk + Bcache with N between 35 and 40. This is because for higher N, the probability that the same cached block is accessed by more than one client increases. Intuitively, caching is more effective with large N.

Figures 32.5(i) and (j) show the relationship of Bnet and N for both network connections. Both figures nicely complement Figures 32.5(e) and (f). With the 100 Mbps connection, Bnet increases steadily with respect to N until it levels off at 68 Mbps with Nmax (12 clients). For the 1 Gb/s connection, the result is similar except that Bnet levels off at 200 Mbps with N = 35 (Nmax for 1 Gb/s setup). Notice that the horizontal dashed line in Figure 32.5(i) represents the theoretical bandwidth limit for the 100 Mb/s setup.

Figures 32.5(g) and (h) show Rr with respect to N for 100 Mb/s and 1 Gb/s network connections. Both figures suggest a similar trend: there exists a threshold T where, if N is less than T, Rr is quite small (approximately 1 per second); otherwise, Rr increases significantly to 2 for 100 Mb/s connection and 5 for the 1 Gb/s connection. With the 100 Mb/s setup, T is reached at approximately 12 clients. For the 1 Gb/s case, the limit is 33 clients. In general, T roughly matches Nmax for both experiments. Note that in both cases, for N > T, at some point Rr begins to decrease. This is due to client starvation. Under these circumstances such clients send a maximum stream transmission rate change request. Because this maximum cannot be increased, no further rate changes are sent.

Note that in both the 100 Mb/s and 1 Gb/s experiments, Nmax is reached when some system resources become a bottleneck. For the 100 Mb/s setup, Figure 32.5(a) and Figure 32.5(e) suggest that the CPU and disk bandwidth are not the bottleneck, because neither of them reaches more than 50% utilization. On the other hand, Figure 32.5(i) indicates that the network bandwidth, Bnet, reaches approximately 70% utilization for N = 12 (Nmax for 100 Mb/s setup), and hence is most likely the bottleneck of the system. For the 1 Gb/s experiment, Figure 32.5(f) and Figure 32.5(j) show that the disk and network bandwidth are not the bottleneck. Conversely, Figure 32.5(b) shows that the CPU is the bottleneck of the system because it is heavily utilized (usystem + uuser is around 80%) for N = 35 (Nmax for the 1 Gb/s setup).

5.3 Server Scale up Experiments

Experimental setup: To evaluate the cluster scalability of Yima server, we conducted 3 sets of experiments. Figure 32.1 illustrates our experimental setup. The server cluster consists of the same type of rack-mountable Pentium III 866 MHz PCs with 256 MB of memory. We increased the number of server PCs from 1 to 2 to 4, respectively. The server PCs are connected to an Ethernet switch (model Cabletron 6000) via 100 Mb/s network interfaces. The movies are striped over several 18 GB Seagate Cheetah disk drives (model ST118202LC, one per server node). The disks are attached through an Ultra2 low-voltage differential (LVD) SCSI connection that can provide 80 MB/s throughput. RedHat Linux 7.0 is used as the operating system and the client setup is the same as in Sec. 5.2.

Experimental results: The results for a single node server have already been reported in Section 5.2. So, we will not repeat them here, but refer to them where appropriate. Figure 32.6 shows the server measurement results for the 2 node and 4 node experiments in two columns.

click to expand

click to expand
Figure 32.6: Yima server 2 node versus 4 node performance.

Figures 32.6(c) and (d) present the measured bandwidth BaveNet as a function of N. Figure 32.6(c) shows two curves representing the two nodes: BaveNet[1] and BaveNet[1] + BaveNet[2]. Similarly, Figure 32.6(d) shows four curves: BaveNet[1], BaveNet[1] + BaveNet[2], BaveNet[1] + BaveNet[2] + BaveNet[3] and BaveNet[1] + BaveNet[2] + BaveNet[3] + BaveNet[4]. Note that each server node contributes roughly the same portion to the total bandwidth BaveNet, i.e., 50% in case of the 2 node system and 25% for the 4 node cluster. This illustrates how well the nodes are load balanced within our architecture. Recall that the same software modules are running on every server node, and the movie data blocks are evenly distributed by the random data placement technique. Similarly to Figures 32.5(c) and (d), the maximum number of supported clients can be derived as Nmax = 25 for 2 nodes and Nmax = 48 for 4 nodes. Including the previous results from 1 node (see 100 Mb/s experimental results in Figure 32.5) with 2 and 4 nodes, the maximum number of client streams Nmax are 12, 25 and 48 respectively, which represents an almost ideal linear scale-up.

Figures 32.6(a) and (b) show the average CPU utilization on 2 and 4 server nodes as a function of N. In both figures, usystem and usystem + uuser are depicted as two curves with similar trends. For 2 nodes the CPU load (usystem + uuser) increases gradually from 3% with N = 1 to approximately 38% with N = 25 (Nmax for this setup), and then levels off. With 4 nodes, the CPU load increases from 2% with N = 1 to 40% with N = 48 (Nmax for this setup).

Figures 32.6(e) and (f) show Bdisk + Bcache. The 4 curves presented in Figure 32.6(e) cumulatively show the disk and cache bandwidth for 2 nodes: Bdisk[1], Bdisk[1] + Bcache[1], Bdisk[2] + Bdisk[1] + Bcache[1] and Bdisk[2] + Bcache[2] + Bdisk[1] + Bcache[1]. The curves exhibit the same trend as shown in Figures 32.5(e) and (f) for a single node. Bdisk + Bcache reach a peak value of 17 MB/s with N = Nmax for the 2 node setup and 32 MB/s for the 4 node experiment. Note that Bdisk + Bcache for 4 nodes is nearly doubled compared with 2 nodes, which is double that of the 1 node setup. In both cases, each server contributes approximately the same portion to the total of Bdisk + Bcache, illustrating the balanced load in the Yima cluster. Furthermore, similar to Figures 32.5(e) and (f), caching effects are more pronounced with large N in both the 2 and 4 node experiments.

Figures 32.6(i) and (j) show the achieved network throughput Bnet. Again, Figures 32.6(i) and (j) nicely complement Figures 32.6(e) and (f). For example, the peak rate, Bnet, of 136 Mb/s for the 2 node setup is equal to the 17 MB/s peak rate of Bdisk + Bcache. Each node contributes equally to the total served network throughput.

Finally, Figures 32.6(g) and (h) show the number of rate changes, Rr, that are sent to the server cluster by all clients. Similarly to the 1 node experiment, for the 2 node server Rr is very small (approximately 1 per second) when N is less than 26, and increases significantly above this threshold. For the 4 node experiment, a steady increase is recorded when N is less than 26; after that it remains constant at 2.5 for N between 27 and 45, and finally Rr increases for N beyond 45. Note that for all experiments, with N < Nmax, the rate change messages Rr generate negligible network traffic and server processing load. Therefore, our MTFC smoothing technique (see Sec. 3) is well suited for a scalable cluster architecture.

Overall, the experimental results presented here demonstrate that our current architecture scales linearly to four nodes while at the same time achieves an impressive performance on each individual node. Furthermore, the load is nicely balanced and remains such, even if additional nodes or disks are added to the system (with SCADDAR). We expect that high-performance Yima systems can be built with 8 and more nodes. When higher performing CPUs are used (beyond our 933 MHz Pentium IIIs) each node should be able to eventually reach 300 to 400 Mb/s. With such a configuration almost any currently available network could be saturated (e.g., 8 x 300 Mb/s = 2.4 Gb/s effective bandwidth).




Handbook of Video Databases. Design and Applications
Handbook of Video Databases: Design and Applications (Internet and Communications)
ISBN: 084937006X
EAN: 2147483647
Year: 2003
Pages: 393

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net