|< Day Day Up >|
Even though it has been several weeks since the airfare sale disaster, it is still difficult to piece together exactly what triggered the apparent massive server outages. What is known is that the website slowdowns began within three minutes of the fare sale announcement, and that it got progressively worse as the day wore on.
Fortunately, in spite of the conflicting anecdotal reports you are receiving from all constituencies, many tools and utilities can shed light on these kinds of problems. High-Hat Airways is an enthusiastic user of these technologies, so a trail of events should exist that you can use to reconstruct what happened as well as deliver a recommendation on how to prevent it from occurring again.
When evaluating an apparent serverwide performance crisis, it's not a bad idea to start with the tools provided by the operating system. Because the servers in question are running Linux, you have a broad set of technologies at your disposal, as described in Chapter 2, "Performance Monitoring Options." These include the sar, free, and vmstat utilities, which focus on system activity, physical memory, and virtual memory, respectively.
As it turns out, you are able to obtain historical records from all of these utilities from the time of the server problems. They were unanimous in their output: The servers were under tremendous load, with CPU utilization in the 90% 100% range. In addition, memory was incredibly scarce, which forced Linux to continually perform extremely expensive swapping.
Next, you turn your attention to the role MySQL's configuration might have played in the debacle. You know that the client applications were running elsewhere; the database servers were dedicated to MySQL alone. This eliminates the possibility of server-side conflict among applications, so any performance problems on the server will likely be tied directly back to MySQL.
MySQL itself offers a collection of helpful utilities, tools, and logs. These include the general query log, slow query log, and MySQL Administrator; all were described in Chapter 2.
Alas, these logs don't always help very much in a case like this: It's hard to identify a problem query when all queries are running slowly. This again indicates a systemic, serverwide problem. However, examining the MySQL server configuration delivers some interesting results.
It appears that these servers have been configured with generous memory allocation to the key buffer. The query cache and buffer pool are each substantially smaller. What makes this especially bad is that with the exception of a few ancillary tables that use the MyISAM storage engine, the vast majority of tables on these servers use InnoDB. You also observe that the administrators have also elected to lock the mysqld process in memory; the servers launch with the --memlock option enabled.
When you last reviewed the airline's MySQL performance issues (see Chapter 18), you suggested that High-Hat look at replication as a way of distributing information so that casual and report-driven users could work with near real-time data without interfering with operations. In the interim, your recommendation was implemented: Numerous slave servers are now deployed in production.
With your research and analysis complete, it's now up to you to deliver some specific advice to help prevent this kind of server failure from happening again.
Based on the server overloading that you saw from examining the output from the Linux utilities, you recommend that High-Hat make the investment in additional server computers; if these computers can sport multiprocessors, that's even better. In addition, you advocate that each server (both existing and new) receive a generous memory upgrade to help reduce or eliminate the swapping problem. Finally, you advise more active monitoring of these utilities; it might be possible to intercept the next problem before it spirals out of control into a full-blown crisis.
In Chapter 11, "MyISAM Performance Enhancement," you saw that the key cache is a section of memory used to buffer index information from tables that use the MyISAM storage engine. On the other hand, Chapter 10, "General Server Performance and Parameters Tuning," described how the query cache stores frequently used queries and results regardless of the storage engine being used. Finally, Chapter 12, "InnoDB Performance Enhancement," discussed InnoDB's reliance on the buffer pool for its own caching.
Because, in this case, your servers are heavily slanted toward InnoDB, it would be smart to deallocate memory from MyISAM-focused structures and instead use that memory for InnoDB as well as engine-neutral caching features.
Finally, you determine that locking mysqld in memory might not be the wisest choice if the servers don't have enough RAM to begin with. Instead, you opt to try to let the operating system manage its own memory.
The replication architecture that is currently deployed goes a long way toward reducing the query processing strain on master servers. However, it does nothing to spread the write load, nor does it address what happens during a major master server outage. Although database users might be able to connect to a replicated slave server, it will not be very easy for them to make updates.
For these reasons, you recommend deploying MySQL Cluster for any applications that need high availability. First described in Chapter 17, "Optimal Clustering," MySQL Cluster can coexist with replication, delivering an even better, faster-performing solution. To help isolate the impact of any potential node failures, you counsel a 1:1 ratio between nodes (SQL, data, or management) and hosts (that is, the computer that runs the node). To better spread the burden, you also propose a 2:1 ratio of SQL nodes to data nodes with a round-robin or other load-balancing technology serving to connect clients with a SQL node. You also recognize that this ratio might need to change as MySQL AB continually enhances MySQL Cluster's capabilities; more processing might eventually be done on the data nodes themselves.
|< Day Day Up >|