Improving performance means doing more work with fewer resources. The meaning of "resources" can vary; for a given activity, some specific resource is usually in shortest supply, whether it is CPU cycles, memory, network bandwidth, I/O bandwidth, database requests, disk space, or any number of other resources. When the performance of an activity is limited by availability of a particular resource, we say it is bound by that resource: CPU-bound, database-bound, etc.
While the goal may be to improve performance overall, using multiple threads always introduces some performance costs compared to the single-threaded approach. These include the overhead associated with coordinating between threads (locking, signaling, and memory synchronization), increased context switching, thread creation and teardown, and scheduling overhead. When threading is employed effectively, these costs are more than made up for by greater throughput, responsiveness, or capacity. On the other hand, a poorly designed concurrent application can perform even worse than a comparable sequential one.
 A colleague provided this amusing anecodote: he had been involved in the testing of an expensive and complex application that managed its work via a tunable thread pool. After the system was complete, testing showed that the optimal number of threads for the pool was . . . 1. This should have been obvious from the outset; the target system was a single-CPU system and the application was almost entirely CPU-bound.
In using concurrency to achieve better performance, we are trying to do two things: utilize the processing resources we have more effectively, and enable our program to exploit additional processing resources if they become available. From a performance monitoring perspective, this means we are looking to keep the CPUs as busy as possible. (Of course, this doesn't mean burning cycles with useless computation; we want to keep the CPUs busy with useful work.) If the program is compute-bound, then we may be able to increase its capacity by adding more processors; if it can't even keep the processors we have busy, adding more won't help. Threading offers a means to keep the CPU(s) "hotter" by decomposing the application so there is always work to be done by an available processor.
11.1.1. Performance Versus Scalability
Application performance can be measured in a number of ways, such as service time, latency, throughput, efficiency, scalability, or capacity. Some of these (service time, latency) aremeasures of "how fast" a given unit of work can be processed or acknowledged; others (capacity, throughput) are measures of "how much" work can be performed with a given quantity of computing resources.
Scalability describes the ability to improve throughput or capacity when additional computing resources (such as additional CPUs, memory, storage, or I/O bandwidth) are added.
Designing and tuning concurrent applications for scalability can be very different from traditional performance optimization. When tuning for performance, the goal is usually to do the same work with less effort, such as by reusing previously computed results through caching or replacing an O(n2) algorithm with an O(n log n) one. When tuning for scalability, you are instead trying to find ways to parallelize the problem so you can take advantage of additional processing resources to do more work with more resources.
These two aspects of performancehow fast and how muchare completely separate, and sometimes even at odds with each other. In order to achieve higher scalability or better hardware utilization, we often end up increasing the amount of work done to process each individual task, such as when we divide tasks into multiple "pipelined" subtasks. Ironically, many of the tricks that improve performance in single-threaded programs are bad for scalability (see Section 11.4.4 for an example).
The familiar three-tier application modelin which presentation, business logic, and persistence are separated and may be handled by different systemsillustrates how improvements in scalability often come at the expense of performance. A monolithic application where presentation, business logic, and persistence are intertwined would almost certainly provide better performance for the first unit of work than would a well-factored multitier implementation distributed over multiple systems. How could it not? The monolithic application would not have the network latency inherent in handing off tasks between tiers, nor would it have to pay the costs inherent in separating a computational process into distinct abstracted layers (such as queuing overhead, coordination overhead, and data copying).
However, when the monolithic system reaches its processing capacity, we could have a serious problem: it may be prohibitively difficult to significantly increase capacity. So we often accept the performance costs of longer service time or greater computing resources used per unit of work so that our application can scale to handle greater load by adding more resources.
Of the various aspects of performance, the "how much" aspectsscalability, throughput, and capacityare usually of greater concern for server applications than the "how fast" aspects. (For interactive applications, latency tends to be more important, so that users need not wait for indications of progress and wonder what is going on.) This chapter focuses primarily on scalability rather than raw single-threaded performance.
11.1.2. Evaluating Performance Tradeoffs
Nearly all engineering decisions involve some form of tradeoff. Using thicker steel in a bridge span may increase its capacity and safety, but also its construction cost. While software engineering decisions don't usually involve tradeoffs between money and risk to human life, we often have less information with which to make the right tradeoffs. For example, the "quicksort" algorithm is highly efficient for large data sets, but the less sophisticated "bubble sort" is actually more efficient for small data sets. If you are asked to implement an efficient sort routine, you need to know something about the sizes of data sets it will have to process, along with metrics that tell you whether you are trying to optimize average-case time, worst-case time, or predictability. Unfortunately, that information is often not part of the requirements given to the author of a library sort routine. This is one of the reasons why most optimizations are premature: they are often undertaken before a clear set of requirements is available.
Avoid premature optimization. First make it right, then make it fastif it is not already fast enough.
When making engineering decisions, sometimes you are trading one form of cost for another (service time versus memory consumption); sometimes you are trading cost for safety. Safety doesn't necessarily mean risk to human lives, as it did in the bridge example. Many performance optimizations come at the cost of readability or maintainabilitythe more "clever" or nonobvious code is, the harder it is to understand and maintain. Sometimes optimizations entail compromising good object-oriented design principles, such as breaking encapsulation; sometimes they involve greater risk of error, because faster algorithms are usually more complicated. (If you can't spot the costs or risks, you probably haven't thought it through carefully enough to proceed.)
Most performance decisions involve multiple variables and are highly situational. Before deciding that one approach is "faster" than another, ask yourself some questions:
These considerations apply to any performance-related engineering decision, but this is a book about concurrency. Why are we recommending such a conservative approach to optimization? The quest for performance is probably the single greatest source of concurrency bugs. The belief that synchronization was "too slow" led to many clever-looking but dangerous idioms for reducing synchronization (such as double-checked locking, discussed in Section 16.2.4), and is often cited as an excuse for not following the rules regarding synchronization. Because concurrency bugs are among the most difficult to track down and eliminate, however, anything that risks introducing them must be undertaken very carefully.
Worse, when you trade safety for performance, you may get neither. Especially when it comes to concurrency, the intuition of many developers about where a performance problem lies or which approach will be faster or more scalable is often incorrect. It is therefore imperative that any performance tuning exercise be accompanied by concrete performance requirements (so you know both when to tune and when to stop tuning) and with a measurement program in place using a realistic configuration and load profile. Measure again after tuning to verify that you've achieved the desired improvements. The safety and maintenance risks associated with many optimizations are bad enoughyou don't want to pay these costs if you don't need toand you definitely don't want to pay them if you don't even get the desired benefit.
Measure, don't guess.
There are sophisticated profiling tools on the market for measuring performance and tracking down performance bottlenecks, but you don't have to spend a lot of money to figure out what your program is doing. For example, the free perfbar application can give you a good picture of how busy the CPUs are, and since your goal is usually to keep the CPUs busy, this is a very good way to evaluate whether you need performance tuning or how effective your tuning has been.
Part I: Fundamentals
Part II: Structuring Concurrent Applications
Cancellation and Shutdown
Applying Thread Pools
Part III: Liveness, Performance, and Testing
Avoiding Liveness Hazards
Performance and Scalability
Testing Concurrent Programs
Part IV: Advanced Topics
Building Custom Synchronizers
Atomic Variables and Nonblocking Synchronization
The Java Memory Model