Effective Use of Histograms | SUSE LINUX Enterprise Server 9 Administrators Handbook

Essentially, a histogram is information about the values in a selected column. Histograms are used in Oracle to describe the skew in the distribution of data in columns that hold non-unique, repeating values. In other words, when a few distinct values in that column form a sizeable portion of the row count, and this count is not uniformly distributed, then histograms can describe this distribution mathematically. Thus, when histograms are created on key columns like those that contain a set of repeating values, such as the state ID in a national cell phone company's customer table, they can quantify the spread of distinct values. Histograms exert a great influence on the Optimizer and are essentially "buckets" that specify a range of values in a column. The Oracle kernel sorts the non-null values in the column and groups them into the specified number of these buckets so that each bucket holds the same number of data points, bounded by the end-point value of the previous bucket. Histograms can be either height balanced or width balanced. In other words, if the distinct sets of values are less than the number of histograms that are required to be collected, a width-balanced histogram is created, otherwise, a height-balanced histogram is collected. Histograms can be created and used on numeric, character, and date columns.

In the simplest case, when the Optimizer considers the value of the predicate in a WHERE clause, and histograms are available for that key column, the Optimizer calculates the "cost" of access for both an indexed read as well as a full table scan (FTS), based on the frequency of occurrence of that key value. Depending on the frequency of occurrence of the value, the Optimizer may decide to use an FTS rather than an indexed read for popular values that occur frequently, because this option may be less costly in terms of total I/O. On the other hand, the Optimizer may decide to use indexed reads for less-popular keys, again based on the total I/O. In a slightly more complex setting, the Optimizer considers histograms while determining the optimal order in which tables should be joined. For example, when presented with the need to join multiple tables using join conditions on columns that contain histograms, the Optimizer can determine the order in which the tables should be joined, executing the join that is expected to produce the least number of rows first, followed by joins that produce successively larger numbers of rows. All this occurs without the programmer manipulating the SQL, and this concept thus lends itself to reduced program "tuning fixes" as data patterns change. This is a key strength of the Optimizer and is an important reason for considering histograms in your database.

There is a caveat, though. Until Oracle 9i, the Optimizer considered histograms during the SQL parse phase only if literal values (of the form WHERE PREDICATE = '123') rather than bind variables (of the form WHERE PREDICATE = :key_value) were used in SQL conditional clauses. From Oracle 9i Release 9.0.1 on, the Optimizer could "peek" at the value of the bind variable prior to the parse, and use histograms when available. The downside, though, in both Oracle 9i and Oracle Database 10g, is that the execution plan is generated based on the current value of the bind variable at the time of parsing and is thus "fixed" during the first parse. Hence, subsequent changes in the value of the bind variable do not change the path. This has resulted in a number of performance issues, so you need to keep this information in mind. This functionality is turned on by default in both Oracle 9i and Oracle Database 10g, but can be manually disabled via the undocumented and hidden parameter _OPTIM_PEEK_USER_BINDS. Of course, changes should be done only under Oracle Support's direction!

In Oracle Database 10g, histograms are not automatically created on all key columns, nor should they be. Histograms are collected using the very same built-in DBMS_STATS PL/SQL package, which collects object statistics for tables, indexes, and columns for use by the Optimizer. They are collected based on the METHOD_OPT parameter of any of the DBMS_STATS.GATHER_object_STATS procedures. Histograms are collected only when the clause FOR ALL [INDEXED | HIDDEN] COLUMNS size_clause is set and the size_clause is set to a value more than 1. The number specifies the number of buckets that the histogram should have; the maximum is 254.

Both Oracle Database 10g and Oracle 9i allow histograms to be selectively created only for those columns that are determined to have a skew. This is done at the time of sampling when METHOD_OPT is set to FOR ALL COLUMNS SIZE AUTO. In this case, Oracle automatically determines which columns require histograms and the number of buckets (size) of each histogram by considering both the data distribution in the column as well as the usage from COL_USAGE$. The SKEWONLY option does not consider column usage and hence should not be used. Another option, new in Oracle Database 10g, is the REPEAT keyword, which collects histograms only for those columns that already have themin effect refreshing the values. You can also manually specify which columns should have histograms and the size of each histogram individually. For further details and examples of how this is done, refer to the Oracle Database 10g PL/SQL Package and Types Reference as well as the Oracle Database 10g Performance Tuning Guide.

Histograms can be viewed using ALL_HISTOGRAMS, ALL_PART_HISTOGRAMS, ALL_TAB_HISTOGRAMS, and their DBA_% equivalents. Columns containing histograms can be determined from the NUM_BUCKETS column in the ALL_TAB_COLUMNS; this value should be greater than 1 for histograms to exist. Note that all columns that have statistics attached to them by virtue of being analyzed will have at least two entries in the ALL_TAB_COLUMNS view, with these entries containing the low and high values for the column. Histogram information for those columns that have more than two rows is available in the ALL_HISTOGRAMS view.

There are some downsides to collecting and using histograms:

They take up space in the data-dictionary portion of the shared pool and use additional resources and storage space during statistics collection.
They add some minimal overhead during SQL parse time because the Optimizer now must compute additional paths.
As with any other object statistics, they can become stale, and their staleness can produce a greater impact than with some of the other object statistics.