view counter

Histograms for strongly skewed columns

Thanks to Nikolay Savvinov for this story

On a recent OTN thread, I learned a nice trick by J. Lewis that allows to circumvent certain problems with histograms.

view counter
Histograms were designed to solve the problem of estimating cardinality for skewed columns (i.e. where some values occur much more frequently than the others). For columns with low number of distinct values (NDV) Oracle collects a frequency histogram, which can be thought of as a set of two one-dimensional arrays:  one containing all possible values, the other containing their frequency (i.e. how many rows have this value). However, if sample size is small, then Oracle can miss rare values, and they won’t be reflected in the histogram. As a result, the cardinality estimates for those values will be wrong (depending on version Oracle will either set it to either 1 or to half of the frequency for the rarest value found).  A detailed explanation of the issues with examples can be found in blog posts by J. Lewis and R. Geist.

The trick is to create a function-based index that hides popular values, thus making sure that rare values are accurately represented in a histogram. Let’s consider an example: a table T with just one column X, which has two possible values, 1 (1M occurences) and 2 (10 occurences).

create table t as
with gen as (select 1 from dual connect by level<=1e3)
select 1 x from gen g1, gen g2;

insert into t select 2 from dual connect by level<=10;

commit;

create index i$t$x on t(x);

exec dbms_stats.gather_table_stats(null, 'T', method_opt=>'for all columns size 254');
explain plan for select * from t where x=2;

select * from table(dbms_xplan.display());

Plan hash value: 1601196873

--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 500K| 1464K| 448 (3)| 00:00:06 |
|* 1 | TABLE ACCESS FULL| T | 500K| 1464K| 448 (3)| 00:00:06 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("X"=2)
As we can see, we get a wrong cardinality with cardinality off by 5 orders of magnitude.
Now let’s create a function-based index that would “hide” the popular value:

create index i$t$x_rare on t(case(x) when 2 then x end);

exec dbms_stats.gather_table_stats('SCOTT', 'T', method_opt=>'for all columns size 254');

and modify the query to use statistics on the virtual column created by the FBI

explain plan for select * from t where case(x) when 2 then x end =2;

select * from table(dbms_xplan.display());

Plan hash value: 1371096493

------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10 | 40 | 2 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| T | 10 | 40 | 2 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | I$T$X_RARE | 10 | | 1 (0)| 00:00:01 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(CASE "X" WHEN 2 THEN "X" END =2)

Now we are getting a better plan and correct cardinalities.
Of course it is a big disadvantage that we need to rewrite the query to make use of the virtual column, so in some cases it may be a better solution to use another trick suggested by J. Lewis: use “hand-made” histograms that provide a more realistic distribution. Of course, it is also always possible to increase the sample size, making sure that rare values aren’t missed, if the data size is not so big that larger sample size would make calculating histograms prohibitively expensive.
It should be also possible to the virtual column approach for dealing with correlated predicates.

Read the entire article at its source

view counter