Sunday, June 8, 2008

[HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

Currently eqsel assumes that, except for the values stored as mcv's,
the number of times that a value appears in a table column is
independent of it's value. Unfortunately, this often yields
inappropriate selectivity estimates and frequently leads to
inappropriate plans.

As an example, consider an insurance company that keeps a record of
patient heights. Assume there are a 1000000 patient heights in this
column, and they are distributed normally with a mean of 1.7526 and a
standard deviation of 0.0762. Furthermore, assume that the heights are
only measured to the nearest centimeter. Then, we'd expect there to be
about 73 distinct heights, with a SD of 1.5.

Ignoring the effects of MCV's, the planner expects
SELECT height FROM heights WHERE height = 1.75;
to yield roughly 13000 results. However, given that we know the
underlying distribution, we would expect to see ~52000 results.

Similarly, the planner expects to see 13000 results from
SELECT 1.75 FROM heights WHERE height = 2.05;
While we expect to see 2.7.

Obviously this example is not totally convincing: if I were to post
this to pg-general looking for advice I'm sure that everyone would
tell me to just increase the size of my mcv stats. However, in cases
where the number of distinct values is higher, this isn't always
feasible. Also, why store a list of 50 values and their frequencies
when 10 extra would provide the same plans without bloating
pg_statistics?

To combat this problem, I have two different proposals.

Idea 1: Keep an array of stadistinct that correspond to each bucket size.

In the example above, ( again ignoring mcv's ) the quantile data is

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
1.38 1.66 1.69 1.71 1.73 1.75 1.77 1.79 1.82 1.85 2.12

with numdistinct values of ( respectively )

29 2 2 2 2 2 2 3 3 25

For the two above examples, this new approach would yield selectivity
estimates of

(1000000/10)/2 = 50000 ( vs an actual ED of ~52000 )
and
(1000000/10)/25 = 4000 ( vs an actual ED of ~2.7 )

Furthermore, this is done without mcvs. Since mcv's would make the
histogram more sensitive to the edges, the estimates with mcv's should
be correspondingly better.


There are two potential problems that I see with this approach:

1) It assumes the = is equivalent to <= and >= . This is certainly
true for real numbers, but is it true for every equality relation that
eqsel predicts for?

2) It bloats the stats table.

Idea 2: Keep a correlation statistic between ndistinct and bucket size

This addresses problem #2.

In lieu of keeping an actual list of ndistinct per histogram bucket,
we store the linear scaling coefficient between histogram bucket width
and ndistinct/(avg ndistinct). To visualize this, it is easiest to
consider plotting the bucket width versus ndistinct. The scaling
coefficient is the linear line that passes through origin and
minimizes the square of the difference between it's estimate for
ndistinct and the actual value.

When I apply this method to the above data I find a coefficient of
13.63 for an average ndist of 72/10. This provides selectivity
estimates, for the above two examples, of
(1000000/10)/( 13.63*7.2*(1.77 - 1.75) ) = 50950 ( vs an actual ED of ~52000 )
and
(1000000/10)/( 13.63*7.2*(2.12 - 1.85) ) = 3774 ( vs an actual ED of ~2.7 )

Although this yields better results than idea 1 for this particular
example, it will be much more sensitive to weird distributions.

Obviously there are some special cases to consider: we wouldn't want
the stats to be skewed such that they provide really bad plans.
However, with some carefully designed caps I believe that we could
ensure that the estimates are at least as good as they are now. In
fact, I'm not certain that an R^2 penalty is the correct loss
function. Ideally, we want to minimize the extra time that the db
spends by choosing an incorrect plan. Maybe slight overestimations are
better than slight underestimations? Maybe the cost of the occasional
(really) bad plan is less than the cost of a bunch of kinda bad plans?

Finally, we aren't limited to just one coefficient. We could also
store multiple coefficents to improve our estimates, and provide a
compromise between ideas 1 and 2.

Food for future thought...

I addition to the previous benefits, I think that this method has the
potential to make the process by which MCV are chosen (or not chosen)
smarter. Now the planner chooses a value to be an mcv candidate if
it's frequency is greater than 1.25 * the average frequency. Given
that this improved selectivity estimate is implemented, maybe a better
way would be to include a value as an mcv if it's a) above a certain
threshold and b) the histogram selectivity estimator does do a poor
job.

What are peoples thoughts on idea 1 vs idea 2?

Am I missing any relevant details about the planner's operation?

Do people think that the improved estimates would be worth the
additional overhead?

-Nathan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

No comments: