Monday, June 9, 2008

Re: [HACKERS] Overhauling GUCS

Greg,

> Speak to the statisticians. Our sample size is calculated using the same
> theory behind polls which sample 600 people to learn what 250 million
> people are going to do on election day. You do NOT need (significantly)
> larger samples for larger populations.

Your analogy is bad. For elections, the voters have only a few choices.
In a 300 million row table, there could be 300 million different values,
and the histogram becomes less accurate for every order of magnitude
smaller than 300 million it is.

> Also, our estimates for n_distinct are very unreliable. The math behind
> sampling for statistics just doesn't work the same way for properties
> like n_distinct. For that Josh is right, we *would* need a sample size
> proportional to the whole data set which would practically require us to
> scan the whole table (and have a technique for summarizing the results
> in a nearly constant sized data structure).

Actually, a number of papers have shown block-based algorithms which can
arrive a reasonably confident (between 50% and 250% of accurate) estimates
based on scanning only 5% of *blocks*. Simon did some work on this a
couple years ago, but he and I had difficultly convincing -hackers that a
genuine problem existed.

You're correct that we'd need to change pg_statistic, though. For one
thing, we need to separate the sample size from the histogram size.

Also, we seem to be getting pretty far away from the original GUC
discussion.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

No comments: