> Speak to the statisticians. Our sample size is calculated using the same
> theory behind polls which sample 600 people to learn what 250 million
> people are going to do on election day. You do NOT need (significantly)
> larger samples for larger populations.
Your analogy is bad. For elections, the voters have only a few choices.
In a 300 million row table, there could be 300 million different values,
and the histogram becomes less accurate for every order of magnitude
smaller than 300 million it is.
> Also, our estimates for n_distinct are very unreliable. The math behind
> sampling for statistics just doesn't work the same way for properties
> like n_distinct. For that Josh is right, we *would* need a sample size
> proportional to the whole data set which would practically require us to
> scan the whole table (and have a technique for summarizing the results
> in a nearly constant sized data structure).
Actually, a number of papers have shown block-based algorithms which can
arrive a reasonably confident (between 50% and 250% of accurate) estimates
based on scanning only 5% of *blocks*. Simon did some work on this a
couple years ago, but he and I had difficultly convincing -hackers that a
genuine problem existed.
You're correct that we'd need to change pg_statistic, though. For one
thing, we need to separate the sample size from the histogram size.
Also, we seem to be getting pretty far away from the original GUC
discussion.
--
--Josh
Josh Berkus
PostgreSQL @ Sun
San Francisco
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
No comments:
Post a Comment