> On Mon, 18 Aug 2008, Moritz Onken wrote:
>> Running the query for more than one user is indeed not much slower.
>> That's what I need. I'm clustering the results table on domain
>> right now. But why is this better than clustering it on "user"?
>
> The reason is the way that the merge join algorithm works. What it
> does is takes two tables, and sorts them both by the join fields.
> Then it can stream through both tables producing results as it goes.
> It's the best join algorithm, but it does require both tables to be
> sorted by the same thing, which is domain in this case. The
> aggregating on user happens after the join has been done, and the
> hash aggregate can accept the users in random order.
>
> If you look at your last EXPLAIN, see that it has to sort the result
> table on domain, although it can read the domain_categories in
> domain order due to the clustered index.
explain select
a."user", b.category, sum(1.0/b.cat_count)::float
from result a, domain_categories b
where a."domain" = b."domain"
group by a."user", b.category;
"GroupAggregate (cost=21400443313.69..22050401897.13 rows=35049240
width=12)"
" -> Sort (cost=21400443313.69..21562757713.35 rows=64925759864
width=12)"
" Sort Key: a."user", b.category"
" -> Merge Join (cost=4000210.40..863834009.08
rows=64925759864 width=12)"
" Merge Cond: (b.domain = a.domain)"
" -> Index Scan using domain_categories_domain on
domain_categories b (cost=0.00..391453.79 rows=12105014 width=12)"
" -> Materialize (cost=3999931.73..4253766.93
rows=20306816 width=8)"
" -> Sort (cost=3999931.73..4050698.77
rows=20306816 width=8)"
" Sort Key: a.domain"
" -> Seq Scan on result a
(cost=0.00..424609.16 rows=20306816 width=8)"
Both results and domain_categories are clustered on domain and analyzed.
It took 50 minutes to run this query for 280 users ("and "user" IN
([280 ids])"), 78000 rows were returned and stored in a table. Is this
reasonable?
Why is it still sorting on domain? I thought the clustering should
prevent the planner from doing this?
moritz
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
No comments:
Post a Comment