Friday, July 18, 2008

Re: [GENERAL] Reducing memory usage of insert into select operations?

On 4:55 pm 07/18/08 Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The only thing I can think of is that you had a huge number of rows
> with all the same hash value, so that there wasn't any way to split
> the batch into smaller sections. What are the join keys exactly in
> this query, and what can you tell us about their data distributions?

I can't put actual table or column names so I am putting the actual select
and explain, with all names changed..

insert into customer_transactions
(record_id, date, type, amount, ids, groupid)
select
ca.record_id, coh.date, coh.type, coh.amount, coh.ids, ids.groupid
from
customer_original_historical coh,
cards ca,
customer_ids ids
where
ca.natural_key = coh.natural_key
and ids.ids = coh.ids
and coh.yearmo > '200703';

Hash Join (cost=712213.57..27293913.33 rows=234402352 width=24)
Hash Cond: (coh.id = ids.id)
-> Hash Join (cost=551387.26..18799378.16 rows=234402352
width=22)
Hash Cond: (coh.user_id = ca.user_id)
-> Seq Scan on customer_original_historical coh
(cost=0.00..6702501.40 rows=234402352 width=47)
Filter: (yearmo > '200703'::bpchar)
-> Hash (cost=268355.67..268355.67 rows=14637567 width=32)
-> Seq Scan on cards ca
(cost=0.00..268355.67 rows=14637567 width=32)
-> Hash (cost=77883.25..77883.25 rows=5055525 width=6)
-> Seq Scan on customer_ids ids
(cost=0.00..77883.25 rows=5055525 width=6)

There was a single table, customer_original_historical, which was using a
natural key with a text field.

Most queries used the customer_original_historical by itself or joined
against a single other table which we shoudl call "area".

The new schema I am testing is to split the one single file into 12 files
per month.

In addition I replaced the natural keys with a synthetic integer key.
I also replaced the "area" table with a customer_ids table which only has
two columns: synthetic key for historical and a region.

In order to have 12 tables per month I grouped all the regions into 12
groups. Queries are usually within a single region so what I am trying to
benchmark is if dividing 24 months of data into 24 sets of 12 regions will
perform better than a single large file.

The distribution of the joins is:
There are about 1000,000 unique natural keys. Each natural key has in
average 15 rows per month.
ids are regions where the natural_keys are. Figure 10s of thousands of
natural_keys to an id.

Is that along the lines of what you were looking for?


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

No comments: