Thursday, September 25, 2008

Re: [GENERAL] The planner hates me.

change t.date2< dates.date to t.date2+0<dates.date, this will prevent
the query from trying to use the index on date2 because the where clause
now references an expression and not the column itself:

explain analyze
select sum(amount), dates.date as date
from transaction t
join (select get_dates as date from get_dates('09/17/08','09/24/08'))
dates
on (t.state='I' or t.date1 >= dates.date) and t.date2+0<dates.date
group by dates.date


-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Jeff Amiel
Sent: Thursday, September 25, 2008 10:24 AM
To: pgsql-general@postgresql.org
Subject: [GENERAL] The planner hates me.

"PostgreSQL 8.2.4 on i386-pc-solaris2.10, compiled by GCC gcc (GCC)
3.4.3 (csl-sol210-3_4-branch+sol_rpath)"
(test environment)

Picture a table called 'transaction' with 1 million rows.
most (> 99% of the records have date1 and date2 values in the past
(spread over 4 years)
99.99% of the records have a state OTHER than 'I'.


CREATE TABLE transaction
(
amount numeric(10,2) NOT NULL,
date1 date NOT NULL,
state character(1) NOT NULL,
date2 date DEFAULT date(now())
)

CREATE INDEX t_date1_index
ON transaction
USING btree
(date1);

CREATE INDEX t_date2_index
ON transaction
USING btree
(date2);

CREATE INDEX t_state_index
ON transaction
USING btree
(state);

explain analyze
select sum(amount), dates.date as date
from transaction t
join (select get_dates as date from
get_dates('09/17/08','09/24/08')) dates on
(t.state='I' or t.date1 >= dates.date)
group by dates.date

get_dates simply returns each date between (and including) the passed
dates ...and is a 'stable' function). Yes...I know I probably could
have used a generate_series or something...but this was written before I
knew anything about that

"HashAggregate (cost=1290485.15..1290485.40 rows=200 width=17) (actual
time=277.804..277.809 rows=8 loops=1)"
" -> Nested Loop (cost=270.37..1123134.88 rows=334700533 width=17)
(actual time=3.182..153.741 rows=120536 loops=1)"
" -> Function Scan on get_dates (cost=0.00..1.25 rows=1000
width=4) (actual time=0.057..0.065 rows=8 loops=1)"
" -> Bitmap Heap Scan on "transaction" t (cost=270.37..618.60
rows=336357 width=22) (actual time=3.093..10.958 rows=15067 loops=8)"
" Recheck Cond: ((t.state = 'I'::bpchar) OR (t.date1 >=
get_dates.get_dates))"
" -> BitmapOr (cost=270.37..270.37 rows=336357 width=0)
(actual time=2.853..2.853 rows=0 loops=8)"
" -> Bitmap Index Scan on t_state_index
(cost=0.00..8.97 rows=4971 width=0) (actual time=2.842..2.842 rows=15067
loops=8)"
" Index Cond: (state = 'I'::bpchar)"
" -> Bitmap Index Scan on t_date1_index
(cost=0.00..252.99 rows=331387 width=0) (actual time=0.009..0.009 rows=2
loops=8)"
" Index Cond: (t.date1 >= get_dates.get_dates)"
"Total runtime: 277.883 ms"


Uses bitmap scans for optimization and performs admirably.

BUT...when I change the query thusly... (adding in an addition AND
clause)

explain analyze
select sum(amount), dates.date as date
from transaction t
join (select get_dates as date from
get_dates('09/17/08','09/24/08')) dates on
(t.state='I' or t.date1 >= dates.date) and t.date2
< dates.date
group by dates.date

"HashAggregate (cost=1222618.09..1222618.34 rows=200 width=15) (actual
time=7538.193..7538.199 rows=8 loops=1)"
" -> Nested Loop (cost=0.00..1166174.15 rows=112887885 width=15)
(actual time=0.889..7411.997 rows=120522 loops=1)"
" Join Filter: ((t.state = 'I'::bpchar) OR (t.date1 >=
get_dates.get_dates))"
" -> Function Scan on get_dates (cost=0.00..1.25 rows=1000
width=4) (actual time=0.055..0.062 rows=8 loops=1)"
" -> Index Scan using t_date2_index on "transaction" t
(cost=0.00..590.77 rows=328800 width=24) (actual time=0.018..492.348
rows=986273 loops=8)"
" Index Cond: (t.date2 < get_dates.get_dates)"
"Total runtime: 7538.259 ms"

I am pulling my hair out. Why does it insist on using the index on the
date2 field? It was doing so well with the bitmap indexing on the
other fields...if the planner would simply do the same with date2, all
would be well. I really need The date2 portion of the query to filter
AFTER the first two clauses, because those are quicker and narrows the
data down to a manageable size....


Any suggestions on query or index changes? I still need the date2 index
for other queries....but the fact that it gets used exclusively for this
query drives me batty!.

I've tried various combinations of multi-column indexes to no
avail...because I still need the date2 index for other reasons, this
query ALWAYS chooses it for some reason.


Any help would be appreciated.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

No comments: