Wednesday, July 30, 2008

Re: [PERFORM] what is less resource-intensive, WHERE id IN or INNER JOIN?

WHERE id IN will generally lead to faster query plans.  Often, much faster on large tables.

The inner join deals with duplicate values for id differently.  WHERE id IN ( subquery ) will be much more likely to choose a hash method for filtering the scan on table1.

I just ran into this today on a query of similar nature on a table with 2M rows (table1 below) joining on a table with 100k (table 2).  The speedup was about 10x -- 20 seconds to 2 seconds in my case -- but I'm working with many GB of RAM and 800MB of work_mem.

I'm not an expert on the query planner guts -- the below is all knowledge based on experimenting with queries on my dataset and triggering various query plans.  Due to the differences in semantics of these two queries the plan on larger tables will be two sorts and a merge join for the INNER JOIN on most columns, though indexes and especially unique indexes will change this.  This holds true even if the number of distinct values of the column being joined on is very low.

The WHERE id IN version will produce a much faster query plan most of the time, assuming you have enough work_mem configured.  The planner is conservative if it estimates usage of work_mem to overflow even a little bit -- and often shifts to sorts rather than hashes on disk.

Since you are working with such a small ammount of RAM, make sure you have some of it doled out to work_mem and tune the balance between work_mem, the OS, and the shared buffers carefully.  The conservative, sort-if-uncertain nature of the query planner may need some coersion with an unusual environment such as yours. You may even have faster results with a hash overflown to disk than a sort overflown to disk with that little memory if the % of overflow is small enough and the OS disk cache large enough. Plus, virtual machines sometimes do some odd things with caching non sync disk writes that may distort the usual random versus sequential disk cost for small I/O volumes.  Though my VM experience is VMWare not Xen.
The querry planner won't generally go for hashes on disk on purpose however, so you might need to be creative with manual statistics setting or changing the optimizer cost settings to experiment with various query plans and measure the unique aspects of your atypical environment and your data. 

On Wed, Jul 30, 2008 at 3:11 PM, Miernik <public@public.miernik.name> wrote:
AFAIK, provided bar is UNIQUE in table2 (e.g. is a PRIMARY KEY) the two
queries will give the same result:

SELECT foo, id FROM table1 WHERE id IN (SELECT id FROM table2);

SELECT foo, id FROM table1 INNER JOIN table2 USING (id);

Given table1 has about 100k rows, and table2 about 100 rows, which one
should be faster, less resource intensive, use less RAM, disk access, etc?
Are there any other even better ways to acomlish the same query?

Using 8.3.3 on a 48 MB RAM Xen.

--
Miernik
http://miernik.name/


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

No comments: