Sunday, August 17, 2008

Re: [PERFORM] Optimizing a VIEW

Decibel! wrote:
> On Aug 15, 2008, at 1:36 PM, Madison Kelly wrote:
>> The 'cust_id' references the customer that the given data belongs to.
>> The reason for this "data bucket" (does this structure have a proper
>> name?) is that the data I need to store on a give customer is quite
>> variable and outside of my control. As it is, there is about 400
>> different variable/value pairs I need to store per customer.
>
>
> It's called Entity-Attribute-Value, and it's performance is pretty much
> guaranteed to suck for any kind of a large dataset. The problem is that
> you're storing a MASSIVE amount of extra information for every single
> value. Consider:
>
> If each data point was just a field in a table, then even if we left
> cd_value as text, each data point would consume 4 bytes* + 1 byte per
> character (I'm assuming you don't need extra UTF8 chars or anything). Of
> course if you know you're only storing numbers or the like then you can
> make that even more efficient.
>
> * In 8.3, the text field overhead could be as low as 1 byte if the field
> is small enough.
>
> OTOH, your table is going to 32+24 bytes per row just for the per-row
> overhead, ints and timestamps. Each text field will have 1 or 4 bytes in
> overhead, then you have to store the actual data. Realistically, you're
> looking at 60+ bytes per data point, as opposed to maybe 15, or even
> down to 4 if you know you're storing an int.
>
> Now figure out what that turns into if you have 100 data points per
> minute. It doesn't take very long until you have a huge pile of data
> you're trying to deal with. (As an aside, I once consulted with a
> company that wanted to do this... they wanted to store about 400 data
> points from about 1000 devices on a 5 minute interval. That worked out
> to something like 5GB per day, just for the EAV table. Just wasn't going
> to scale...)
>
> So, back to your situation... there's several things you can do that
> will greatly improve things.
>
> Identify data points that are very common and don't use EAV to store
> them. Instead, store them as regular fields in a table (and don't use
> text if at all possible).
>
> You need to trim down your EAV table. Throw out the added/modified info;
> there's almost certainly no reason to store that *per data point*. Get
> rid of cd_id; there should be a natural PK you can use, and you
> certainly don't want anything else referring to this table (which is a
> big reason to use a surrogate key).
>
> cd_variable and cd_tag need to be ints that point at other tables. For
> that matter, do you really need to tag each *data point*? Probably not...
>
> Finally, if you have a defined set of points that you need to report on,
> create a materialized view that has that information.
>
> BTW, it would probably be better to store data either in the main table,
> or the history table, but not both places.

This is a very long and thoughtful reply, thank you very kindly.

Truth be told, I sort of expected this would be what I had to do. I
think I asked this more in hoping that there might be some "magic" I
didn't know about, but I see now that's not the case. :)

As my data points grow to 500,000+, the time it took to return these
results grew to well over 10 minutes on a decent server and the DB size
was growing rapidly, as you spoke of.

So I did just as you suggested and took the variable names I knew about
specifically and created a table for them. These are the ones that are
being most often updated (hourly per customer) and made each column an
'int' or 'real' where possible and ditched the tracking of the
adding/modifying user and time stamp. I added those out of habit, more
than anything. This data will always come from a system app though, so...

Given that my DB is in development and how very long and intensive it
would have been to pull out the existing data, I have started over and
am now gathering new data. In a week or so I should have the same amount
of data as I had before and I will be able to do a closer comparison test.

However, I already suspect the growth of the database will be
substantially slower and the queries will return substantially faster.

Thank you again!

Madi

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

No comments: