Saturday, August 16, 2008

Re: [PERFORM] Optimizing a VIEW

On Aug 15, 2008, at 1:36 PM, Madison Kelly wrote:
> The 'cust_id' references the customer that the given data belongs
> to. The reason for this "data bucket" (does this structure have a
> proper name?) is that the data I need to store on a give customer
> is quite variable and outside of my control. As it is, there is
> about 400 different variable/value pairs I need to store per customer.


It's called Entity-Attribute-Value, and it's performance is pretty
much guaranteed to suck for any kind of a large dataset. The problem
is that you're storing a MASSIVE amount of extra information for
every single value. Consider:

If each data point was just a field in a table, then even if we left
cd_value as text, each data point would consume 4 bytes* + 1 byte per
character (I'm assuming you don't need extra UTF8 chars or anything).
Of course if you know you're only storing numbers or the like then
you can make that even more efficient.

* In 8.3, the text field overhead could be as low as 1 byte if the
field is small enough.

OTOH, your table is going to 32+24 bytes per row just for the per-row
overhead, ints and timestamps. Each text field will have 1 or 4 bytes
in overhead, then you have to store the actual data. Realistically,
you're looking at 60+ bytes per data point, as opposed to maybe 15,
or even down to 4 if you know you're storing an int.

Now figure out what that turns into if you have 100 data points per
minute. It doesn't take very long until you have a huge pile of data
you're trying to deal with. (As an aside, I once consulted with a
company that wanted to do this... they wanted to store about 400 data
points from about 1000 devices on a 5 minute interval. That worked
out to something like 5GB per day, just for the EAV table. Just
wasn't going to scale...)

So, back to your situation... there's several things you can do that
will greatly improve things.

Identify data points that are very common and don't use EAV to store
them. Instead, store them as regular fields in a table (and don't use
text if at all possible).

You need to trim down your EAV table. Throw out the added/modified
info; there's almost certainly no reason to store that *per data
point*. Get rid of cd_id; there should be a natural PK you can use,
and you certainly don't want anything else referring to this table
(which is a big reason to use a surrogate key).

cd_variable and cd_tag need to be ints that point at other tables.
For that matter, do you really need to tag each *data point*?
Probably not...

Finally, if you have a defined set of points that you need to report
on, create a materialized view that has that information.

BTW, it would probably be better to store data either in the main
table, or the history table, but not both places.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

No comments: