Thursday, July 31, 2008

[GENERAL] Statistics Data archiving with Postgres

Hello
We are developing an application and would like to compute statistics on
it in order:
- to have a better understanding of what is used mostly in our
application to model at best our load test scenarios.
- to get information on the usage of the application for other departments.

The problem is that our application is currently read mostly while
statistics logging is more a log write mostly process. And stats collect
will generate a huge volume of data (because a very low granularity is
mandatory).
We would like to avoid as much as possible any interference of the stats
collecting with the main application.

We have looked in the Postgres documentation and we have found several
ideas:
- We have decided to isolate stats in a specific schema.
- We have looked at polymorphism in order to split our stat tables in
smallest ones that we could "detach" when they are old.
- We have looked at fsync tuning or better at asynchronous commit as
these data are not critical.

But we have been facing several questions/problems:

Polymorphism and ORM question:
- First as we are using an ORM tool around PG access, the rule we
defined in the Polymorphism returned 0 after an insert because the last
rule was generally not the one that made the insert. In our case we know
that only a single rule will match, so we made a hack setting the active
rule name with a zzz but that is very hacky. in that case anyway
Hibernate is happy.

One or several databases, one or several servers ?
- In such a case could we store both our application content and stats
in the same database ? Should we better use two databases in the same
cluster or should we even have to different dedicated servers ?
- If we want to use fsync, I suppose we need two separated servers. I
read that asynchronous commit can be set for a transaction. Is there a
way to say that a given cluster or tables are in asynchronous commit by
default, perhaps with triggers ....

We would like to archive old data collected in slow file storage in any
case but would like to avoid having our database reaching Tb only for
data collecting concerns. May be this is a bad idea. Anyway if this is
not so bad, we have again questions:
With polymorphism we can dump some tables regularly. But polymorphism
has been seen a bit complex and we were studying a simpler way to and we
also have to study other ways with simpler but larger stats tables.
We have studied the simple pg_dump command with only the data but we
would need to dump only a part of the table. Thus we have looked at the
COPY command which seems interesting in our case. Are there experience
or any feedback on that command.

Sorry, there are many questions, our problem is a bit wide because there
are several concerns:
- Polymorphism or not
- One or several DB clusters or servers
- Fsync/asynchronous problem
- Rule limitations
- Use of COPY
But to sum up we would like to collect statistics (write mostly tables,
high volume generation, data not critical) on an application usage on a
read mostly DB with the least impact on this DB perfs. ANn we would also
like to be able to archive outside the DB, the old collected data.

Thanks for any help!

Pascal


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

No comments: