Wednesday, September 10, 2008

[HACKERS] Interesting glitch in autovacuum

I observed a curious bug in autovac just now. Since plain vacuum avoids
calling GetTransactionSnapshot, an autovac worker that happens not to
analyze any tables will never call GetTransactionSnapshot at all.
This means it will arrive at vac_update_datfrozenxid with
RecentGlobalXmin never having been changed from its boot value of
FirstNormalTransactionId, which means that it will fail to update the
database's datfrozenxid ... or, if the current value of datfrozenxid
is past 2 billion, that it will improperly advance datfrozenxid to
sometime in the future.

Once you get into this state in a reasonably idle database such as
template1, autovac is completely dead in the water: if it thinks
template1 needs to be vacuumed for wraparound, then every subsequent
worker will be launched at template1, every one will fail to advance
its datfrozenxid, rinse and repeat. Even before that happens, the
DB's datfrozenxid will prevent clog truncation, which might explain
some of the recent complaints.

I've only directly tested this in HEAD, but I suspect the problem goes
back a ways.

On reflection I'm not even sure that this is strictly an autovacuum
bug. It can be cast more generically as "RecentGlobalXmin getting
used without ever having been set", and it sure looks to me like the
HOT patch may have introduced a few risks of that sort.

I'm thinking that maybe an appropriate fix is to insert a
GetTransactionSnapshot call at the beginning of InitPostgres'
transaction, thus ensuring that every backend has some vaguely sane
value for RecentGlobalXmin before it tries to do any database access.

Another thought is that even with that, an autovac worker is likely
to reach vac_update_datfrozenxid with a RecentGlobalXmin value that
was computed at the start of its run, and is thus rather old.
I wonder why vac_update_datfrozenxid is using the variable at all
rather than doing GetOldestXmin? It's not like that function is
so performance-critical that it needs to avoid calling GetOldestXmin.

Lastly, now that we have the PROC_IN_VACUUM test in GetSnapshotData,
is it actually necessary for lazy vacuum to avoid setting a snapshot?
It seems like it might be a good idea for it to do so in order to
keep its RecentGlobalXmin reasonably current.

I've only looked at this in HEAD, but I am thinking that we have
a real problem here in both HEAD and 8.3. I'm less sure how bad
things are in the older branches.

Comments?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

No comments: