Thursday, September 11, 2008

Re: [HACKERS] Transaction Snapshots and Hot Standby

Thanks for the detailed thinking. At least one very good new idea here,
some debate on other points.


On Thu, 2008-09-11 at 09:24 +0300, Heikki Linnakangas wrote:

> And still we can't escape the scenario that the slave receives a WAL
> record that vacuums away a tuple that's still visible according to a
> snapshot used in the slave. Even with the proposed scheme, this can happen:
>
> 1. Slave receives a snapshot from master
> 2. A long-running transaction begins on the slave, using that snapshot
> 3. Network connection is lost
> 4. Master hits a timeout, and decides to discard the snapshot it sent to
> the slave
> 5. A tuple visible to the snapshot is vacuumed
> 6. Network connection is re-established
> 7. Slave receives the vacuum WAL record, even though the long-running
> transaction still needs the tuple.

Interesting point. (4) is a problem, though not for the reason you
suggest. If we were to stop and start master, that would be sufficient
to discard the snapshot that the standby is using and so cause problems.
So the standby *must* tell the master the recentxmin it is using, as you
suggest later, so good thinking. So part of the handshake between
primary and standby must be "what is your recentxmin?". The primary will
then use the lower/earliest of the two.

> I like the idea of acquiring snapshots locally in the slave much more.

Me too. We just need to know how, if at all.

> As you mentioned, the options there are to defer applying WAL, or cancel
> queries. I think both options need the same ability to detect when
> you're about to remove a tuple that's still visible to some snapshot,
> just the action is different. We should probably provide a GUC to
> control which you want.

I don't see any practical way of telling whether a tuple removal will
affect a snapshot or not. Each removed row would need to be checked
against each standby snapshot. Even if those were available, it would be
too costly. And even if we can do that, ISTM that neither option is
acceptable: if we cancel queries then touching a frequently updated
table is nearly impossible, or if we delay applying WAL then the standby
could fall behind, impairing its ability for use in HA. (If there was a
way, yes, we should have a parameter for it).

It was also suggested we might take the removed rows and put them in a
side table, but that makes me think of the earlier ideas for HOT and so
I've steered clear of that.

You might detect blocks that have had tuples removed from them *after* a
query started by either
* keeping a hash table of changed blocks - it would be a very big data
structure and hard to keep clean
* adding an additional "last cleaned LSN" onto every data block
* keeping an extra LSN on the bufhdr for each of the shared_buffers,
plus keeping a hash table of blocks that have been cleaned and then
paged out
Once detected, your only option is to cancel the query.

ISTM if we want to try to avoid making recentxmin same on both primary
and standby then the only viable options are the 3 on the original post.

> However, if we still to provide the behavior that "as long as the
> network connection works, the master will not remove tuples still needed
> in the slave" as an option, a lot simpler implementation is to
> periodically send the slave's oldest xmin to master. Master can take
> that into account when calculating its own oldest xmin. That requires a
> lot less communication than the proposed scheme to send snapshots back
> and forth. A softer version of that is also possible, where the master
> obeys the slave's oldest xmin, but only up to a point.

I like this very much. Much simpler implementation and no need for a
delay in granting snapshots. I'll go for this as the default
implementation. Thanks for the idea.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

No comments: