-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support replicating from READ-ONLY DB #169
Comments
We definitely use Ghostferry from read-only replica internally and have never encountered any issue. In fact, RowLock is definitely needed for data correctness purposes, not just for verification. As I recall it's not an intuitive proof why it would cause data corruption. Since this is a common question asked by many people, I can try to find a "prose proof" that I've written a long time ago from our documents about Ghostferry and paste it into here in the nearish future. |
I just reran the TLC against the TLA+ specification by making the following change and regenerating the TLA+ from PlusCal:
Each label in PlusCal is an atomic step. If the read and write during the data copy is not in an atomic step, they would be in 2 labels as above. This simulates a scenario with when FOR UPDATE is turned off. As a result, TLC tells me I have an invariant violation. It takes 26 steps in the model to get to this error condition and a runtime of 2min on my computer. At one point I had a prose explanation of why this is the case but it's been many years and I have since forgotten. Unfortunately the TLC trace showing how the errors can be reproduced is a bit daunting and difficult to parse. I don't have time for it for now so I'm going to paste it into here verbatim. In the future I might post a better explanation of it by parsing the TLC output. The key line we can see is at the 26th step, the Target and Source tables have different entries:
|
that is very interesting. The Could it be that you are connecting to the source DB using a user with the |
I just checked and yes we have SUPER. |
ok, thanks for confirming. I'll have to think about this for a bit. I dislike giving our replication SUPER on the source |
The permissions you need are:
|
good to know - would be good to include in the docs somewhere. NOTE: For me this poses a problem, as we're using GCP and cloud-SQL does not support https://cloud.google.com/sql/faq
I'll see what I can do in our environment |
As @shuhaowu pointed out above,
As we can see in step 4 above, the application is allowed to update the row selected in step 3 because cc @Shopify/pods |
thanks for the detailed explanation! Indeed not the most straight-forward sequence of events or something one would think of immediately :-) I agree that locking is the correct strategy, no doubt there. The problem in my scenario is that I cannot use locking, because of the environment in which I deploy (namely reading from a read-only GCP cloud-SQL slave DB). Not even a root/SUPER user has the privileges to lock on such a system. From your explanation above, it seems to be related to transactions and rollbacks. I'm not entire sure (but have the strong feeling) that a rolled back transaction would ever make it onto the slave, and the scenario would not occur (because the master reverts and thus never commits the transaction commands to the binlog). Does that sound reasonable? Also note that our application/use-case guarantees that replicating from a slave is safe. Fail-overs will not occur unless it's guaranteed that the slave has received all data. |
I think there's some misunderstanding here about the race condition @kolbitsch-lastline. The fundamental issue is that the BatchWriter (aka DataIterator / TableIterator) requires the source data to remain unchanged while it commits its writes to the target database. Breaking the atomicity of "read data from source then write data to target" is what causes data corruption. If there was no atomicity in the read-then-write operation, this is the race condition you'd encounter:
If all processes (binlogstreamer, application, and table iterator) stop at this point, it's clear that the source table has r0 but target table has a value of r1 |
I hope the last reply clears up the race condition for you.. Now, onto your issue, which is the lack of permissions to hold a transaction open on a read slave. First of all, in the example explained above, the "application" can be equated to the mysql replication thread on the read slave. Hence, the race condition still holds, whether you're running ghostferry off of a read slave (that is guarded from your "real" application) or a writer Secondly, with the current ghostferry implementation I strongly suspect you're out of luck, which makes us sad too! Finally, I don't think all hope is lost because internally we have discussed that mutually excluding the BinlogWriter from the DataIterator might be an alternative to having the In any case, if you want to explore this, we can support you but merging it upstream would require re-validation of the idea in TLA+ (link directs you to the current model for our "read-then-write atomicity") Let us know your thoughts, and thanks again for all the discussion! |
FYI, I'm convinced I had responded to your explanation above, but I can't find it anywhere... not sure if it was deleted, if I posted on a wrong ticket, or if I was being stupid and just never hit I finally get the problem and agree that there has to be a lock. I find that putting the lock into the I'm currently working on testing a patch that I would love to see what you guys (and TLA+ ;-) ) have to say about. |
With this commit, we allow using an in-application lock in ghostferry, instead of using the source DB as lock. The lock is required to avoid race conditions between the data iteration/copy and the binlog writer. The default behavior is preserved; a new option "LockStrategy" allows moving the lock from the source DB into ghostferry, or disabling the lock altogether. This fixes Shopify#169 Change-Id: I20f1d2a189078a3877f831c7a98e8ca956620cc7
With this commit, we allow using an in-application lock in ghostferry, instead of using the source DB as lock. The lock is required to avoid race conditions between the data iteration/copy and the binlog writer. The default behavior is preserved; a new option "LockStrategy" allows moving the lock from the source DB into ghostferry, or disabling the lock altogether. This fixes Shopify#169 Change-Id: I20f1d2a189078a3877f831c7a98e8ca956620cc7
I agree that the application-level lock is cleaner. I'm also interested to see it modeled, since I'm quite sure there is a correct variation. The current implementation has been validated extensively in production, and this does add some risk to change, but it is clear the change will benefit some environments. |
The current
ghostferry-copydb
default behavior is to use set theRowLock
property on the data-iterator when copying over rows from the source to the target DB. Unfortunately the reason why is not documented very well (that I could find).This default behavior means that it is not possible to replicate from a MySQL slave DB, because such a server typically is run in
READ-ONLY
mode, preventing theSELECT ... FOR UPDATE
used by ghostferry.My assumption is that this is to keep the data consistent between the reading and the row verification. Beyond verification, it should be safe to operate without the
FOR UPDATE
(which also improves performance, as we don't require the round-trips for the transaction, which is always rolled back anyways), because any modifications on the source DB overlapping with a row read should be "fixed" by the binlog-writer anyways.Can you confirm my assumption is correct?
If so, would it make sense to disable the use of
RowLock = true
if no verifier is enabled, or at least allow theDataIterator
to allow disabling the row-lock as part of the constructor?The text was updated successfully, but these errors were encountered: