-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runner/rbd: do not reopen device if blacklisted #384
Conversation
tcmu_acquire_dev_lock() must not skip tcmu_notify_conn_lost() if handler told us explicitly that we are blaclisted. Otherwise, kernel lio won't be flushed, and a stale request sitting somewhere in kernel queues may come to us later, when we have reacquired the lock. Signed-off-by: Maxim Patlasov <[email protected]>
@@ -299,7 +299,8 @@ int tcmu_acquire_dev_lock(struct tcmu_device *dev) | |||
retries++; | |||
goto retry; | |||
} | |||
|
|||
/* fall through */ | |||
case TCMUR_LOCK_BLACKLISTED: | |||
tcmu_dev_dbg(dev, "Fail handler device connection.\n"); | |||
tcmu_notify_conn_lost(dev); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch and debugging this.
We used to do something similar. The problem was that for failback or failover with N > 2 nodes, this would cause the initiator to failover extra times because the conn lost call kills the session the initiator is trying to use causing it to try another path then later switch back over when the tpg is enabled. I think that is definitely better than data corruption though.
Let me do some more testing in the morning.
Hey, We are going to try and go with the STPG based approach I posted on the list. That should prevent us from hitting the ping pong storms during failback with multi-lun targets. I will update the patches I posted on the list with vmware/windows support and fix some bugs in the next data. |
Mike, Does STPG based approach assume that tcmu-runner won't support implicit ALUA? Because otherwise, it seems to be a bug to skip tcmu_notify_conn_lost() on acquiring the lock. What do you think? |
runner reports in the RTPG what we support and the initiator follows that. However, if there were to be a bug in the initiator where it ignores our settings, then runner will not perform implicit failovers when in explicit mode (the ceph-iscsi-config patch is what switches runner to use explicit instead of implicit mode). So for your test it would work like this:
Note that for the tools/configfs we set explicit and implicit just because the latter is required for the tools to interact with the configfs interface. Those values do not get reported to the initiator for tcmu though, so in the tcmu runner RTPG handler in the posted patch we only report explicit support. |
So just to be clear I will probably just delete the implicit support code, because with explicit it is never used and I am not sure if a user would ever want to turn it on. |
Thank you for detailed explanation. Seems STPG plays the role of barrier:
That's exactly I wanted to clarify, thank you! |
tcmu_acquire_dev_lock() must not skip tcmu_notify_conn_lost()
if handler told us explicitly that we are blaclisted.
Otherwise, kernel lio won't be flushed, and a stale request
sitting somewhere in kernel queues may come to us later, when
we have reacquired the lock.
Signed-off-by: Maxim Patlasov [email protected]