-
Notifications
You must be signed in to change notification settings - Fork 23.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement backend-agnostic rpc._wait_all_workers() utility #32190
Conversation
We need a backend-agnostic mechanism to do barrier-like operation before locally destroy RRef context and shutdown RPC Agent. - Sort worker names. - Elect the first name as the leader in the ordered worker names. - Followers reports therir intent to synchronize to the leader. - Leader also reports to itself, when `_wait_all_workers()` called. - If all workers report their intent to proceed, leader send the command to every one to proceed. Differential Revision: [D19399908](https://our.internmc.facebook.com/intern/diff/D19399908/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19399908/)! [ghstack-poisoned]
_destroy_rref_context(_ignore_rref_leak) | ||
finally: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli This fix the destruction segfault in Python3.5. We need to use finally
clause to ensure running _cleanup_python_rpc_handler()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! would you please add comments for finally here so that we can understand it in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhaojuanmao Updated the comment.
💊 CircleCI build failures summary and remediationsAs of commit 25ef38f: Commit 25ef38f was recently pushed. Waiting for builds... This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker. This comment has been revised 2 times. |
We need a backend-agnostic mechanism to do barrier-like operation before locally destroy RRef context and shutdown RPC Agent. - Sort worker names. - Elect the first name as the leader in the ordered worker names. - Followers reports therir intent to synchronize to the leader. - Leader also reports to itself, when `_wait_all_workers()` called. - If all workers report their intent to proceed, leader send the command to every one to proceed. Differential Revision: [D19399908](https://our.internmc.facebook.com/intern/diff/D19399908/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19399908/)! [ghstack-poisoned]
Pull Request resolved: #32190 We need a backend-agnostic mechanism to do barrier-like operation before locally destroy RRef context and shutdown RPC Agent. - Sort worker names. - Elect the first name as the leader in the ordered worker names. - Followers reports therir intent to synchronize to the leader. - Leader also reports to itself, when `_wait_all_workers()` called. - If all workers report their intent to proceed, leader send the command to every one to proceed. ghstack-source-id: 96693296 Differential Revision: [D19399908](https://our.internmc.facebook.com/intern/diff/D19399908/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19399908/)!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for debugging this! This is awesome work!!
This pull request has been merged in 02f09a1. |
…2190) Summary: Pull Request resolved: pytorch#32190 We need a backend-agnostic mechanism to do barrier-like operation before locally destroy RRef context and shutdown RPC Agent. - Sort worker names. - Elect the first name as the leader in the ordered worker names. - Followers reports therir intent to synchronize to the leader. - Leader also reports to itself, when `_wait_all_workers()` called. - If all workers report their intent to proceed, leader send the command to every one to proceed. ghstack-source-id: 96693296 Test Plan: # Unit tests ``` buck test mode/dev-nosan //caffe2/test:rpc_fork buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_wait_all_workers buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_rref_leak ``` ``` buck test mode/dev-nosan //caffe2/test:rpc_spawn buck-out/gen/caffe2/test/rpc_spawn\#binary.par -r test_wait_all_workers buck-out/gen/caffe2/test/rpc_spawn\#binary.par -r test_rref_leak ``` ``` buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_wait_all_workers buck-out/gen/caffe2/test/rpc_fork_thrift\#binary.par -r test_worker_id ``` # Stress runs ``` buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_stress_light_rpc --stress-runs 10 ``` ``` buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift -- test_stress_light_rpc --stress-runs 10 ``` ``` buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_stress_heavy_rpc --stress-runs 10 ``` ``` buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift -- test_stress_heavy_rpc --stress-runs 10 ``` Differential Revision: D19399908 fbshipit-source-id: 1dee607cd49adafe88534621a1c85e2736e2f595
…2190) Summary: Pull Request resolved: pytorch#32190 We need a backend-agnostic mechanism to do barrier-like operation before locally destroy RRef context and shutdown RPC Agent. - Sort worker names. - Elect the first name as the leader in the ordered worker names. - Followers reports therir intent to synchronize to the leader. - Leader also reports to itself, when `_wait_all_workers()` called. - If all workers report their intent to proceed, leader send the command to every one to proceed. ghstack-source-id: 96693296 Test Plan: # Unit tests ``` buck test mode/dev-nosan //caffe2/test:rpc_fork buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_wait_all_workers buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_rref_leak ``` ``` buck test mode/dev-nosan //caffe2/test:rpc_spawn buck-out/gen/caffe2/test/rpc_spawn\#binary.par -r test_wait_all_workers buck-out/gen/caffe2/test/rpc_spawn\#binary.par -r test_rref_leak ``` ``` buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_wait_all_workers buck-out/gen/caffe2/test/rpc_fork_thrift\#binary.par -r test_worker_id ``` # Stress runs ``` buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_stress_light_rpc --stress-runs 10 ``` ``` buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift -- test_stress_light_rpc --stress-runs 10 ``` ``` buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_stress_heavy_rpc --stress-runs 10 ``` ``` buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift -- test_stress_heavy_rpc --stress-runs 10 ``` Differential Revision: D19399908 fbshipit-source-id: 1dee607cd49adafe88534621a1c85e2736e2f595
Stack from ghstack:
We need a backend-agnostic mechanism to do barrier-like operation before locally destroy RRef context and shutdown RPC Agent.
_wait_all_workers()
called.Differential Revision: D19399908
NOTE FOR REVIEWERS: This PR has internal Facebook specific changes, please review them on Phabricator!