Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement backend-agnostic rpc._wait_all_workers() utility #32190

Closed
wants to merge 2 commits into from

Conversation

xush6528
Copy link
Contributor

@xush6528 xush6528 commented Jan 14, 2020

Stack from ghstack:

We need a backend-agnostic mechanism to do barrier-like operation before locally destroy RRef context and shutdown RPC Agent.

  • Sort worker names.
  • Elect the first name as the leader in the ordered worker names.
  • Followers reports therir intent to synchronize to the leader.
  • Leader also reports to itself, when _wait_all_workers() called.
  • If all workers report their intent to proceed, leader send the command to every one to proceed.

Differential Revision: D19399908

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes, please review them on Phabricator!

We need a backend-agnostic mechanism to do barrier-like operation before locally destroy RRef context and shutdown RPC Agent.

- Sort worker names.
- Elect the first name as the leader in the ordered worker names.
- Followers reports therir intent to synchronize to the leader.
- Leader also reports to itself, when `_wait_all_workers()` called.
- If all workers report their intent to proceed, leader send the command to every one to proceed.

Differential Revision: [D19399908](https://our.internmc.facebook.com/intern/diff/D19399908/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19399908/)!

[ghstack-poisoned]
_destroy_rref_context(_ignore_rref_leak)
finally:
Copy link
Contributor Author

@xush6528 xush6528 Jan 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli This fix the destruction segfault in Python3.5. We need to use finally clause to ensure running _cleanup_python_rpc_handler().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! would you please add comments for finally here so that we can understand it in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhaojuanmao Updated the comment.

@kostmo
Copy link
Member

kostmo commented Jan 15, 2020

💊 CircleCI build failures summary and remediations

As of commit 25ef38f:

Commit 25ef38f was recently pushed. Waiting for builds...


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 2 times.

We need a backend-agnostic mechanism to do barrier-like operation before locally destroy RRef context and shutdown RPC Agent.

- Sort worker names.
- Elect the first name as the leader in the ordered worker names.
- Followers reports therir intent to synchronize to the leader.
- Leader also reports to itself, when `_wait_all_workers()` called.
- If all workers report their intent to proceed, leader send the command to every one to proceed.

Differential Revision: [D19399908](https://our.internmc.facebook.com/intern/diff/D19399908/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19399908/)!

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Jan 15, 2020
Pull Request resolved: #32190

We need a backend-agnostic mechanism to do barrier-like operation before locally destroy RRef context and shutdown RPC Agent.

- Sort worker names.
- Elect the first name as the leader in the ordered worker names.
- Followers reports therir intent to synchronize to the leader.
- Leader also reports to itself, when `_wait_all_workers()` called.
- If all workers report their intent to proceed, leader send the command to every one to proceed.
ghstack-source-id: 96693296

Differential Revision: [D19399908](https://our.internmc.facebook.com/intern/diff/D19399908/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D19399908/)!
@xush6528 xush6528 changed the title Implement backend-agnostic rpc._wait_all_workers() utility" Implement backend-agnostic rpc._wait_all_workers() utility Jan 15, 2020
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for debugging this! This is awesome work!!

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 02f09a1.

@facebook-github-bot facebook-github-bot deleted the gh/xush6528/59/head branch January 18, 2020 15:17
wuhuikx pushed a commit to wuhuikx/pytorch that referenced this pull request Jan 30, 2020
…2190)

Summary:
Pull Request resolved: pytorch#32190

We need a backend-agnostic mechanism to do barrier-like operation before locally destroy RRef context and shutdown RPC Agent.

- Sort worker names.
- Elect the first name as the leader in the ordered worker names.
- Followers reports therir intent to synchronize to the leader.
- Leader also reports to itself, when `_wait_all_workers()` called.
- If all workers report their intent to proceed, leader send the command to every one to proceed.
ghstack-source-id: 96693296

Test Plan:
# Unit tests

```
buck test mode/dev-nosan //caffe2/test:rpc_fork

buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_wait_all_workers
buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_rref_leak
```

```
buck test mode/dev-nosan //caffe2/test:rpc_spawn

buck-out/gen/caffe2/test/rpc_spawn\#binary.par -r test_wait_all_workers
buck-out/gen/caffe2/test/rpc_spawn\#binary.par -r test_rref_leak
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift

buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_wait_all_workers
buck-out/gen/caffe2/test/rpc_fork_thrift\#binary.par -r test_worker_id
```

# Stress runs
```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_stress_light_rpc --stress-runs 10
```

```
buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift -- test_stress_light_rpc --stress-runs 10
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_stress_heavy_rpc --stress-runs 10
```

```
buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift -- test_stress_heavy_rpc --stress-runs 10
```

Differential Revision: D19399908

fbshipit-source-id: 1dee607cd49adafe88534621a1c85e2736e2f595
ttumiel pushed a commit to ttumiel/pytorch that referenced this pull request Mar 4, 2020
…2190)

Summary:
Pull Request resolved: pytorch#32190

We need a backend-agnostic mechanism to do barrier-like operation before locally destroy RRef context and shutdown RPC Agent.

- Sort worker names.
- Elect the first name as the leader in the ordered worker names.
- Followers reports therir intent to synchronize to the leader.
- Leader also reports to itself, when `_wait_all_workers()` called.
- If all workers report their intent to proceed, leader send the command to every one to proceed.
ghstack-source-id: 96693296

Test Plan:
# Unit tests

```
buck test mode/dev-nosan //caffe2/test:rpc_fork

buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_wait_all_workers
buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_rref_leak
```

```
buck test mode/dev-nosan //caffe2/test:rpc_spawn

buck-out/gen/caffe2/test/rpc_spawn\#binary.par -r test_wait_all_workers
buck-out/gen/caffe2/test/rpc_spawn\#binary.par -r test_rref_leak
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift

buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_wait_all_workers
buck-out/gen/caffe2/test/rpc_fork_thrift\#binary.par -r test_worker_id
```

# Stress runs
```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_stress_light_rpc --stress-runs 10
```

```
buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift -- test_stress_light_rpc --stress-runs 10
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_stress_heavy_rpc --stress-runs 10
```

```
buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift -- test_stress_heavy_rpc --stress-runs 10
```

Differential Revision: D19399908

fbshipit-source-id: 1dee607cd49adafe88534621a1c85e2736e2f595
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants