-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fault tolerance support for trial failure #424
Add fault tolerance support for trial failure #424
Conversation
/assign @hougangliu |
@hougangliu @YujiOshima @johnugeorge |
/lgtm |
@hougangliu This PR is well tested and ready to merge. Please take a look. |
/lgtm |
Could you approve it so that I can go on @hougangliu |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hougangliu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
In practice, even the training container is correctly implemented, trials may fail due to a variety of reasons such as insufficient GPU memories. If trials fail,
GetEvaluationResult()
will return None. The suggestion should be able to handle that exception. My solution is:RESPAWN_SLEEP
seconds.RESPAWN_LIMIT
times still cannot collect valid results, then fail the task because it may indicate that the training container has errors.Besides, this PR also fixes some small bugs and adds some important TODOs.
An example of fault handling output
This change is