Add fault tolerance support for trial failure #424

jinan-zhou · 2019-03-08T02:54:42Z

In practice, even the training container is correctly implemented, trials may fail due to a variety of reasons such as insufficient GPU memories. If trials fail, GetEvaluationResult() will return None. The suggestion should be able to handle that exception. My solution is:

If not all the trials failed, use the metrics of the successful ones to do the update
If all the trials failed:
1. Firstly, try to respawn the previous trials after sleeping for RESPAWN_SLEEP seconds.
2. If respawning the trials for RESPAWN_LIMIT times still cannot collect valid results, then fail the task because it may indicate that the training container has errors.

Besides, this PR also fixes some small bugs and adds some important TODOs.

An example of fault handling output

---------------------------------------------------------------------------
Suggestion Step 1 for StudyJob nas-example-1 (ID: e9850c4a885acb19)
---------------------------------------------------------------------------
>>> 2 Trials succeeded, 1 Trials failed:
p55d6ef3720fec95: Failed
n8f94938e123b38d: 0.6475
g30851e3c259d07a: 0.6709
The average is 0.6592

>>> Suggestion updated. LSTM Controller Reward: 44.43968200683594

This change is

jinan-zhou · 2019-03-08T22:28:22Z

/assign @hougangliu

pkg/suggestion/nasrl_service.py

andreyvelich · 2019-03-08T22:38:33Z

@hougangliu @YujiOshima @johnugeorge
Do we have any way to make StudyJob Failed from Suggestion ?

andreyvelich · 2019-03-13T00:06:18Z

/lgtm

jinan-zhou · 2019-03-13T00:13:15Z

@hougangliu This PR is well tested and ready to merge. Please take a look.

hougangliu · 2019-03-13T23:11:21Z

/lgtm

jinan-zhou · 2019-03-14T01:29:59Z

Could you approve it so that I can go on @hougangliu

hougangliu · 2019-03-14T01:56:23Z

/approve
thanks!

k8s-ci-robot · 2019-03-14T01:56:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hougangliu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [hougangliu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

add fault tolerance for trial failure

de54e9c

k8s-ci-robot requested review from andreyvelich and texasmichelle March 8, 2019 02:54

k8s-ci-robot added the size/M label Mar 8, 2019

fix a small typo

6a572f6

k8s-ci-robot assigned hougangliu Mar 8, 2019

andreyvelich reviewed Mar 8, 2019

View reviewed changes

pkg/suggestion/nasrl_service.py Outdated Show resolved Hide resolved

DeeperMind added 5 commits March 11, 2019 11:53

fix a typo

62031e5

improve fault processing strategy

ae902d7

add an important TODO

c7dcc0f

fix typo

fa19ab9

add some more TODOs

9e58236

k8s-ci-robot assigned andreyvelich Mar 13, 2019

k8s-ci-robot added the lgtm label Mar 13, 2019

k8s-ci-robot added the approved label Mar 14, 2019

k8s-ci-robot merged commit 06f955b into kubeflow:master Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fault tolerance support for trial failure #424

Add fault tolerance support for trial failure #424

jinan-zhou commented Mar 8, 2019 •

edited

Loading

jinan-zhou commented Mar 8, 2019

andreyvelich commented Mar 8, 2019

andreyvelich commented Mar 13, 2019

jinan-zhou commented Mar 13, 2019

hougangliu commented Mar 13, 2019

jinan-zhou commented Mar 14, 2019

hougangliu commented Mar 14, 2019

k8s-ci-robot commented Mar 14, 2019

Add fault tolerance support for trial failure #424

Add fault tolerance support for trial failure #424

Conversation

jinan-zhou commented Mar 8, 2019 • edited Loading

jinan-zhou commented Mar 8, 2019

andreyvelich commented Mar 8, 2019

andreyvelich commented Mar 13, 2019

jinan-zhou commented Mar 13, 2019

hougangliu commented Mar 13, 2019

jinan-zhou commented Mar 14, 2019

hougangliu commented Mar 14, 2019

k8s-ci-robot commented Mar 14, 2019

jinan-zhou commented Mar 8, 2019 •

edited

Loading