ReduceLROnPlateau with a naive Backtracking #2478

Taha-Bahadori · 2017-08-17T21:39:17Z

Is it possible to implement a simple backtracking for the ReduceLROnPlateau module?
That is, store the best model coefficients and reload it upon rate reduction.

In my experiments, this helps speed up learning, though it might be expensive for very large models.

cc @vincentqb

The text was updated successfully, but these errors were encountered:

Taha-Bahadori · 2017-08-23T17:30:02Z

@soumith: I created a subclass to do this as follows. It works as I described above.


class ReduceLROnPlateauBT(ReduceLROnPlateau):
    def __init__(self, optimizer, mode='min', factor=0.1, patience=10,
                 verbose=False, threshold=1e-4, threshold_mode='rel',
                 cooldown=0, min_lr=0, eps=1e-8, model=None):
        super(ReduceLROnPlateauBT, self).__init__(optimizer, mode=mode,
                                                  factor=factor, patience=patience,
                                                  verbose=verbose, threshold=threshold, 
                                                  threshold_mode=threshold_mode,
                                                  cooldown=cooldown, min_lr=min_lr, eps=eps)
        self.model = model
        self.model_state_dict = None if model is None else model.state_dict()

    def step(self, metrics, epoch=None):
        current = metrics
        if epoch is None:
            epoch = self.last_epoch = self.last_epoch + 1
        self.last_epoch = epoch

        if self.is_better(current, self.best):
            self.best = current
            self.num_bad_epochs = 0
            if self.model is not None: # Saving good models
                self.model_state_dict = self.model.state_dict()
        else:
            self.num_bad_epochs += 1

        if self.in_cooldown:
            self.cooldown_counter -= 1
            self.num_bad_epochs = 0  # ignore any bad epochs in cooldown

        if self.num_bad_epochs > self.patience:
            self._reduce_lr(epoch)
            self.cooldown_counter = self.cooldown
            self.num_bad_epochs = 0
            if self.model is not None: # Loading good models
                self.model.load_state_dict(self.model_state_dict )

Jiaming-Liu · 2017-08-23T17:30:50Z

Hi, I am the puller of #1370
I currently do this via some code in the main loop. Since ReduceLROnPlateau only has access of the optimizer, and optimizer.state_dict() does NOT include its parameters (~~I guess this is some kind of bug~~), backtracking could not be done quite naturally.

# after optim.load_state_dict( ... )
In [17]: optim.state_dict()
Out[17]:
{'param_groups': [{'betas': (0.9, 0.999),
   'eps': 1e-08,
   'lr': 0.001,
   'params': [139683866380144,
    139683866381584,
    139683866379472,
    139683866381680],
   'weight_decay': 0}],
 'state': {}}

@soumith any idea?

Edit: I believe that optim.state_dict does not contain all parameter of a model (like BN's running average mean&std). Therefore we still need to have access to nn.Module even without this issue.

Jiaming-Liu · 2017-08-23T17:35:28Z

@Taha-Bahadori Would you plz make it a PR?
btw, self.model_state_dict = self.model.state_dict() would NOT make a copy of the current state. You might have to pickle the state_dict for future loading.

You would see what happens by running this snippet.

import torch
m = torch.nn.Linear(1,2)
optim = torch.optim.Adam(m.parameters())
state_dict = m.state_dict()
print(state_dict)
m.state_dict()['weight'][0]=1000
print(state_dict)

Taha-Bahadori · 2017-08-24T21:34:17Z

@Jiaming-Liu I think there should be a mistake in your code snippet. Here is an example that shows the above saving and loading state_dict should work:

import torch
import torch.nn as nn

class M(nn.Module):
    def __init__(self):
        super(M, self).__init__()
        self.mem = nn.Parameter(torch.zeros(1))

m = M()
print m
print m.state_dict()

# Saving the current state_dict
sd = m.state_dict()

# Changing the value of the parameter
m.mem.data = torch.ones(1)
print m.state_dict()

# Now loading the original zero parameter back
m.load_state_dict(sd)
print m.state_dict()

Jiaming-Liu · 2017-08-24T21:48:49Z

@Taha-Bahadori I think this snippet would be closer to the real use-case. Note that in-place tensor operation is used in optimizer.step(). That is, m.mem.data = torch.ones(1) in your snippet should be something like m.mem.data.copy_(torch.ones(1)) instead.

As well, your ReduceLROnPlateauBT ignores the state_dict of optimizer, which contains some important history info (like momentum for SGD).

import torch
net = torch.nn.Linear(1,2)
optim = torch.optim.Adam(net.parameters())
state_dict = net.state_dict()
print(state_dict)

x = torch.FloatTensor([[1],[2]])
x = torch.autograd.Variable(x)
y = torch.FloatTensor([[0,1],[2,3]])
y = torch.autograd.Variable(y)
loss = torch.nn.functional.mse_loss(net(x),y)
loss.backward()
optim.step() # Changing the value of the parameter

net.load_state_dict(state_dict)
print(net.state_dict())

…740f8f (pytorch#32125) Summary: Pull Request resolved: pytorch#32125 Previous import was 57ebc587fcf3913b4be93653b0dd58c686447298 Included changes: - **[65020daa](onnx/onnx@65020daa)**: better error message for undefined inputs (pytorch#2540) <Yuxin Wu> - **[8afff0e9](onnx/onnx@8afff0e9)**: bump ORT version (pytorch#2538) <Lu Fang> - **[3d9ca57e](onnx/onnx@3d9ca57e)**: fix name of directory (pytorch#2537) <Prasanth Pulavarthi> - **[df8fa2c9](onnx/onnx@df8fa2c9)**: Repository guidelines (pytorch#2539) <Prasanth Pulavarthi> - **[49cc2f02](onnx/onnx@49cc2f02)**: Update CircleCI job to use Python3.6 (pytorch#2527) <bddppq> - **[25ff79a4](onnx/onnx@25ff79a4)**: Fix wrong model version, it's not 12 (the onnx_opset_version()), not 11 (the opset version of the latest stable), but 10 (pytorch#2478) <daquexian> - **[7cebaed5](onnx/onnx@7cebaed5)**: Fix Windows py3.5 CI (pytorch#2529) <bddppq> - **[eddae00e](onnx/onnx@eddae00e)**: Correct the order of arguments of InferShapes (pytorch#2500) <Shinichiro Hamaji> - **[41b5afe6](onnx/onnx@41b5afe6)**: Include <ostream> in common/status.h (pytorch#2519) <Casey Carter> - **[423f1977](onnx/onnx@423f1977)**: add 8 bit support to maxpool op (pytorch#2510) <Ashwini Khade> - **[78593c2f](onnx/onnx@78593c2f)**: add 8 bit support to reducemin and reducemax ops (pytorch#2516) <Ashwini Khade> Test Plan: cont build Differential Revision: D19380034 fbshipit-source-id: 1ea677eed6779d2b3f8e4683225ba856c68159cd

…740f8f (#32125) Summary: Pull Request resolved: #32125 Previous import was 57ebc587fcf3913b4be93653b0dd58c686447298 Included changes: - **[65020daa](onnx/onnx@65020daa)**: better error message for undefined inputs (#2540) <Yuxin Wu> - **[8afff0e9](onnx/onnx@8afff0e9)**: bump ORT version (#2538) <Lu Fang> - **[3d9ca57e](onnx/onnx@3d9ca57e)**: fix name of directory (#2537) <Prasanth Pulavarthi> - **[df8fa2c9](onnx/onnx@df8fa2c9)**: Repository guidelines (#2539) <Prasanth Pulavarthi> - **[49cc2f02](onnx/onnx@49cc2f02)**: Update CircleCI job to use Python3.6 (#2527) <bddppq> - **[25ff79a4](onnx/onnx@25ff79a4)**: Fix wrong model version, it's not 12 (the onnx_opset_version()), not 11 (the opset version of the latest stable), but 10 (#2478) <daquexian> - **[7cebaed5](onnx/onnx@7cebaed5)**: Fix Windows py3.5 CI (#2529) <bddppq> - **[eddae00e](onnx/onnx@eddae00e)**: Correct the order of arguments of InferShapes (#2500) <Shinichiro Hamaji> - **[41b5afe6](onnx/onnx@41b5afe6)**: Include <ostream> in common/status.h (#2519) <Casey Carter> - **[423f1977](onnx/onnx@423f1977)**: add 8 bit support to maxpool op (#2510) <Ashwini Khade> - **[78593c2f](onnx/onnx@78593c2f)**: add 8 bit support to reducemin and reducemax ops (#2516) <Ashwini Khade> Test Plan: cont build Reviewed By: benoitsteiner Differential Revision: D19380034 fbshipit-source-id: ddce8450864a611773b2a32e2f0254c9bb6b6906

…740f8f (pytorch#32125) Summary: Pull Request resolved: pytorch#32125 Previous import was 57ebc587fcf3913b4be93653b0dd58c686447298 Included changes: - **[65020daa](onnx/onnx@65020daa)**: better error message for undefined inputs (pytorch#2540) <Yuxin Wu> - **[8afff0e9](onnx/onnx@8afff0e9)**: bump ORT version (pytorch#2538) <Lu Fang> - **[3d9ca57e](onnx/onnx@3d9ca57e)**: fix name of directory (pytorch#2537) <Prasanth Pulavarthi> - **[df8fa2c9](onnx/onnx@df8fa2c9)**: Repository guidelines (pytorch#2539) <Prasanth Pulavarthi> - **[49cc2f02](onnx/onnx@49cc2f02)**: Update CircleCI job to use Python3.6 (pytorch#2527) <bddppq> - **[25ff79a4](onnx/onnx@25ff79a4)**: Fix wrong model version, it's not 12 (the onnx_opset_version()), not 11 (the opset version of the latest stable), but 10 (pytorch#2478) <daquexian> - **[7cebaed5](onnx/onnx@7cebaed5)**: Fix Windows py3.5 CI (pytorch#2529) <bddppq> - **[eddae00e](onnx/onnx@eddae00e)**: Correct the order of arguments of InferShapes (pytorch#2500) <Shinichiro Hamaji> - **[41b5afe6](onnx/onnx@41b5afe6)**: Include <ostream> in common/status.h (pytorch#2519) <Casey Carter> - **[423f1977](onnx/onnx@423f1977)**: add 8 bit support to maxpool op (pytorch#2510) <Ashwini Khade> - **[78593c2f](onnx/onnx@78593c2f)**: add 8 bit support to reducemin and reducemax ops (pytorch#2516) <Ashwini Khade> Test Plan: cont build Reviewed By: benoitsteiner Differential Revision: D19380034 fbshipit-source-id: ddce8450864a611773b2a32e2f0254c9bb6b6906

gchanan · 2020-03-10T19:24:04Z

@vincentqb can you make a call on this?

vincentqb · 2020-03-10T21:47:59Z

Given the cost associated to backtracking, this mechanism should be left in the control of the user. The scheduler could have flag that indicates its status though, so the user can save/load based on that flag, see comment.

liuxiaotong15 · 2023-11-20T21:45:00Z

This is really a nice issue, so, what is the final decision of "needs research"?

Jiaming-Liu mentioned this issue Aug 26, 2017

Implement ReduceLROnPlateau with Backtrack #2544

Closed

SamPruden mentioned this issue Dec 11, 2019

ReduceLROnPlateau detects a plateau during a steady decrease after a spike #30873

Open

houseroad mentioned this issue Jan 13, 2020

Automatic update of fbcode/onnx to 65020daafa9183c769938b4512ce543fd5740f8f #32125

Closed

gchanan assigned vincentqb Mar 10, 2020

vincentqb removed the needs research We need to decide whether or not this merits inclusion, based on research world label Mar 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReduceLROnPlateau with a naive Backtracking #2478

ReduceLROnPlateau with a naive Backtracking #2478

Taha-Bahadori commented Aug 17, 2017 •

edited by pytorch-probot bot

Loading

Taha-Bahadori commented Aug 23, 2017

Jiaming-Liu commented Aug 23, 2017 •

edited

Loading

Jiaming-Liu commented Aug 23, 2017 •

edited

Loading

Taha-Bahadori commented Aug 24, 2017

Jiaming-Liu commented Aug 24, 2017 •

edited

Loading

gchanan commented Mar 10, 2020

vincentqb commented Mar 10, 2020

liuxiaotong15 commented Nov 20, 2023

ReduceLROnPlateau with a naive Backtracking #2478

ReduceLROnPlateau with a naive Backtracking #2478

Comments

Taha-Bahadori commented Aug 17, 2017 • edited by pytorch-probot bot Loading

Taha-Bahadori commented Aug 23, 2017

Jiaming-Liu commented Aug 23, 2017 • edited Loading

Jiaming-Liu commented Aug 23, 2017 • edited Loading

Taha-Bahadori commented Aug 24, 2017

Jiaming-Liu commented Aug 24, 2017 • edited Loading

gchanan commented Mar 10, 2020

vincentqb commented Mar 10, 2020

liuxiaotong15 commented Nov 20, 2023

Taha-Bahadori commented Aug 17, 2017 •

edited by pytorch-probot bot

Loading

Jiaming-Liu commented Aug 23, 2017 •

edited

Loading

Jiaming-Liu commented Aug 23, 2017 •

edited

Loading

Jiaming-Liu commented Aug 24, 2017 •

edited

Loading