python - model.eval() return a NoneType object when using deepspeed

When I want to accelerate the model training by using deepspeed, a problem occured when I want to evaluate the model on validation dataset. Here is the problem code snippet:

def evaluate(self, epoch_num=None, keep_all=True):
        print("self.model:", self.model)

        self.model = self.model.eval()
        print("self.model after eval:", self.model)

Then the output log:

self.model: DeepSpeedEngine(
  (module): TSTransformerEncoder(
    (project_inp): Linear(in_features=6, out_features=128, bias=True)
    (pos_enc): LearnablePositionalEncoding(
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer_encoder): TransformerEncoder(
      (layers): ModuleList(
        (0-2): 3 x TransformerBatchNormEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
          )
          (linear1): Linear(in_features=128, out_features=256, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=256, out_features=128, bias=True)
          (norm1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (norm2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (output_layer): Linear(in_features=128, out_features=6, bias=True)
    (dropout1): Dropout(p=0.1, inplace=False)
  )
)
self.model after eval: None

Without using the DeepSpeed tool, the model can be trained and evaluated normally. However, after using DeepSpeed, the above problem occurs.

The way I initialize the deepspeed:

    model, optimizer, _, _ = deepspeed.initialize(
        model=model,
        optimizer=optimizer,
        config_params=ds_config
    )

The ds_config file:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
 
    "optimizer": {
        "params": {
            "lr": 0.001,
            "weight_decay": 0,
            "optimizer_class": "optimizers.RAdam"
        }
    },
 
    "zero_optimization": {
        "stage": 1,
        "overlap_comm": true,
        "contiguous_gradients": true
    },


    "zero_allow_untested_optimizer": true,
    "train_batch_size": 256,
    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

Problem Analysis

I originally expected that self.model.eval() would only set the model to evaluation mode, and the model itself would not become None. However, the actual output shows that self.model becomes None after calling the eval() method. I suspect that this might be related to the encapsulation or configuration of DeepSpeed, but I'm not sure about the specific cause.

Relevant Environment Information

Python Version: 3.8.20
PyTorch Version: 2.4.1
DeepSpeed Version: 0.16.4

When I want to accelerate the model training by using deepspeed, a problem occured when I want to evaluate the model on validation dataset. Here is the problem code snippet:

def evaluate(self, epoch_num=None, keep_all=True):
        print("self.model:", self.model)

        self.model = self.model.eval()
        print("self.model after eval:", self.model)

Then the output log:

self.model: DeepSpeedEngine(
  (module): TSTransformerEncoder(
    (project_inp): Linear(in_features=6, out_features=128, bias=True)
    (pos_enc): LearnablePositionalEncoding(
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer_encoder): TransformerEncoder(
      (layers): ModuleList(
        (0-2): 3 x TransformerBatchNormEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
          )
          (linear1): Linear(in_features=128, out_features=256, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=256, out_features=128, bias=True)
          (norm1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (norm2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (output_layer): Linear(in_features=128, out_features=6, bias=True)
    (dropout1): Dropout(p=0.1, inplace=False)
  )
)
self.model after eval: None

Without using the DeepSpeed tool, the model can be trained and evaluated normally. However, after using DeepSpeed, the above problem occurs.

The way I initialize the deepspeed:

    model, optimizer, _, _ = deepspeed.initialize(
        model=model,
        optimizer=optimizer,
        config_params=ds_config
    )

The ds_config file:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
 
    "optimizer": {
        "params": {
            "lr": 0.001,
            "weight_decay": 0,
            "optimizer_class": "optimizers.RAdam"
        }
    },
 
    "zero_optimization": {
        "stage": 1,
        "overlap_comm": true,
        "contiguous_gradients": true
    },


    "zero_allow_untested_optimizer": true,
    "train_batch_size": 256,
    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

Problem Analysis

Relevant Environment Information

Python Version: 3.8.20
PyTorch Version: 2.4.1
DeepSpeed Version: 0.16.4

Share Improve this question asked Mar 15 at 17:28 external 111 silver badge1 bronze badge

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

From the source code:

class DeepSpeedEngine(Module):
    r"""DeepSpeed engine for training."""
    ...

    def eval(self):
        r""""""

        self.warn_unscaled_loss = True
        self.module.train(False)

The eval method updates the internal train status of the model but does not return anything. This is different from the standard Pytorch eval code that returns the model itself.

This means self.model.eval() sets the model to eval mode internally, but returns None. This means that when you assign the output of self.model.eval() to self.model via self.model = self.model.eval() , you are essentially running self.model = None.

You can change your code to:

def evaluate(self, epoch_num=None, keep_all=True):
        print("self.model:", self.model)

        self.model.eval() # simply call `eval`, no assignment necessary
        print("self.model after eval:", self.model)

Note that this also works for standard pytorch models - eval primarily updates the internal state of the model object, so reassigning the model object to the same variable name is unnecessary both for the DeepSpeedEngine model and standard pytorch models.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - model.eval() return a NoneType object when using deepspeed - Stack Overflow

Problem Analysis

Relevant Environment Information

Problem Analysis

Relevant Environment Information

1 Answer 1

与本文相关的文章

评论列表(0)