最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - model.eval() return a NoneType object when using deepspeed - Stack Overflow

programmeradmin2浏览0评论

When I want to accelerate the model training by using deepspeed, a problem occured when I want to evaluate the model on validation dataset. Here is the problem code snippet:

def evaluate(self, epoch_num=None, keep_all=True):
        print("self.model:", self.model)

        self.model = self.model.eval()
        print("self.model after eval:", self.model)

Then the output log:

self.model: DeepSpeedEngine(
  (module): TSTransformerEncoder(
    (project_inp): Linear(in_features=6, out_features=128, bias=True)
    (pos_enc): LearnablePositionalEncoding(
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer_encoder): TransformerEncoder(
      (layers): ModuleList(
        (0-2): 3 x TransformerBatchNormEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
          )
          (linear1): Linear(in_features=128, out_features=256, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=256, out_features=128, bias=True)
          (norm1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (norm2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (output_layer): Linear(in_features=128, out_features=6, bias=True)
    (dropout1): Dropout(p=0.1, inplace=False)
  )
)
self.model after eval: None

Without using the DeepSpeed tool, the model can be trained and evaluated normally. However, after using DeepSpeed, the above problem occurs.

The way I initialize the deepspeed:

    model, optimizer, _, _ = deepspeed.initialize(
        model=model,
        optimizer=optimizer,
        config_params=ds_config
    )

The ds_config file:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
 
    "optimizer": {
        "params": {
            "lr": 0.001,
            "weight_decay": 0,
            "optimizer_class": "optimizers.RAdam"
        }
    },
 
    "zero_optimization": {
        "stage": 1,
        "overlap_comm": true,
        "contiguous_gradients": true
    },


    "zero_allow_untested_optimizer": true,
    "train_batch_size": 256,
    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

Problem Analysis

I originally expected that self.model.eval() would only set the model to evaluation mode, and the model itself would not become None. However, the actual output shows that self.model becomes None after calling the eval() method. I suspect that this might be related to the encapsulation or configuration of DeepSpeed, but I'm not sure about the specific cause.

Relevant Environment Information

  • Python Version: 3.8.20

  • PyTorch Version: 2.4.1

  • DeepSpeed Version: 0.16.4

When I want to accelerate the model training by using deepspeed, a problem occured when I want to evaluate the model on validation dataset. Here is the problem code snippet:

def evaluate(self, epoch_num=None, keep_all=True):
        print("self.model:", self.model)

        self.model = self.model.eval()
        print("self.model after eval:", self.model)

Then the output log:

self.model: DeepSpeedEngine(
  (module): TSTransformerEncoder(
    (project_inp): Linear(in_features=6, out_features=128, bias=True)
    (pos_enc): LearnablePositionalEncoding(
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer_encoder): TransformerEncoder(
      (layers): ModuleList(
        (0-2): 3 x TransformerBatchNormEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
          )
          (linear1): Linear(in_features=128, out_features=256, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=256, out_features=128, bias=True)
          (norm1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (norm2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (output_layer): Linear(in_features=128, out_features=6, bias=True)
    (dropout1): Dropout(p=0.1, inplace=False)
  )
)
self.model after eval: None

Without using the DeepSpeed tool, the model can be trained and evaluated normally. However, after using DeepSpeed, the above problem occurs.

The way I initialize the deepspeed:

    model, optimizer, _, _ = deepspeed.initialize(
        model=model,
        optimizer=optimizer,
        config_params=ds_config
    )

The ds_config file:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
 
    "optimizer": {
        "params": {
            "lr": 0.001,
            "weight_decay": 0,
            "optimizer_class": "optimizers.RAdam"
        }
    },
 
    "zero_optimization": {
        "stage": 1,
        "overlap_comm": true,
        "contiguous_gradients": true
    },


    "zero_allow_untested_optimizer": true,
    "train_batch_size": 256,
    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

Problem Analysis

I originally expected that self.model.eval() would only set the model to evaluation mode, and the model itself would not become None. However, the actual output shows that self.model becomes None after calling the eval() method. I suspect that this might be related to the encapsulation or configuration of DeepSpeed, but I'm not sure about the specific cause.

Relevant Environment Information

  • Python Version: 3.8.20

  • PyTorch Version: 2.4.1

  • DeepSpeed Version: 0.16.4

Share Improve this question asked Mar 15 at 17:28 external external 111 silver badge1 bronze badge
Add a comment  | 

1 Answer 1

Reset to default 1

From the source code:

class DeepSpeedEngine(Module):
    r"""DeepSpeed engine for training."""
    ...

    def eval(self):
        r""""""

        self.warn_unscaled_loss = True
        self.module.train(False)

The eval method updates the internal train status of the model but does not return anything. This is different from the standard Pytorch eval code that returns the model itself.

This means self.model.eval() sets the model to eval mode internally, but returns None. This means that when you assign the output of self.model.eval() to self.model via self.model = self.model.eval() , you are essentially running self.model = None.

You can change your code to:

def evaluate(self, epoch_num=None, keep_all=True):
        print("self.model:", self.model)

        self.model.eval() # simply call `eval`, no assignment necessary
        print("self.model after eval:", self.model)

Note that this also works for standard pytorch models - eval primarily updates the internal state of the model object, so reassigning the model object to the same variable name is unnecessary both for the DeepSpeedEngine model and standard pytorch models.

发布评论

评论列表(0)

  1. 暂无评论