When I want to accelerate the model training by using deepspeed, a problem occured when I want to evaluate the model on validation dataset. Here is the problem code snippet:
def evaluate(self, epoch_num=None, keep_all=True):
print("self.model:", self.model)
self.model = self.model.eval()
print("self.model after eval:", self.model)
Then the output log:
self.model: DeepSpeedEngine(
(module): TSTransformerEncoder(
(project_inp): Linear(in_features=6, out_features=128, bias=True)
(pos_enc): LearnablePositionalEncoding(
(dropout): Dropout(p=0.1, inplace=False)
)
(transformer_encoder): TransformerEncoder(
(layers): ModuleList(
(0-2): 3 x TransformerBatchNormEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
)
(linear1): Linear(in_features=128, out_features=256, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=256, out_features=128, bias=True)
(norm1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(norm2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
)
)
)
(output_layer): Linear(in_features=128, out_features=6, bias=True)
(dropout1): Dropout(p=0.1, inplace=False)
)
)
self.model after eval: None
Without using the DeepSpeed tool, the model can be trained and evaluated normally. However, after using DeepSpeed, the above problem occurs.
The way I initialize the deepspeed:
model, optimizer, _, _ = deepspeed.initialize(
model=model,
optimizer=optimizer,
config_params=ds_config
)
The ds_config file:
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"params": {
"lr": 0.001,
"weight_decay": 0,
"optimizer_class": "optimizers.RAdam"
}
},
"zero_optimization": {
"stage": 1,
"overlap_comm": true,
"contiguous_gradients": true
},
"zero_allow_untested_optimizer": true,
"train_batch_size": 256,
"steps_per_print": 2000,
"wall_clock_breakdown": false
}
Problem Analysis
I originally expected that self.model.eval()
would only set the model to evaluation mode, and the model itself would not become None
. However, the actual output shows that self.model
becomes None
after calling the eval()
method. I suspect that this might be related to the encapsulation or configuration of DeepSpeed, but I'm not sure about the specific cause.
Relevant Environment Information
Python Version: 3.8.20
PyTorch Version: 2.4.1
DeepSpeed Version: 0.16.4
When I want to accelerate the model training by using deepspeed, a problem occured when I want to evaluate the model on validation dataset. Here is the problem code snippet:
def evaluate(self, epoch_num=None, keep_all=True):
print("self.model:", self.model)
self.model = self.model.eval()
print("self.model after eval:", self.model)
Then the output log:
self.model: DeepSpeedEngine(
(module): TSTransformerEncoder(
(project_inp): Linear(in_features=6, out_features=128, bias=True)
(pos_enc): LearnablePositionalEncoding(
(dropout): Dropout(p=0.1, inplace=False)
)
(transformer_encoder): TransformerEncoder(
(layers): ModuleList(
(0-2): 3 x TransformerBatchNormEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
)
(linear1): Linear(in_features=128, out_features=256, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=256, out_features=128, bias=True)
(norm1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(norm2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
)
)
)
(output_layer): Linear(in_features=128, out_features=6, bias=True)
(dropout1): Dropout(p=0.1, inplace=False)
)
)
self.model after eval: None
Without using the DeepSpeed tool, the model can be trained and evaluated normally. However, after using DeepSpeed, the above problem occurs.
The way I initialize the deepspeed:
model, optimizer, _, _ = deepspeed.initialize(
model=model,
optimizer=optimizer,
config_params=ds_config
)
The ds_config file:
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"params": {
"lr": 0.001,
"weight_decay": 0,
"optimizer_class": "optimizers.RAdam"
}
},
"zero_optimization": {
"stage": 1,
"overlap_comm": true,
"contiguous_gradients": true
},
"zero_allow_untested_optimizer": true,
"train_batch_size": 256,
"steps_per_print": 2000,
"wall_clock_breakdown": false
}
Problem Analysis
I originally expected that self.model.eval()
would only set the model to evaluation mode, and the model itself would not become None
. However, the actual output shows that self.model
becomes None
after calling the eval()
method. I suspect that this might be related to the encapsulation or configuration of DeepSpeed, but I'm not sure about the specific cause.
Relevant Environment Information
Python Version: 3.8.20
PyTorch Version: 2.4.1
DeepSpeed Version: 0.16.4
1 Answer
Reset to default 1From the source code:
class DeepSpeedEngine(Module):
r"""DeepSpeed engine for training."""
...
def eval(self):
r""""""
self.warn_unscaled_loss = True
self.module.train(False)
The eval
method updates the internal train
status of the model but does not return anything. This is different from the standard Pytorch eval code that returns the model itself.
This means self.model.eval()
sets the model to eval mode internally, but returns None
. This means that when you assign the output of self.model.eval()
to self.model
via self.model = self.model.eval()
, you are essentially running self.model = None
.
You can change your code to:
def evaluate(self, epoch_num=None, keep_all=True):
print("self.model:", self.model)
self.model.eval() # simply call `eval`, no assignment necessary
print("self.model after eval:", self.model)
Note that this also works for standard pytorch models - eval
primarily updates the internal state of the model object, so reassigning the model object to the same variable name is unnecessary both for the DeepSpeedEngine
model and standard pytorch models.