I'm trying to train a small llm on my local computer which has a single gpu with 16gb vram. I kept encoutering vram oom, so I was looking for a way to reduce vram use. DeepSpeed seemed interesting, so I tried it out. But just initializing the model uses a lot of memory and I'm still getting oom. Is there a way to initialize the model without memory overhead? Below is the code I used.
def create_policy_model():
model = AutoModelForCausalLM.from_pretrained(
policy_model_name,
trust_remote_code=True,
device_map="auto",
quantization_config=bnb_config,
torch_dtype=torch.bfloat16,
config=policy_autoconfig,
attn_implementation="flash_attention_2",
low_cpu_mem_usage=True
)
model.gradient_checkpointing_enable()
model.enable_input_require_grads()
return model
policy_model_1 = create_policy_model()
deepspeed_config = {
"train_micro_batch_size_per_gpu": 1,
"distributed_type": "NO",
"optimizer": {
"type": "Adam",
"params": {"lr": 5e-5}
},
"bf16": {"enabled": True},
"zero_optimization": {
"stage": 3,
},
}
# this is where vram use spikes (at around 10 sec in the scoreenshot below)
model_engine, _, _, _ = deepspeed.initialize(
model=policy_model_1,
model_parameters=policy_model_1.parameters(),
config=deepspeed_config,
)
vram spike screenshot
I tried changing zero optimization stage to all the available values (0-3), but it's all the same.