DeepSpeed model initialization memory overhead fix?

I'm trying to train a small llm on my local computer which has a single gpu with 16gb vram. I kept encoutering vram oom, so I was looking for a way to reduce vram use. DeepSpeed seemed interesting, so I tried it out. But just initializing the model uses a lot of memory and I'm still getting oom. Is there a way to initialize the model without memory overhead? Below is the code I used.

def create_policy_model():
    model = AutoModelForCausalLM.from_pretrained(
        policy_model_name,
        trust_remote_code=True,
        device_map="auto",
        quantization_config=bnb_config,
        torch_dtype=torch.bfloat16,
        config=policy_autoconfig,
        attn_implementation="flash_attention_2",
        low_cpu_mem_usage=True
    )

    model.gradient_checkpointing_enable()
    model.enable_input_require_grads()

    return model

policy_model_1 = create_policy_model()

deepspeed_config = {
    "train_micro_batch_size_per_gpu": 1,
    "distributed_type": "NO",
    "optimizer": {
        "type": "Adam",
        "params": {"lr": 5e-5}
    },
    "bf16": {"enabled": True},
    "zero_optimization": {
        "stage": 3,
    },
}

# this is where vram use spikes (at around 10 sec in the scoreenshot below)
model_engine, _, _, _ = deepspeed.initialize(
    model=policy_model_1,
    model_parameters=policy_model_1.parameters(),
    config=deepspeed_config,
)

vram spike screenshot

I tried changing zero optimization stage to all the available values (0-3), but it's all the same.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

DeepSpeed model initialization memory overhead fix? - Stack Overflow

与本文相关的文章

评论列表(0)