speaker diarization - Fine-Tuning Pyannote Model for VAD Task — Issues After Training

I try to fine-tune pre train Pyannote model for VAD task. I can fine-tune it for Segmentation task and everything goes well and I can improve the model results.

Here is the code how I fine-tune it:

pretrained = Model.from_pretrained(config["pretrained_model_path"])
registry.load_database(config["database_path"])
data = registry.get_protocol('MyProtocol.SpeakerDiarization.data') 
finetuned = deepcopy(pretrained)
task = Segmentation(
    data,
    duration=config["duration"],
    #max_num_speakers=config["max_num_speakers"],
    batch_size=config["batch_size"], #TODO (2^ - 16...)
    num_workers=config["num_workers"],  #2S
loss=config["loss"],
vad_loss=config["vad_loss"])
finetuned.task = task
finetuned.prepare_data()
finetuned.setup()
trainer = Trainer(accelerator=config["accelerator"], 
                  callbacks=callbacks,
                  max_epochs=config["max_epochs"],
                  gradient_clip_val=config["gradient_clip_val"],
                  logger=[tensorboard_logger, csv_logger])    
trainer.fit(finetuned)

My dataset includes only audio files with speakers. When I check the VAD after the training it faild. I test it with zeros vector and with wav files without any speaker.

This is how I check the vad model on my model:

pipeline = VoiceActivityDetection(segmentation=segmentation_model_path)
pipeline.onset = 0.5
pipeline.offset = 0.5
pipeline.instantiate({               
        "min_duration_on": 0.3,
        "min_duration_off": 1.0 
    })
waveform, sr = torchaudio.load(audio_file_path)
vad_result = pipeline({"waveform": waveform, "sample_rate": sample_rate})

When I use the pre-trained model, it correctly detects speech/non-speech regions. However, after fine-tuning the model (even for one epoch), it fails to detect these regions accurately.

I then tried fine-tuning specifically for the VAD task. I understand that .lab files with speech/non-speech labels are required for this. However, I noticed that in the - Pyannote VAD tutorial, there’s no mention of .lab files — only .rttm and .uem files are referenced. This has left me confused about the correct setup for fine-tuning a Pyannote model specifically for VAD.

My Questions:

Why does the fine-tuned model for Segmentation fail to detect silence or non-speech regions when tested for VAD?
For fine-tuning Pyannote specifically for VAD, do I need .lab files, and if so, how can I generate them from my dataset? And how can I configure the database.yml file for it?
How can I ensure that the fine-tuned model performs well for the VAD task?

Any guidance would be greatly appreciated!

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

speaker diarization - Fine-Tuning Pyannote Model for VAD Task — Issues After Training - Stack Overflow

与本文相关的文章

评论列表(0)