I try to fine-tune pre train Pyannote model for VAD task. I can fine-tune it for Segmentation task and everything goes well and I can improve the model results.
Here is the code how I fine-tune it:
pretrained = Model.from_pretrained(config["pretrained_model_path"])
registry.load_database(config["database_path"])
data = registry.get_protocol('MyProtocol.SpeakerDiarization.data')
finetuned = deepcopy(pretrained)
task = Segmentation(
data,
duration=config["duration"],
#max_num_speakers=config["max_num_speakers"],
batch_size=config["batch_size"], #TODO (2^ - 16...)
num_workers=config["num_workers"], #2S
loss=config["loss"],
vad_loss=config["vad_loss"])
finetuned.task = task
finetuned.prepare_data()
finetuned.setup()
trainer = Trainer(accelerator=config["accelerator"],
callbacks=callbacks,
max_epochs=config["max_epochs"],
gradient_clip_val=config["gradient_clip_val"],
logger=[tensorboard_logger, csv_logger])
trainer.fit(finetuned)
My dataset includes only audio files with speakers. When I check the VAD after the training it faild. I test it with zeros vector and with wav files without any speaker.
This is how I check the vad model on my model:
pipeline = VoiceActivityDetection(segmentation=segmentation_model_path)
pipeline.onset = 0.5
pipeline.offset = 0.5
pipeline.instantiate({
"min_duration_on": 0.3,
"min_duration_off": 1.0
})
waveform, sr = torchaudio.load(audio_file_path)
vad_result = pipeline({"waveform": waveform, "sample_rate": sample_rate})
When I use the pre-trained model, it correctly detects speech/non-speech regions. However, after fine-tuning the model (even for one epoch), it fails to detect these regions accurately.
I then tried fine-tuning specifically for the VAD task. I understand that .lab files with speech/non-speech labels are required for this. However, I noticed that in the - Pyannote VAD tutorial, there’s no mention of .lab files — only .rttm and .uem files are referenced. This has left me confused about the correct setup for fine-tuning a Pyannote model specifically for VAD.
My Questions:
- Why does the fine-tuned model for Segmentation fail to detect silence or non-speech regions when tested for VAD?
- For fine-tuning Pyannote specifically for VAD, do I need .lab files, and if so, how can I generate them from my dataset? And how can I configure the database.yml file for it?
- How can I ensure that the fine-tuned model performs well for the VAD task?
Any guidance would be greatly appreciated!