silero_vad package
Contents
silero_vad package#
Submodules#
silero_vad.silero_vad module#
- class silero_vad.silero_vad.BaseTask(audio_file: pathlib.Path)[source]#
Bases:
object
A base class for a task to complete after VAD.
- property audio_file: pathlib.Path#
Get the audio file of the task.
- Returns
The audio file of the task.
- do_task(speech_timestamps: Union[List[Dict[str, int]], List[List[Dict[str, int]]]])[source]#
Do the task on the given speech timestamps. The base task will simply save the speech timestamps as the result.
- Parameters
speech_timestamps – The speech timestamps to do the task on as outputted from the VAD.
- class silero_vad.silero_vad.SpeechDiarizationTask(audio_file: pathlib.Path, speaker_labels: List[str])[source]#
Bases:
silero_vad.silero_vad.BaseTask
A speech diarization task. The task will diarize the VAD speech timestamps into speakers.
- class silero_vad.silero_vad.TaskCreator(task_type: Type[silero_vad.silero_vad.BaseTask], task_kwargs: Optional[dict] = None)[source]#
Bases:
object
A task creator to create different tasks to run after the VAD.
- create_task(audio_file: pathlib.Path) → silero_vad.silero_vad.BaseTask[source]#
Create a task with the given audio file.
- Parameters
audio_file – The audio file to assign to the task.
- Returns
The created task.
- classmethod from_tuple(task_tuple: Tuple[str, dict]) → silero_vad.silero_vad.BaseTask[source]#
Create a task from a tuple of the audio file name and the task kwargs.
- Parameters
task_tuple – The task tuple to create the task from.
- Returns
The created task.
- class silero_vad.silero_vad.VoiceActivityDetector(use_onnx: bool = True, force_onnx_cpu: bool = True, threshold: float = 0.5, sampling_rate: int = 16000, min_speech_duration_ms: int = 250, max_speech_duration_s: float = inf, min_silence_duration_ms: int = 100, window_size_samples: int = 512, speech_pad_ms: int = 30, return_seconds: bool = False, per_channel: bool = False)[source]#
Bases:
object
A voice activity detection wrapper for the silero VAD model - https://github.com/snakers4/silero-vad.
- detect_voice(audio_file: pathlib.Path) → Union[List[Dict[str, int]], List[List[Dict[str, int]]]][source]#
Infer the audio through the VAD model and return the speech timestamps.
- Parameters
audio_file – The audio file to infer.
- Returns
The speech timestamps in the audio. A list of timestamps where each timestamp is a dictionary with the following keys:
”start”: The start sample index of the speech in the audio.
”end”: The end sample index of the speech in the audio.
If per_channel is True, a list of timestamps per channel will be returned.
- silero_vad.silero_vad.detect_voice(data_path: Union[str, pathlib.Path, List[Union[str, pathlib.Path]]], use_onnx: bool = True, force_onnx_cpu: bool = True, threshold: float = 0.5, sampling_rate: int = 16000, min_speech_duration_ms: int = 250, max_speech_duration_s: float = inf, min_silence_duration_ms: int = 100, window_size_samples: int = 512, speech_pad_ms: int = 30, return_seconds: bool = False, per_channel: bool = False, use_multiprocessing: int = 0, verbose: bool = False)[source]#
Perform voice activity detection on given audio files using the silero VAD model - https://github.com/snakers4/silero-vad. The end result is a dictionary with the file names as keys and their VAD timestamps dictionaries as value.
For example:
{ "file_1.wav": [ {"start": 0, "end": 16000}, {"start": 16000, "end": 32000}, {"start": 32000, "end": 48000}, ... ], "file_2.wav": [ {"start": 0, "end": 16000}, {"start": 16000, "end": 32000}, {"start": 32000, "end": 48000}, ... ], ... }
- Parameters
data_path – The path to the audio files to diarize. Can be a path to a single file, a path to a directory or a list of paths to files.
use_onnx – Whether to use ONNX for inference. Default is True.
force_onnx_cpu – Whether to force ONNX to use CPU for inference. Default is True.
threshold – Speech threshold. Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH. It is better to tune this parameter for each dataset separately, but “lazy” 0.5 is pretty good for most datasets.
sampling_rate – Currently, silero VAD models support 8000 and 16000 sample rates.
min_speech_duration_ms – Final speech chunks shorter min_speech_duration_ms are thrown out.
max_speech_duration_s – Maximum duration of speech chunks in seconds. Chunks longer than max_speech_duration_s will be split at the timestamp of the last silence that lasts more than 100ms (if any), to prevent aggressive cutting. Otherwise, they will be split aggressively just before max_speech_duration_s.
min_silence_duration_ms – In the end of each speech chunk wait for min_silence_duration_ms before separating it.
window_size_samples –
Audio chunks of window_size_samples size are fed to the silero VAD model.
WARNING! Silero VAD models were trained using 512, 1024, 1536 samples for 16000 sample rate and 256, 512, 768 samples for 8000 sample rate. Values other than these may affect model performance!
speech_pad_ms – Final speech chunks are padded by speech_pad_ms each side.
return_seconds – Whether return timestamps in seconds. False means to return timestamps in samples (default - False).
per_channel – Whether to return timestamps per channel (default - False). This will run VAD on each channel separately and return a list of timestamps per channel.
use_multiprocessing – The number of workers to use for multiprocessing. If 0, no multiprocessing will be used. Default is 0.
verbose – Verbosity.
- silero_vad.silero_vad.diarize(data_path: Union[str, pathlib.Path, List[Union[str, pathlib.Path]]], use_onnx: bool = True, force_onnx_cpu: bool = True, threshold: float = 0.5, sampling_rate: int = 16000, min_speech_duration_ms: int = 250, max_speech_duration_s: float = inf, min_silence_duration_ms: int = 100, window_size_samples: int = 512, speech_pad_ms: int = 30, speaker_labels: Optional[List[str]] = None, use_multiprocessing: int = 0, verbose: bool = False)[source]#
Perform speech diarization on given audio files using the silero VAD model - https://github.com/snakers4/silero-vad. The speech diarization is performed per channel so that each channel in the audio belong to a different speaker. The end result is a dictionary with the file names as keys and their diarization as value. A diarization is a list of tuples: (start, end, speaker_label).
For example:
{ "file_1.wav": [ (0.0, 1.0, "speaker_0"), (1.0, 2.0, "speaker_1"), (2.0, 3.0, "speaker_0"), ... ], "file_2.wav": [ (0.0, 1.0, "speaker_0"), (1.0, 2.0, "speaker_1"), (2.0, 3.0, "speaker_0"), ... ], ... }
- Parameters
data_path – The path to the audio files to diarize. Can be a path to a single file, a path to a directory or a list of paths to files.
use_onnx – Whether to use ONNX for inference. Default is True.
force_onnx_cpu – Whether to force ONNX to use CPU for inference. Default is True.
threshold – Speech threshold. Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH. It is better to tune this parameter for each dataset separately, but “lazy” 0.5 is pretty good for most datasets.
sampling_rate – Currently, silero VAD models support 8000 and 16000 sample rates.
min_speech_duration_ms – Final speech chunks shorter min_speech_duration_ms are thrown out.
max_speech_duration_s – Maximum duration of speech chunks in seconds. Chunks longer than max_speech_duration_s will be split at the timestamp of the last silence that lasts more than 100ms (if any), to prevent aggressive cutting. Otherwise, they will be split aggressively just before max_speech_duration_s.
min_silence_duration_ms – In the end of each speech chunk wait for min_silence_duration_ms before separating it.
window_size_samples –
Audio chunks of window_size_samples size are fed to the silero VAD model.
WARNING! Silero VAD models were trained using 512, 1024, 1536 samples for 16000 sample rate and 256, 512, 768 samples for 8000 sample rate. Values other than these may affect model performance!
speech_pad_ms – Final speech chunks are padded by speech_pad_ms each side.
speaker_labels – The speaker labels to use for the diarization. If not given, the speakers will be named “speaker_0”, “speaker_1”, etc.
use_multiprocessing – The number of workers to use for multiprocessing. If 0, no multiprocessing will be used. Default is 0.
verbose – Verbosity.