transcribe package#
Submodules#
transcribe.transcribe module#
- class transcribe.transcribe.BaseTask(audio_file: Path, transcription_output: dict | str, text_file: Path)[source]#
Bases:
object
A task to write the transcription to file.
- class transcribe.transcribe.BatchProcessor(audio_files: List[Path], output_directory: Path)[source]#
Bases:
object
A batch processor to process batches of transcriptions. The batch processor is creating tasks and is aimed to be working along the transcriber. It can be used with multiprocessing queue or run the tasks directly using the associated methods.
- do_tasks()[source]#
Perform the tasks. Should be used if no multiprocessing queue is given to a transcriber.
- class transcribe.transcribe.PerChannelSpeechDiarizationBatchProcessor(audio_files: List[Path], output_directory: Path, n_channels: int, speakers: List[str])[source]#
Bases:
BatchProcessor
A batch processor to process batches of transcriptions per channel. The batch processor is creating tasks with the selected amount of channels given and is aimed to be working along the transcriber. It can be used with multiprocessing queue or run the tasks directly using the associated methods.
- class transcribe.transcribe.SpeechDiarizationBatchProcessor(audio_files: List[Path], output_directory: Path, speech_diarization: dict)[source]#
Bases:
BatchProcessor
A batch processor to process batches of transcriptions with respect to a given speech diarization. The batch processor is creating tasks and is aimed to be working along the transcriber. It can be used with multiprocessing queue or run the tasks directly using the associated methods.
- class transcribe.transcribe.SpeechDiarizationPerChannelTask(audio_file: Path, text_file: Path)[source]#
Bases:
BaseTask
A task to write the transcription to file with respect to a given speech diarization per channel.
- to_tuple() Tuple[str, dict] [source]#
Convert the task to a tuple to reconstruct it later (used for multiprocessing to pass in queue).
- Returns:
The converted task.
- property transcription_output_channels: List[Tuple[str, dict]]#
Get the transcription output channels.
- Returns:
The transcription output channels.
- class transcribe.transcribe.SpeechDiarizationTask(audio_file: Path, transcription_output: dict, text_file: Path, speech_diarization: List[Tuple[float, float, str]])[source]#
Bases:
BaseTask
A task to write the transcription to file with respect to a given speech diarization.
- class transcribe.transcribe.Transcriber(model_name: str, device: str | None = None, use_flash_attention_2: bool | None = None, use_better_transformers: bool | None = None, assistant_model: str | None = None, max_new_tokens: int = 128, chunk_length_s: int = 30, batch_size: int = 2, spoken_language: str | None = None, translate_to_english: bool = False, return_timestamps: bool | Literal['word'] = False, per_channel_transcription: int = 0)[source]#
Bases:
object
A transcription wrapper for the Huggingface’s ASR pipeline - https://huggingface.co/transformers/main_classes/pipelines.html#transformers.AutomaticSpeechRecognitionPipeline to use with OpenAI’s Whisper models - https://huggingface.co/openai.
- transcribe(audio_files: List[Path], batch_processor: BatchProcessor | None = None, batches_queue: Queue | None = None, verbose: bool = False) List[List[dict]] | None [source]#
Transcribe the given audio files. The transcriptions will be sent to a queue or a batch processor for further processing like writing to text files. If no queue or batch processor is given, the transcriptions outputs from the pipeline will be returned. Otherwise, None is returned.
- Parameters:
audio_files – The audio files to transcribe.
batch_processor – A batch processor.
batches_queue – A multiprocessing queue to put the batches in.
verbose – Whether to show a progress bar. Default is False.
- Returns:
The transcriptions outputs from the pipeline if no queue or batch processor is given, otherwise, None.
- transcribe.transcribe.open_mpi_handler(worker_inputs: List[str], root_worker_inputs: Dict[str, Any] | None = None)[source]#
- transcribe.transcribe.transcribe(data_path: str | Path | List[str | Path], output_directory: str | None = None, model_name: str = 'openai/whisper-tiny', device: str | None = None, use_flash_attention_2: bool | None = None, use_better_transformers: bool | None = None, assistant_model: str | None = None, max_new_tokens: int = 128, chunk_length_s: int = 30, batch_size: int = 8, spoken_language: str | None = None, translate_to_english: bool = False, speech_diarization: Dict[str, List[Tuple[float, float, str]]] | None = None, speech_diarize_per_channel: int | None = None, speaker_labels: List[str] | None = None, use_multiprocessing: bool | int = False, verbose: bool = False)[source]#
Transcribe audio files into text files and collect additional data. The end result is a directory of transcribed text files and a dataframe containing the following columns:
audio_file - The audio file path.
transcription_file - The transcribed text file name in the output directory.
The transcription is based on Huggingface’s ASR pipeline - https://huggingface.co/transformers/main_classes/pipelines.html#transformers.AutomaticSpeechRecognitionPipeline and is tested with OpenAI’s Whisper models - https://huggingface.co/openai.
If one of the speaker diarization parameters are given (either speech_diarization or speech_diarize_per_channel), the transcription will be written in a conversation format, where each speaker will be written in a separate line:
speaker_1: text speaker_2: text speaker_1: text ...
- Parameters:
data_path – A directory of audio files or a single file or a list of files to transcribe.
output_directory – Path to a directory to save all transcribed audio files. If not given, will save the transcribed files in a temporary directory.
model_name – The model name to use. Should be a model from the OpenAI’s Whisper models for best results (for example “tiny”, “base”, “large”, etc.). See here for more information: https://huggingface.co/openai?search_models=whisper.
device – The device to use for inference. If not given, will use GPU if available.
use_flash_attention_2 –
Whether to use the Flash Attention 2 implementation. It can be used only with one of the following GPUs: Nvidia H series and Nvidia A series. T4 support will be available soon.
Note: If both use_flash_attention_2 and use_better_transformers are None, the optimization will be chosen automatically according to the available resources.
use_better_transformers –
Whether to use the Better Transformers library to further optimize the model. Should be used for all use cases that do not support flash attention 2.
Note: If both use_flash_attention_2 and use_better_transformers are None, the optimization will be chosen automatically according to the available resources.
assistant_model –
The assistant model name to use for inference. Notice that the optimizations (flash attention 2 and better transformers) will be applied for the assistant as well. Should be a model from Huggingface’s distil-whisper (see here for more information: huggingface/distil-whisper).
Note: Currently an assistant model is only usable with batch size of 1.
max_new_tokens – The maximum number of new tokens to generate. This is used to limit the generation length. Default is 128 tokens.
chunk_length_s – The audio chunk to split the audio to (in seconds). Default is 30 seconds.
batch_size – The batch size to use for inference. Default is 2.
spoken_language – Aim whisper to know what language is spoken. If None, it will try to detect it.
translate_to_english – Whether to translate the transcriptions to English.
speech_diarization –
A speech diarization dictionary with the file names to transcribe as keys and their diarization as value. The diarization is a list of tuples: (start, end, speaker). An example for a diarization dictionary:
{
- ”audio_file_name”: [
- {
“start”: 0.0, “end”: 2.0, “speaker”: “Agent”,
}, {
”start”: 2.0, “end”: 4.0, “speaker”: “Client”,
}
Note: The diarization must be for the entire duration of the audio file (as long as Whisper is predicting words up until then.
speech_diarize_per_channel – Perform speech diarization per channel. Each speaker is expected to belong to a separate channel in the audio. Notice: This will make the transcription slower as each channel wil be transcribed separatly. If a speech diarization is passed (via the speech_diarization parameter), this parameter is ignored.
speaker_labels – A list of speaker labels by channel order to use for writing the transcription with respect to per channel speech diarization. This won’t be used together with a given speech diarization (via the speech_diarization parameter).
use_multiprocessing – Whether to use multiprocessing to transcribe the audio files. Can be either a boolean value or an integer. If True, will use the default amount of workers (3): 1 for transcription, 1 for batch processing and 1 for task completion (such as speech diarization and writing to files). To control the amount of tasks completion workers, an integer can be provided to specify the amount of workers. False, will use a single process. Default is False.
verbose – Whether to print the progress of the transcription. Default is False.