Transcribe tutorial#

import tempfile
import mlrun

Importing the transcribe function from hub#

To import the function directly from hub, use:

transcribe_fn = mlrun.import_function("hub://transcribe")

artifact_path = tempfile.mkdtemp()

transcribe_fn = mlrun.import_function("function.yaml")

Running transcribe#

transcribe_run = transcribe_fn.run(
    handler="transcribe",
    params={
        "model_name": "tiny",
        "input_path": "./data",
        "decoding_options": {"fp16": False},
        "output_directory": "./output",
    },
    returns=[
        "transcriptions: path",
        "transcriptions_df: dataset",
        {"key": "transcriptions_errors", "artifact_type": "file", "file_format": "yaml"},
    ],
    local=True,
    artifact_path=artifact_path,
)

> 2023-07-16 17:14:01,968 [info] Storing function: {'name': 'transcribe-transcribe', 'uid': 'd1384cb679bc4c178b0195d964b628a8', 'db': None}
> 2023-07-16 17:14:01,969 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:01,969 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
> 2023-07-16 17:14:01,970 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:01,970 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
> 2023-07-16 17:14:01,972 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:01,972 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
> 2023-07-16 17:14:09,804 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:09,805 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
> 2023-07-16 17:14:09,805 [info] Loading whisper model: 'tiny'

The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html

> 2023-07-16 17:14:10,374 [info] Model loaded.

Transcribing:  67%|██████▋   | 2/3 [00:02<00:01,  1.04s/file]

> 2023-07-16 17:14:12,556 [warning] Error in file: '/Users/Yonatan_Shelach/projects/functions/transcribe/data/error_file.txt'

Transcribing: 100%|██████████| 3/3 [00:02<00:00,  1.39file/s]

> 2023-07-16 17:14:12,566 [info] Done:
      audio_file transcription_file language     length  rate_of_speech
0  speech_01.mp3      speech_01.txt       en   2.011333        3.480278
1  speech_02.mp3      speech_02.txt       en  20.793500        2.548873
> 2023-07-16 17:14:12,596 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:12,597 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
> 2023-07-16 17:14:12,659 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:12,660 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
> 2023-07-16 17:14:12,671 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:12,672 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect

> 2023-07-16 17:14:12,707 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:12,707 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
> 2023-07-16 17:14:12,708 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:12,708 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect

project	uid	iter	start	state	name	labels	inputs	parameters	results	artifacts
default	...b628a8	0	Jul 16 14:14:01	completed	transcribe-transcribe	kind= owner=Yonatan_Shelach host=M-QWXQJK77Q0		model_name=tiny audio_files_directory=./data decoding_options={'fp16': False} output_directory=./output		transcriptions transcriptions_df transcriptions_errors

> to track results use the .show() or .logs() methods

> 2023-07-16 17:14:12,721 [info] Run execution finished: {'status': 'completed', 'name': 'transcribe-transcribe'}

transcribe_run.outputs

{'transcriptions': 'store://artifacts/default/transcribe-transcribe_transcriptions:d1384cb679bc4c178b0195d964b628a8',
 'transcriptions_df': 'store://artifacts/default/transcribe-transcribe_transcriptions_df:d1384cb679bc4c178b0195d964b628a8',
 'transcriptions_errors': 'store://artifacts/default/transcribe-transcribe_transcriptions_errors:d1384cb679bc4c178b0195d964b628a8'}

Notice: If connected to mlrun server, you can simply use:

df = transcribe_run.artifact("transcriptions_df")

artifact_path += f"/{transcribe_run.metadata.name}/{transcribe_run.metadata.iteration}/"

df = mlrun.get_dataitem(artifact_path + "transcriptions_df.parquet").as_df()

df.head()

	audio_file	transcription_file	language	length	rate_of_speech
0	speech_01.mp3	speech_01.txt	en	2.011333	3.480278
1	speech_02.mp3	speech_02.txt	en	20.793500	2.548873

Transcribe tutorial

Contents

Transcribe tutorial#

Importing the transcribe function from hub#

Running transcribe#