Transcribe tutorial#

import tempfile
import mlrun

Importing the transcribe function from hub#

To import the function directly from hub, use:

transcribe_fn = mlrun.import_function("hub://transcribe")
artifact_path = tempfile.mkdtemp()
transcribe_fn = mlrun.import_function("function.yaml")

Running transcribe#

transcribe_run = transcribe_fn.run(
    handler="transcribe",
    params={
        "model_name": "tiny",
        "input_path": "./data",
        "decoding_options": {"fp16": False},
        "output_directory": "./output",
    },
    returns=[
        "transcriptions: path",
        "transcriptions_df: dataset",
        {"key": "transcriptions_errors", "artifact_type": "file", "file_format": "yaml"},
    ],
    local=True,
    artifact_path=artifact_path,
)
> 2023-07-16 17:14:01,968 [info] Storing function: {'name': 'transcribe-transcribe', 'uid': 'd1384cb679bc4c178b0195d964b628a8', 'db': None}
> 2023-07-16 17:14:01,969 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:01,969 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
> 2023-07-16 17:14:01,970 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:01,970 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
> 2023-07-16 17:14:01,972 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:01,972 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
> 2023-07-16 17:14:09,804 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:09,805 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
> 2023-07-16 17:14:09,805 [info] Loading whisper model: 'tiny'
The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
> 2023-07-16 17:14:10,374 [info] Model loaded.
Transcribing:  67%|██████▋   | 2/3 [00:02<00:01,  1.04s/file]
> 2023-07-16 17:14:12,556 [warning] Error in file: '/Users/Yonatan_Shelach/projects/functions/transcribe/data/error_file.txt'
Transcribing: 100%|██████████| 3/3 [00:02<00:00,  1.39file/s]
> 2023-07-16 17:14:12,566 [info] Done:
      audio_file transcription_file language     length  rate_of_speech
0  speech_01.mp3      speech_01.txt       en   2.011333        3.480278
1  speech_02.mp3      speech_02.txt       en  20.793500        2.548873
> 2023-07-16 17:14:12,596 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:12,597 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
> 2023-07-16 17:14:12,659 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:12,660 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
> 2023-07-16 17:14:12,671 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:12,672 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect

> 2023-07-16 17:14:12,707 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:12,707 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
> 2023-07-16 17:14:12,708 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:12,708 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
project uid iter start state name labels inputs parameters results artifacts
default
...b628a8
0 Jul 16 14:14:01 completed transcribe-transcribe
kind=
owner=Yonatan_Shelach
host=M-QWXQJK77Q0
model_name=tiny
audio_files_directory=./data
decoding_options={'fp16': False}
output_directory=./output
transcriptions
transcriptions_df
transcriptions_errors

> to track results use the .show() or .logs() methods
> 2023-07-16 17:14:12,721 [info] Run execution finished: {'status': 'completed', 'name': 'transcribe-transcribe'}
transcribe_run.outputs
{'transcriptions': 'store://artifacts/default/transcribe-transcribe_transcriptions:d1384cb679bc4c178b0195d964b628a8',
 'transcriptions_df': 'store://artifacts/default/transcribe-transcribe_transcriptions_df:d1384cb679bc4c178b0195d964b628a8',
 'transcriptions_errors': 'store://artifacts/default/transcribe-transcribe_transcriptions_errors:d1384cb679bc4c178b0195d964b628a8'}

Notice: If connected to mlrun server, you can simply use:

df = transcribe_run.artifact("transcriptions_df")
artifact_path += f"/{transcribe_run.metadata.name}/{transcribe_run.metadata.iteration}/"
df = mlrun.get_dataitem(artifact_path + "transcriptions_df.parquet").as_df()
df.head()
audio_file transcription_file language length rate_of_speech
0 speech_01.mp3 speech_01.txt en 2.011333 3.480278
1 speech_02.mp3 speech_02.txt en 20.793500 2.548873