Transcribe tutorial
Contents
Transcribe tutorial#
import tempfile
import mlrun
Importing the transcribe function from hub#
To import the function directly from hub, use:
transcribe_fn = mlrun.import_function("hub://transcribe")
artifact_path = tempfile.mkdtemp()
transcribe_fn = mlrun.import_function("function.yaml")
Running transcribe#
transcribe_run = transcribe_fn.run(
handler="transcribe",
params={
"model_name": "tiny",
"input_path": "./data",
"decoding_options": {"fp16": False},
"output_directory": "./output",
},
returns=[
"transcriptions: path",
"transcriptions_df: dataset",
{"key": "transcriptions_errors", "artifact_type": "file", "file_format": "yaml"},
],
local=True,
artifact_path=artifact_path,
)
> 2023-07-16 17:14:01,968 [info] Storing function: {'name': 'transcribe-transcribe', 'uid': 'd1384cb679bc4c178b0195d964b628a8', 'db': None}
> 2023-07-16 17:14:01,969 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:01,969 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
> 2023-07-16 17:14:01,970 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:01,970 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
> 2023-07-16 17:14:01,972 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:01,972 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
> 2023-07-16 17:14:09,804 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:09,805 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
> 2023-07-16 17:14:09,805 [info] Loading whisper model: 'tiny'
The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
> 2023-07-16 17:14:10,374 [info] Model loaded.
Transcribing: 67%|██████▋ | 2/3 [00:02<00:01, 1.04s/file]
> 2023-07-16 17:14:12,556 [warning] Error in file: '/Users/Yonatan_Shelach/projects/functions/transcribe/data/error_file.txt'
Transcribing: 100%|██████████| 3/3 [00:02<00:00, 1.39file/s]
> 2023-07-16 17:14:12,566 [info] Done:
audio_file transcription_file language length rate_of_speech
0 speech_01.mp3 speech_01.txt en 2.011333 3.480278
1 speech_02.mp3 speech_02.txt en 20.793500 2.548873
> 2023-07-16 17:14:12,596 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:12,597 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
> 2023-07-16 17:14:12,659 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:12,660 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
> 2023-07-16 17:14:12,671 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:12,672 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
> 2023-07-16 17:14:12,707 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:12,707 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
> 2023-07-16 17:14:12,708 [warning] Could not detect path to API server, not connected to API server!
> 2023-07-16 17:14:12,708 [warning] MLRUN_DBPATH is not set. Set this environment variable to the URL of the API server in order to connect
project | uid | iter | start | state | name | labels | inputs | parameters | results | artifacts |
---|---|---|---|---|---|---|---|---|---|---|
default | ...b628a8 |
0 | Jul 16 14:14:01 | completed | transcribe-transcribe | kind= owner=Yonatan_Shelach host=M-QWXQJK77Q0 |
model_name=tiny audio_files_directory=./data decoding_options={'fp16': False} output_directory=./output |
transcriptions transcriptions_df transcriptions_errors |
> to track results use the .show() or .logs() methods
> 2023-07-16 17:14:12,721 [info] Run execution finished: {'status': 'completed', 'name': 'transcribe-transcribe'}
transcribe_run.outputs
{'transcriptions': 'store://artifacts/default/transcribe-transcribe_transcriptions:d1384cb679bc4c178b0195d964b628a8',
'transcriptions_df': 'store://artifacts/default/transcribe-transcribe_transcriptions_df:d1384cb679bc4c178b0195d964b628a8',
'transcriptions_errors': 'store://artifacts/default/transcribe-transcribe_transcriptions_errors:d1384cb679bc4c178b0195d964b628a8'}
Notice: If connected to mlrun server, you can simply use:
df = transcribe_run.artifact("transcriptions_df")
artifact_path += f"/{transcribe_run.metadata.name}/{transcribe_run.metadata.iteration}/"
df = mlrun.get_dataitem(artifact_path + "transcriptions_df.parquet").as_df()
df.head()
audio_file | transcription_file | language | length | rate_of_speech | |
---|---|---|---|---|---|
0 | speech_01.mp3 | speech_01.txt | en | 2.011333 | 3.480278 |
1 | speech_02.mp3 | speech_02.txt | en | 20.793500 | 2.548873 |