Translate tutorial#

Short description and explenation#

Machine translation has made huge strides in recent years thanks to advances in deep learning, our translte function makes it even easier to use.
Simply tell it where your file is and the languages you’re working with (the one you’re translating from and the one you want),
and this function takes care of the rest. It cleverly picks the right pre-trained model for your language pair, ensuring top-notch translations.

No need to worry about finding the perfect model or dealing with complex setup – it’s all handled behind the scenes.

With this function, language translation becomes a breeze, making your documents accessible in any language without breaking a sweat.

Background#

The function takes two parameters: a model name or the source and target languages, and a path to one or more text files to translate.

It first checks if a model name was passed. If so, it loads that Helsinki-NLP model.
If not, it looks at the source and target languages and loads the appropriate Helsinki-NLP translation model.

It then reads in the text files and translates them using the loaded model.

Finally, it writes the translated text out to new files and returns the filename or dir name.

This allows the user to easily translate a text file to another language using Helsinki-NLP’s pre-trained models by just passing the model name or language pair and source text file.

This function auto-model selection is based on the great translation models offered by Helsinki. Check them out https://huggingface.co/Helsinki-NLP

Requirements#

transformers
tqdm

Documentation#

data_path: A directory of text files or a single text file or a list of files to translate.

output_directory: Directory where the translated files will be saved.

model_name: The name of a model to load. If None, the model name is constructed using the source and
target languages parameters from the “Helsinki-NLP” group.

source_language: The source language code (e.g., ‘en’ for English).

target_language: The target language code (e.g., ‘en’ for English).

model_kwargs: Keyword arguments to pass regarding the loading of the model in HuggingFace’s “pipeline” function.

device: The device index for transformers. Default will prefer cuda if available.

batch_size: The number of batches to use in translation. The files are translated one by one, but the sentences can be batched.

translation_kwargs: Additional keyword arguments to pass to a “transformers.TranslationPipeline” when doing
the translation inference. Notice the batch size here is being added automatically.

Demo#

The following demo will show an example of translating a text file written in turkish to eanglish using the tranlate function.

(1.) Import the function (import mlrun, set project and import function)#

import mlrun

We want to translate the following turkish sentence into english, so we will write it to a text file.

%%writefile data.txt
Ali her gece bir kitap okur. # which means: "Ali reads a book every night."
Writing data.txt

Setting a project and importing the translate function

project = mlrun.new_project("test-translate")
translate_fn = project.set_function("hub://translate", "translate")
> 2023-12-06 14:44:05,223 [info] Created and saved project: {'name': 'test-translate', 'from_template': None, 'overwrite': False, 'context': './', 'save': True}

Usage#

(2.1.) Manual model selection#

Here we run our function that we’ve imported from the MLRun Function Hub.
We select the specific model, give the function a path to to the file and output directory and choose to run on the cpu.

translate_run = translate_fn.run(
    handler="translate",
    inputs={"data_path": "data.txt"},
    params={
        "model_name": "Helsinki-NLP/opus-mt-tr-en",
        "device": "cpu",
        "output_directory": "./",
    },
    local=True,
    returns=[
        "files: path",
        "text_files_dataframe: dataset",
        "errors: dict",
    ],
)
> 2023-12-06 14:48:52,794 [info] Storing function: {'name': 'translate-translate', 'uid': '5768d0ddaf06469da053c85d47f61a47', 'db': 'http://mlrun-api:8080'}
Recommended: pip install sacremoses.
> 2023-12-06 14:48:56,190 [warning] Skipping logging an object with the log hint '{'key': 'errors', 'artifact_type': 'dict'}' due to the following error:
An exception was raised during the packing of '{}': No packager was found for the combination of 'object_type=builtins.dict' and 'artifact_type=dict'.
project uid iter start state name labels inputs parameters results artifacts
test-translate 0 Dec 06 14:48:52 completed translate-translate
v3io_user=yonis
kind=local
owner=yonis
host=jupyter-yonis-7c9bdbfb4d-9g2p2
data_path
model_name=Helsinki-NLP/opus-mt-tr-en
device=cpu
output_directory=./
files
text_files_dataframe

> to track results use the .show() or .logs() methods or click here to open in UI
> 2023-12-06 14:48:56,409 [info] Run execution finished: {'status': 'completed', 'name': 'translate-translate'}

(2.1.) Auto model detectyion#

Here we run our function that we’ve imported from the MLRun Function Hub.
We select the languages to use for choosing the model, give the function a path to to the file and output directory and choose to run on the cpu.

translate_run = translate_fn.run(
    handler="translate",
    inputs={"data_path": "data.txt"},
    params={
        "target_language": "en",
        "source_language": "tr",
        "device": "cpu",
        "output_directory": "./",
    },
    local=True,
    returns=[
        "files: path",
        "text_files_dataframe: dataset",
        "errors: dict",
    ],
)

We can take alook at the file created

(3.) Review results#

We can look at the articat returned, the import

translate_run.artifact("text_files_dataframe").show()
text_file translation_file
0 data.txt data_2.txt

Checking that translation is correct, we print the text file created by function, and can see the sentence is as expected.

with open("data_2.txt", "r") as f:
    print(f"Translated text:\n{f.read()}")
Translated text:
Ali reads a book every night.