translate package#

Submodules#

translate.translate module#

translate.translate.open_mpi_handler(worker_inputs: List[str], root_worker_inputs: Dict[str, Any] | None = None)[source]#
translate.translate.translate(data_path: str | List[str] | Path, output_directory: str, model_name: str | None = None, source_language: str | None = None, target_language: str | None = None, device: str | None = None, model_kwargs: dict | None = None, batch_size: int = 1, translation_kwargs: dict | None = None, verbose: bool = False) Tuple[str, DataFrame, dict][source]#

Translate text files using a transformer model from Huggingface’s hub according to the source and target languages given (or using the directly provided model name). The end result is a directory of translated text files and a dataframe containing the following columns:

  • text_file - The text file path.

  • translation_file - The translation text file name in the output directory.

Parameters:
  • data_path – A directory of text files or a single file or a list of files to translate.

  • output_directory – Directory where the translated files will be saved.

  • model_name – The name of a model to load. If None, the model name is constructed using the source and target languages parameters.

  • source_language – The source language code (e.g., ‘en’ for English).

  • target_language – The target language code (e.g., ‘en’ for English).

  • model_kwargs – Keyword arguments to pass regarding the loading of the model in HuggingFace’s pipeline function.

  • device – The device index for transformers. Default will prefer cuda if available.

  • batch_size – The number of batches to use in translation. The files are translated one by one, but the sentences can be batched.

  • translation_kwargs – Additional keyword arguments to pass to a transformers.TranslationPipeline when doing the translation inference. Notice the batch size here is being added automatically.

  • verbose – Whether to present logs of a progress bar and errors. Default: True.

Returns:

A tuple of:

  • Path to the output directory.

  • A dataframe dataset of the translated file names.

  • A dictionary of errored files that were not translated.

Module contents#