translate package#

Submodules#

translate.translate module#

translate.translate.open_mpi_handler(worker_inputs: List[str], root_worker_inputs: Optional[Dict[str, Any]] = None)[source]#
translate.translate.translate(data_path: Union[str, List[str], pathlib.Path], output_directory: str, model_name: Optional[str] = None, source_language: Optional[str] = None, target_language: Optional[str] = None, device: Optional[str] = None, model_kwargs: Optional[dict] = None, batch_size: int = 1, translation_kwargs: Optional[dict] = None, verbose: bool = False)Tuple[str, pandas.core.frame.DataFrame, dict][source]#

Translate text files using a transformer model from Huggingface’s hub according to the source and target languages given (or using the directly provided model name). The end result is a directory of translated text files and a dataframe containing the following columns:

  • text_file - The text file path.

  • translation_file - The translation text file name in the output directory.

Parameters
  • data_path – A directory of text files or a single file or a list of files to translate.

  • output_directory – Directory where the translated files will be saved.

  • model_name – The name of a model to load. If None, the model name is constructed using the source and target languages parameters.

  • source_language – The source language code (e.g., ‘en’ for English).

  • target_language – The target language code (e.g., ‘en’ for English).

  • model_kwargs – Keyword arguments to pass regarding the loading of the model in HuggingFace’s pipeline function.

  • device – The device index for transformers. Default will prefer cuda if available.

  • batch_size – The number of batches to use in translation. The files are translated one by one, but the sentences can be batched.

  • translation_kwargs – Additional keyword arguments to pass to a transformers.TranslationPipeline when doing the translation inference. Notice the batch size here is being added automatically.

  • verbose – Whether to present logs of a progress bar and errors. Default: True.

Returns

A tuple of:

  • Path to the output directory.

  • A dataframe dataset of the translated file names.

  • A dictionary of errored files that were not translated.

Module contents#