arc_to_parquet package#

Submodules#

arc_to_parquet.arc_to_parquet module#

arc_to_parquet.arc_to_parquet.arc_to_parquet(context: MLClientCtx, archive_url: DataItem, header: List[str] = [None], chunksize: int = 0, dtype=None, encoding: str = 'latin-1', key: str = 'data', dataset: str = 'None', part_cols=[], file_ext: str = 'parquet', index: bool = False, refresh_data: bool = False, stats: bool = False) None[source]#

Open a file/object archive and save as a parquet file or dataset

Notes

  • this function is typically for large files, please be sure to check all settings

  • partitioning requires precise specification of column types.

  • the archive_url can be any file readable by pandas read_csv, which includes tar files

  • if the dataset parameter is not empty, then a partitioned dataset will be created

instead of a single file in the folder dataset * if a key exists already then it will not be re-acquired unless the refresh_data param is set to True. This is in case the original file is corrupt, or a refresh is required.

Parameters:
  • context – the function context

  • archive_url – MLRun data input (DataItem object)

  • chunksize – (0) when > 0, row size (chunk) to retrieve per iteration

:param dtype destination data type of specified columns :param encoding (“latin-8”) file encoding :param key: key in artifact store (when log_data=True) :param dataset: (None) if not None then “target_path/dataset”

is folder for partitioned files

Parameters:
  • part_cols – ([]) list of partitioning columns

  • file_ext – (parquet) csv/parquet file extension

  • index – (False) pandas save index option

  • refresh_data – (False) overwrite existing data at that location

  • stats – (None) calculate table stats when logging artifact

Module contents#