arc_to_parquet package#

Submodules#

arc_to_parquet.arc_to_parquet module#

arc_to_parquet.arc_to_parquet.arc_to_parquet(context: MLClientCtx, archive_url: DataItem, header: List[str] = [None], chunksize: int = 0, dtype=None, encoding: str = 'latin-1', key: str = 'data', dataset: str = 'None', part_cols=[], file_ext: str = 'parquet', index: bool = False, refresh_data: bool = False, stats: bool = False) → None[source]#

Open a file/object archive and save as a parquet file or dataset

Notes

this function is typically for large files, please be sure to check all settings
partitioning requires precise specification of column types.
the archive_url can be any file readable by pandas read_csv, which includes tar files
if the dataset parameter is not empty, then a partitioned dataset will be created

instead of a single file in the folder dataset * if a key exists already then it will not be re-acquired unless the refresh_data param is set to True. This is in case the original file is corrupt, or a refresh is required.

Parameters:

context – the function context
archive_url – MLRun data input (DataItem object)
chunksize – (0) when > 0, row size (chunk) to retrieve per iteration

:param dtype destination data type of specified columns :param encoding (“latin-8”) file encoding :param key: key in artifact store (when log_data=True) :param dataset: (None) if not None then “target_path/dataset”

is folder for partitioned files

Parameters:

part_cols – ([]) list of partitioning columns
file_ext – (parquet) csv/parquet file extension
index – (False) pandas save index option
refresh_data – (False) overwrite existing data at that location
stats – (None) calculate table stats when logging artifact

arc_to_parquet package

Contents

arc_to_parquet package#

Submodules#

arc_to_parquet.arc_to_parquet module#

Module contents#