arc_to_parquet package#
Submodules#
arc_to_parquet.arc_to_parquet module#
- arc_to_parquet.arc_to_parquet.arc_to_parquet(context: MLClientCtx, archive_url: DataItem, header: List[str] = [None], chunksize: int = 0, dtype=None, encoding: str = 'latin-1', key: str = 'data', dataset: str = 'None', part_cols=[], file_ext: str = 'parquet', index: bool = False, refresh_data: bool = False, stats: bool = False) None [source]#
Open a file/object archive and save as a parquet file or dataset
Notes
this function is typically for large files, please be sure to check all settings
partitioning requires precise specification of column types.
the archive_url can be any file readable by pandas read_csv, which includes tar files
if the dataset parameter is not empty, then a partitioned dataset will be created
instead of a single file in the folder dataset * if a key exists already then it will not be re-acquired unless the refresh_data param is set to True. This is in case the original file is corrupt, or a refresh is required.
- Parameters:
context – the function context
archive_url – MLRun data input (DataItem object)
chunksize – (0) when > 0, row size (chunk) to retrieve per iteration
:param dtype destination data type of specified columns :param encoding (“latin-8”) file encoding :param key: key in artifact store (when log_data=True) :param dataset: (None) if not None then “target_path/dataset”
is folder for partitioned files
- Parameters:
part_cols – ([]) list of partitioning columns
file_ext – (parquet) csv/parquet file extension
index – (False) pandas save index option
refresh_data – (False) overwrite existing data at that location
stats – (None) calculate table stats when logging artifact