Archive to parquet function Example

Contents

Archive to parquet function Example#

the arc_to_parquet function is typically for large files, the function accept an input of archive and stores the data into a file system. in the example we will use arc_to_parquet function to unarchive the higgs-sample data-file stored on s3, and will store it on the local file system in parquet format ,

# upload environment variables from env file if exists
import os,mlrun
   
# Specify path
path = "/tmp/examples_ci.env"
   
if os.path.exists(path):
    env_dict = mlrun.set_env_from_file(path, return_dict=True)
# create the new project
project_name = 'arch-to-parquet-example'

# Initialize the MLRun project object
project = mlrun.get_or_create_project(project_name, context="./", user_project=True)
> 2022-12-25 11:14:04,646 [info] loaded project arch-to-parquet-example from MLRun DB
# import packages
import mlrun
from mlrun import import_function
# declare the dataset
DATA_URL = "https://s3.wasabisys.com/iguazio/data/market-palce/arc_to_parquet/higgs-sample.csv.gz"
# import the function
arc_to_parquet_function = import_function("hub://arc_to_parquet")
# run the function
arc_to_parquet_run = arc_to_parquet_function.run(params={"key": "higgs-sample"},
           handler="arc_to_parquet",
           inputs={"archive_url": DATA_URL}
           )
    
> 2022-12-25 11:14:05,030 [warning] it is recommended to use k8s secret (specify secret_name), specifying the aws_access_key/aws_secret_key directly is unsafe
> 2022-12-25 11:14:05,046 [info] starting run arc-to-parquet-arc_to_parquet uid=cb1962a5333f4f9f9c16faabfd1e94c1 DB=http://mlrun-api:8080
> 2022-12-25 11:14:05,203 [info] Job is running in the background, pod: arc-to-parquet-arc-to-parquet-8kz4b
> 2022-12-25 11:14:44,126 [info] downloading https://s3.wasabisys.com/iguazio/data/market-palce/arc_to_parquet/higgs-sample.csv.gz to local temp file
> 2022-12-25 11:14:44,793 [info] destination file does not exist, downloading
> 2022-12-25 11:14:45,143 [info] To track results use the CLI: {'info_cmd': 'mlrun get run cb1962a5333f4f9f9c16faabfd1e94c1 -p arch-to-parquet-example-jovyan', 'logs_cmd': 'mlrun logs cb1962a5333f4f9f9c16faabfd1e94c1 -p arch-to-parquet-example-jovyan'}
> 2022-12-25 11:14:45,144 [info] run executed, status=completed
final state: completed
project uid iter start state name labels inputs parameters results artifacts
arch-to-parquet-example-jovyan 0 Dec 25 11:14:44 completed arc-to-parquet-arc_to_parquet
kind=job
owner=jovyan
mlrun/client_version=1.2.1-rc7
host=arc-to-parquet-arc-to-parquet-8kz4b
archive_url
key=higgs-sample
higgs-sample

> to track results use the .show() or .logs() methods or click here to open in UI
> 2022-12-25 11:14:47,549 [info] run executed, status=completed

Show the results#

arc_to_parquet_run.artifact('higgs-sample').show()
Unnamed: 0 1.000000000000000000e+00 8.692932128906250000e-01 -6.350818276405334473e-01 2.256902605295181274e-01 3.274700641632080078e-01 -6.899932026863098145e-01 7.542022466659545898e-01 -2.485731393098831177e-01 -1.092063903808593750e+00 ... -1.045456994324922562e-02 -4.576716944575309753e-02 3.101961374282836914e+00 1.353760004043579102e+00 9.795631170272827148e-01 9.780761599540710449e-01 9.200048446655273438e-01 7.216574549674987793e-01 9.887509346008300781e-01 8.766783475875854492e-01
0 0 1.0 0.907542 0.329147 0.359412 1.497970 -0.313010 1.095531 -0.557525 -1.588230 ... -1.138930 -0.000819 0.000000 0.302220 0.833048 0.985700 0.978098 0.779732 0.992356 0.798343
1 1 1.0 0.798835 1.470639 -1.635975 0.453773 0.425629 1.104875 1.282322 1.381664 ... 1.128848 0.900461 0.000000 0.909753 1.108330 0.985692 0.951331 0.803252 0.865924 0.780118
2 2 0.0 1.344385 -0.876626 0.935913 1.992050 0.882454 1.786066 -1.646778 -0.942383 ... -0.678379 -1.360356 0.000000 0.946652 1.028704 0.998656 0.728281 0.869200 1.026736 0.957904
3 3 1.0 1.105009 0.321356 1.522401 0.882808 -1.205349 0.681466 -1.070464 -0.921871 ... -0.373566 0.113041 0.000000 0.755856 1.361057 0.986610 0.838085 1.133295 0.872245 0.808487
4 4 0.0 1.595839 -0.607811 0.007075 1.818450 -0.111906 0.847550 -0.566437 1.581239 ... -0.654227 -1.274345 3.101961 0.823761 0.938191 0.971758 0.789176 0.430553 0.961357 0.957818
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 95 1.0 0.708794 0.850221 0.672354 0.948589 -1.137755 1.240911 0.416861 1.581794 ... 1.461144 -0.758832 0.000000 0.971662 0.856350 1.134024 0.949969 1.594826 1.048655 0.922793
96 96 0.0 1.135022 0.285319 -1.109411 1.088544 -0.896261 1.103134 0.126724 0.964220 ... -1.183070 -0.956380 1.550981 0.883162 0.925714 0.986575 1.057785 0.599632 0.887197 0.970676
97 97 1.0 1.124042 0.354470 0.039812 1.132499 1.620306 0.955921 1.375404 0.415942 ... -0.175354 1.561916 0.000000 0.851553 1.251061 1.546395 0.743475 0.138550 0.717625 0.746045
98 98 1.0 0.341495 -1.223359 -1.372971 0.993666 0.691938 1.086187 0.318829 -1.185753 ... 1.305406 0.426011 0.000000 1.429510 0.975100 0.988090 1.257337 1.353208 1.040413 0.962988
99 99 0.0 1.217926 -0.307828 -1.601573 1.532369 -1.006824 0.555781 -0.059439 0.819528 ... -1.487883 0.811120 0.000000 0.627298 0.812112 0.989371 0.704444 0.573487 0.708875 0.764996

100 rows × 30 columns