Archive to parquet function Example#
the arc_to_parquet function is typically for large files, the function accept an input of archive and stores the data into a file system. in the example we will use arc_to_parquet function to unarchive the higgs-sample data-file stored on s3, and will store it on the local file system in parquet format ,
# upload environment variables from env file if exists
import os,mlrun
# Specify path
path = "/tmp/examples_ci.env"
if os.path.exists(path):
env_dict = mlrun.set_env_from_file(path, return_dict=True)
# create the new project
project_name = 'arch-to-parquet-example'
# Initialize the MLRun project object
project = mlrun.get_or_create_project(project_name, context="./", user_project=True)
> 2022-12-25 11:14:04,646 [info] loaded project arch-to-parquet-example from MLRun DB
# import packages
import mlrun
from mlrun import import_function
# declare the dataset
DATA_URL = "https://s3.wasabisys.com/iguazio/data/market-palce/arc_to_parquet/higgs-sample.csv.gz"
# import the function
arc_to_parquet_function = import_function("hub://arc_to_parquet")
# run the function
arc_to_parquet_run = arc_to_parquet_function.run(params={"key": "higgs-sample"},
handler="arc_to_parquet",
inputs={"archive_url": DATA_URL}
)
> 2022-12-25 11:14:05,030 [warning] it is recommended to use k8s secret (specify secret_name), specifying the aws_access_key/aws_secret_key directly is unsafe
> 2022-12-25 11:14:05,046 [info] starting run arc-to-parquet-arc_to_parquet uid=cb1962a5333f4f9f9c16faabfd1e94c1 DB=http://mlrun-api:8080
> 2022-12-25 11:14:05,203 [info] Job is running in the background, pod: arc-to-parquet-arc-to-parquet-8kz4b
> 2022-12-25 11:14:44,126 [info] downloading https://s3.wasabisys.com/iguazio/data/market-palce/arc_to_parquet/higgs-sample.csv.gz to local temp file
> 2022-12-25 11:14:44,793 [info] destination file does not exist, downloading
> 2022-12-25 11:14:45,143 [info] To track results use the CLI: {'info_cmd': 'mlrun get run cb1962a5333f4f9f9c16faabfd1e94c1 -p arch-to-parquet-example-jovyan', 'logs_cmd': 'mlrun logs cb1962a5333f4f9f9c16faabfd1e94c1 -p arch-to-parquet-example-jovyan'}
> 2022-12-25 11:14:45,144 [info] run executed, status=completed
final state: completed
project | uid | iter | start | state | name | labels | inputs | parameters | results | artifacts |
---|---|---|---|---|---|---|---|---|---|---|
arch-to-parquet-example-jovyan | 0 | Dec 25 11:14:44 | completed | arc-to-parquet-arc_to_parquet | kind=job owner=jovyan mlrun/client_version=1.2.1-rc7 host=arc-to-parquet-arc-to-parquet-8kz4b |
archive_url |
key=higgs-sample |
higgs-sample |
> to track results use the .show() or .logs() methods or click here to open in UI
> 2022-12-25 11:14:47,549 [info] run executed, status=completed
Show the results#
arc_to_parquet_run.artifact('higgs-sample').show()
Unnamed: 0 | 1.000000000000000000e+00 | 8.692932128906250000e-01 | -6.350818276405334473e-01 | 2.256902605295181274e-01 | 3.274700641632080078e-01 | -6.899932026863098145e-01 | 7.542022466659545898e-01 | -2.485731393098831177e-01 | -1.092063903808593750e+00 | ... | -1.045456994324922562e-02 | -4.576716944575309753e-02 | 3.101961374282836914e+00 | 1.353760004043579102e+00 | 9.795631170272827148e-01 | 9.780761599540710449e-01 | 9.200048446655273438e-01 | 7.216574549674987793e-01 | 9.887509346008300781e-01 | 8.766783475875854492e-01 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1.0 | 0.907542 | 0.329147 | 0.359412 | 1.497970 | -0.313010 | 1.095531 | -0.557525 | -1.588230 | ... | -1.138930 | -0.000819 | 0.000000 | 0.302220 | 0.833048 | 0.985700 | 0.978098 | 0.779732 | 0.992356 | 0.798343 |
1 | 1 | 1.0 | 0.798835 | 1.470639 | -1.635975 | 0.453773 | 0.425629 | 1.104875 | 1.282322 | 1.381664 | ... | 1.128848 | 0.900461 | 0.000000 | 0.909753 | 1.108330 | 0.985692 | 0.951331 | 0.803252 | 0.865924 | 0.780118 |
2 | 2 | 0.0 | 1.344385 | -0.876626 | 0.935913 | 1.992050 | 0.882454 | 1.786066 | -1.646778 | -0.942383 | ... | -0.678379 | -1.360356 | 0.000000 | 0.946652 | 1.028704 | 0.998656 | 0.728281 | 0.869200 | 1.026736 | 0.957904 |
3 | 3 | 1.0 | 1.105009 | 0.321356 | 1.522401 | 0.882808 | -1.205349 | 0.681466 | -1.070464 | -0.921871 | ... | -0.373566 | 0.113041 | 0.000000 | 0.755856 | 1.361057 | 0.986610 | 0.838085 | 1.133295 | 0.872245 | 0.808487 |
4 | 4 | 0.0 | 1.595839 | -0.607811 | 0.007075 | 1.818450 | -0.111906 | 0.847550 | -0.566437 | 1.581239 | ... | -0.654227 | -1.274345 | 3.101961 | 0.823761 | 0.938191 | 0.971758 | 0.789176 | 0.430553 | 0.961357 | 0.957818 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | 95 | 1.0 | 0.708794 | 0.850221 | 0.672354 | 0.948589 | -1.137755 | 1.240911 | 0.416861 | 1.581794 | ... | 1.461144 | -0.758832 | 0.000000 | 0.971662 | 0.856350 | 1.134024 | 0.949969 | 1.594826 | 1.048655 | 0.922793 |
96 | 96 | 0.0 | 1.135022 | 0.285319 | -1.109411 | 1.088544 | -0.896261 | 1.103134 | 0.126724 | 0.964220 | ... | -1.183070 | -0.956380 | 1.550981 | 0.883162 | 0.925714 | 0.986575 | 1.057785 | 0.599632 | 0.887197 | 0.970676 |
97 | 97 | 1.0 | 1.124042 | 0.354470 | 0.039812 | 1.132499 | 1.620306 | 0.955921 | 1.375404 | 0.415942 | ... | -0.175354 | 1.561916 | 0.000000 | 0.851553 | 1.251061 | 1.546395 | 0.743475 | 0.138550 | 0.717625 | 0.746045 |
98 | 98 | 1.0 | 0.341495 | -1.223359 | -1.372971 | 0.993666 | 0.691938 | 1.086187 | 0.318829 | -1.185753 | ... | 1.305406 | 0.426011 | 0.000000 | 1.429510 | 0.975100 | 0.988090 | 1.257337 | 1.353208 | 1.040413 | 0.962988 |
99 | 99 | 0.0 | 1.217926 | -0.307828 | -1.601573 | 1.532369 | -1.006824 | 0.555781 | -0.059439 | 0.819528 | ... | -1.487883 | 0.811120 | 0.000000 | 0.627298 | 0.812112 | 0.989371 | 0.704444 | 0.573487 | 0.708875 | 0.764996 |
100 rows × 30 columns