Describe
Contents
Describe#
This function will analysis the data and outputs the following artifacts per column within the data frame (based on data types):
describe csv
histogram matrix
violin chart
correlation-matrix chart
correlation-matrix csv
imbalance pie chart
imbalance-weights-vec csv
analyse#
Docs#
Parameters:#
context
:mlrun.MLClientCtx
- The MLRun function execution contextname
:str
- Key of the dataset to database (“dataset” for default).table
:DataItem = None
- MLRun input pointing to pandas dataframe (csv/parquet file path)label_column
:str = None
- Ground truth column labelplots_dest
:str = "plots"
- Destination folder of summary plots (relative to artifact_path)random_state
:int = 1
- When the table has more than 500,000 samples, we sample randomly 500,000 samples.dask_key
:string = datasets
- key of dataframe in dask client “datasets” attribute.dask_function
:str = None
- dask function url (db://..).dask_client
:str = None
- dask client object.
DEMO#
Set-up#
import pandas as pd
import mlrun
import os
from sklearn.datasets import make_classification
# Set our project's name:
project_name = "new-describe-project"
# Create the project:
project = mlrun.get_or_create_project(name=project_name, context="./", user_project=True)
> 2022-04-26 07:25:24,033 [info] loaded project new-describe-project from MLRun DB
Loading random dataset#
We will use make_classification to generate random dataset
n_features=5
X, y = make_classification(n_samples=100, n_features=n_features, n_classes=3, random_state = 18,
class_sep=2, n_informative=3)
df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(n_features)])
df['label'] = y
try:
os.mkdir('artifacts')
except:
pass
df.to_parquet("artifacts/random_dataset.parquet")
Import the describe MLRun function with analysis handler
describe_func = mlrun.import_function("hub://describe")
describe_func.apply(mlrun.platforms.auto_mount())
<mlrun.runtimes.kubejob.KubejobRuntime at 0x7f0402b15510>
Run the function on new data set#
Run describe function
After we run the function you can see the created artifacts by click on the run uid and go -> artifacts
describe_run = describe_func.run(
name="task-describe",
handler='analyze',
inputs={"table": os.path.abspath("artifacts/random_dataset.parquet")},
params={"label_column": "label"},
local=True
)
> 2022-04-26 07:25:24,124 [info] starting run task-describe uid=4290cd324f784a60b226461f22750fe1 DB=http://mlrun-api:8080
> 2022-04-26 07:25:30,557 [info] The data set is logged to the project under dataset name
project | uid | iter | start | state | name | labels | inputs | parameters | results | artifacts |
---|---|---|---|---|---|---|---|---|---|---|
new-describe-project-davids | 0 | Apr 26 07:25:24 | completed | task-describe | v3io_user=davids kind= owner=davids host=jupyter-davids-5d6fdc4597-4tpss |
table |
label_column=label |
describe-csv histograms-matrix histograms violin imbalance imbalance-weights-vec correlation-matrix-csv correlation dataset |
> 2022-04-26 07:25:30,768 [info] run executed, status=completed
describe_run.artifact('imbalance').show()
describe_run.artifact('scatter-2d').show()
Run the function on alredy loaded data set#
log new data set to the project
context = mlrun.get_or_create_ctx(project_name)
df = pd.read_parquet(os.path.abspath("artifacts/random_dataset.parquet"))
context.log_dataset(key="dataset", db_key="dataset1", stats=True, df=df)
<mlrun.artifacts.dataset.DatasetArtifact at 0x7f0402a96fd0>
describe_run = describe_func.run(
name="task-describe",
handler='analyze',
inputs={"table": os.path.abspath("artifacts/random_dataset.parquet")},
params={"name": "dataset1", "label_column": "label"},
local=True
)
> 2022-04-26 07:25:31,096 [info] starting run task-describe uid=0789bdac0aa54605bc4cc298060affa6 DB=http://mlrun-api:8080
> 2022-04-26 07:25:33,154 [info] The data set is logged to the project under dataset1 name
project | uid | iter | start | state | name | labels | inputs | parameters | results | artifacts |
---|---|---|---|---|---|---|---|---|---|---|
new-describe-project-davids | 0 | Apr 26 07:25:31 | completed | task-describe | v3io_user=davids kind= owner=davids host=jupyter-davids-5d6fdc4597-4tpss |
table |
name=dataset1 label_column=label |
describe-csv histograms-matrix histograms violin imbalance imbalance-weights-vec correlation-matrix-csv correlation dataset |
> 2022-04-26 07:25:33,340 [info] run executed, status=completed
describe_run.artifact('correlation').show()
describe_run.artifact('histograms').show()
Run the function with dask#
create a dask test cluster (dask function)
dask_cluster = mlrun.new_function('dask_tests', kind='dask', image='mlrun/ml-models')
dask_cluster.apply(mlrun.mount_v3io())
dask_cluster.spec.remote = True
dask_cluster.with_requests(mem='2G')
dask_cluster_name = dask_cluster.save()
Run the describe function. After we run the function you can see the created artifacts by click on the run uid and go -> artifacts
describe_run = describe_func.run(
name="task-describe",
handler='analyze',
inputs={"table": os.path.abspath("artifacts/random_dataset.parquet")},
params={"name": "dataset1", "label_column": "label", "dask_function": dask_cluster_name},
local=True
)
> 2022-04-26 07:25:37,924 [info] starting run task-describe uid=9ffc9f61b9c745248a39301e0d9c8a8a DB=http://mlrun-api:8080
> 2022-04-26 07:25:55,516 [info] to get a dashboard link, use NodePort service_type
> 2022-04-26 07:25:55,517 [info] trying dask client at: tcp://mlrun-dask-tests-e2bed324-4.default-tenant:8786
> 2022-04-26 07:25:55,572 [info] using remote dask scheduler (mlrun-dask-tests-e2bed324-4) at: tcp://mlrun-dask-tests-e2bed324-4.default-tenant:8786
/User/.pythonlibs/jupyter-davids/lib/python3.7/site-packages/distributed/client.py:1131: VersionMismatchWarning:
Mismatched versions found
+---------+--------+-----------+---------+
| Package | client | scheduler | workers |
+---------+--------+-----------+---------+
| blosc | 1.7.0 | 1.10.6 | None |
| lz4 | 3.1.0 | 3.1.10 | None |
| msgpack | 1.0.0 | 1.0.3 | None |
| toolz | 0.11.1 | 0.11.2 | None |
| tornado | 6.0.4 | 6.1 | None |
+---------+--------+-----------+---------+
Notes:
- msgpack: Variation is ok, as long as everything is above 0.6
> 2022-04-26 07:26:00,340 [info] The data set is logged to the project under dataset1 name
project | uid | iter | start | state | name | labels | inputs | parameters | results | artifacts |
---|---|---|---|---|---|---|---|---|---|---|
new-describe-project-davids | 0 | Apr 26 07:25:38 | completed | task-describe | v3io_user=davids kind= owner=davids host=jupyter-davids-5d6fdc4597-4tpss |
table |
name=dataset1 label_column=label dask_function=db://new-describe-project-davids/dask_tests |
describe-csv histograms-matrix histograms violin imbalance imbalance-weights-vec correlation-matrix-csv correlation dataset |
> 2022-04-26 07:26:00,657 [info] run executed, status=completed
describe_run.artifact('violin').show()