Describe#

This function will analysis the data and outputs the following artifacts per column within the data frame (based on data types):

describe csv
histogram matrix
violin chart
correlation-matrix chart
correlation-matrix csv
imbalance pie chart
imbalance-weights-vec csv

analyse#

Docs#

Parameters:#

context: mlrun.MLClientCtx - The MLRun function execution context
name: str - Key of the dataset to database (“dataset” for default).
table: DataItem = None - MLRun input pointing to pandas dataframe (csv/parquet file path)
label_column: str = None - Ground truth column label
plots_dest: str = "plots" - Destination folder of summary plots (relative to artifact_path)
random_state: int = 1 - When the table has more than 500,000 samples, we sample randomly 500,000 samples.
dask_key: string = datasets - key of dataframe in dask client “datasets” attribute.
dask_function: str = None - dask function url (db://..).
dask_client: str = None - dask client object.

DEMO#

Set-up#

import pandas as pd
import mlrun
import os
from sklearn.datasets import make_classification

# Set our project's name:
project_name = "new-describe-project"

# Create the project:
project = mlrun.get_or_create_project(name=project_name, context="./", user_project=True)

> 2022-04-26 07:25:24,033 [info] loaded project new-describe-project from MLRun DB

Loading random dataset#

We will use make_classification to generate random dataset

n_features=5
X, y = make_classification(n_samples=100, n_features=n_features, n_classes=3, random_state = 18,
                                     class_sep=2, n_informative=3)
df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(n_features)])
df['label'] = y
try:
    os.mkdir('artifacts')
except:
    pass
df.to_parquet("artifacts/random_dataset.parquet")

Import the describe MLRun function with analysis handler

describe_func = mlrun.import_function("hub://describe")
describe_func.apply(mlrun.platforms.auto_mount())

<mlrun.runtimes.kubejob.KubejobRuntime at 0x7f0402b15510>

Run the function on new data set#

Run describe function

After we run the function you can see the created artifacts by click on the run uid and go -> artifacts

describe_run = describe_func.run(
            name="task-describe",
            handler='analyze',
            inputs={"table": os.path.abspath("artifacts/random_dataset.parquet")},
            params={"label_column": "label"},
            local=True
        )

> 2022-04-26 07:25:24,124 [info] starting run task-describe uid=4290cd324f784a60b226461f22750fe1 DB=http://mlrun-api:8080
> 2022-04-26 07:25:30,557 [info] The data set is logged to the project under dataset name

project	uid	iter	start	state	name	labels	inputs	parameters	results	artifacts
new-describe-project-davids	...22750fe1	0	Apr 26 07:25:24	completed	task-describe	v3io_user=davids kind= owner=davids host=jupyter-davids-5d6fdc4597-4tpss	table	label_column=label		describe-csv histograms-matrix histograms violin imbalance imbalance-weights-vec correlation-matrix-csv correlation dataset

> to track results use the .show() or .logs() methods or click here to open in UI

> 2022-04-26 07:25:30,768 [info] run executed, status=completed

describe_run.artifact('imbalance').show()

describe_run.artifact('scatter-2d').show()

Run the function on alredy loaded data set#

log new data set to the project

context = mlrun.get_or_create_ctx(project_name)
df = pd.read_parquet(os.path.abspath("artifacts/random_dataset.parquet"))
context.log_dataset(key="dataset", db_key="dataset1", stats=True, df=df)

<mlrun.artifacts.dataset.DatasetArtifact at 0x7f0402a96fd0>

describe_run = describe_func.run(
            name="task-describe",
            handler='analyze',
            inputs={"table": os.path.abspath("artifacts/random_dataset.parquet")},
            params={"name": "dataset1", "label_column": "label"},
            local=True
        )

> 2022-04-26 07:25:31,096 [info] starting run task-describe uid=0789bdac0aa54605bc4cc298060affa6 DB=http://mlrun-api:8080
> 2022-04-26 07:25:33,154 [info] The data set is logged to the project under dataset1 name

project	uid	iter	start	state	name	labels	inputs	parameters	results	artifacts
new-describe-project-davids	...060affa6	0	Apr 26 07:25:31	completed	task-describe	v3io_user=davids kind= owner=davids host=jupyter-davids-5d6fdc4597-4tpss	table	name=dataset1 label_column=label		describe-csv histograms-matrix histograms violin imbalance imbalance-weights-vec correlation-matrix-csv correlation dataset

> to track results use the .show() or .logs() methods or click here to open in UI

> 2022-04-26 07:25:33,340 [info] run executed, status=completed

describe_run.artifact('correlation').show()

describe_run.artifact('histograms').show()

Run the function with dask#

create a dask test cluster (dask function)

dask_cluster = mlrun.new_function('dask_tests', kind='dask', image='mlrun/ml-models')
dask_cluster.apply(mlrun.mount_v3io())
dask_cluster.spec.remote = True
dask_cluster.with_requests(mem='2G')
dask_cluster_name = dask_cluster.save()

Run the describe function. After we run the function you can see the created artifacts by click on the run uid and go -> artifacts

describe_run = describe_func.run(
            name="task-describe",
            handler='analyze',
            inputs={"table": os.path.abspath("artifacts/random_dataset.parquet")},
            params={"name": "dataset1", "label_column": "label", "dask_function": dask_cluster_name},
            local=True
        )

> 2022-04-26 07:25:37,924 [info] starting run task-describe uid=9ffc9f61b9c745248a39301e0d9c8a8a DB=http://mlrun-api:8080
> 2022-04-26 07:25:55,516 [info] to get a dashboard link, use NodePort service_type
> 2022-04-26 07:25:55,517 [info] trying dask client at: tcp://mlrun-dask-tests-e2bed324-4.default-tenant:8786
> 2022-04-26 07:25:55,572 [info] using remote dask scheduler (mlrun-dask-tests-e2bed324-4) at: tcp://mlrun-dask-tests-e2bed324-4.default-tenant:8786

/User/.pythonlibs/jupyter-davids/lib/python3.7/site-packages/distributed/client.py:1131: VersionMismatchWarning:

Mismatched versions found

+---------+--------+-----------+---------+
| Package | client | scheduler | workers |
+---------+--------+-----------+---------+
| blosc   | 1.7.0  | 1.10.6    | None    |
| lz4     | 3.1.0  | 3.1.10    | None    |
| msgpack | 1.0.0  | 1.0.3     | None    |
| toolz   | 0.11.1 | 0.11.2    | None    |
| tornado | 6.0.4  | 6.1       | None    |
+---------+--------+-----------+---------+
Notes: 
-  msgpack: Variation is ok, as long as everything is above 0.6

> 2022-04-26 07:26:00,340 [info] The data set is logged to the project under dataset1 name

project	uid	iter	start	state	name	labels	inputs	parameters	results	artifacts
new-describe-project-davids	...0d9c8a8a	0	Apr 26 07:25:38	completed	task-describe	v3io_user=davids kind= owner=davids host=jupyter-davids-5d6fdc4597-4tpss	table	name=dataset1 label_column=label dask_function=db://new-describe-project-davids/dask_tests		describe-csv histograms-matrix histograms violin imbalance imbalance-weights-vec correlation-matrix-csv correlation dataset

> to track results use the .show() or .logs() methods or click here to open in UI

> 2022-04-26 07:26:00,657 [info] run executed, status=completed

describe_run.artifact('violin').show()

Describe

Contents

Describe#

analyse#

Docs#

Parameters:#

DEMO#

Set-up#

Loading random dataset#

Run the function on new data set#

Run the function on alredy loaded data set#

Run the function with dask#