Describe#

This function will analysis the data and outputs the following artifacts per column within the data frame (based on data types):

describe csv
histogram matrix
violin chart
correlation-matrix chart
correlation-matrix csv
imbalance pie chart
imbalance-weights-vec csv

analyse#

Docs#

Parameters:#

  • context: mlrun.MLClientCtx - The MLRun function execution context

  • name: str - Key of the dataset to database (“dataset” for default).

  • table: DataItem = None - MLRun input pointing to pandas dataframe (csv/parquet file path)

  • label_column: str = None - Ground truth column label

  • plots_dest: str = "plots" - Destination folder of summary plots (relative to artifact_path)

  • random_state: int = 1 - When the table has more than 500,000 samples, we sample randomly 500,000 samples.

  • dask_key: string = datasets - key of dataframe in dask client “datasets” attribute.

  • dask_function: str = None - dask function url (db://..).

  • dask_client: str = None - dask client object.

DEMO#

Set-up#

import pandas as pd
import mlrun
import os
from sklearn.datasets import make_classification
# Set our project's name:
project_name = "new-describe-project"

# Create the project:
project = mlrun.get_or_create_project(name=project_name, context="./", user_project=True)
> 2022-04-26 07:25:24,033 [info] loaded project new-describe-project from MLRun DB

Loading random dataset#

We will use make_classification to generate random dataset

n_features=5
X, y = make_classification(n_samples=100, n_features=n_features, n_classes=3, random_state = 18,
                                     class_sep=2, n_informative=3)
df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(n_features)])
df['label'] = y
try:
    os.mkdir('artifacts')
except:
    pass
df.to_parquet("artifacts/random_dataset.parquet")

Import the describe MLRun function with analysis handler

describe_func = mlrun.import_function("hub://describe")
describe_func.apply(mlrun.platforms.auto_mount())
<mlrun.runtimes.kubejob.KubejobRuntime at 0x7f0402b15510>

Run the function on new data set#

Run describe function

After we run the function you can see the created artifacts by click on the run uid and go -> artifacts

describe_run = describe_func.run(
            name="task-describe",
            handler='analyze',
            inputs={"table": os.path.abspath("artifacts/random_dataset.parquet")},
            params={"label_column": "label"},
            local=True
        )
> 2022-04-26 07:25:24,124 [info] starting run task-describe uid=4290cd324f784a60b226461f22750fe1 DB=http://mlrun-api:8080
> 2022-04-26 07:25:30,557 [info] The data set is logged to the project under dataset name
project uid iter start state name labels inputs parameters results artifacts
new-describe-project-davids 0 Apr 26 07:25:24 completed task-describe
v3io_user=davids
kind=
owner=davids
host=jupyter-davids-5d6fdc4597-4tpss
table
label_column=label
describe-csv
histograms-matrix
histograms
violin
imbalance
imbalance-weights-vec
correlation-matrix-csv
correlation
dataset

> to track results use the .show() or .logs() methods or click here to open in UI
> 2022-04-26 07:25:30,768 [info] run executed, status=completed
describe_run.artifact('imbalance').show()
describe_run.artifact('scatter-2d').show()

Run the function on alredy loaded data set#

log new data set to the project

context = mlrun.get_or_create_ctx(project_name)
df = pd.read_parquet(os.path.abspath("artifacts/random_dataset.parquet"))
context.log_dataset(key="dataset", db_key="dataset1", stats=True, df=df)
<mlrun.artifacts.dataset.DatasetArtifact at 0x7f0402a96fd0>
describe_run = describe_func.run(
            name="task-describe",
            handler='analyze',
            inputs={"table": os.path.abspath("artifacts/random_dataset.parquet")},
            params={"name": "dataset1", "label_column": "label"},
            local=True
        )
> 2022-04-26 07:25:31,096 [info] starting run task-describe uid=0789bdac0aa54605bc4cc298060affa6 DB=http://mlrun-api:8080
> 2022-04-26 07:25:33,154 [info] The data set is logged to the project under dataset1 name
project uid iter start state name labels inputs parameters results artifacts
new-describe-project-davids 0 Apr 26 07:25:31 completed task-describe
v3io_user=davids
kind=
owner=davids
host=jupyter-davids-5d6fdc4597-4tpss
table
name=dataset1
label_column=label
describe-csv
histograms-matrix
histograms
violin
imbalance
imbalance-weights-vec
correlation-matrix-csv
correlation
dataset

> to track results use the .show() or .logs() methods or click here to open in UI
> 2022-04-26 07:25:33,340 [info] run executed, status=completed
describe_run.artifact('correlation').show()
describe_run.artifact('histograms').show()

Run the function with dask#

create a dask test cluster (dask function)

dask_cluster = mlrun.new_function('dask_tests', kind='dask', image='mlrun/ml-models')
dask_cluster.apply(mlrun.mount_v3io())
dask_cluster.spec.remote = True
dask_cluster.with_requests(mem='2G')
dask_cluster_name = dask_cluster.save()

Run the describe function. After we run the function you can see the created artifacts by click on the run uid and go -> artifacts

describe_run = describe_func.run(
            name="task-describe",
            handler='analyze',
            inputs={"table": os.path.abspath("artifacts/random_dataset.parquet")},
            params={"name": "dataset1", "label_column": "label", "dask_function": dask_cluster_name},
            local=True
        )
> 2022-04-26 07:25:37,924 [info] starting run task-describe uid=9ffc9f61b9c745248a39301e0d9c8a8a DB=http://mlrun-api:8080
> 2022-04-26 07:25:55,516 [info] to get a dashboard link, use NodePort service_type
> 2022-04-26 07:25:55,517 [info] trying dask client at: tcp://mlrun-dask-tests-e2bed324-4.default-tenant:8786
> 2022-04-26 07:25:55,572 [info] using remote dask scheduler (mlrun-dask-tests-e2bed324-4) at: tcp://mlrun-dask-tests-e2bed324-4.default-tenant:8786
/User/.pythonlibs/jupyter-davids/lib/python3.7/site-packages/distributed/client.py:1131: VersionMismatchWarning:

Mismatched versions found

+---------+--------+-----------+---------+
| Package | client | scheduler | workers |
+---------+--------+-----------+---------+
| blosc   | 1.7.0  | 1.10.6    | None    |
| lz4     | 3.1.0  | 3.1.10    | None    |
| msgpack | 1.0.0  | 1.0.3     | None    |
| toolz   | 0.11.1 | 0.11.2    | None    |
| tornado | 6.0.4  | 6.1       | None    |
+---------+--------+-----------+---------+
Notes: 
-  msgpack: Variation is ok, as long as everything is above 0.6
> 2022-04-26 07:26:00,340 [info] The data set is logged to the project under dataset1 name
project uid iter start state name labels inputs parameters results artifacts
new-describe-project-davids 0 Apr 26 07:25:38 completed task-describe
v3io_user=davids
kind=
owner=davids
host=jupyter-davids-5d6fdc4597-4tpss
table
name=dataset1
label_column=label
dask_function=db://new-describe-project-davids/dask_tests
describe-csv
histograms-matrix
histograms
violin
imbalance
imbalance-weights-vec
correlation-matrix-csv
correlation
dataset

> to track results use the .show() or .logs() methods or click here to open in UI
> 2022-04-26 07:26:00,657 [info] run executed, status=completed
describe_run.artifact('violin').show()