Describe#

This function will analysis the data and outputs the following artifacts per column within the data frame (based on data types):

describe csv
histogram matrix
violin chart
correlation-matrix chart
correlation-matrix csv
imbalance pie chart
imbalance-weights-vec csv

analyse#

Docs#

Parameters:#

  • context: mlrun.MLClientCtx - The MLRun function execution context

  • name: str - Key of the dataset to database (“dataset” for default).

  • table: DataItem = None - MLRun input pointing to pandas dataframe (csv/parquet file path)

  • label_column: str = None - Ground truth column label

  • plots_dest: str = "plots" - Destination folder of summary plots (relative to artifact_path)

  • random_state: int = 1 - When the table has more than 500,000 samples, we sample randomly 500,000 samples.

  • dask_key: string = datasets - key of dataframe in dask client “datasets” attribute.

  • dask_function: str = None - dask function url (db://..).

  • dask_client: str = None - dask client object.

DEMO#

Set-up#

import pandas as pd
import mlrun
import os
from sklearn.datasets import make_classification
# Set our project's name:
project_name = "new-describe-project"

# Create the project:
project = mlrun.get_or_create_project(name=project_name, context="./", user_project=True)

Loading random dataset#

We will use make_classification to generate random dataset

n_features=5
X, y = make_classification(n_samples=100, n_features=n_features, n_classes=3, random_state = 18,
                                     class_sep=2, n_informative=3)
df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(n_features)])
df['label'] = y
try:
    os.mkdir('artifacts')
except:
    pass
df.to_parquet("artifacts/random_dataset.parquet")

Import the describe MLRun function with analysis handler

describe_func = mlrun.import_function("hub://describe")
describe_func.apply(mlrun.platforms.auto_mount())

Run the function on new data set#

Run describe function

After we run the function you can see the created artifacts by click on the run uid and go -> artifacts

describe_run = describe_func.run(
            name="task-describe",
            handler='analyze',
            inputs={"table": os.path.abspath("artifacts/random_dataset.parquet")},
            params={"label_column": "label"},
            local=True
        )
describe_run.artifact('imbalance').show()
describe_run.artifact('scatter-2d').show()

Run the function on alredy loaded data set#

log new data set to the project

context = mlrun.get_or_create_ctx(project_name)
df = pd.read_parquet(os.path.abspath("artifacts/random_dataset.parquet"))
context.log_dataset(key="dataset", db_key="dataset1", stats=True, df=df)
describe_run = describe_func.run(
            name="task-describe",
            handler='analyze',
            inputs={"table": os.path.abspath("artifacts/random_dataset.parquet")},
            params={"name": "dataset1", "label_column": "label"},
            local=True
        )
describe_run.artifact('correlation').show()
describe_run.artifact('histograms').show()

Run the function with dask#

create a dask test cluster (dask function)

dask_cluster = mlrun.new_function('dask_tests', kind='dask', image='mlrun/ml-models')
dask_cluster.apply(mlrun.mount_v3io())
dask_cluster.spec.remote = True
dask_cluster.with_requests(mem='2G')
dask_cluster_name = dask_cluster.save()

Run the describe function. After we run the function you can see the created artifacts by click on the run uid and go -> artifacts

describe_run = describe_func.run(
            name="task-describe",
            handler='analyze',
            inputs={"table": os.path.abspath("artifacts/random_dataset.parquet")},
            params={"name": "dataset1", "label_column": "label", "dask_function": dask_cluster_name},
            local=True
        )
describe_run.artifact('violin').show()