Describe#
This function will analysis the data and outputs the following artifacts per column within the data frame (based on data types):
describe csv
histogram matrix
violin chart
correlation-matrix chart
correlation-matrix csv
imbalance pie chart
imbalance-weights-vec csv
analyse#
Docs#
Parameters:#
context
:mlrun.MLClientCtx
- The MLRun function execution contextname
:str
- Key of the dataset to database (“dataset” for default).table
:DataItem = None
- MLRun input pointing to pandas dataframe (csv/parquet file path)label_column
:str = None
- Ground truth column labelplots_dest
:str = "plots"
- Destination folder of summary plots (relative to artifact_path)random_state
:int = 1
- When the table has more than 500,000 samples, we sample randomly 500,000 samples.dask_key
:string = datasets
- key of dataframe in dask client “datasets” attribute.dask_function
:str = None
- dask function url (db://..).dask_client
:str = None
- dask client object.
DEMO#
Set-up#
import pandas as pd
import mlrun
import os
from sklearn.datasets import make_classification
# Set our project's name:
project_name = "new-describe-project"
# Create the project:
project = mlrun.get_or_create_project(name=project_name, context="./", user_project=True)
Loading random dataset#
We will use make_classification to generate random dataset
n_features=5
X, y = make_classification(n_samples=100, n_features=n_features, n_classes=3, random_state = 18,
class_sep=2, n_informative=3)
df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(n_features)])
df['label'] = y
try:
os.mkdir('artifacts')
except:
pass
df.to_parquet("artifacts/random_dataset.parquet")
Import the describe MLRun function with analysis handler
describe_func = mlrun.import_function("hub://describe")
describe_func.apply(mlrun.platforms.auto_mount())
Run the function on new data set#
Run describe function
After we run the function you can see the created artifacts by click on the run uid and go -> artifacts
describe_run = describe_func.run(
name="task-describe",
handler='analyze',
inputs={"table": os.path.abspath("artifacts/random_dataset.parquet")},
params={"label_column": "label"},
local=True
)
describe_run.artifact('imbalance').show()
describe_run.artifact('scatter-2d').show()
Run the function on alredy loaded data set#
log new data set to the project
context = mlrun.get_or_create_ctx(project_name)
df = pd.read_parquet(os.path.abspath("artifacts/random_dataset.parquet"))
context.log_dataset(key="dataset", db_key="dataset1", stats=True, df=df)
describe_run = describe_func.run(
name="task-describe",
handler='analyze',
inputs={"table": os.path.abspath("artifacts/random_dataset.parquet")},
params={"name": "dataset1", "label_column": "label"},
local=True
)
describe_run.artifact('correlation').show()
describe_run.artifact('histograms').show()
Run the function with dask#
create a dask test cluster (dask function)
dask_cluster = mlrun.new_function('dask_tests', kind='dask', image='mlrun/ml-models')
dask_cluster.apply(mlrun.mount_v3io())
dask_cluster.spec.remote = True
dask_cluster.with_requests(mem='2G')
dask_cluster_name = dask_cluster.save()
Run the describe function. After we run the function you can see the created artifacts by click on the run uid and go -> artifacts
describe_run = describe_func.run(
name="task-describe",
handler='analyze',
inputs={"table": os.path.abspath("artifacts/random_dataset.parquet")},
params={"name": "dataset1", "label_column": "label", "dask_function": dask_cluster_name},
local=True
)
describe_run.artifact('violin').show()