Histogram Data Drift Demo#

The Histogram Data Drift monitoring app is MLRun’s default data drift application for model monitoring. It’s considered a built-in app within the model monitoring flow and is deployed by default when model monitoring is enabled for a project. For more information, see the MLRun documentation.

This notebook walks through a simple example of using this app from the hub to monitor data drift between a baseline dataset and a new dataset, using the evaluate() method.

Set up a project and prepare the data#

import mlrun
project = mlrun.get_or_create_project("histogram-data-drift-demo",'./histogram-data-drift-demo')
sample_data = mlrun.get_sample_path("data/batch-predict/training_set.parquet")
reference_data = mlrun.get_sample_path("data/batch-predict/prediction_set.parquet")

Get the module from the hub and call evaluate()#

hub_mod = mlrun.get_hub_module("hub://histogram_data_drift", download_files=True)
mod = hub_mod.module()
hist_app = mod.HistogramDataDriftApplication

Since the histogram data drift application doesn’t generate artifacts by default, we need to pass class arguments to the evaluate() method to instruct it to produce artifacts during the run.

Note that the option to pass class arguments is only available to MLRun users on version 1.10.1 or higher. Alternatively, you can modify the class defaults directly in the downloaded source code file.

run_result = hist_app.evaluate(
    func_path=hub_mod.get_module_file_path(),
    sample_data=sample_data,
    reference_data=reference_data,
    run_local=True,
    class_arguments={
        "produce_json_artifact": True,
        "produce_plotly_artifact": True,
    },
)

Note that the run is linked to the current (active) project.

Examine the results#

First, we’ll print nicely the average results:

for i in range (3):
    metric = run_result.status.results["return"][i]
    print(metric["metric_name"], ": ", metric["metric_value"])
result = run_result.status.results["return"][3]
print(result["result_name"], ": ", result["result_value"])
hellinger_mean :  0.34211088243167637
kld_mean :  2.2839485090490426
tvd_mean :  0.30536
general_drift :  0.3237354412158382

And we can also examine these metrics per feature, along with other metrics, using the artifacts the app generated for us.

The rightmost column indicates whether the feature has drifted or not. The drift decision rule is the value per-feature mean of the Total Variance Distance (TVD) and Hellinger distance scores. In the histogram-data-drift application, the “Drift detected” threshold is 0.7 and the “Drift suspected” threshold is 0.5

# The artifact is logged with the run's name
artifact_key = f"{run_result.metadata.name}_drift_table_plot"
artifact = project.get_artifact(artifact_key)
artifact.to_dataitem().show()
print("Drift value per feature:")
artifact_key = f"{run_result.metadata.name}_features_drift_results"
artifact = project.get_artifact(artifact_key)
artifact.to_dataitem().show()
Drift value per feature:
<IPython.core.display.JSON object>