Histogram Data Drift Demo#

The Histogram Data Drift monitoring app is MLRun’s default data drift application for model monitoring. It’s considered a built-in app within the model monitoring flow and is deployed by default when model monitoring is enabled for a project. For more information, see the MLRun documentation.

This notebook walks through a simple example of using this app from the hub to monitor data drift between a baseline dataset and a new dataset, using the evaluate() method.

Set up a project and prepare the data#

import mlrun
project = mlrun.get_or_create_project("histogram-data-drift-demo",'./histogram-data-drift-demo')
sample_data = mlrun.get_sample_path("data/batch-predict/training_set.parquet")
reference_data = mlrun.get_sample_path("data/batch-predict/prediction_set.parquet")

Get the module from the hub and edit its defaults#

hub_mod = mlrun.get_hub_module("hub://histogram_data_drift", download_files=True)
src_file_path = hub_mod.get_module_file_path()

Since the histogram data drift application doesn’t produce artifacts by default, we need to modify the class defaults. This can be done in one of two ways: either by editing the downloaded source file directly and then evaluating with the standard class, or - as we’ll do now - by adding an inheriting class to the same file and evaluating using that new class.

# add a declaration of an inheriting class to change the default parameters
wrapper_code = """
class HistogramDataDriftApplicationWithArtifacts(HistogramDataDriftApplication):
    # The same histogram application but with artifacts

    def __init__(self) -> None:
        super().__init__(produce_json_artifact=True, produce_plotly_artifact=True)
"""
with open(src_file_path, "a") as f:
    f.write(wrapper_code)

Now we can actually import it as a module, using the module() method

app_module = hub_mod.module()
hist_app = app_module.HistogramDataDriftApplicationWithArtifacts # or the standard class if you chose to modify its code

And we are ready to call evaluate() (notice that the run is linked to the current (active) project)

run_result = hist_app.evaluate(
    func_path=hub_mod.get_module_file_path(),
    sample_data=sample_data,
    reference_data=reference_data,
    run_local=True
)

Examine the results#

First, we’ll print nicely the average results:

for i in range (3):
    metric = run_result.status.results["return"][i]
    print(metric["metric_name"], ": ", metric["metric_value"])
result = run_result.status.results["return"][3]
print(result["result_name"], ": ", result["result_value"])
hellinger_mean :  0.34211088243167637
kld_mean :  2.2839485090490426
tvd_mean :  0.30536
general_drift :  0.3237354412158382

And we can also examine these metrics per feature, along with other metrics, using the artifacts the app generated for us.

The rightmost column indicates whether the feature has drifted or not. The drift decision rule is the value per-feature mean of the Total Variance Distance (TVD) and Hellinger distance scores. In the histogram-data-drift application, the “Drift detected” threshold is 0.7 and the “Drift suspected” threshold is 0.5

# The artifact is logged with the run's name
artifact_key = f"{run_result.metadata.name}_drift_table_plot"
artifact = project.get_artifact(artifact_key)
artifact.to_dataitem().show()
print("Drift value per feature:")
artifact_key = f"{run_result.metadata.name}_features_drift_results"
artifact = project.get_artifact(artifact_key)
artifact.to_dataitem().show()
Drift value per feature:
<IPython.core.display.JSON object>