 # Part 1: MLRun Basics

 Part 1 of the getting-started tutorial introduces you to the basics of working with functions by using the MLRun open-source MLOps orchestration framework.
 
 The tutorial takes you through the following steps:

 1. [Installation and Setup](#gs-tutorial-1-step-setup)
 2. [Creating a basic function and running it locally](#gs-tutorial-1-step-create-basic-function)
 3. [Running the function on the cluster](#gs-tutorial-1-run-function-on-cluster)
 4. [Viewing jobs on the dashboard (UI)](#gs-tutorial-1-step-ui-jobs-view)
 5. [Scheduling jobs](#gs-tutorial-1-step-schedule-jobs)

By the end of this tutorial you'll learn how to

- Create a basic data-preparation MLRun function.
- Store data artifacts to be used and managed in a central database.
- Run your code on a distributed Kubernetes cluster without any DevOps overhead.
- Schedule jobs to run on the cluster.

<a id="gs-tutorial-1-setup-remote-env"></a>
> **Using MLRun Remotely**<br>
> This tutorial is aimed at running your project from a local Jupyter Notebook service in the same environment in which MLRun is installed and running.
> However, as a developer you might want to develop your project from a remote location using your own IDE (such as a local Jupyter Notebook or PyCharm), and connect to the MLRun environment remotely.
> To learn how to use MLRun from a remote IDE, see [Setting a Remote Environment](https://mlrun.readthedocs.io/en/release-v0.6.x-latest/howto/remote.html).

<a id="gs-tutorial-1-mlrun-intro"></a>

## Introduction to MLRun

[MLRun](https://github.com/mlrun/mlrun) is an open-source MLOps framework that offers an integrative approach to managing your machine-learning pipelines from early development through model development to full pipeline deployment in production.
MLRun offers a convenient abstraction layer to a wide variety of technology stacks while empowering data engineers and data scientists to define the feature and models.

MLRun provides the following key benefits:

- **Rapid deployment** of code to production pipelines
- **Elastic scaling** of batch and real-time workloads
- **Feature management** &mdash; ingestion, preparation, and monitoring
- **Works anywhere** &mdash; your local IDE, multi-cloud, or on-prem

MLRun can be installed over Kubernetes or is available as a managed service in the [Iguazio Data Science Platform](https://www.iguazio.com/).

&#x25B6; For more information about MLRun, see the MLRun [**Architecture and Vision**](https://mlrun.readthedocs.io/en/release-v0.6.x-latest/architecture.html) documentation.

<a id="gs-tutorial-1-step-setup"></a>

## Step 1: Installation and Setup

For information on how to install and configure MLRun over Kubernetes, see the MLRun [installation guide](https://mlrun.readthedocs.io/en/release-v0.6.x-latest/install.html).
To install the MLRun package, run `pip install mlrun` with the MLRun version that matches your MLRun service.

> **For Iguazio Data Science Platform Users**<br>
> If your are using the Iguazio Data Science Platform, MLRun is available as a default (pre-deployed) shared service.<br>
> You can run `!/User/align_mlrun.sh` to install the MLRun package or upgrade the version of an installed package.
> By default, the script attempts to download the latest version of the MLRun Python package that matches the version of the running MLRun service.

> **Kernel Restart**<br>
> After installing or updating the MLRun package, restart the notebook kernel in your environment!

<a id="gs-tutorial-1-mlrun-envr-init"></a>

### Initializing Your MLRun Environment

MLRun projects are used for packaging multiple runs, functions, workflows, and artifacts.
Projects are created when you run a job or save an object (such as a function or artifact) to a specific project.
For more information about MLRun project, see the MLRun [projects documentation](https://mlrun.readthedocs.io/en/release-v0.6.x-latest/projects.html).

Use the `set_environment` MLRun method to configure the working environment and default configuration. 
This method returns a tuple with the current project name and artifacts path.

Set the method's `project` parameter to your selected project name.
You can also optionally set the `user_project` parameter to `True` to automatically append the username of the running user to the project name specified in the `project` parameter, resulting in a `<project>-<username>` project name;
this is useful for avoiding project-name conflicts among different users.

You can optionally pass additional parameters to `set_environment`, as detailed in the MLRun API reference.
For example:
- You can set the `artifact_path` parameter to override the default path for storing project artifacts, as explained in the MLRun [artifacts documentation](https://mlrun.readthedocs.io/en/release-v0.6.x-latest/store/artifacts.html).
- When using a remote MLRun or Kubernetes cluster, you can set the `api_path` parameter to the URL of your remote environment, and set the `access_key` parameter to an authentication key for this environment.

Run the following code to initialize your MLRun environment to use a "getting-started-tutorial-&lt;username&gt;"
project and store the project artifacts in the default artifacts path:

In [1]:
from os import path
import mlrun

# Set the base project name
project_name_base = 'getting-started-tutorial'
# Initialize the MLRun environment and save the project name and artifacts path
project_name, artifact_path = mlrun.set_environment(project=project_name_base,
                                                    user_project=True)
                                                    
# Display the current project name and artifacts path
print(f'Project name: {project_name}')
print(f'Artifacts path: {artifact_path}')

Project name: getting-started-tutorial-jovyan
Artifacts path: /home/jovyan/data


<a id="gs-tutorial-1-step-create-basic-function"></a>

## Step 2: Creating a Basic Function

This step introduces you to MLRun functions and artifacts and walks you through the process of converting a local function to an MLRun function.

<a id="gs-tutorial-1-define-local-func"></a>

### Defining a Local Function

The following example code defines a data-preparation function (`prep_data`) that reads (ingests) a CSV file from the provided source URL into a pandas DataFrame;
prepares ("cleans") the data by changing the type of the categorical data in the specified label column;
and returns the DataFrame and its length.
In the next sub-step you'll redefine this function and convert it to an MLRun function that leverages MLRun to perform the following tasks:

- Reading the data
- Logging the data to the MLRun database

In [2]:
import pandas as pd

# Ingest a data set
def prep_data(source_url, label_column):

    df = pd.read_csv(source_url)
    df[label_column] = df[label_column].astype('category').cat.codes    
    return df, df.shape[0]

<a id="gs-tutorial-1-create-and-run-an-mlrun-function"></a>

### Creating and Running Your First MLRun Function

#### MLRun Functions

MLRun jobs and pipelines run over serverless functions.
These functions can include the function code and specification (spec").
The spec contains metadata for configuring related operational aspects, such as the image, required packages, CPU/memory/GPU resources, storage, and the environment.
The different serverless runtime engines automatically transform the function code and spec into fully managed and elastic services that run over Kubernetes.
Functions are versioned and can be generated from code or notebooks, or loaded from a marketplace.

To work with functions you need to be familiar with the following function components:

- **Context** &mdash; a function-context object.
    The code can be set up to get parameters, secrets, and inputs from the context, as well as log run outputs, artifacts, tags, and metrics in the context.
- **Parameters** &mdash; the parameters (arguments) that are passed to the functions.
- **Inputs** &mdash; MLRun functions have a special `inputs` parameter for passing data objects (such as data sets, models, or files) as input to a function.
    Use this parameter to pass data items to a function.
    An MLRun **data item** (`DataItem`) represents either a single data item or a collection of data times (such as files, directories, and tables) for any type of data that is produced or consumed by functions or jobs.
    MLRun **artifacts** are versioned, and contains metadata that describes one or more data items.

For more information see the following MLRun documentation:

- [Functions, Runs, and Context](https://mlrun.readthedocs.io/en/release-v0.6.x-latest/runtimes/functions.html)
- [Data Items](https://mlrun.readthedocs.io/en/release-v0.6.x-latest/store/datastore.html)
- [Artifacts](https://mlrun.readthedocs.io/en/release-v0.6.x-latest/store/artifacts.html)

#### MLRun Function Code

The following code demonstrates how to redefine your local data-preparation function to make it compatible with MLRun, and then convert the local notebook code into an MLRun function.

In [3]:
# mlrun: start-code

In [4]:
import mlrun
def prep_data(context, source_url: mlrun.DataItem, label_column='label'):

    # Convert the DataItem to a pandas DataFrame
    df = source_url.as_df()
    df[label_column] = df[label_column].astype('category').cat.codes    
    
    # Record the DataFrane length after the run
    context.log_result('num_rows', df.shape[0])

    # Store the data set in your artifacts database
    context.log_dataset('cleaned_data', df=df, index=False, format='csv')

In [5]:
# mlrun: end-code

The MLRun function has the following parameter changes compared to the original local function:

- To effectively run your code in MLRun, you need to add a `context` parameter to your function (or alternatively, get the context by using {py:func}`~mlrun.run.get_or_create_ctx`).
    This allows you to log and retrieve information related to the function's execution.
- The tutorial example sets the `source_url` parameter to `mlrun.DataItem` to send a data item as input when the function is called (using the `inputs` parameter).

The example tutorial function code works as follows:

- Obtain a pandas DataFrame from the `source_url` data item, by calling the `as_df` method.
- Prepare (clean) the data, as done in the local-function implementation in the previous step.
- Record the data length (number of rows) using the `log_result` method. 
    This method records (logs) the values of standard function variables (such as int, float, string, and list).
- Log the data-set artifact using the `log_dataset` method.
    This method saves and logs function data items and related metadata (i.e., logs function artifacts).

#### Converting the Notebook Code to a Function

Use the `# mlrun: ...` comment annotations at the beginning of relevant code cells to identify the code that needs to be converted into an MLRun function.
These annotations provide non-intrusive hints as to how you want to convert the notebook into a full function and function specification:

- The `# mlrun: ignore` annotation identifies code that shouldn't be included in the MLRun function (such as prints, plots, tests, and debug code).
- The `# mlrun: start-code` and `# mlrun: end-code` annotations identify code to be converted to an MLRun function:
    everything before the `start-code` annotation and after the `end-code` annotation is ignored, and only code between these two annotations is converted.
    These annotations are used in the tutorial notebook instead of adding the `ignore` annotation to all cells that shouldn't be converted.

The following code uses the `code_to_function` MLRun method to convert your local `prep_data` function code to a `data_prep_func` MLRun function.

The `kind` parameter of the `code_to_function` method determines the engine for running the code.
MLRun allows running function code using different engines &mdash; such as Python, Spark, MPI, Nuclio, and Dask.
The following example sets the `kind` parameter to `job` to run the code as a Python process ("job").

In [6]:
# Convert the local prep_data function to an MLRun project function
data_prep_func = mlrun.code_to_function(name='prep_data', kind='job', image='mlrun/mlrun')

<a id="gs-tutorial-1-run-mlrun-function-locally"></a>

#### Running the MLRun Function Locally

Now you're ready to run your MLRun function (`data_prep_func`).
The following example uses the `run` MLRun method and sets its `local` parameter to `True` to run the function code locally within your Jupyter pod, meaning that the function uses the environment variables, volumes, and image that are running in this pod.

> **Note:** When running a function locally, the function code is saved only in a temporary local directory and not in your project's ML functions code repository.
> In the next step of this tutorial you'll run the function on a cluster, which automatically saves the function object in the project.

The execution results are stored in the MLRun database.
The tutorial example sets the following function parameters:

- `name` &mdash; the job name
- `handler` &mdash; the name of the function handler
- `input` &mdash; the data-set URL

As input for the function, the example uses a CSV file from a cloud object-store service named wasabisys.
> **Note:** You can also use the function to ingest data in other formats than CSV, such as Parquet, without modifying the code.

In [7]:
# Set the source-data URL
source_url = 'https://s3.wasabisys.com/iguazio/data/iris/iris.data.raw.csv'

In [8]:
# Run the `data_prep_func` MLRun function locally
prep_data_run = data_prep_func.run(name='prep_data',
                                   handler=prep_data,
                                   inputs={'source_url': source_url},
                                   local=True)

> 2021-05-21 05:27:04,430 [info] starting run prep_data uid=2fc4e1404f1848708608532720f17bc8 DB=http://mlrun-api:8080


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
getting-started-tutorial-jovyan,...20f17bc8,0,May 21 05:27:04,completed,prep_data,kind=owner=jovyanhost=mlrun-kit-jupyter-5848c5c9f9-mt8jd,source_url,,num_rows=150,cleaned_data


to track results use .show() or .logs() or in CLI: 
!mlrun get run 2fc4e1404f1848708608532720f17bc8 --project getting-started-tutorial-jovyan , !mlrun logs 2fc4e1404f1848708608532720f17bc8 --project getting-started-tutorial-jovyan
> 2021-05-21 05:27:05,771 [info] run executed, status=completed


<a id="gs-tutorial-1-get-run-object-info"></a>

### Getting Information About the Run Object

Every run object that's returned by the MLRun `run` method has the following methods:

- `uid` &mdash; returns the unique ID.
- `state` &mdash; returns the last known state.
- `show` &mdash; shows the latest job state and data in a visual widget (with hyperlinks and hints).
- `outputs` &mdash; returns a dictionary of the run results and artifact paths.
- `logs` &mdash; returns the latest logs.
    Use `Watch=False` to disable the interactive mode in running jobs.
- `artifact` &mdash; returns full artifact details for the provided key.
- `output` &mdash; returns a specific result or an artifact path for the provided key.
- `to_dict`, `to_yaml`, `to_json` &mdash; converts the run object to a dictionary, YAML, or JSON format (respectively).

In [9]:
# example
prep_data_run.state()

'completed'

In [10]:
prep_data_run.outputs['cleaned_data']

'store://artifacts/getting-started-tutorial-jovyan/prep_data_cleaned_data:2fc4e1404f1848708608532720f17bc8'

<a id="gs-tutorial-1-read-output"></a>

### Reading the Output

The data-set location is returned in the `outputs` field.
Therefore, you can get the location by calling `prep_data_run.outputs['cleaned_data']` and using `run.get_dataitem` to get the data set itself.

In [11]:
dataset = mlrun.run.get_dataitem(prep_data_run.outputs['cleaned_data'])

You can also get the data as a pandas DataFrame by calling the `dataset.as_df` method:

In [12]:
dataset.as_df()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


<a id="gs-tutorial-1-save-artifcats-in-run-specific-paths"></a>

### Saving the Artifacts in Run-Specific Paths

In the previous steps, each time the function was executed its artifacts were saved to the same directory, overwriting the existing artifacts in this directory.
But you can also select to save the run results (source-data file) to a different directory for each job execution.
This is done by setting the artifacts path and using the unique run-ID parameter (`{{run.uid}}`) in the path.
Now, under the artifact path you should be able to see the source-data file in a new directory whose name is derived from the unique run ID.

In [13]:
out = artifact_path 

prep_data_run = data_prep_func.run(name='prep_data',
                         handler=prep_data,
                         inputs={'source_url': source_url},
                         local=True,
                         artifact_path=path.join(out, '{{run.uid}}'))

> 2021-05-21 05:27:05,844 [info] starting run prep_data uid=47c7f2d1962e41ad843eef9636768e91 DB=http://mlrun-api:8080


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
getting-started-tutorial-jovyan,...36768e91,0,May 21 05:27:05,completed,prep_data,kind=owner=jovyanhost=mlrun-kit-jupyter-5848c5c9f9-mt8jd,source_url,,num_rows=150,cleaned_data


to track results use .show() or .logs() or in CLI: 
!mlrun get run 47c7f2d1962e41ad843eef9636768e91 --project getting-started-tutorial-jovyan , !mlrun logs 47c7f2d1962e41ad843eef9636768e91 --project getting-started-tutorial-jovyan
> 2021-05-21 05:27:06,457 [info] run executed, status=completed


<a id="gs-tutorial-1-step-run-func-on-cluster"></a>

## Step 3: Running the Function on a Cluster

You can also run MLRun functions on the cluster itself, as opposed to running them locally in the Jupyter pod, as done in the previous steps.
Running a function on the cluster allows you to leverage the cluster's resources and run a more resource-intensive workloads.
MLRun helps you to easily run your code without the hassle of creating configuration files and build images.
To run an MLRun function on a cluster, just change the value of the `local` flag in the call to the `run` method to `False`.

In [14]:
from mlrun.platforms import auto_mount

In [15]:
data_prep_func.apply(auto_mount())
prep_data_run = data_prep_func.run(name='prep_data',
                                   handler='prep_data',
                                   inputs={'source_url': source_url},
                                   local=False)

> 2021-05-21 05:27:06,473 [info] starting run prep_data uid=02c536481e6843a294a74baca1b37578 DB=http://mlrun-api:8080
> 2021-05-21 05:27:06,523 [info] Job is running in the background, pod: prep-data-b9mdj
> 2021-05-21 05:27:11,165 [info] run executed, status=completed
final state: completed


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
getting-started-tutorial-jovyan,...a1b37578,0,May 21 05:27:10,completed,prep_data,kind=jobowner=jovyanhost=prep-data-b9mdj,source_url,,num_rows=150,cleaned_data


to track results use .show() or .logs() or in CLI: 
!mlrun get run 02c536481e6843a294a74baca1b37578 --project getting-started-tutorial-jovyan , !mlrun logs 02c536481e6843a294a74baca1b37578 --project getting-started-tutorial-jovyan
> 2021-05-21 05:27:12,649 [info] run executed, status=completed


In [16]:
print(prep_data_run.outputs)

{'num_rows': 150, 'cleaned_data': 'store://artifacts/getting-started-tutorial-jovyan/prep_data_cleaned_data:02c536481e6843a294a74baca1b37578'}


<a id="gs-tutorial-1-step-ui-jobs-view"></a>

## Step 4: Viewing Jobs on the Dashboard (UI)

On the **Projects** dashboard page, select your project and then navigate to the project's jobs and workflow page by selecting the relevant link.
For this tutorial, after running the `prep_data` method twice, you should see three records with types local (**&lt;&gt;**) and job.
In this view you can track all jobs running in your project and view detailed job information.
Select a job name to display tabs with additional information such as an input data set, artifacts that were generated by the job, and execution results and logs. 

<img src="./images/jobs.png" alt="Jobs" width="800"/>

<a id="gs-tutorial-1-step-schedule-jobs"></a>

## Step 5: Scheduling Jobs

To schedule a job, you can set the `schedule` parameter of the `run` method.
The scheduling is done by using a crontab format.

You can also schedule jobs from the dashboard: on the jobs and monitoring project page, you can create a new job using the **New Job** wizard.
At the end of the wizard flow you can set the job scheduling.
In the following example, the job is set to run every 30 minutes.

In [17]:
data_prep_func.apply(auto_mount())
prep_data_run = data_prep_func.run(name='prep_data',
                                   handler='prep_data',
                                   inputs={'source_url': source_url},
                                   local=False,
                                   schedule='*/30 * * * *')

> 2021-05-21 05:27:12,670 [info] starting run prep_data uid=5a8d1c01680647d996c581aec239bed4 DB=http://mlrun-api:8080
> 2021-05-21 05:27:12,715 [info] task scheduled, {'schedule': '*/30 * * * *', 'project': 'getting-started-tutorial-jovyan', 'name': 'prep_data'}


<a id="gs-tutorial-1-scheduled-jobs-list"></a>

### List Scheduled Jobs

Use the `get_run_db.list_schedules` MLRun method to list your project's scheduled jobs, and display the results.

In [18]:
print(mlrun.get_run_db().list_schedules(project_name))

schedules=[ScheduleOutput(name='prep_data', kind=<ScheduleKinds.job: 'job'>, scheduled_object={'task': {'spec': {'inputs': {'source_url': 'https://s3.wasabisys.com/iguazio/data/iris/iris.data.raw.csv'}, 'output_path': '/home/jovyan/data', 'function': 'getting-started-tutorial-jovyan/prep-data@ffa9ed7dc47cf880098f4a154b3e3a4e4f0863a1', 'secret_sources': [], 'scrape_metrics': True, 'handler': 'prep_data'}, 'metadata': {'uid': '5a8d1c01680647d996c581aec239bed4', 'name': 'prep_data', 'project': 'getting-started-tutorial-jovyan', 'labels': {'kind': 'job', 'owner': 'jovyan'}, 'iteration': 0}, 'status': {'state': 'created'}}, 'schedule': '*/30 * * * *'}, cron_trigger=ScheduleCronTrigger(year=None, month='*', day='*', week=None, day_of_week='*', hour='*', minute='*/30', second=None, start_date=None, end_date=None, timezone=None, jitter=None), desired_state=None, labels={'kind': 'job', 'owner': 'jovyan'}, concurrency_limit=1, creation_time=datetime.datetime(2021, 5, 21, 5, 27, 12, 710633, tzinf

<a id="gs-tutorial-1-scheduled-jobs-get"></a>

### Get Scheduled Jobs

Use the `get_run_db.get_schedule` MLRun method to get the job schedule for a scheduled job.

In [19]:
mlrun.get_run_db().get_schedule(project_name, 'prep_data')

ScheduleOutput(name='prep_data', kind=<ScheduleKinds.job: 'job'>, scheduled_object={'task': {'spec': {'inputs': {'source_url': 'https://s3.wasabisys.com/iguazio/data/iris/iris.data.raw.csv'}, 'output_path': '/home/jovyan/data', 'function': 'getting-started-tutorial-jovyan/prep-data@ffa9ed7dc47cf880098f4a154b3e3a4e4f0863a1', 'secret_sources': [], 'scrape_metrics': True, 'handler': 'prep_data'}, 'metadata': {'uid': '5a8d1c01680647d996c581aec239bed4', 'name': 'prep_data', 'project': 'getting-started-tutorial-jovyan', 'labels': {'kind': 'job', 'owner': 'jovyan'}, 'iteration': 0}, 'status': {'state': 'created'}}, 'schedule': '*/30 * * * *'}, cron_trigger=ScheduleCronTrigger(year=None, month='*', day='*', week=None, day_of_week='*', hour='*', minute='*/30', second=None, start_date=None, end_date=None, timezone=None, jitter=None), desired_state=None, labels={'kind': 'job', 'owner': 'jovyan'}, concurrency_limit=1, creation_time=datetime.datetime(2021, 5, 21, 5, 27, 12, 710633, tzinfo=datetime.

<a id="gs-tutorial-1-scheduled-jobs-ui-view"></a>

### View Scheduled Jobs on the Dashboard (UI)

You can also see your scheduled jobs on your project's **Jobs | Schedule** dashboard page. 

<img src="./images/func-schedule.jpg" alt="scheduled-jobs" width="1400"/>

<a id="gs-tutorial-1-scheduled-jobs-delete"></a>

### Deleting Scheduled Jobs

When you no longer need to run the scheduled jobs, remove them by using the `get_run_db().delete_schedule` MLRun method to delete the job-schedule objects that you created.

In [20]:
mlrun.get_run_db().delete_schedule(project_name, 'prep_data')

You can verify that a scheduled job has been deleted by calling `get_schedule` to get the job schedule.
If the delete operation was successful, this call should fail.

In [21]:
#mlrun.get_run_db().get_schedule(project_name,'prep_data')

<a id="gs-tutorial-1-done"></a>

## Done!

Congratulation! You've completed Part 1 of the MLRun getting-started tutorial.
Proceed to [Part 2](02-model-training.ipynb) to learn how to train an ML model.