PII Recognizer#

A function to detect pii data and anonymize the pii entity in the text.

In this notebook we will go over the function’s docs and outputs and see an end-to-end example of running it.

  1. Documentation

  2. Results

  3. End-to-end Demo

1. Documentation#

The function receive a directory path with all the text files in it. It walk through the directory, get all the text file. Then it detect the pii entity inside of the text file, apply the operator on the entity. Generate the html file with all pii entity highlighted. Generate the json report has the explaination of the process.

1.1. Parameters:#

  • context: mlrun.MLClientCtx

    The MLRun context

  • input_path: str

    The input directory with all the text files

  • output_path: str

    The directory that is used to store the anonymized text files. it is also used for mlrun to log the artifact as zip file

  • output_suffix: str

    The suffix will added to the input file. for example if the input text file is pii.txt, if output_suffix is “anonymized”, the output file would be pii_anonymized.txt

  • html_key: str

    The artifact name of the html file

  • entities: List[str]

    The list of the entities to recognize. Please make sure the model you choose can recognize the entities.

  • entity_operator_map: List[str] For different entity, we can apply different operator. Now supports Keep, Mask, Replace, Redact, Hash

       entity_operator_map = {
          "PERSON": ("keep", {}),
          "EMAIL": ("mask", {"masking_char": "#", "chars_to_mask": 5, "from_end": False}),
          "PHONE": ("hash", {}),
          "LOCATION": ("redact", {}),
          "ORGANIZATION": ("replace", {"new_value": "Company XYZ"})
          }
       

    In this example:

    • “PERSON” entities are kept as they are using the “keep” operator.

    • “EMAIL_ADDRESS” entities are masked with the “#” character, masking the first five characters.

    • “PHONE_NUMBER” entities are replaced with their hashed value using the “hash” operator.

    • “LOCATION” entities are completely removed using the “redact” operator.

    • “ORGANIZATION” entities are replaced with the string “Company XYZ” using the “replace” operator.

  • model: str

    • “whole”, “spacy”, “pattern”, “flair”. The default is “whole”.

    For each model, it can detect some entities. The “whole” model is combined all three models together. It can detect all the entities list below.

    • “spacy” : [“LOCATION”, “PERSON”,”NRP”,”ORGANIZATION”,”DATE_TIME”]

    • “pattern”: [“CREDIT_CARD”, “SSN”, “PHONE”, “EMAIL”]

    • “flair”: [ “LOCATION”, “PERSON”, “NRP”, “GPE”, “ORGANIZATION”, “MAC_ADDRESS”, “US_BANK_NUMBER”, “IMEI”, “TITLE”, “LICENSE_PLATE”, “US_PASSPORT”, “CURRENCY”, “ROUTING_NUMBER”, “US_ITIN”, “US_BANK_NUMBER”, “US_DRIVER_LICENSE”, “AGE”, “PASSWORD”, “SWIFT_CODE” ]

  • score_threshold:

    Minimum confidence value, the default is 0 to align with presidio.AnalyzerEngine

  • generate_json_rpt:

    Whether to generate the json report of the explaination

  • generate_html_rpt:

    Whether to generate the html with highlighted pii entities or not

  • is_full_text:

    Whether to return the full text or just the sentences with pii entities.

  • is_full_html: bool

    Whether to return the full html or just the annotated html

  • is_full_report: bool

    Whether to return the full json report or just the score and start, end index

1.2. Outputs:#

There are two outputs of this function.

  • output_path: str

    The directory stored all the anonymized text files

  • rpt_json: dict

    A dict of reporting to explain how does the model detect the pii entity

  • errors : dict A dict of errors when processing the text files if any

2. Results#

The result of the function looks like the following:

For example if the input string is

John Doe 's ssn is 182838483, connect john doe with john_doe@gmail.com or 6288389029, he can pay you with 41482929939393

The anonymized_text is

<PERSON>'s <ORGANIZATION> is <SSN>, connect <PERSON> with <PERSON> <EMAIL> or <PHONE>, he can pay you with <CREDIT_CARD>

The html_str is

John Doe'sPERSON ssnORGANIZATION is 182838483SSN, connect me with john_doe@gmail.comPERSONjohn_doe@gmail.comEMAIL or 6288389029PHONE, he can pay you with 41482929939393CREDIT_CARD

The json report that explain the output is

[
  {
    "entity_type": "PERSON", # result of the labeling
    "start": 0, # start positon of the entity
    "end": 9,  # end postion of the entity
    "score": 0.99, # the confident score of the model + context_improvement
    "analysis_explanation": {
      "recognizer": "FlairRecognizer", # which recognizer is used to recognize this entity
      "pattern_name": null,
      "pattern": null,
      "original_score": 0.99, # The original confident score from the pre-trained model
      "score": 0.99, # the final score = original_score + score_context_improvement
      "textual_explanation": "Identified as PER by Flair's Named Entity Recognition",
      "score_context_improvement": 0, # The improvement from the context
      "supportive_context_word": "",
      "validation_result": null
    },
    "recognition_metadata": {
      "recognizer_identifier": "Flair Analytics_5577088640",
      "recognizer_name": "Flair Analytics"
    }
  },
  ....
]

3. End-to-end Demo#

3.1. Recognition configurations#

  • model: which model you want to use.

  • entities: What entities to recognize?

  • score_threshold: From which score to mark the recogniztion as trusted?

import mlrun
artifact_path = "./"
proj = mlrun.get_or_create_project("pii", "./")
fn = mlrun.code_to_function(
    project="pii",
    name="pii_recognizer",
    filename="pii_recognizer.py",
    kind="job",
    image="mlrun/mlrun",
    handler="recognize_pii",
    description="This function is used to recognize PII in a given text",
)
run_obj = fn.run(
    artifact_path = artifact_path,
    params= {
        'model': "whole", 
        'input_path': "./data/",
        'output_path': "./data/output1/",
        "entities": ['PERSON', "EMAIL", "PHONE", "LOCATION", "ORGANIZATION"], # the entities that needs to recognize
        "output_suffix": "output",
        "html_key": "highlighted",
        "score_threshold" : 0.5, # the score threshold to mark the recognition as trusted
    },
    returns = ["output_path: path", "rpt_json: file", "errors: file"],
    local=True,
)
> 2023-07-31 02:17:04,305 [info] Project loaded successfully: {'project_name': 'pii'}
> 2023-07-31 02:17:04,312 [warning] Failed to add git metadata, ignore if path is not part of a git repo.: {'path': './', 'error': '/User/pii_recognizer'}
> 2023-07-31 02:17:04,408 [warning] artifact/output path is not defined or is local and relative, artifacts will not be visible in the UI: {'output_path': './'}
> 2023-07-31 02:17:04,409 [info] Storing function: {'name': 'pii-recognizer-recognize-pii', 'uid': '51b5ad8144004e52a1008c08850842c8', 'db': None}
2023-07-31 02:17:04,567 loading file /User/.flair/models/flair-pii-distilbert/models--beki--flair-pii-distilbert/snapshots/20fb59f1762edcf253bce67716a94a43cb075ae6/pytorch_model.bin
2023-07-31 02:17:07,730 SequenceTagger predicts: Dictionary with 21 tags: O, S-LOC, B-LOC, E-LOC, I-LOC, S-PER, B-PER, E-PER, I-PER, S-DATE_TIME, B-DATE_TIME, E-DATE_TIME, I-DATE_TIME, S-ORG, B-ORG, E-ORG, I-ORG, S-NRP, B-NRP, E-NRP, I-NRP
Model loaded
project uid iter start state name labels inputs parameters results artifacts
pii 0 Jul 31 02:17:04 completed pii-recognizer-recognize-pii
v3io_user=pengw
kind=
owner=pengw
host=jupyter-pengw-5f99fb678d-mnvxl
model=whole
input_path=./data/
output_path=./data/output1/
entities=['PERSON', 'EMAIL', 'PHONE', 'LOCATION', 'ORGANIZATION']
output_suffix=output
html_key=highlighted
score_threshold=0.5
highlighted
output_path
rpt_json
errors

> to track results use the .show() or .logs() methods or click here to open in UI
> 2023-07-31 02:17:12,403 [info] Run execution finished: {'status': 'completed', 'name': 'pii-recognizer-recognize-pii'}
#get the mlrun context
context = mlrun.get_or_create_ctx('pii_ctx1')
import pathlib
from tqdm.auto import tqdm
for i, txt_file in enumerate(
        tqdm(
            list(pathlib.Path("./data/output1/").glob("*.txt")),
            desc="Processing files",
            unit="file",
        )
    ):
            # Load the str from the text file
        text = txt_file.read_text()
        print(text)
Dear Mr. <PERSON>,

We are pleased to inform you that you have been selected as the winner of our exclusive vacation package giveaway! Congratulations! You, along with your family, will enjoy a luxurious stay at our resort in the beautiful city of <ORGANIZATION>. Your flight tickets have been booked, and you will be departing on July 15th, 2023.

Please provide us with the necessary details to finalize your travel arrangements. We kindly request your full name, date of birth, passport number, and contact information. Rest assured that all provided information will be handled with utmost confidentiality and in compliance with data protection regulations.

We look forward to creating unforgettable memories for you and your loved ones during your stay with us. If you have any questions or require further assistance, please don't hesitate to contact our customer support team.

<PERSON> <ORGANIZATION> is 182838483, connect him with <EMAIL> or <PHONE>, he can pay you with <PHONE>9393
#check the highlighted html 
html_output = context.get_cached_artifact("highlighted")
html_str = mlrun.get_dataitem(html_output.get_target_path()).get().decode("utf-8")
from IPython.core.display import display, HTML
display(HTML(html_str))
Highlighted Pii Entities

Highlighted Pii Entities

  • data/letter.txt

    Dear Mr. John DoePERSON, We are pleased to inform you that you have been selected as the winner of our exclusive vacation package giveaway! Congratulations! You, along with your family, will enjoy a luxurious stay at our resort in the beautiful city of RivieraLOCATIONRivieraORGANIZATION. Your flight tickets have been booked, and you will be departing on July 15th, 2023. Please provide us with the necessary details to finalize your travel arrangements. We kindly request your full name, date of birth, passport number, and contact information. Rest assured that all provided information will be handled with utmost confidentiality and in compliance with data protection regulations. We look forward to creating unforgettable memories for you and your loved ones during your stay with us. If you have any questions or require further assistance, please don't hesitate to contact our customer support team.

  • data/pii_data.txt

    John smith'sPERSON ssnORGANIZATION is 182838483, connect him with JohnPERSONJohn_smith@gmail.comEMAILsmithPERSON@gmail.com or 6288389029PHONE, he can pay you with 4148292993PHONE9393

  • #check the json report about the explanation.
    rpt_output1 = context.get_cached_artifact("rpt_json")
    rpt_str1 = mlrun.get_dataitem(rpt_output1.get_target_path()).get().decode("utf-8")
    import json
    obj = json.loads(rpt_str1)
     
    # Pretty Print JSON
    json_formatted_str1 = json.dumps(obj, indent=4)
    print(json_formatted_str1)
    
    {
        "data/letter.txt": [
            {
                "entity_type": "PERSON",
                "start": 9,
                "end": 17,
                "score": 1,
                "analysis_explanation": {
                    "recognizer": "CustomSpacyRecognizer",
                    "pattern_name": null,
                    "pattern": null,
                    "original_score": 1,
                    "score": 1,
                    "textual_explanation": "Identified as PERSON by Spacy's Named Entity Recognition (Privy-trained)",
                    "score_context_improvement": 0,
                    "supportive_context_word": "",
                    "validation_result": null
                },
                "recognition_metadata": {
                    "recognizer_name": "CustomSpacyRecognizer",
                    "recognizer_identifier": "CustomSpacyRecognizer_139944219101744"
                }
            },
            {
                "entity_type": "LOCATION",
                "start": 248,
                "end": 255,
                "score": 1.0,
                "analysis_explanation": {
                    "recognizer": "FlairRecognizer",
                    "pattern_name": null,
                    "pattern": null,
                    "original_score": 1.0,
                    "score": 1.0,
                    "textual_explanation": "Identified as LOC by Flair's Named Entity Recognition",
                    "score_context_improvement": 0,
                    "supportive_context_word": "",
                    "validation_result": null
                },
                "recognition_metadata": {
                    "recognizer_identifier": "Flair Analytics_139944219101936",
                    "recognizer_name": "Flair Analytics"
                }
            },
            {
                "entity_type": "ORGANIZATION",
                "start": 248,
                "end": 255,
                "score": 1,
                "analysis_explanation": {
                    "recognizer": "CustomSpacyRecognizer",
                    "pattern_name": null,
                    "pattern": null,
                    "original_score": 1,
                    "score": 1,
                    "textual_explanation": "Identified as ORG by Spacy's Named Entity Recognition (Privy-trained)",
                    "score_context_improvement": 0,
                    "supportive_context_word": "",
                    "validation_result": null
                },
                "recognition_metadata": {
                    "recognizer_name": "CustomSpacyRecognizer",
                    "recognizer_identifier": "CustomSpacyRecognizer_139944219101744"
                }
            }
        ],
        "data/pii_data.txt": [
            {
                "entity_type": "PERSON",
                "start": 0,
                "end": 12,
                "score": 1,
                "analysis_explanation": {
                    "recognizer": "CustomSpacyRecognizer",
                    "pattern_name": null,
                    "pattern": null,
                    "original_score": 1,
                    "score": 1,
                    "textual_explanation": "Identified as PERSON by Spacy's Named Entity Recognition (Privy-trained)",
                    "score_context_improvement": 0,
                    "supportive_context_word": "",
                    "validation_result": null
                },
                "recognition_metadata": {
                    "recognizer_name": "CustomSpacyRecognizer",
                    "recognizer_identifier": "CustomSpacyRecognizer_139944219101744"
                }
            },
            {
                "entity_type": "ORGANIZATION",
                "start": 13,
                "end": 16,
                "score": 1,
                "analysis_explanation": {
                    "recognizer": "CustomSpacyRecognizer",
                    "pattern_name": null,
                    "pattern": null,
                    "original_score": 1,
                    "score": 1,
                    "textual_explanation": "Identified as ORG by Spacy's Named Entity Recognition (Privy-trained)",
                    "score_context_improvement": 0,
                    "supportive_context_word": "",
                    "validation_result": null
                },
                "recognition_metadata": {
                    "recognizer_name": "CustomSpacyRecognizer",
                    "recognizer_identifier": "CustomSpacyRecognizer_139944219101744"
                }
            },
            {
                "entity_type": "PERSON",
                "start": 53,
                "end": 58,
                "score": 1.0,
                "analysis_explanation": {
                    "recognizer": "FlairRecognizer",
                    "pattern_name": null,
                    "pattern": null,
                    "original_score": 1.0,
                    "score": 1.0,
                    "textual_explanation": "Identified as PER by Flair's Named Entity Recognition",
                    "score_context_improvement": 0,
                    "supportive_context_word": "",
                    "validation_result": null
                },
                "recognition_metadata": {
                    "recognizer_identifier": "Flair Analytics_139944219101936",
                    "recognizer_name": "Flair Analytics"
                }
            },
            {
                "entity_type": "PERSON",
                "start": 48,
                "end": 52,
                "score": 0.87,
                "analysis_explanation": {
                    "recognizer": "FlairRecognizer",
                    "pattern_name": null,
                    "pattern": null,
                    "original_score": 0.87,
                    "score": 0.87,
                    "textual_explanation": "Identified as PER by Flair's Named Entity Recognition",
                    "score_context_improvement": 0,
                    "supportive_context_word": "",
                    "validation_result": null
                },
                "recognition_metadata": {
                    "recognizer_identifier": "Flair Analytics_139944219101936",
                    "recognizer_name": "Flair Analytics"
                }
            },
            {
                "entity_type": "EMAIL",
                "start": 48,
                "end": 68,
                "score": 0.5,
                "analysis_explanation": {
                    "recognizer": "PatternRecognizer",
                    "pattern_name": "EMAIL",
                    "pattern": "\\S+@\\S+",
                    "original_score": 0.5,
                    "score": 0.5,
                    "textual_explanation": null,
                    "score_context_improvement": 0,
                    "supportive_context_word": "",
                    "validation_result": null
                },
                "recognition_metadata": {
                    "recognizer_name": "PatternRecognizer",
                    "recognizer_identifier": "PatternRecognizer_139944352474640"
                }
            },
            {
                "entity_type": "PHONE",
                "start": 72,
                "end": 82,
                "score": 0.5,
                "analysis_explanation": {
                    "recognizer": "PatternRecognizer",
                    "pattern_name": "PHONE",
                    "pattern": "\\(?\\d{3}\\)?[-.\\s]?\\d{3}[-.\\s]?\\d{4}",
                    "original_score": 0.5,
                    "score": 0.5,
                    "textual_explanation": null,
                    "score_context_improvement": 0,
                    "supportive_context_word": "",
                    "validation_result": null
                },
                "recognition_metadata": {
                    "recognizer_name": "PatternRecognizer",
                    "recognizer_identifier": "PatternRecognizer_139944352476560"
                }
            },
            {
                "entity_type": "PHONE",
                "start": 104,
                "end": 114,
                "score": 0.5,
                "analysis_explanation": {
                    "recognizer": "PatternRecognizer",
                    "pattern_name": "PHONE",
                    "pattern": "\\(?\\d{3}\\)?[-.\\s]?\\d{3}[-.\\s]?\\d{4}",
                    "original_score": 0.5,
                    "score": 0.5,
                    "textual_explanation": null,
                    "score_context_improvement": 0,
                    "supportive_context_word": "",
                    "validation_result": null
                },
                "recognition_metadata": {
                    "recognizer_name": "PatternRecognizer",
                    "recognizer_identifier": "PatternRecognizer_139944352476560"
                }
            }
        ]
    }
    

    3.2. Masking configurations#

    • entity_operator_map: it defined what to do with recognized tokens? Mask them? mask them with what? remove them? replace them?

         entity_operator_map = {
            "PERSON": ("keep", {}),
            "EMAIL": ("mask", {"masking_char": "😀", "chars_to_mask": 5, "from_end": False}),
            "PHONE": ("hash", {}),
            "LOCATION": ("redact", {}),
            "ORGANIZATION": ("replace", {"new_value": "Company XYZ"})
            }
         
    import mlrun
    artifact_path = "./"
    proj = mlrun.get_or_create_project("pii", "./")
    fn = mlrun.code_to_function(
        project="pii",
        name="pii_recognizer",
        filename="pii_recognizer.py",
        kind="job",
        image="mlrun/mlrun",
        handler="recognize_pii",
        description="This function is used to recognize PII in a given text",
    )
    
    entity_operator_map = {
            "PERSON": ("keep", {}),
            "EMAIL": ("mask", {"masking_char": "😀", "chars_to_mask" : 100, "from_end": False}),
            "PHONE": ("hash", {}),
            "LOCATION": ("redact", {}),
            "ORGANIZATION": ("replace", {"new_value": "Company XYZ"})
            }
    run_obj = fn.run(
        artifact_path = artifact_path,
        params= {
            'model': "whole", 
            'input_path': "./data/",
            'output_path': "./data/output2/",
            "entities": ['PERSON', "EMAIL", "PHONE", "LOCATION", "ORGANIZATION"],
            "output_suffix": "output",
            "html_key": "highlighted",
            "score_threshold" : 0.5,
            "entity_operator_map": entity_operator_map,
            
        },
        returns = ["output_path: path", "rpt_json: file", "errors: file"],
        local=True,
    )
    
    > 2023-07-31 02:20:40,550 [info] Project loaded successfully: {'project_name': 'pii'}
    > 2023-07-31 02:20:40,556 [warning] Failed to add git metadata, ignore if path is not part of a git repo.: {'path': './', 'error': '/User/pii_recognizer'}
    > 2023-07-31 02:20:40,649 [warning] artifact/output path is not defined or is local and relative, artifacts will not be visible in the UI: {'output_path': './'}
    > 2023-07-31 02:20:40,649 [info] Storing function: {'name': 'pii-recognizer-recognize-pii', 'uid': '2b43f80c7ca44b43b229760bb55f814d', 'db': None}
    2023-07-31 02:20:40,812 loading file /User/.flair/models/flair-pii-distilbert/models--beki--flair-pii-distilbert/snapshots/20fb59f1762edcf253bce67716a94a43cb075ae6/pytorch_model.bin
    2023-07-31 02:20:44,130 SequenceTagger predicts: Dictionary with 21 tags: O, S-LOC, B-LOC, E-LOC, I-LOC, S-PER, B-PER, E-PER, I-PER, S-DATE_TIME, B-DATE_TIME, E-DATE_TIME, I-DATE_TIME, S-ORG, B-ORG, E-ORG, I-ORG, S-NRP, B-NRP, E-NRP, I-NRP
    Model loaded
    
    project uid iter start state name labels inputs parameters results artifacts
    pii 0 Jul 31 02:20:40 completed pii-recognizer-recognize-pii
    v3io_user=pengw
    kind=
    owner=pengw
    host=jupyter-pengw-5f99fb678d-mnvxl
    model=whole
    input_path=./data/
    output_path=./data/output2/
    entities=['PERSON', 'EMAIL', 'PHONE', 'LOCATION', 'ORGANIZATION']
    output_suffix=output
    html_key=highlighted
    score_threshold=0.5
    entity_operator_map={'PERSON': ('keep', {}), 'EMAIL': ('mask', {'masking_char': '😀', 'chars_to_mask': 100, 'from_end': False, 'entity_type': 'EMAIL'}), 'PHONE': ('hash', {}), 'LOCATION': ('redact', {}), 'ORGANIZATION': ('replace', {'new_value': 'Company XYZ', 'entity_type': 'ORGANIZATION'})}
    highlighted
    output_path
    rpt_json
    errors
    
    
    > to track results use the .show() or .logs() methods or click here to open in UI
    > 2023-07-31 02:20:48,903 [info] Run execution finished: {'status': 'completed', 'name': 'pii-recognizer-recognize-pii'}
    
    #get the mlrun context
    context = mlrun.get_or_create_ctx('pii_ctx1')
    import pathlib
    from tqdm.auto import tqdm
    for i, txt_file in enumerate(
            tqdm(
                list(pathlib.Path("./data/output2/").glob("*.txt")),
                desc="Processing files",
                unit="file",
            )
        ):
                # Load the str from the text file
            text = txt_file.read_text()
            print(text)
    
    Dear Mr. John Doe,
    
    We are pleased to inform you that you have been selected as the winner of our exclusive vacation package giveaway! Congratulations! You, along with your family, will enjoy a luxurious stay at our resort in the beautiful city of Company XYZ. Your flight tickets have been booked, and you will be departing on July 15th, 2023.
    
    Please provide us with the necessary details to finalize your travel arrangements. We kindly request your full name, date of birth, passport number, and contact information. Rest assured that all provided information will be handled with utmost confidentiality and in compliance with data protection regulations.
    
    We look forward to creating unforgettable memories for you and your loved ones during your stay with us. If you have any questions or require further assistance, please don't hesitate to contact our customer support team.
    
    John smith's Company XYZ is 182838483, connect him with 😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀 or 3990096a212e92850c3b3c8e57ab398252d482444a32def6b030cbac2d51efa3, he can pay you with a6983d9477e93eab115305afd124bd096699e6cb7d2ce72ec6e29a6378a4e8059393
    
    #check the highlighted html 
    html_output = context.get_cached_artifact("highlighted")
    html_str = mlrun.get_dataitem(html_output.get_target_path()).get().decode("utf-8")
    from IPython.core.display import display, HTML
    display(HTML(html_str))
    
    Highlighted Pii Entities

    Highlighted Pii Entities

  • data/letter.txt

    Dear Mr. John DoePERSON, We are pleased to inform you that you have been selected as the winner of our exclusive vacation package giveaway! Congratulations! You, along with your family, will enjoy a luxurious stay at our resort in the beautiful city of RivieraLOCATIONRivieraORGANIZATION. Your flight tickets have been booked, and you will be departing on July 15th, 2023. Please provide us with the necessary details to finalize your travel arrangements. We kindly request your full name, date of birth, passport number, and contact information. Rest assured that all provided information will be handled with utmost confidentiality and in compliance with data protection regulations. We look forward to creating unforgettable memories for you and your loved ones during your stay with us. If you have any questions or require further assistance, please don't hesitate to contact our customer support team.

  • data/pii_data.txt

    John smith'sPERSON ssnORGANIZATION is 182838483, connect him with JohnPERSONJohn_smith@gmail.comEMAILsmithPERSON@gmail.com or 6288389029PHONE, he can pay you with 4148292993PHONE9393

  • #check the json report about the explanation.
    rpt_output1 = context.get_cached_artifact("rpt_json")
    rpt_str1 = mlrun.get_dataitem(rpt_output1.get_target_path()).get().decode("utf-8")
    import json
    obj = json.loads(rpt_str1)
     
    # Pretty Print JSON
    json_formatted_str1 = json.dumps(obj, indent=4)
    print(json_formatted_str1)
    
    {
        "data/letter.txt": [
            {
                "entity_type": "PERSON",
                "start": 9,
                "end": 17,
                "score": 1.0,
                "analysis_explanation": {
                    "recognizer": "FlairRecognizer",
                    "pattern_name": null,
                    "pattern": null,
                    "original_score": 1.0,
                    "score": 1.0,
                    "textual_explanation": "Identified as PER by Flair's Named Entity Recognition",
                    "score_context_improvement": 0,
                    "supportive_context_word": "",
                    "validation_result": null
                },
                "recognition_metadata": {
                    "recognizer_identifier": "Flair Analytics_139944345555488",
                    "recognizer_name": "Flair Analytics"
                }
            },
            {
                "entity_type": "LOCATION",
                "start": 248,
                "end": 255,
                "score": 1.0,
                "analysis_explanation": {
                    "recognizer": "FlairRecognizer",
                    "pattern_name": null,
                    "pattern": null,
                    "original_score": 1.0,
                    "score": 1.0,
                    "textual_explanation": "Identified as LOC by Flair's Named Entity Recognition",
                    "score_context_improvement": 0,
                    "supportive_context_word": "",
                    "validation_result": null
                },
                "recognition_metadata": {
                    "recognizer_identifier": "Flair Analytics_139944345555488",
                    "recognizer_name": "Flair Analytics"
                }
            },
            {
                "entity_type": "ORGANIZATION",
                "start": 248,
                "end": 255,
                "score": 1,
                "analysis_explanation": {
                    "recognizer": "CustomSpacyRecognizer",
                    "pattern_name": null,
                    "pattern": null,
                    "original_score": 1,
                    "score": 1,
                    "textual_explanation": "Identified as ORG by Spacy's Named Entity Recognition (Privy-trained)",
                    "score_context_improvement": 0,
                    "supportive_context_word": "",
                    "validation_result": null
                },
                "recognition_metadata": {
                    "recognizer_name": "CustomSpacyRecognizer",
                    "recognizer_identifier": "CustomSpacyRecognizer_139943499301312"
                }
            }
        ],
        "data/pii_data.txt": [
            {
                "entity_type": "PERSON",
                "start": 0,
                "end": 12,
                "score": 1,
                "analysis_explanation": {
                    "recognizer": "CustomSpacyRecognizer",
                    "pattern_name": null,
                    "pattern": null,
                    "original_score": 1,
                    "score": 1,
                    "textual_explanation": "Identified as PERSON by Spacy's Named Entity Recognition (Privy-trained)",
                    "score_context_improvement": 0,
                    "supportive_context_word": "",
                    "validation_result": null
                },
                "recognition_metadata": {
                    "recognizer_name": "CustomSpacyRecognizer",
                    "recognizer_identifier": "CustomSpacyRecognizer_139943499301312"
                }
            },
            {
                "entity_type": "ORGANIZATION",
                "start": 13,
                "end": 16,
                "score": 1,
                "analysis_explanation": {
                    "recognizer": "CustomSpacyRecognizer",
                    "pattern_name": null,
                    "pattern": null,
                    "original_score": 1,
                    "score": 1,
                    "textual_explanation": "Identified as ORG by Spacy's Named Entity Recognition (Privy-trained)",
                    "score_context_improvement": 0,
                    "supportive_context_word": "",
                    "validation_result": null
                },
                "recognition_metadata": {
                    "recognizer_name": "CustomSpacyRecognizer",
                    "recognizer_identifier": "CustomSpacyRecognizer_139943499301312"
                }
            },
            {
                "entity_type": "PERSON",
                "start": 53,
                "end": 58,
                "score": 1.0,
                "analysis_explanation": {
                    "recognizer": "FlairRecognizer",
                    "pattern_name": null,
                    "pattern": null,
                    "original_score": 1.0,
                    "score": 1.0,
                    "textual_explanation": "Identified as PER by Flair's Named Entity Recognition",
                    "score_context_improvement": 0,
                    "supportive_context_word": "",
                    "validation_result": null
                },
                "recognition_metadata": {
                    "recognizer_identifier": "Flair Analytics_139944345555488",
                    "recognizer_name": "Flair Analytics"
                }
            },
            {
                "entity_type": "PERSON",
                "start": 48,
                "end": 52,
                "score": 0.87,
                "analysis_explanation": {
                    "recognizer": "FlairRecognizer",
                    "pattern_name": null,
                    "pattern": null,
                    "original_score": 0.87,
                    "score": 0.87,
                    "textual_explanation": "Identified as PER by Flair's Named Entity Recognition",
                    "score_context_improvement": 0,
                    "supportive_context_word": "",
                    "validation_result": null
                },
                "recognition_metadata": {
                    "recognizer_identifier": "Flair Analytics_139944345555488",
                    "recognizer_name": "Flair Analytics"
                }
            },
            {
                "entity_type": "EMAIL",
                "start": 48,
                "end": 68,
                "score": 0.5,
                "analysis_explanation": {
                    "recognizer": "PatternRecognizer",
                    "pattern_name": "EMAIL",
                    "pattern": "\\S+@\\S+",
                    "original_score": 0.5,
                    "score": 0.5,
                    "textual_explanation": null,
                    "score_context_improvement": 0,
                    "supportive_context_word": "",
                    "validation_result": null
                },
                "recognition_metadata": {
                    "recognizer_name": "PatternRecognizer",
                    "recognizer_identifier": "PatternRecognizer_139943864893792"
                }
            },
            {
                "entity_type": "PHONE",
                "start": 72,
                "end": 82,
                "score": 0.5,
                "analysis_explanation": {
                    "recognizer": "PatternRecognizer",
                    "pattern_name": "PHONE",
                    "pattern": "\\(?\\d{3}\\)?[-.\\s]?\\d{3}[-.\\s]?\\d{4}",
                    "original_score": 0.5,
                    "score": 0.5,
                    "textual_explanation": null,
                    "score_context_improvement": 0,
                    "supportive_context_word": "",
                    "validation_result": null
                },
                "recognition_metadata": {
                    "recognizer_name": "PatternRecognizer",
                    "recognizer_identifier": "PatternRecognizer_139943864894128"
                }
            },
            {
                "entity_type": "PHONE",
                "start": 104,
                "end": 114,
                "score": 0.5,
                "analysis_explanation": {
                    "recognizer": "PatternRecognizer",
                    "pattern_name": "PHONE",
                    "pattern": "\\(?\\d{3}\\)?[-.\\s]?\\d{3}[-.\\s]?\\d{4}",
                    "original_score": 0.5,
                    "score": 0.5,
                    "textual_explanation": null,
                    "score_context_improvement": 0,
                    "supportive_context_word": "",
                    "validation_result": null
                },
                "recognition_metadata": {
                    "recognizer_name": "PatternRecognizer",
                    "recognizer_identifier": "PatternRecognizer_139943864894128"
                }
            }
        ]
    }
    

    3.3 Output configurations#

    • is_full_text: whether produce full text or just the sentences have PII entities in it

    • generate_html: whether to produce the html with highlighted pii entities

    • generate_json: whether to proudce the json report with the explaination of the process

    • is_full_html: whether produce full text with the pii entities highlighted or just sentences with pii entities.

    • is_full_report: whether produce the json report with detailed information or just start, end index and scores.

    import mlrun
    artifact_path = "./"
    proj = mlrun.get_or_create_project("pii", "./")
    fn = mlrun.code_to_function(
        project="pii",
        name="pii_recognizer",
        filename="pii_recognizer.py",
        kind="job",
        image="mlrun/mlrun",
        handler="recognize_pii",
        description="This function is used to recognize PII in a given text",
    )
    
    entity_operator_map = {
            "PERSON": ("keep", {}),
            "EMAIL": ("mask", {"masking_char": "😀", "chars_to_mask" : 100, "from_end": False}),
            "PHONE": ("hash", {}),
            "LOCATION": ("redact", {}),
            "ORGANIZATION": ("replace", {"new_value": "Company XYZ"})
            }
    run_obj = fn.run(
        artifact_path = artifact_path,
        params= {
            'model': "whole", 
            'input_path': "./data/",
            'output_path': "./data/output3/",
            "entities": ['PERSON', "EMAIL", "PHONE", "LOCATION", "ORGANIZATION"],
            "output_suffix": "output",
            "html_key": "highlighted",
            "score_threshold" : 0.5,
            "entity_operator_map": entity_operator_map,
            "is_full_text": False,
            "is_full_html": False,
            "is_full_report": False,
        },
        returns = ["output_path: path", "rpt_json: file", "errors: file"],
        local=True,
    )
    
    > 2023-07-31 02:22:57,789 [info] Project loaded successfully: {'project_name': 'pii'}
    > 2023-07-31 02:22:57,799 [warning] Failed to add git metadata, ignore if path is not part of a git repo.: {'path': './', 'error': '/User/pii_recognizer'}
    > 2023-07-31 02:22:57,891 [warning] artifact/output path is not defined or is local and relative, artifacts will not be visible in the UI: {'output_path': './'}
    > 2023-07-31 02:22:57,892 [info] Storing function: {'name': 'pii-recognizer-recognize-pii', 'uid': '3f6d701e423346b39026dc365698c15c', 'db': None}
    2023-07-31 02:22:58,079 loading file /User/.flair/models/flair-pii-distilbert/models--beki--flair-pii-distilbert/snapshots/20fb59f1762edcf253bce67716a94a43cb075ae6/pytorch_model.bin
    2023-07-31 02:23:01,565 SequenceTagger predicts: Dictionary with 21 tags: O, S-LOC, B-LOC, E-LOC, I-LOC, S-PER, B-PER, E-PER, I-PER, S-DATE_TIME, B-DATE_TIME, E-DATE_TIME, I-DATE_TIME, S-ORG, B-ORG, E-ORG, I-ORG, S-NRP, B-NRP, E-NRP, I-NRP
    Model loaded
    
    project uid iter start state name labels inputs parameters results artifacts
    pii 0 Jul 31 02:22:57 completed pii-recognizer-recognize-pii
    v3io_user=pengw
    kind=
    owner=pengw
    host=jupyter-pengw-5f99fb678d-mnvxl
    model=whole
    input_path=./data/
    output_path=./data/output3/
    entities=['PERSON', 'EMAIL', 'PHONE', 'LOCATION', 'ORGANIZATION']
    output_suffix=output
    html_key=highlighted
    score_threshold=0.5
    entity_operator_map={'PERSON': ('keep', {}), 'EMAIL': ('mask', {'masking_char': '😀', 'chars_to_mask': 100, 'from_end': False, 'entity_type': 'EMAIL'}), 'PHONE': ('hash', {}), 'LOCATION': ('redact', {}), 'ORGANIZATION': ('replace', {'new_value': 'Company XYZ', 'entity_type': 'ORGANIZATION'})}
    is_full_text=False
    is_full_html=False
    is_full_report=False
    highlighted
    output_path
    rpt_json
    errors
    
    
    > to track results use the .show() or .logs() methods or click here to open in UI
    > 2023-07-31 02:23:06,096 [info] Run execution finished: {'status': 'completed', 'name': 'pii-recognizer-recognize-pii'}
    
    #get the mlrun context
    context = mlrun.get_or_create_ctx('pii_ctx')
    import pathlib
    from tqdm.auto import tqdm
    
    for i, txt_file in enumerate(
            tqdm(
                list(pathlib.Path("./data/output3/").glob("*.txt")),
                desc="Processing files",
                unit="file",
            )
        ):
                # Load the str from the text file
            text = txt_file.read_text()
            print(text)
    
    Dear Mr. John Doe,
    
    We are pleased to inform you that you have been selected as the winner of our exclusive vacation package giveaway!
    John smith's Company XYZ is 182838483, connect him with 😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀 or 3990096a212e92850c3b3c8e57ab398252d482444a32def6b030cbac2d51efa3, he can pay you with a6983d9477e93eab115305afd124bd096699e6cb7d2ce72ec6e29a6378a4e8059393
    
    #check the highlighted html 
    html_output = context.get_cached_artifact("highlighted")
    html_str = mlrun.get_dataitem(html_output.get_target_path()).get().decode("utf-8")
    from IPython.core.display import display, HTML
    display(HTML(html_str))
    
    Highlighted Pii Entities

    Highlighted Pii Entities

  • data/letter.txt

    Dear Mr. John DoePERSON, We are pleased to inform you that you have been selected as the winner of our exclusive vacation package giveaway! Congratulations! You, along with your family, will enjoy a luxurious stay at our resort in the beautiful city of RivieraLOCATIONRivieraORGANIZATION

  • data/pii_data.txt

    John smith'sPERSON ssnORGANIZATION is 182838483, connect him with JohnPERSONJohn_smith@gmail.comEMAILsmithPERSON

  • #check the json report about the explanation.
    rpt_output = context.get_cached_artifact("rpt_json")
    rpt_str = mlrun.get_dataitem(rpt_output.get_target_path()).get().decode("utf-8")
    import json
    obj = json.loads(rpt_str)
     
    # Pretty Print JSON
    json_formatted_str = json.dumps(obj, indent=4)
    print(json_formatted_str)
    
    {
        "data/letter.txt": [
            {
                "entity_type": "PERSON",
                "start": 9,
                "end": 17,
                "score": 1
            },
            {
                "entity_type": "LOCATION",
                "start": 248,
                "end": 255,
                "score": 1.0
            },
            {
                "entity_type": "ORGANIZATION",
                "start": 248,
                "end": 255,
                "score": 1
            }
        ],
        "data/pii_data.txt": [
            {
                "entity_type": "PERSON",
                "start": 0,
                "end": 12,
                "score": 1
            },
            {
                "entity_type": "ORGANIZATION",
                "start": 13,
                "end": 16,
                "score": 1
            },
            {
                "entity_type": "PERSON",
                "start": 53,
                "end": 58,
                "score": 1.0
            },
            {
                "entity_type": "PERSON",
                "start": 48,
                "end": 52,
                "score": 0.87
            },
            {
                "entity_type": "EMAIL",
                "start": 48,
                "end": 68,
                "score": 0.5
            },
            {
                "entity_type": "PHONE",
                "start": 72,
                "end": 82,
                "score": 0.5
            },
            {
                "entity_type": "PHONE",
                "start": 104,
                "end": 114,
                "score": 0.5
            }
        ]
    }