PII Recognizer
Contents
PII Recognizer#
A function to detect pii data and anonymize the pii entity in the text.
In this notebook we will go over the function’s docs and outputs and see an end-to-end example of running it.
Documentation
Results
End-to-end Demo
1. Documentation#
The function receive a directory path with all the text files in it. It walk through the directory, get all the text file. Then it detect the pii entity inside of the text file, apply the operator on the entity. Generate the html file with all pii entity highlighted. Generate the json report has the explaination of the process.
1.1. Parameters:#
context:
mlrun.MLClientCtx
The MLRun context
input_path:
str
The input directory with all the text files
output_path:
str
The directory that is used to store the anonymized text files. it is also used for mlrun to log the artifact as zip file
output_suffix:
str
The suffix will added to the input file. for example if the input text file is pii.txt, if output_suffix is “anonymized”, the output file would be pii_anonymized.txt
html_key:
str
The artifact name of the html file
entities:
List[str]
The list of the entities to recognize. Please make sure the model you choose can recognize the entities.
entity_operator_map:
List[str]
For different entity, we can apply different operator. Now supports Keep, Mask, Replace, Redact, Hashentity_operator_map = { "PERSON": ("keep", {}), "EMAIL": ("mask", {"masking_char": "#", "chars_to_mask": 5, "from_end": False}), "PHONE": ("hash", {}), "LOCATION": ("redact", {}), "ORGANIZATION": ("replace", {"new_value": "Company XYZ"}) }
In this example:
“PERSON” entities are kept as they are using the “keep” operator.
“EMAIL_ADDRESS” entities are masked with the “#” character, masking the first five characters.
“PHONE_NUMBER” entities are replaced with their hashed value using the “hash” operator.
“LOCATION” entities are completely removed using the “redact” operator.
“ORGANIZATION” entities are replaced with the string “Company XYZ” using the “replace” operator.
model:
str
“whole”, “spacy”, “pattern”, “flair”. The default is “whole”.
For each model, it can detect some entities. The “whole” model is combined all three models together. It can detect all the entities list below.
“spacy” : [“LOCATION”, “PERSON”,”NRP”,”ORGANIZATION”,”DATE_TIME”]
“pattern”: [“CREDIT_CARD”, “SSN”, “PHONE”, “EMAIL”]
“flair”: [ “LOCATION”, “PERSON”, “NRP”, “GPE”, “ORGANIZATION”, “MAC_ADDRESS”, “US_BANK_NUMBER”, “IMEI”, “TITLE”, “LICENSE_PLATE”, “US_PASSPORT”, “CURRENCY”, “ROUTING_NUMBER”, “US_ITIN”, “US_BANK_NUMBER”, “US_DRIVER_LICENSE”, “AGE”, “PASSWORD”, “SWIFT_CODE” ]
score_threshold:
Minimum confidence value, the default is 0 to align with presidio.AnalyzerEngine
generate_json_rpt:
Whether to generate the json report of the explaination
generate_html_rpt:
Whether to generate the html with highlighted pii entities or not
is_full_text:
Whether to return the full text or just the sentences with pii entities.
is_full_html:
bool
Whether to return the full html or just the annotated html
is_full_report:
bool
Whether to return the full json report or just the score and start, end index
1.2. Outputs:#
There are two outputs of this function.
output_path:
str
The directory stored all the anonymized text files
rpt_json:
dict
A dict of reporting to explain how does the model detect the pii entity
errors :
dict
A dict of errors when processing the text files if any
2. Results#
The result of the function looks like the following:
For example if the input string is
John Doe 's ssn is 182838483, connect john doe with john_doe@gmail.com or 6288389029, he can pay you with 41482929939393
The anonymized_text is
<PERSON>'s <ORGANIZATION> is <SSN>, connect <PERSON> with <PERSON> <EMAIL> or <PHONE>, he can pay you with <CREDIT_CARD>
The html_str is
is , connect me with or , he can pay you with
The json report that explain the output is
[
{
"entity_type": "PERSON", # result of the labeling
"start": 0, # start positon of the entity
"end": 9, # end postion of the entity
"score": 0.99, # the confident score of the model + context_improvement
"analysis_explanation": {
"recognizer": "FlairRecognizer", # which recognizer is used to recognize this entity
"pattern_name": null,
"pattern": null,
"original_score": 0.99, # The original confident score from the pre-trained model
"score": 0.99, # the final score = original_score + score_context_improvement
"textual_explanation": "Identified as PER by Flair's Named Entity Recognition",
"score_context_improvement": 0, # The improvement from the context
"supportive_context_word": "",
"validation_result": null
},
"recognition_metadata": {
"recognizer_identifier": "Flair Analytics_5577088640",
"recognizer_name": "Flair Analytics"
}
},
....
]
3. End-to-end Demo#
3.1. Recognition configurations#
model: which model you want to use.
entities: What entities to recognize?
score_threshold: From which score to mark the recogniztion as trusted?
import mlrun
artifact_path = "./"
proj = mlrun.get_or_create_project("pii", "./")
fn = mlrun.code_to_function(
project="pii",
name="pii_recognizer",
filename="pii_recognizer.py",
kind="job",
image="mlrun/mlrun",
handler="recognize_pii",
description="This function is used to recognize PII in a given text",
)
run_obj = fn.run(
artifact_path = artifact_path,
params= {
'model': "whole",
'input_path': "./data/",
'output_path': "./data/output1/",
"entities": ['PERSON', "EMAIL", "PHONE", "LOCATION", "ORGANIZATION"], # the entities that needs to recognize
"output_suffix": "output",
"html_key": "highlighted",
"score_threshold" : 0.5, # the score threshold to mark the recognition as trusted
},
returns = ["output_path: path", "rpt_json: file", "errors: file"],
local=True,
)
> 2023-07-31 02:17:04,305 [info] Project loaded successfully: {'project_name': 'pii'}
> 2023-07-31 02:17:04,312 [warning] Failed to add git metadata, ignore if path is not part of a git repo.: {'path': './', 'error': '/User/pii_recognizer'}
> 2023-07-31 02:17:04,408 [warning] artifact/output path is not defined or is local and relative, artifacts will not be visible in the UI: {'output_path': './'}
> 2023-07-31 02:17:04,409 [info] Storing function: {'name': 'pii-recognizer-recognize-pii', 'uid': '51b5ad8144004e52a1008c08850842c8', 'db': None}
2023-07-31 02:17:04,567 loading file /User/.flair/models/flair-pii-distilbert/models--beki--flair-pii-distilbert/snapshots/20fb59f1762edcf253bce67716a94a43cb075ae6/pytorch_model.bin
2023-07-31 02:17:07,730 SequenceTagger predicts: Dictionary with 21 tags: O, S-LOC, B-LOC, E-LOC, I-LOC, S-PER, B-PER, E-PER, I-PER, S-DATE_TIME, B-DATE_TIME, E-DATE_TIME, I-DATE_TIME, S-ORG, B-ORG, E-ORG, I-ORG, S-NRP, B-NRP, E-NRP, I-NRP
Model loaded
project | uid | iter | start | state | name | labels | inputs | parameters | results | artifacts |
---|---|---|---|---|---|---|---|---|---|---|
pii | 0 | Jul 31 02:17:04 | completed | pii-recognizer-recognize-pii | v3io_user=pengw kind= owner=pengw host=jupyter-pengw-5f99fb678d-mnvxl |
model=whole input_path=./data/ output_path=./data/output1/ entities=['PERSON', 'EMAIL', 'PHONE', 'LOCATION', 'ORGANIZATION'] output_suffix=output html_key=highlighted score_threshold=0.5 |
highlighted output_path rpt_json errors |
> 2023-07-31 02:17:12,403 [info] Run execution finished: {'status': 'completed', 'name': 'pii-recognizer-recognize-pii'}
#get the mlrun context
context = mlrun.get_or_create_ctx('pii_ctx1')
import pathlib
from tqdm.auto import tqdm
for i, txt_file in enumerate(
tqdm(
list(pathlib.Path("./data/output1/").glob("*.txt")),
desc="Processing files",
unit="file",
)
):
# Load the str from the text file
text = txt_file.read_text()
print(text)
Dear Mr. <PERSON>,
We are pleased to inform you that you have been selected as the winner of our exclusive vacation package giveaway! Congratulations! You, along with your family, will enjoy a luxurious stay at our resort in the beautiful city of <ORGANIZATION>. Your flight tickets have been booked, and you will be departing on July 15th, 2023.
Please provide us with the necessary details to finalize your travel arrangements. We kindly request your full name, date of birth, passport number, and contact information. Rest assured that all provided information will be handled with utmost confidentiality and in compliance with data protection regulations.
We look forward to creating unforgettable memories for you and your loved ones during your stay with us. If you have any questions or require further assistance, please don't hesitate to contact our customer support team.
<PERSON> <ORGANIZATION> is 182838483, connect him with <EMAIL> or <PHONE>, he can pay you with <PHONE>9393
#check the highlighted html
html_output = context.get_cached_artifact("highlighted")
html_str = mlrun.get_dataitem(html_output.get_target_path()).get().decode("utf-8")
from IPython.core.display import display, HTML
display(HTML(html_str))
Highlighted Pii Entities
data/letter.txt
Dear Mr. , We are pleased to inform you that you have been selected as the winner of our exclusive vacation package giveaway! Congratulations! You, along with your family, will enjoy a luxurious stay at our resort in the beautiful city of . Your flight tickets have been booked, and you will be departing on July 15th, 2023. Please provide us with the necessary details to finalize your travel arrangements. We kindly request your full name, date of birth, passport number, and contact information. Rest assured that all provided information will be handled with utmost confidentiality and in compliance with data protection regulations. We look forward to creating unforgettable memories for you and your loved ones during your stay with us. If you have any questions or require further assistance, please don't hesitate to contact our customer support team.
data/pii_data.txt
is 182838483, connect him with @gmail.com or , he can pay you with 9393
#check the json report about the explanation.
rpt_output1 = context.get_cached_artifact("rpt_json")
rpt_str1 = mlrun.get_dataitem(rpt_output1.get_target_path()).get().decode("utf-8")
import json
obj = json.loads(rpt_str1)
# Pretty Print JSON
json_formatted_str1 = json.dumps(obj, indent=4)
print(json_formatted_str1)
{
"data/letter.txt": [
{
"entity_type": "PERSON",
"start": 9,
"end": 17,
"score": 1,
"analysis_explanation": {
"recognizer": "CustomSpacyRecognizer",
"pattern_name": null,
"pattern": null,
"original_score": 1,
"score": 1,
"textual_explanation": "Identified as PERSON by Spacy's Named Entity Recognition (Privy-trained)",
"score_context_improvement": 0,
"supportive_context_word": "",
"validation_result": null
},
"recognition_metadata": {
"recognizer_name": "CustomSpacyRecognizer",
"recognizer_identifier": "CustomSpacyRecognizer_139944219101744"
}
},
{
"entity_type": "LOCATION",
"start": 248,
"end": 255,
"score": 1.0,
"analysis_explanation": {
"recognizer": "FlairRecognizer",
"pattern_name": null,
"pattern": null,
"original_score": 1.0,
"score": 1.0,
"textual_explanation": "Identified as LOC by Flair's Named Entity Recognition",
"score_context_improvement": 0,
"supportive_context_word": "",
"validation_result": null
},
"recognition_metadata": {
"recognizer_identifier": "Flair Analytics_139944219101936",
"recognizer_name": "Flair Analytics"
}
},
{
"entity_type": "ORGANIZATION",
"start": 248,
"end": 255,
"score": 1,
"analysis_explanation": {
"recognizer": "CustomSpacyRecognizer",
"pattern_name": null,
"pattern": null,
"original_score": 1,
"score": 1,
"textual_explanation": "Identified as ORG by Spacy's Named Entity Recognition (Privy-trained)",
"score_context_improvement": 0,
"supportive_context_word": "",
"validation_result": null
},
"recognition_metadata": {
"recognizer_name": "CustomSpacyRecognizer",
"recognizer_identifier": "CustomSpacyRecognizer_139944219101744"
}
}
],
"data/pii_data.txt": [
{
"entity_type": "PERSON",
"start": 0,
"end": 12,
"score": 1,
"analysis_explanation": {
"recognizer": "CustomSpacyRecognizer",
"pattern_name": null,
"pattern": null,
"original_score": 1,
"score": 1,
"textual_explanation": "Identified as PERSON by Spacy's Named Entity Recognition (Privy-trained)",
"score_context_improvement": 0,
"supportive_context_word": "",
"validation_result": null
},
"recognition_metadata": {
"recognizer_name": "CustomSpacyRecognizer",
"recognizer_identifier": "CustomSpacyRecognizer_139944219101744"
}
},
{
"entity_type": "ORGANIZATION",
"start": 13,
"end": 16,
"score": 1,
"analysis_explanation": {
"recognizer": "CustomSpacyRecognizer",
"pattern_name": null,
"pattern": null,
"original_score": 1,
"score": 1,
"textual_explanation": "Identified as ORG by Spacy's Named Entity Recognition (Privy-trained)",
"score_context_improvement": 0,
"supportive_context_word": "",
"validation_result": null
},
"recognition_metadata": {
"recognizer_name": "CustomSpacyRecognizer",
"recognizer_identifier": "CustomSpacyRecognizer_139944219101744"
}
},
{
"entity_type": "PERSON",
"start": 53,
"end": 58,
"score": 1.0,
"analysis_explanation": {
"recognizer": "FlairRecognizer",
"pattern_name": null,
"pattern": null,
"original_score": 1.0,
"score": 1.0,
"textual_explanation": "Identified as PER by Flair's Named Entity Recognition",
"score_context_improvement": 0,
"supportive_context_word": "",
"validation_result": null
},
"recognition_metadata": {
"recognizer_identifier": "Flair Analytics_139944219101936",
"recognizer_name": "Flair Analytics"
}
},
{
"entity_type": "PERSON",
"start": 48,
"end": 52,
"score": 0.87,
"analysis_explanation": {
"recognizer": "FlairRecognizer",
"pattern_name": null,
"pattern": null,
"original_score": 0.87,
"score": 0.87,
"textual_explanation": "Identified as PER by Flair's Named Entity Recognition",
"score_context_improvement": 0,
"supportive_context_word": "",
"validation_result": null
},
"recognition_metadata": {
"recognizer_identifier": "Flair Analytics_139944219101936",
"recognizer_name": "Flair Analytics"
}
},
{
"entity_type": "EMAIL",
"start": 48,
"end": 68,
"score": 0.5,
"analysis_explanation": {
"recognizer": "PatternRecognizer",
"pattern_name": "EMAIL",
"pattern": "\\S+@\\S+",
"original_score": 0.5,
"score": 0.5,
"textual_explanation": null,
"score_context_improvement": 0,
"supportive_context_word": "",
"validation_result": null
},
"recognition_metadata": {
"recognizer_name": "PatternRecognizer",
"recognizer_identifier": "PatternRecognizer_139944352474640"
}
},
{
"entity_type": "PHONE",
"start": 72,
"end": 82,
"score": 0.5,
"analysis_explanation": {
"recognizer": "PatternRecognizer",
"pattern_name": "PHONE",
"pattern": "\\(?\\d{3}\\)?[-.\\s]?\\d{3}[-.\\s]?\\d{4}",
"original_score": 0.5,
"score": 0.5,
"textual_explanation": null,
"score_context_improvement": 0,
"supportive_context_word": "",
"validation_result": null
},
"recognition_metadata": {
"recognizer_name": "PatternRecognizer",
"recognizer_identifier": "PatternRecognizer_139944352476560"
}
},
{
"entity_type": "PHONE",
"start": 104,
"end": 114,
"score": 0.5,
"analysis_explanation": {
"recognizer": "PatternRecognizer",
"pattern_name": "PHONE",
"pattern": "\\(?\\d{3}\\)?[-.\\s]?\\d{3}[-.\\s]?\\d{4}",
"original_score": 0.5,
"score": 0.5,
"textual_explanation": null,
"score_context_improvement": 0,
"supportive_context_word": "",
"validation_result": null
},
"recognition_metadata": {
"recognizer_name": "PatternRecognizer",
"recognizer_identifier": "PatternRecognizer_139944352476560"
}
}
]
}
3.2. Masking configurations#
entity_operator_map: it defined what to do with recognized tokens? Mask them? mask them with what? remove them? replace them?
entity_operator_map = { "PERSON": ("keep", {}), "EMAIL": ("mask", {"masking_char": "😀", "chars_to_mask": 5, "from_end": False}), "PHONE": ("hash", {}), "LOCATION": ("redact", {}), "ORGANIZATION": ("replace", {"new_value": "Company XYZ"}) }
import mlrun
artifact_path = "./"
proj = mlrun.get_or_create_project("pii", "./")
fn = mlrun.code_to_function(
project="pii",
name="pii_recognizer",
filename="pii_recognizer.py",
kind="job",
image="mlrun/mlrun",
handler="recognize_pii",
description="This function is used to recognize PII in a given text",
)
entity_operator_map = {
"PERSON": ("keep", {}),
"EMAIL": ("mask", {"masking_char": "😀", "chars_to_mask" : 100, "from_end": False}),
"PHONE": ("hash", {}),
"LOCATION": ("redact", {}),
"ORGANIZATION": ("replace", {"new_value": "Company XYZ"})
}
run_obj = fn.run(
artifact_path = artifact_path,
params= {
'model': "whole",
'input_path': "./data/",
'output_path': "./data/output2/",
"entities": ['PERSON', "EMAIL", "PHONE", "LOCATION", "ORGANIZATION"],
"output_suffix": "output",
"html_key": "highlighted",
"score_threshold" : 0.5,
"entity_operator_map": entity_operator_map,
},
returns = ["output_path: path", "rpt_json: file", "errors: file"],
local=True,
)
> 2023-07-31 02:20:40,550 [info] Project loaded successfully: {'project_name': 'pii'}
> 2023-07-31 02:20:40,556 [warning] Failed to add git metadata, ignore if path is not part of a git repo.: {'path': './', 'error': '/User/pii_recognizer'}
> 2023-07-31 02:20:40,649 [warning] artifact/output path is not defined or is local and relative, artifacts will not be visible in the UI: {'output_path': './'}
> 2023-07-31 02:20:40,649 [info] Storing function: {'name': 'pii-recognizer-recognize-pii', 'uid': '2b43f80c7ca44b43b229760bb55f814d', 'db': None}
2023-07-31 02:20:40,812 loading file /User/.flair/models/flair-pii-distilbert/models--beki--flair-pii-distilbert/snapshots/20fb59f1762edcf253bce67716a94a43cb075ae6/pytorch_model.bin
2023-07-31 02:20:44,130 SequenceTagger predicts: Dictionary with 21 tags: O, S-LOC, B-LOC, E-LOC, I-LOC, S-PER, B-PER, E-PER, I-PER, S-DATE_TIME, B-DATE_TIME, E-DATE_TIME, I-DATE_TIME, S-ORG, B-ORG, E-ORG, I-ORG, S-NRP, B-NRP, E-NRP, I-NRP
Model loaded
project | uid | iter | start | state | name | labels | inputs | parameters | results | artifacts |
---|---|---|---|---|---|---|---|---|---|---|
pii | 0 | Jul 31 02:20:40 | completed | pii-recognizer-recognize-pii | v3io_user=pengw kind= owner=pengw host=jupyter-pengw-5f99fb678d-mnvxl |
model=whole input_path=./data/ output_path=./data/output2/ entities=['PERSON', 'EMAIL', 'PHONE', 'LOCATION', 'ORGANIZATION'] output_suffix=output html_key=highlighted score_threshold=0.5 entity_operator_map={'PERSON': ('keep', {}), 'EMAIL': ('mask', {'masking_char': '😀', 'chars_to_mask': 100, 'from_end': False, 'entity_type': 'EMAIL'}), 'PHONE': ('hash', {}), 'LOCATION': ('redact', {}), 'ORGANIZATION': ('replace', {'new_value': 'Company XYZ', 'entity_type': 'ORGANIZATION'})} |
highlighted output_path rpt_json errors |
> 2023-07-31 02:20:48,903 [info] Run execution finished: {'status': 'completed', 'name': 'pii-recognizer-recognize-pii'}
#get the mlrun context
context = mlrun.get_or_create_ctx('pii_ctx1')
import pathlib
from tqdm.auto import tqdm
for i, txt_file in enumerate(
tqdm(
list(pathlib.Path("./data/output2/").glob("*.txt")),
desc="Processing files",
unit="file",
)
):
# Load the str from the text file
text = txt_file.read_text()
print(text)
Dear Mr. John Doe,
We are pleased to inform you that you have been selected as the winner of our exclusive vacation package giveaway! Congratulations! You, along with your family, will enjoy a luxurious stay at our resort in the beautiful city of Company XYZ. Your flight tickets have been booked, and you will be departing on July 15th, 2023.
Please provide us with the necessary details to finalize your travel arrangements. We kindly request your full name, date of birth, passport number, and contact information. Rest assured that all provided information will be handled with utmost confidentiality and in compliance with data protection regulations.
We look forward to creating unforgettable memories for you and your loved ones during your stay with us. If you have any questions or require further assistance, please don't hesitate to contact our customer support team.
John smith's Company XYZ is 182838483, connect him with 😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀 or 3990096a212e92850c3b3c8e57ab398252d482444a32def6b030cbac2d51efa3, he can pay you with a6983d9477e93eab115305afd124bd096699e6cb7d2ce72ec6e29a6378a4e8059393
#check the highlighted html
html_output = context.get_cached_artifact("highlighted")
html_str = mlrun.get_dataitem(html_output.get_target_path()).get().decode("utf-8")
from IPython.core.display import display, HTML
display(HTML(html_str))
Highlighted Pii Entities
data/letter.txt
Dear Mr. , We are pleased to inform you that you have been selected as the winner of our exclusive vacation package giveaway! Congratulations! You, along with your family, will enjoy a luxurious stay at our resort in the beautiful city of . Your flight tickets have been booked, and you will be departing on July 15th, 2023. Please provide us with the necessary details to finalize your travel arrangements. We kindly request your full name, date of birth, passport number, and contact information. Rest assured that all provided information will be handled with utmost confidentiality and in compliance with data protection regulations. We look forward to creating unforgettable memories for you and your loved ones during your stay with us. If you have any questions or require further assistance, please don't hesitate to contact our customer support team.
data/pii_data.txt
is 182838483, connect him with @gmail.com or , he can pay you with 9393
#check the json report about the explanation.
rpt_output1 = context.get_cached_artifact("rpt_json")
rpt_str1 = mlrun.get_dataitem(rpt_output1.get_target_path()).get().decode("utf-8")
import json
obj = json.loads(rpt_str1)
# Pretty Print JSON
json_formatted_str1 = json.dumps(obj, indent=4)
print(json_formatted_str1)
{
"data/letter.txt": [
{
"entity_type": "PERSON",
"start": 9,
"end": 17,
"score": 1.0,
"analysis_explanation": {
"recognizer": "FlairRecognizer",
"pattern_name": null,
"pattern": null,
"original_score": 1.0,
"score": 1.0,
"textual_explanation": "Identified as PER by Flair's Named Entity Recognition",
"score_context_improvement": 0,
"supportive_context_word": "",
"validation_result": null
},
"recognition_metadata": {
"recognizer_identifier": "Flair Analytics_139944345555488",
"recognizer_name": "Flair Analytics"
}
},
{
"entity_type": "LOCATION",
"start": 248,
"end": 255,
"score": 1.0,
"analysis_explanation": {
"recognizer": "FlairRecognizer",
"pattern_name": null,
"pattern": null,
"original_score": 1.0,
"score": 1.0,
"textual_explanation": "Identified as LOC by Flair's Named Entity Recognition",
"score_context_improvement": 0,
"supportive_context_word": "",
"validation_result": null
},
"recognition_metadata": {
"recognizer_identifier": "Flair Analytics_139944345555488",
"recognizer_name": "Flair Analytics"
}
},
{
"entity_type": "ORGANIZATION",
"start": 248,
"end": 255,
"score": 1,
"analysis_explanation": {
"recognizer": "CustomSpacyRecognizer",
"pattern_name": null,
"pattern": null,
"original_score": 1,
"score": 1,
"textual_explanation": "Identified as ORG by Spacy's Named Entity Recognition (Privy-trained)",
"score_context_improvement": 0,
"supportive_context_word": "",
"validation_result": null
},
"recognition_metadata": {
"recognizer_name": "CustomSpacyRecognizer",
"recognizer_identifier": "CustomSpacyRecognizer_139943499301312"
}
}
],
"data/pii_data.txt": [
{
"entity_type": "PERSON",
"start": 0,
"end": 12,
"score": 1,
"analysis_explanation": {
"recognizer": "CustomSpacyRecognizer",
"pattern_name": null,
"pattern": null,
"original_score": 1,
"score": 1,
"textual_explanation": "Identified as PERSON by Spacy's Named Entity Recognition (Privy-trained)",
"score_context_improvement": 0,
"supportive_context_word": "",
"validation_result": null
},
"recognition_metadata": {
"recognizer_name": "CustomSpacyRecognizer",
"recognizer_identifier": "CustomSpacyRecognizer_139943499301312"
}
},
{
"entity_type": "ORGANIZATION",
"start": 13,
"end": 16,
"score": 1,
"analysis_explanation": {
"recognizer": "CustomSpacyRecognizer",
"pattern_name": null,
"pattern": null,
"original_score": 1,
"score": 1,
"textual_explanation": "Identified as ORG by Spacy's Named Entity Recognition (Privy-trained)",
"score_context_improvement": 0,
"supportive_context_word": "",
"validation_result": null
},
"recognition_metadata": {
"recognizer_name": "CustomSpacyRecognizer",
"recognizer_identifier": "CustomSpacyRecognizer_139943499301312"
}
},
{
"entity_type": "PERSON",
"start": 53,
"end": 58,
"score": 1.0,
"analysis_explanation": {
"recognizer": "FlairRecognizer",
"pattern_name": null,
"pattern": null,
"original_score": 1.0,
"score": 1.0,
"textual_explanation": "Identified as PER by Flair's Named Entity Recognition",
"score_context_improvement": 0,
"supportive_context_word": "",
"validation_result": null
},
"recognition_metadata": {
"recognizer_identifier": "Flair Analytics_139944345555488",
"recognizer_name": "Flair Analytics"
}
},
{
"entity_type": "PERSON",
"start": 48,
"end": 52,
"score": 0.87,
"analysis_explanation": {
"recognizer": "FlairRecognizer",
"pattern_name": null,
"pattern": null,
"original_score": 0.87,
"score": 0.87,
"textual_explanation": "Identified as PER by Flair's Named Entity Recognition",
"score_context_improvement": 0,
"supportive_context_word": "",
"validation_result": null
},
"recognition_metadata": {
"recognizer_identifier": "Flair Analytics_139944345555488",
"recognizer_name": "Flair Analytics"
}
},
{
"entity_type": "EMAIL",
"start": 48,
"end": 68,
"score": 0.5,
"analysis_explanation": {
"recognizer": "PatternRecognizer",
"pattern_name": "EMAIL",
"pattern": "\\S+@\\S+",
"original_score": 0.5,
"score": 0.5,
"textual_explanation": null,
"score_context_improvement": 0,
"supportive_context_word": "",
"validation_result": null
},
"recognition_metadata": {
"recognizer_name": "PatternRecognizer",
"recognizer_identifier": "PatternRecognizer_139943864893792"
}
},
{
"entity_type": "PHONE",
"start": 72,
"end": 82,
"score": 0.5,
"analysis_explanation": {
"recognizer": "PatternRecognizer",
"pattern_name": "PHONE",
"pattern": "\\(?\\d{3}\\)?[-.\\s]?\\d{3}[-.\\s]?\\d{4}",
"original_score": 0.5,
"score": 0.5,
"textual_explanation": null,
"score_context_improvement": 0,
"supportive_context_word": "",
"validation_result": null
},
"recognition_metadata": {
"recognizer_name": "PatternRecognizer",
"recognizer_identifier": "PatternRecognizer_139943864894128"
}
},
{
"entity_type": "PHONE",
"start": 104,
"end": 114,
"score": 0.5,
"analysis_explanation": {
"recognizer": "PatternRecognizer",
"pattern_name": "PHONE",
"pattern": "\\(?\\d{3}\\)?[-.\\s]?\\d{3}[-.\\s]?\\d{4}",
"original_score": 0.5,
"score": 0.5,
"textual_explanation": null,
"score_context_improvement": 0,
"supportive_context_word": "",
"validation_result": null
},
"recognition_metadata": {
"recognizer_name": "PatternRecognizer",
"recognizer_identifier": "PatternRecognizer_139943864894128"
}
}
]
}
3.3 Output configurations#
is_full_text: whether produce full text or just the sentences have PII entities in it
generate_html: whether to produce the html with highlighted pii entities
generate_json: whether to proudce the json report with the explaination of the process
is_full_html: whether produce full text with the pii entities highlighted or just sentences with pii entities.
is_full_report: whether produce the json report with detailed information or just start, end index and scores.
import mlrun
artifact_path = "./"
proj = mlrun.get_or_create_project("pii", "./")
fn = mlrun.code_to_function(
project="pii",
name="pii_recognizer",
filename="pii_recognizer.py",
kind="job",
image="mlrun/mlrun",
handler="recognize_pii",
description="This function is used to recognize PII in a given text",
)
entity_operator_map = {
"PERSON": ("keep", {}),
"EMAIL": ("mask", {"masking_char": "😀", "chars_to_mask" : 100, "from_end": False}),
"PHONE": ("hash", {}),
"LOCATION": ("redact", {}),
"ORGANIZATION": ("replace", {"new_value": "Company XYZ"})
}
run_obj = fn.run(
artifact_path = artifact_path,
params= {
'model': "whole",
'input_path': "./data/",
'output_path': "./data/output3/",
"entities": ['PERSON', "EMAIL", "PHONE", "LOCATION", "ORGANIZATION"],
"output_suffix": "output",
"html_key": "highlighted",
"score_threshold" : 0.5,
"entity_operator_map": entity_operator_map,
"is_full_text": False,
"is_full_html": False,
"is_full_report": False,
},
returns = ["output_path: path", "rpt_json: file", "errors: file"],
local=True,
)
> 2023-07-31 02:22:57,789 [info] Project loaded successfully: {'project_name': 'pii'}
> 2023-07-31 02:22:57,799 [warning] Failed to add git metadata, ignore if path is not part of a git repo.: {'path': './', 'error': '/User/pii_recognizer'}
> 2023-07-31 02:22:57,891 [warning] artifact/output path is not defined or is local and relative, artifacts will not be visible in the UI: {'output_path': './'}
> 2023-07-31 02:22:57,892 [info] Storing function: {'name': 'pii-recognizer-recognize-pii', 'uid': '3f6d701e423346b39026dc365698c15c', 'db': None}
2023-07-31 02:22:58,079 loading file /User/.flair/models/flair-pii-distilbert/models--beki--flair-pii-distilbert/snapshots/20fb59f1762edcf253bce67716a94a43cb075ae6/pytorch_model.bin
2023-07-31 02:23:01,565 SequenceTagger predicts: Dictionary with 21 tags: O, S-LOC, B-LOC, E-LOC, I-LOC, S-PER, B-PER, E-PER, I-PER, S-DATE_TIME, B-DATE_TIME, E-DATE_TIME, I-DATE_TIME, S-ORG, B-ORG, E-ORG, I-ORG, S-NRP, B-NRP, E-NRP, I-NRP
Model loaded
project | uid | iter | start | state | name | labels | inputs | parameters | results | artifacts |
---|---|---|---|---|---|---|---|---|---|---|
pii | 0 | Jul 31 02:22:57 | completed | pii-recognizer-recognize-pii | v3io_user=pengw kind= owner=pengw host=jupyter-pengw-5f99fb678d-mnvxl |
model=whole input_path=./data/ output_path=./data/output3/ entities=['PERSON', 'EMAIL', 'PHONE', 'LOCATION', 'ORGANIZATION'] output_suffix=output html_key=highlighted score_threshold=0.5 entity_operator_map={'PERSON': ('keep', {}), 'EMAIL': ('mask', {'masking_char': '😀', 'chars_to_mask': 100, 'from_end': False, 'entity_type': 'EMAIL'}), 'PHONE': ('hash', {}), 'LOCATION': ('redact', {}), 'ORGANIZATION': ('replace', {'new_value': 'Company XYZ', 'entity_type': 'ORGANIZATION'})} is_full_text=False is_full_html=False is_full_report=False |
highlighted output_path rpt_json errors |
> 2023-07-31 02:23:06,096 [info] Run execution finished: {'status': 'completed', 'name': 'pii-recognizer-recognize-pii'}
#get the mlrun context
context = mlrun.get_or_create_ctx('pii_ctx')
import pathlib
from tqdm.auto import tqdm
for i, txt_file in enumerate(
tqdm(
list(pathlib.Path("./data/output3/").glob("*.txt")),
desc="Processing files",
unit="file",
)
):
# Load the str from the text file
text = txt_file.read_text()
print(text)
Dear Mr. John Doe,
We are pleased to inform you that you have been selected as the winner of our exclusive vacation package giveaway!
John smith's Company XYZ is 182838483, connect him with 😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀 or 3990096a212e92850c3b3c8e57ab398252d482444a32def6b030cbac2d51efa3, he can pay you with a6983d9477e93eab115305afd124bd096699e6cb7d2ce72ec6e29a6378a4e8059393
#check the highlighted html
html_output = context.get_cached_artifact("highlighted")
html_str = mlrun.get_dataitem(html_output.get_target_path()).get().decode("utf-8")
from IPython.core.display import display, HTML
display(HTML(html_str))
Highlighted Pii Entities
data/letter.txt
Dear Mr. , We are pleased to inform you that you have been selected as the winner of our exclusive vacation package giveaway! Congratulations! You, along with your family, will enjoy a luxurious stay at our resort in the beautiful city of
data/pii_data.txt
is 182838483, connect him with
#check the json report about the explanation.
rpt_output = context.get_cached_artifact("rpt_json")
rpt_str = mlrun.get_dataitem(rpt_output.get_target_path()).get().decode("utf-8")
import json
obj = json.loads(rpt_str)
# Pretty Print JSON
json_formatted_str = json.dumps(obj, indent=4)
print(json_formatted_str)
{
"data/letter.txt": [
{
"entity_type": "PERSON",
"start": 9,
"end": 17,
"score": 1
},
{
"entity_type": "LOCATION",
"start": 248,
"end": 255,
"score": 1.0
},
{
"entity_type": "ORGANIZATION",
"start": 248,
"end": 255,
"score": 1
}
],
"data/pii_data.txt": [
{
"entity_type": "PERSON",
"start": 0,
"end": 12,
"score": 1
},
{
"entity_type": "ORGANIZATION",
"start": 13,
"end": 16,
"score": 1
},
{
"entity_type": "PERSON",
"start": 53,
"end": 58,
"score": 1.0
},
{
"entity_type": "PERSON",
"start": 48,
"end": 52,
"score": 0.87
},
{
"entity_type": "EMAIL",
"start": 48,
"end": 68,
"score": 0.5
},
{
"entity_type": "PHONE",
"start": 72,
"end": 82,
"score": 0.5
},
{
"entity_type": "PHONE",
"start": 104,
"end": 114,
"score": 0.5
}
]
}