pii_recognizer package#

Submodules#

pii_recognizer.pii_recognizer module#

class pii_recognizer.pii_recognizer.CustomSpacyRecognizer(*args: Any, **kwargs: Any)[source]#

Bases: presidio_analyzer.

Custom Spacy Recognizer from Presidio Analyzer trained on Privy data. The privy data is generated using this https://github.com/pixie-io/pixie/tree/main/src/datagen/pii/privy It can be used to recognize custom entities, Since we want to use Presidio’s Registries to generate AnalyzerEngine, it inherits from Presidio Analyzer’s LocalRecognizer class.

RECOGNIZABLE_ENTITIES = {'DATE_TIME', 'LOCATION', 'NRP', 'ORGANIZATION', 'PERSON'}#
analyze(text: str, entities: List[str], nlp_artifacts=None)[source]#

Analyze text using Spacy.

Parameters
  • text – Text to analyze

  • entities – Entities to analyze

  • nlp_artifacts – NLP artifacts to use

Returns

List of Presidio RecognizerResult objects

class pii_recognizer.pii_recognizer.Entities[source]#

Bases: object

AGE = 'AGE'#
CREDIT_CARD = 'CREDIT_CARD'#
CURRENCY = 'CURRENCY'#
DATE_TIME = 'DATE_TIME'#
EMAIL = 'EMAIL'#
GPE = ('GPE',)#
IMEI = 'IMEI'#
LICENSE_PLATE = 'LICENSE_PLATE'#
LOCATION = 'LOCATION'#
MAC_ADDRESS = 'MAC_ADDRESS'#
NRP = 'NRP'#
ORGANIZATION = 'ORGANIZATION'#
PASSWORD = 'PASSWORD'#
PERSON = 'PERSON'#
PHONE = 'PHONE'#
ROUTING_NUMBER = 'ROUTING_NUMBER'#
SSN = 'SSN'#
SWIFT_CODE = 'SWIFT_CODE'#
TITLE = 'TITLE'#
US_BANK_NUMBER = 'US_BANK_NUMBER'#
US_DRIVER_LICENSE = 'US_DRIVER_LICENSE'#
US_ITIN = 'US_ITIN'#
US_PASSPORT = 'US_PASSPORT'#
class pii_recognizer.pii_recognizer.FlairRecognizer(*args: Any, **kwargs: Any)[source]#

Bases: presidio_analyzer.

Wrapper for a flair model, if needed to be used within Presidio Analyzer. This is to make sure the recognizer can be registered with Presidio registry.

RECOGNIZABLE_ENTITIES = {'AGE', 'CURRENCY', 'GPE', 'IMEI', 'LICENSE_PLATE', 'LOCATION', 'MAC_ADDRESS', 'NRP', 'ORGANIZATION', 'PASSWORD', 'PERSON', 'ROUTING_NUMBER', 'SWIFT_CODE', 'TITLE', 'US_BANK_NUMBER', 'US_DRIVER_LICENSE', 'US_ITIN', 'US_PASSPORT'}#
analyze(text: str, entities: List[str], nlp_artifacts: Optional[presidio_analyzer.nlp_engine.NlpArtifacts] = None)List[presidio_analyzer.RecognizerResult][source]#

Analyze text and return the results.

Parameters
  • text – The text for analysis.

  • entities – The list of entities to recognize.

  • nlp_artifacts – Not used by this recognizer but needed for the interface.

Returns

The list of Presidio RecognizerResult constructed from the recognized Flair detections.

class pii_recognizer.pii_recognizer.Models[source]#

Bases: object

FLAIR = 'flair'#
PATTERN = 'pattern'#
SPACY = 'spacy'#
WHOLE = 'whole'#
class pii_recognizer.pii_recognizer.PatternRecognizerFactory[source]#

Bases: object

Factory for creating pattern recognizers, it can be extended in the future to add more regex pattern for different entities. For the pattern recognizer to work, we need construct a list of regex patterns for each entity.

RECOGNIZABLE_ENTITIES = {'CREDIT_CARD': [presidio_analyzer.Pattern], 'EMAIL': [presidio_analyzer.Pattern], 'PHONE': [presidio_analyzer.Pattern], 'SSN': [presidio_analyzer.Pattern]}#
pii_recognizer.pii_recognizer.recognize_pii(context: mlrun.execution.MLClientCtx, input_path: Union[str, pathlib.Path], html_key: str, score_threshold: float, output_directory: Optional[str] = None, entities: Optional[List[str]] = None, entity_operator_map: Optional[dict] = None, model: Optional[str] = None, generate_json: bool = True, generate_html: bool = True, is_full_text: bool = True, is_full_html: bool = True, is_full_report: bool = True)Union[Tuple[str, pandas.core.frame.DataFrame, dict, dict], Tuple[str, pandas.core.frame.DataFrame, dict]][source]#

Walk through the input path, recognize PII in text and store the anonymized text in the output path. Generate the html with different colors for each entity, json report of the explanation.

Parameters
  • context – The MLRun context. this is needed for log the artifacts.

  • input_path – The input path of the text files needs to be analyzed.

  • html_key – The html key for the artifact.

  • score_threshold – The score threshold to mark the recognition as trusted.

  • output_directory – The output directory path to store the anonymized text.

  • entities – The list of entities to recognize.

  • entity_operator_map – The map of entity to operator (mask, redact, replace, keep, hash, and its params)

  • model – The model to use. Can be “spacy”, “flair”, “pattern” or “whole”.

  • generate_json – Whether to generate the json report of the explanation.

  • generate_html – Whether to generate the html report of the explanation.

  • is_full_text – Whether to return the full text or only the masked text.

  • is_full_html – Whether to return the full html or just the annotated text

  • is_full_report – Whether to return the full report or just the score and start, end index

Returns

A tuple of:

  • Path to the output directory

  • The json report of the explanation (if generate_json is True)

  • A dictionary of errors files that were not processed

Module contents#