pii_recognizer package
Contents
pii_recognizer package#
Submodules#
pii_recognizer.pii_recognizer module#
- class pii_recognizer.pii_recognizer.CustomSpacyRecognizer(*args: Any, **kwargs: Any)[source]#
Bases:
presidio_analyzer.
Custom Spacy Recognizer from Presidio Analyzer trained on Privy data. The privy data is generated using this https://github.com/pixie-io/pixie/tree/main/src/datagen/pii/privy It can be used to recognize custom entities, Since we want to use Presidio’s Registries to generate AnalyzerEngine, it inherits from Presidio Analyzer’s LocalRecognizer class.
- RECOGNIZABLE_ENTITIES = {'DATE_TIME', 'LOCATION', 'NRP', 'ORGANIZATION', 'PERSON'}#
- class pii_recognizer.pii_recognizer.Entities[source]#
Bases:
object
- AGE = 'AGE'#
- CREDIT_CARD = 'CREDIT_CARD'#
- CURRENCY = 'CURRENCY'#
- DATE_TIME = 'DATE_TIME'#
- EMAIL = 'EMAIL'#
- GPE = ('GPE',)#
- IMEI = 'IMEI'#
- LICENSE_PLATE = 'LICENSE_PLATE'#
- LOCATION = 'LOCATION'#
- MAC_ADDRESS = 'MAC_ADDRESS'#
- NRP = 'NRP'#
- ORGANIZATION = 'ORGANIZATION'#
- PASSWORD = 'PASSWORD'#
- PERSON = 'PERSON'#
- PHONE = 'PHONE'#
- ROUTING_NUMBER = 'ROUTING_NUMBER'#
- SSN = 'SSN'#
- SWIFT_CODE = 'SWIFT_CODE'#
- TITLE = 'TITLE'#
- US_BANK_NUMBER = 'US_BANK_NUMBER'#
- US_DRIVER_LICENSE = 'US_DRIVER_LICENSE'#
- US_ITIN = 'US_ITIN'#
- US_PASSPORT = 'US_PASSPORT'#
- class pii_recognizer.pii_recognizer.FlairRecognizer(*args: Any, **kwargs: Any)[source]#
Bases:
presidio_analyzer.
Wrapper for a flair model, if needed to be used within Presidio Analyzer. This is to make sure the recognizer can be registered with Presidio registry.
- RECOGNIZABLE_ENTITIES = {'AGE', 'CURRENCY', 'GPE', 'IMEI', 'LICENSE_PLATE', 'LOCATION', 'MAC_ADDRESS', 'NRP', 'ORGANIZATION', 'PASSWORD', 'PERSON', 'ROUTING_NUMBER', 'SWIFT_CODE', 'TITLE', 'US_BANK_NUMBER', 'US_DRIVER_LICENSE', 'US_ITIN', 'US_PASSPORT'}#
- analyze(text: str, entities: List[str], nlp_artifacts: Optional[presidio_analyzer.nlp_engine.NlpArtifacts] = None) → List[presidio_analyzer.RecognizerResult][source]#
Analyze text and return the results.
- Parameters
text – The text for analysis.
entities – The list of entities to recognize.
nlp_artifacts – Not used by this recognizer but needed for the interface.
- Returns
The list of Presidio RecognizerResult constructed from the recognized Flair detections.
- class pii_recognizer.pii_recognizer.Models[source]#
Bases:
object
- FLAIR = 'flair'#
- PATTERN = 'pattern'#
- SPACY = 'spacy'#
- WHOLE = 'whole'#
- class pii_recognizer.pii_recognizer.PatternRecognizerFactory[source]#
Bases:
object
Factory for creating pattern recognizers, it can be extended in the future to add more regex pattern for different entities. For the pattern recognizer to work, we need construct a list of regex patterns for each entity.
- RECOGNIZABLE_ENTITIES = {'CREDIT_CARD': [presidio_analyzer.Pattern], 'EMAIL': [presidio_analyzer.Pattern], 'PHONE': [presidio_analyzer.Pattern], 'SSN': [presidio_analyzer.Pattern]}#
- pii_recognizer.pii_recognizer.recognize_pii(context: mlrun.execution.MLClientCtx, input_path: Union[str, pathlib.Path], html_key: str, score_threshold: float, output_directory: Optional[str] = None, entities: Optional[List[str]] = None, entity_operator_map: Optional[dict] = None, model: Optional[str] = None, generate_json: bool = True, generate_html: bool = True, is_full_text: bool = True, is_full_html: bool = True, is_full_report: bool = True) → Union[Tuple[str, pandas.core.frame.DataFrame, dict, dict], Tuple[str, pandas.core.frame.DataFrame, dict]][source]#
Walk through the input path, recognize PII in text and store the anonymized text in the output path. Generate the html with different colors for each entity, json report of the explanation.
- Parameters
context – The MLRun context. this is needed for log the artifacts.
input_path – The input path of the text files needs to be analyzed.
html_key – The html key for the artifact.
score_threshold – The score threshold to mark the recognition as trusted.
output_directory – The output directory path to store the anonymized text.
entities – The list of entities to recognize.
entity_operator_map – The map of entity to operator (mask, redact, replace, keep, hash, and its params)
model – The model to use. Can be “spacy”, “flair”, “pattern” or “whole”.
generate_json – Whether to generate the json report of the explanation.
generate_html – Whether to generate the html report of the explanation.
is_full_text – Whether to return the full text or only the masked text.
is_full_html – Whether to return the full html or just the annotated text
is_full_report – Whether to return the full report or just the score and start, end index
- Returns
A tuple of:
Path to the output directory
The json report of the explanation (if generate_json is True)
A dictionary of errors files that were not processed