I'm using spaCy with the pl_core_news_lg model to extract named entities from Polish text. It correctly detects both anizations (ORG) and people's names (PER):
import spacy
nlp = spacy.load("pl_core_news_lg")
text = "Jan Kowalski pracuje w IBM i współpracuje z Microsoft oraz Google."
doc = nlp(text)
entities = [(ent.text, ent.label_) for ent in doc.ents]
print(entities)
Output:
[('Jan Kowalski', 'persName'), ('IBM', 'Name'), ('Microsoft', 'Name'), ('Google', 'Name')]
However, when I use Presidio with the pl_core_news_lg model and a configuration file, the recognizers do not correctly detect anizations (ORG) or PESEL numbers, even though they appear in the list of supported entities.
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider
provider = NlpEngineProvider(conf_file="path_to_my_file/nlp_config.yaml")
nlp_engine = provider.create_engine()
print(f"Supported recognizers (from NLP engine): {nlp_engine.get_supported_entities()}")
supported_languages = list(nlp_engine.get_supported_languages())
registry = RecognizerRegistry(supported_languages=["pl"])
registry.load_predefined_recognizers(["pl"])
print(f"Supported recognizers (from registry): {registry.get_supported_entities(['pl'])}")
analyzer = AnalyzerEngine(
registry=registry, supported_languages=supported_languages, nlp_engine=nlp_engine
)
results = analyzer.analyze(text, "pl")
for entity in results:
print(f"Found entity: {entity.entity_type} with score {entity.score}")
Output:
Supported recognizers (from NLP engine): ['ID', 'NRP', 'DATE_TIME', 'PERSON', 'LOCATION']
Supported recognizers (from registry): ['IN_VOTER', 'URL', 'IBAN_CODE', 'CREDIT_CARD', 'DATE_TIME', 'NRP', 'PHONE_NUMBER', 'MEDICAL_LICENSE', 'PERSON', 'IP_ADDRESS', 'ORGANIZATION', 'CRYPTO', 'LOCATION', 'PL_PESEL', 'EMAIL_ADDRESS']
Even though 'ORGANIZATION' and 'PL_PESEL' are listed ( should be listed in from NLP engine) as supported recognizers, Presidio does not detect them correctly in the text.
My config file:
nlp_engine_name: spacy
models:
- lang_code: pl
model_name: pl_core_news_lg
ner_model_configuration:
model_to_presidio_entity_mapping:
persName: PERSON
Name: ORGANIZATION
# Name: ORG
placeName: LOCATION
geogName: LOCATION
LOC: LOCATION
GPE: LOCATION
FAC: LOCATION
DATE: DATE_TIME
TIME: DATE_TIME
NORP: NRP
ID: ID
Why does Presidio fail to detect anizations (ORG) and PESEL numbers (PL_PESEL), while spaCy correctly detects them?
I'm using spaCy with the pl_core_news_lg model to extract named entities from Polish text. It correctly detects both anizations (ORG) and people's names (PER):
import spacy
nlp = spacy.load("pl_core_news_lg")
text = "Jan Kowalski pracuje w IBM i współpracuje z Microsoft oraz Google."
doc = nlp(text)
entities = [(ent.text, ent.label_) for ent in doc.ents]
print(entities)
Output:
[('Jan Kowalski', 'persName'), ('IBM', 'Name'), ('Microsoft', 'Name'), ('Google', 'Name')]
However, when I use Presidio with the pl_core_news_lg model and a configuration file, the recognizers do not correctly detect anizations (ORG) or PESEL numbers, even though they appear in the list of supported entities.
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider
provider = NlpEngineProvider(conf_file="path_to_my_file/nlp_config.yaml")
nlp_engine = provider.create_engine()
print(f"Supported recognizers (from NLP engine): {nlp_engine.get_supported_entities()}")
supported_languages = list(nlp_engine.get_supported_languages())
registry = RecognizerRegistry(supported_languages=["pl"])
registry.load_predefined_recognizers(["pl"])
print(f"Supported recognizers (from registry): {registry.get_supported_entities(['pl'])}")
analyzer = AnalyzerEngine(
registry=registry, supported_languages=supported_languages, nlp_engine=nlp_engine
)
results = analyzer.analyze(text, "pl")
for entity in results:
print(f"Found entity: {entity.entity_type} with score {entity.score}")
Output:
Supported recognizers (from NLP engine): ['ID', 'NRP', 'DATE_TIME', 'PERSON', 'LOCATION']
Supported recognizers (from registry): ['IN_VOTER', 'URL', 'IBAN_CODE', 'CREDIT_CARD', 'DATE_TIME', 'NRP', 'PHONE_NUMBER', 'MEDICAL_LICENSE', 'PERSON', 'IP_ADDRESS', 'ORGANIZATION', 'CRYPTO', 'LOCATION', 'PL_PESEL', 'EMAIL_ADDRESS']
Even though 'ORGANIZATION' and 'PL_PESEL' are listed ( should be listed in from NLP engine) as supported recognizers, Presidio does not detect them correctly in the text.
My config file:
nlp_engine_name: spacy
models:
- lang_code: pl
model_name: pl_core_news_lg
ner_model_configuration:
model_to_presidio_entity_mapping:
persName: PERSON
Name: ORGANIZATION
# Name: ORG
placeName: LOCATION
geogName: LOCATION
LOC: LOCATION
GPE: LOCATION
FAC: LOCATION
DATE: DATE_TIME
TIME: DATE_TIME
NORP: NRP
ID: ID
Why does Presidio fail to detect anizations (ORG) and PESEL numbers (PL_PESEL), while spaCy correctly detects them?
Share Improve this question asked Apr 2 at 5:56 MaltionMaltion 691 silver badge12 bronze badges 3- comparing two modules makes no sense. They were created by different people and they use different code - so they can work in different way and they may have different problems. Maybe it needs to ask authors of presidio why their module doesn't work as you expect. Maybe you have to send it to authors as issue. (PL: powodzenia) – furas Commented Apr 2 at 11:36
- maybe send this to Issues · microsoft/presidio – furas Commented Apr 2 at 11:58
- @furas I am comparing it just to show that Spacy itself work as expected. It is clearly stated in my title that it is issue with presidio. Moreover something might be missing in my code so I am not sure if presidio does not work as I expect. – Maltion Commented Apr 2 at 12:18
1 Answer
Reset to default 1The configuration file is missing the 'labels_to_ignore' field, stating that no entities should be ignored in the nlp engine :
labels_to_ignore:
- O
On your configuration it would look like this:
nlp_engine_name: spacy
models:
- lang_code: pl
model_name: pl_core_news_lg
ner_model_configuration:
labels_to_ignore:
- O
model_to_presidio_entity_mapping:
persName: PERSON
Name: ORGANIZATION
# Name: ORG
placeName: LOCATION
geogName: LOCATION
LOC: LOCATION
GPE: LOCATION
FAC: LOCATION
DATE: DATE_TIME
TIME: DATE_TIME
NORP: NRP
ID: ID