最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Why does Presidio with spacy nlp engine not recognize organizations and PESEL while spaCy does? - Stack Overflow

programmeradmin2浏览0评论

I'm using spaCy with the pl_core_news_lg model to extract named entities from Polish text. It correctly detects both anizations (ORG) and people's names (PER):

import spacy

nlp = spacy.load("pl_core_news_lg")
text = "Jan Kowalski pracuje w IBM i współpracuje z Microsoft oraz Google."

doc = nlp(text)
entities = [(ent.text, ent.label_) for ent in doc.ents]

print(entities)

Output:

[('Jan Kowalski', 'persName'), ('IBM', 'Name'), ('Microsoft', 'Name'), ('Google', 'Name')]

However, when I use Presidio with the pl_core_news_lg model and a configuration file, the recognizers do not correctly detect anizations (ORG) or PESEL numbers, even though they appear in the list of supported entities.

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider

provider = NlpEngineProvider(conf_file="path_to_my_file/nlp_config.yaml") 
nlp_engine = provider.create_engine()

print(f"Supported recognizers (from NLP engine): {nlp_engine.get_supported_entities()}")

supported_languages = list(nlp_engine.get_supported_languages())
registry = RecognizerRegistry(supported_languages=["pl"])
registry.load_predefined_recognizers(["pl"])

print(f"Supported recognizers (from registry): {registry.get_supported_entities(['pl'])}")

analyzer = AnalyzerEngine(
    registry=registry, supported_languages=supported_languages, nlp_engine=nlp_engine
)

results = analyzer.analyze(text, "pl")

for entity in results:
    print(f"Found entity: {entity.entity_type} with score {entity.score}")

Output:

Supported recognizers (from NLP engine): ['ID', 'NRP', 'DATE_TIME', 'PERSON', 'LOCATION']
Supported recognizers (from registry): ['IN_VOTER', 'URL', 'IBAN_CODE', 'CREDIT_CARD', 'DATE_TIME', 'NRP', 'PHONE_NUMBER', 'MEDICAL_LICENSE', 'PERSON', 'IP_ADDRESS', 'ORGANIZATION', 'CRYPTO', 'LOCATION', 'PL_PESEL', 'EMAIL_ADDRESS']

Even though 'ORGANIZATION' and 'PL_PESEL' are listed ( should be listed in from NLP engine) as supported recognizers, Presidio does not detect them correctly in the text.

My config file:

nlp_engine_name: spacy
models:
  - lang_code: pl
    model_name: pl_core_news_lg

ner_model_configuration:
  model_to_presidio_entity_mapping:
    persName: PERSON
    Name: ORGANIZATION
#    Name: ORG
    placeName: LOCATION
    geogName: LOCATION
    LOC: LOCATION
    GPE: LOCATION
    FAC: LOCATION
    DATE: DATE_TIME
    TIME: DATE_TIME
    NORP: NRP
    ID: ID

Why does Presidio fail to detect anizations (ORG) and PESEL numbers (PL_PESEL), while spaCy correctly detects them?

I'm using spaCy with the pl_core_news_lg model to extract named entities from Polish text. It correctly detects both anizations (ORG) and people's names (PER):

import spacy

nlp = spacy.load("pl_core_news_lg")
text = "Jan Kowalski pracuje w IBM i współpracuje z Microsoft oraz Google."

doc = nlp(text)
entities = [(ent.text, ent.label_) for ent in doc.ents]

print(entities)

Output:

[('Jan Kowalski', 'persName'), ('IBM', 'Name'), ('Microsoft', 'Name'), ('Google', 'Name')]

However, when I use Presidio with the pl_core_news_lg model and a configuration file, the recognizers do not correctly detect anizations (ORG) or PESEL numbers, even though they appear in the list of supported entities.

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider

provider = NlpEngineProvider(conf_file="path_to_my_file/nlp_config.yaml") 
nlp_engine = provider.create_engine()

print(f"Supported recognizers (from NLP engine): {nlp_engine.get_supported_entities()}")

supported_languages = list(nlp_engine.get_supported_languages())
registry = RecognizerRegistry(supported_languages=["pl"])
registry.load_predefined_recognizers(["pl"])

print(f"Supported recognizers (from registry): {registry.get_supported_entities(['pl'])}")

analyzer = AnalyzerEngine(
    registry=registry, supported_languages=supported_languages, nlp_engine=nlp_engine
)

results = analyzer.analyze(text, "pl")

for entity in results:
    print(f"Found entity: {entity.entity_type} with score {entity.score}")

Output:

Supported recognizers (from NLP engine): ['ID', 'NRP', 'DATE_TIME', 'PERSON', 'LOCATION']
Supported recognizers (from registry): ['IN_VOTER', 'URL', 'IBAN_CODE', 'CREDIT_CARD', 'DATE_TIME', 'NRP', 'PHONE_NUMBER', 'MEDICAL_LICENSE', 'PERSON', 'IP_ADDRESS', 'ORGANIZATION', 'CRYPTO', 'LOCATION', 'PL_PESEL', 'EMAIL_ADDRESS']

Even though 'ORGANIZATION' and 'PL_PESEL' are listed ( should be listed in from NLP engine) as supported recognizers, Presidio does not detect them correctly in the text.

My config file:

nlp_engine_name: spacy
models:
  - lang_code: pl
    model_name: pl_core_news_lg

ner_model_configuration:
  model_to_presidio_entity_mapping:
    persName: PERSON
    Name: ORGANIZATION
#    Name: ORG
    placeName: LOCATION
    geogName: LOCATION
    LOC: LOCATION
    GPE: LOCATION
    FAC: LOCATION
    DATE: DATE_TIME
    TIME: DATE_TIME
    NORP: NRP
    ID: ID

Why does Presidio fail to detect anizations (ORG) and PESEL numbers (PL_PESEL), while spaCy correctly detects them?

Share Improve this question asked Apr 2 at 5:56 MaltionMaltion 691 silver badge12 bronze badges 3
  • comparing two modules makes no sense. They were created by different people and they use different code - so they can work in different way and they may have different problems. Maybe it needs to ask authors of presidio why their module doesn't work as you expect. Maybe you have to send it to authors as issue. (PL: powodzenia) – furas Commented Apr 2 at 11:36
  • maybe send this to Issues · microsoft/presidio – furas Commented Apr 2 at 11:58
  • @furas I am comparing it just to show that Spacy itself work as expected. It is clearly stated in my title that it is issue with presidio. Moreover something might be missing in my code so I am not sure if presidio does not work as I expect. – Maltion Commented Apr 2 at 12:18
Add a comment  | 

1 Answer 1

Reset to default 1

The configuration file is missing the 'labels_to_ignore' field, stating that no entities should be ignored in the nlp engine :

  labels_to_ignore:
    - O

On your configuration it would look like this:

nlp_engine_name: spacy
models:
  - lang_code: pl
    model_name: pl_core_news_lg

ner_model_configuration:
  labels_to_ignore:
    - O
  model_to_presidio_entity_mapping:
    persName: PERSON
    Name: ORGANIZATION
#    Name: ORG
    placeName: LOCATION
    geogName: LOCATION
    LOC: LOCATION
    GPE: LOCATION
    FAC: LOCATION
    DATE: DATE_TIME
    TIME: DATE_TIME
    NORP: NRP
    ID: ID

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论