最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python 3.x - How to use Hugging Face model with 512 max tokens on longer text (for Named Entity Recognition) - Stack Overflow

programmeradmin0浏览0评论

I have been using the Named Entity Recognition (NER) model on Indonesian text as follows:

text = "..."
model_name = "cahya/bert-base-indonesian-NER"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForTokenClassification.from_pretrained(model_name)
nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
entities = nlp(text)

This works great, but when text contains more than 512 tokens I get the error:

The size of tensor a (1098) must match the size of tensor b (512) at non-singleton dimension 1

What is the best way to check if text contains more than 512 tokens, and then split it into manageable chunks that I can use for NER?


Counting the number of tokens seems straightforward:

n_tokens = len(tokenizer.encode(text, add_special_tokens=True, truncation=False))

However, it is unclear what is the best way to split this. Surely there should be a suite of functions for this?

I have been using the Named Entity Recognition (NER) model https://huggingface.co/cahya/bert-base-indonesian-NER on Indonesian text as follows:

text = "..."
model_name = "cahya/bert-base-indonesian-NER"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForTokenClassification.from_pretrained(model_name)
nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
entities = nlp(text)

This works great, but when text contains more than 512 tokens I get the error:

The size of tensor a (1098) must match the size of tensor b (512) at non-singleton dimension 1

What is the best way to check if text contains more than 512 tokens, and then split it into manageable chunks that I can use for NER?


Counting the number of tokens seems straightforward:

n_tokens = len(tokenizer.encode(text, add_special_tokens=True, truncation=False))

However, it is unclear what is the best way to split this. Surely there should be a suite of functions for this?

Share Improve this question asked Nov 20, 2024 at 15:03 Mauro EscuderoMauro Escudero 111 bronze badge
Add a comment  | 

1 Answer 1

Reset to default 0

If your text is too long for cahya/bert-base-indonesian-NER, just split it into overlapping chunks before running NER.

from transformers import BertTokenizer, BertForTokenClassification, pipeline

# Initialize tokenizer and model
model_name = "cahya/bert-base-indonesian-NER"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForTokenClassification.from_pretrained(model_name)
nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Sample text
text = "Pemerintah Indonesia sedang berupaya meningkatkan infrastruktur digital di seluruh negeri. Dalam beberapa tahun terakhir, investasi dalam jaringan internet dan teknologi telah meningkat secara signifikan. Banyak perusahaan rintisan (startup) bermunculan, terutama di sektor e-commerce dan fintech. Selain itu, pendidikan digital juga menjadi fokus utama untuk memastikan masyarakat memiliki keterampilan yang dibutuhkan di era teknologi ini."

# Tokenize text
tokens = tokenizer.tokenize(text)

# Split tokens into overlapping chunks
max_length = 512 - 2  # ignore special tokens
overlap = 50
chunks = []
for i in range(0, len(tokens), max_length - overlap):
    chunks.append(tokens[i:i + max_length])

# Detokenize chunks back to text
chunk_texts = [tokenizer.convert_tokens_to_string(chunk) for chunk in chunks]

# Perform NER on each chunk
all_entities = []
for chunk in chunk_texts:
    entities = nlp(chunk) 
    all_entities.extend(entities)

I found that this model's performance is not very good. You can use this process as a reference and make adjustments.

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论