python 3.x - How to use Hugging Face model with 512 max tokens on longer text (for Named Entity Recognition)

I have been using the Named Entity Recognition (NER) model on Indonesian text as follows:

text = "..."
model_name = "cahya/bert-base-indonesian-NER"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForTokenClassification.from_pretrained(model_name)
nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
entities = nlp(text)

This works great, but when text contains more than 512 tokens I get the error:

The size of tensor a (1098) must match the size of tensor b (512) at non-singleton dimension 1

What is the best way to check if text contains more than 512 tokens, and then split it into manageable chunks that I can use for NER?

Counting the number of tokens seems straightforward:

n_tokens = len(tokenizer.encode(text, add_special_tokens=True, truncation=False))

However, it is unclear what is the best way to split this. Surely there should be a suite of functions for this?

I have been using the Named Entity Recognition (NER) model https://huggingface.co/cahya/bert-base-indonesian-NER on Indonesian text as follows:

text = "..."
model_name = "cahya/bert-base-indonesian-NER"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForTokenClassification.from_pretrained(model_name)
nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
entities = nlp(text)

This works great, but when text contains more than 512 tokens I get the error:

The size of tensor a (1098) must match the size of tensor b (512) at non-singleton dimension 1

What is the best way to check if text contains more than 512 tokens, and then split it into manageable chunks that I can use for NER?

Counting the number of tokens seems straightforward:

n_tokens = len(tokenizer.encode(text, add_special_tokens=True, truncation=False))

However, it is unclear what is the best way to split this. Surely there should be a suite of functions for this?

Share Improve this question asked Nov 20, 2024 at 15:03 Mauro Escudero 111 bronze badge

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

If your text is too long for cahya/bert-base-indonesian-NER, just split it into overlapping chunks before running NER.

from transformers import BertTokenizer, BertForTokenClassification, pipeline

# Initialize tokenizer and model
model_name = "cahya/bert-base-indonesian-NER"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForTokenClassification.from_pretrained(model_name)
nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Sample text
text = "Pemerintah Indonesia sedang berupaya meningkatkan infrastruktur digital di seluruh negeri. Dalam beberapa tahun terakhir, investasi dalam jaringan internet dan teknologi telah meningkat secara signifikan. Banyak perusahaan rintisan (startup) bermunculan, terutama di sektor e-commerce dan fintech. Selain itu, pendidikan digital juga menjadi fokus utama untuk memastikan masyarakat memiliki keterampilan yang dibutuhkan di era teknologi ini."

# Tokenize text
tokens = tokenizer.tokenize(text)

# Split tokens into overlapping chunks
max_length = 512 - 2  # ignore special tokens
overlap = 50
chunks = []
for i in range(0, len(tokens), max_length - overlap):
    chunks.append(tokens[i:i + max_length])

# Detokenize chunks back to text
chunk_texts = [tokenizer.convert_tokens_to_string(chunk) for chunk in chunks]

# Perform NER on each chunk
all_entities = []
for chunk in chunk_texts:
    entities = nlp(chunk) 
    all_entities.extend(entities)

I found that this model's performance is not very good. You can use this process as a reference and make adjustments.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python 3.x - How to use Hugging Face model with 512 max tokens on longer text (for Named Entity Recognition) - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)