When tokenizing paragraphs to sentences in the Russian language, I am observing the special case when the sequence is not treated as the end of the sentence. The case is with the я.
at the end of the sentence. See the working example:
import nltk
tok = nltk.tokenize.PunktTokenizer('russian')
print('-----------------')
line = 'Родилась заново, стал размышлять я. Она не застрелена, а это дело упрощает.'
lst = tok.tokenize(line)
for n, s in enumerate(lst, 1):
print(f'{n}: {s!r}')
print('-----------------')
line = 'Родилась заново, стал размышлять я. - Она не застрелена, а это дело упрощает.'
lst = tok.tokenize(line)
for n, s in enumerate(lst, 1):
print(f'{n}: {s!r}')
The second case works as expected. It differs only in adding the dash (kind of introduction of the speaker's note [sorry for my lack of terms]).
The я
is not present in the Russian abbreviations (c:\nltk_data\tokenizers\punkt_tab\russian\abbrev_types.txt
). However, even when added to the file with abbreviations, it does not make a difference.
How the situation should be fixed?
The nltk is of the version 3.9.1, the nltk_data are shared -- stored in c:\nltk_data; fresh download (30. 1. 2025). Python 3.12 on Windows 10 was used.
Update:
(... observing donwnvotes). Please, I am very new to nltk, and I tried my best to get the answer by myself. When down-voting, write the comment why. I am searching for the answer in various sources without success.
I am using UTF-8 encoding for both the test script here, and also in the file that is actually being processed (the image take from notepad++).
2nd Update: As Joop Eggen suggested, I have tried with other single letters:
import nltk
tok = nltk.tokenize.PunktTokenizer('russian')
print('\n================= я in the original question (я not in the abbrev_types.txt)\n')
line = 'Родилась заново, стал размышлять я. Она не застрелена, а это дело упрощает.'
lst = tok.tokenize(line)
for n, s in enumerate(lst, 1):
print(f'{n}: {s!r}')
print('-----------------')
line = 'Родилась заново, стал размышлять я. - Она не застрелена, а это дело упрощает.'
lst = tok.tokenize(line)
for n, s in enumerate(lst, 1):
print(f'{n}: {s!r}')
print('\n================= г is in the abbrev_types.txt\n')
line = 'Родилась заново, стал размышлять г. Она не застрелена, а это дело упрощает.'
lst = tok.tokenize(line)
for n, s in enumerate(lst, 1):
print(f'{n}: {s!r}')
print('-----------------')
line = 'Родилась заново, стал размышлять г. - Она не застрелена, а это дело упрощает.'
lst = tok.tokenize(line)
for n, s in enumerate(lst, 1):
print(f'{n}: {s!r}')
print('\n================= а IS NOT the abbrev_types.txt\n')
line = 'Родилась заново, стал размышлять а. Она не застрелена, а это дело упрощает.'
lst = tok.tokenize(line)
for n, s in enumerate(lst, 1):
print(f'{n}: {s!r}')
print('-----------------')
line = 'Родилась заново, стал размышлять а. - Она не застрелена, а это дело упрощает.'
lst = tok.tokenize(line)
for n, s in enumerate(lst, 1):
print(f'{n}: {s!r}')
The а
is also not in the abbreviations; however, it is treated as if it was (differently from я
). The г
is in the abbreviations, so here the behavior is expected.