I want to configure Python so that it raises an error when encountering non-ASCII characters in identifiers (e.g., variable names, function names) but still accepts UTF-8 encoded strings (e.g., "Привет, мир!"). For example:
# This should raise an error
def тест():
pass
# This should work
text = "Привет, мир!"
I know about # -*- coding: ascii -*-
, but it blocks non-ASCII characters everywhere in the source code, including in string literals.
(and also the same question for jupyter notebook)
I want to configure Python so that it raises an error when encountering non-ASCII characters in identifiers (e.g., variable names, function names) but still accepts UTF-8 encoded strings (e.g., "Привет, мир!"). For example:
# This should raise an error
def тест():
pass
# This should work
text = "Привет, мир!"
I know about # -*- coding: ascii -*-
, but it blocks non-ASCII characters everywhere in the source code, including in string literals.
(and also the same question for jupyter notebook)
Share Improve this question asked Jan 20 at 8:09 Филя УсковФиля Усков 3022 silver badges8 bronze badges 3 |3 Answers
Reset to default 6This is easily checked with static code analysis. Pylint will report an issue in its default configuration:
foo.py:2:0: C2401: Function name "тест" contains a non-ASCII character, consider renaming it. (non-ascii-name)
You should configure your VCS to run pylint and only accept commits without warnings; or at least without C2401.
While @Friedrich's answer using pylint works for most cases, it has to be noted that pylint is a third-party library that is prone to falling out of sync with each major release of Python. For example, pylint to this date still does not recognize the match
statement, which became part of Python's syntax with the release of Python 3.10 back in 2021. You can try running pylint against the code below to find it not warning about a non-ASCII name:
match тест:
case _:
pass
And @globglogabgalab's answer using dir()
works only for names defined in the module's global namespace and only those that happen to be defined in the current execution path.
An arguably more robust approach would be to take advantage of the convention that all names are parsed into AST as either the id
attribute of a ast.Name
node or the name
attribute of other name-including node types, derived from the base class ast.AST
:
import ast
with open(__file__, encoding='utf-8') as source:
for node in ast.walk(ast.parse(source.read())):
match node:
case ast.Name(id=name):
pass
case ast.AST(name=name) if name:
pass
case _:
continue
if not name.isascii():
raise RuntimeError(f'{name} not an ASCII identifier.')
def тест():
pass
text = "Привет, мир!"
This produces:
RuntimeError: тест not an ASCII identifier.
Demo here
This approach is more future-proof because it is highly unlikely that Python developers stop following this convention in any future syntax changes.
I second the answer of @Friedrich as it is good practice for "real" projects, but for completeness you can try to work from the output of the dir() function :
def check_ascii(args: list[str]) -> bool:
for s in args:
if not s.isascii():
return False
return True
a, b, c = 5, 7, 0
print(dir())
# ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'check_ascii', 'a', 'b', 'c']
check_ascii(dir())
# True
èé = True
print(dir())
# ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'check_ascii', 'a', 'b', 'c', 'èé']
check_ascii(dir())
# False
Edit: you could be tempted to put the dir()
inside check_ascii
's body but it won't work as the scope would not be the same:
def check_dir():
print(dir())
check_dir()
# []
"Привет, мир!"
is just a Unicode string in Python 3, not UTF8. The actual implementation can vary based on the contents according to PEP 393 and in 3.13, that's UTF16. If you useb = array.array('u', text).tobytes()
you'll get 2 bytes per character. The answers to this question explain this – Panagiotis Kanavos Commented Jan 20 at 9:13# coding: utf-8
or in presence of UTF-8 BOM,"Привет, мир!"
would indeed be in UTF-8 in source code (just like the rest of the file, along withdef тест():
); the fact that the string itself will be stored as UTF-16 during execution is not related to the OP's question. – Amadan Commented Jan 21 at 1:46UTF-8 encoded strings (e.g., "Привет, мир!")
– Panagiotis Kanavos Commented Jan 21 at 7:49