utf 8 - How to enforce ASCII-only identifiers in Python while allowing UTF-8 strings?

I want to configure Python so that it raises an error when encountering non-ASCII characters in identifiers (e.g., variable names, function names) but still accepts UTF-8 encoded strings (e.g., "Привет, мир!"). For example:

# This should raise an error
def тест(): 
    pass

# This should work
text = "Привет, мир!"

I know about # -*- coding: ascii -*-, but it blocks non-ASCII characters everywhere in the source code, including in string literals.

(and also the same question for jupyter notebook)

# This should raise an error
def тест(): 
    pass

# This should work
text = "Привет, мир!"

I know about # -*- coding: ascii -*-, but it blocks non-ASCII characters everywhere in the source code, including in string literals.

(and also the same question for jupyter notebook)

Share Improve this question asked Jan 20 at 8:09 Филя Усков 3022 silver badges8 bronze badges

Nitpicking: "Привет, мир!" is just a Unicode string in Python 3, not UTF8. The actual implementation can vary based on the contents according to PEP 393 and in 3.13, that's UTF16. If you use b = array.array('u', text).tobytes() you'll get 2 bytes per character. The answers to this question explain this – Panagiotis Kanavos Commented Jan 20 at 9:13
@PanagiotisKanavos You might be mistaking runtime encoding with source file encoding. The PEP relevant to the OP's question is PEP 263. Assuming there is # coding: utf-8 or in presence of UTF-8 BOM, "Привет, мир!" would indeed be in UTF-8 in source code (just like the rest of the file, along with def тест():); the fact that the string itself will be stored as UTF-16 during execution is not related to the OP's question. – Amadan Commented Jan 21 at 1:46
I don't, and I refer to the OP saying UTF-8 encoded strings (e.g., "Привет, мир!") – Panagiotis Kanavos Commented Jan 21 at 7:49

Add a comment |

3 Answers 3

Sorted by: Reset to default 6

This is easily checked with static code analysis. Pylint will report an issue in its default configuration:

foo.py:2:0: C2401: Function name "тест" contains a non-ASCII character, consider renaming it. (non-ascii-name)

You should configure your VCS to run pylint and only accept commits without warnings; or at least without C2401.

While @Friedrich's answer using pylint works for most cases, it has to be noted that pylint is a third-party library that is prone to falling out of sync with each major release of Python. For example, pylint to this date still does not recognize the match statement, which became part of Python's syntax with the release of Python 3.10 back in 2021. You can try running pylint against the code below to find it not warning about a non-ASCII name:

match тест:
    case _:
        pass

And @globglogabgalab's answer using dir() works only for names defined in the module's global namespace and only those that happen to be defined in the current execution path.

An arguably more robust approach would be to take advantage of the convention that all names are parsed into AST as either the id attribute of a ast.Name node or the name attribute of other name-including node types, derived from the base class ast.AST:

import ast

with open(__file__, encoding='utf-8') as source:
    for node in ast.walk(ast.parse(source.read())):
        match node:
            case ast.Name(id=name):
                pass
            case ast.AST(name=name) if name:
                pass
            case _:
                continue
        if not name.isascii():
            raise RuntimeError(f'{name} not an ASCII identifier.')

def тест():
    pass

text = "Привет, мир!"

This produces:

RuntimeError: тест not an ASCII identifier.

Demo here

This approach is more future-proof because it is highly unlikely that Python developers stop following this convention in any future syntax changes.

I second the answer of @Friedrich as it is good practice for "real" projects, but for completeness you can try to work from the output of the dir() function :

def check_ascii(args: list[str]) -> bool:
    for s in args:
        if not s.isascii():
            return False
    return True

a, b, c = 5, 7, 0
print(dir())
# ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'check_ascii', 'a', 'b', 'c']
check_ascii(dir())
# True

èé = True
print(dir())
# ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'check_ascii', 'a', 'b', 'c', 'èé']
check_ascii(dir())
# False

Edit: you could be tempted to put the dir() inside check_ascii's body but it won't work as the scope would not be the same:

def check_dir():
    print(dir())

check_dir()
# []

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

utf 8 - How to enforce ASCII-only identifiers in Python while allowing UTF-8 strings? - Stack Overflow

3 Answers 3

与本文相关的文章

评论列表(0)