最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Check if string only contains characters from a certain ISO specification - Stack Overflow

programmeradmin1浏览0评论

Short question: What is the most efficient way to check whether a .TXT file contains only characters defined in a selected ISO specification?

Question with full context: In the German energy market EDIFACT is used to automatically exchange information. Each file exchanged has a header segment which contains information about the contents of the file.

Please find an example of this segment below.

UNB+UNOC:3+9903323000007:500+9900080000007:500+250102:0900+Y48A42R58CRR43++++++

As you can see after the UNB+ we find the content UNOC. This tells us which character set is used in the file. In this case it is ISO/IEC 8859-1.

I would like a python method which checks whether the EDIFACT file contains only characters specified in ISO/IEC 8859-1.

The most simple solution I can think of is doing something like this (pseudo code).

ISO_string = "All characters contained in ISO/IEC 8859-1"
EDIFACT_string = "Contents of EDIFACT file"
is_iso_char = FALSE

For EDIFACT_Char in EDIFACT_string:
    For ISO_char in ISO_string:
        if EDIFACT_Char = ISO_char:
            is_iso_char = TRUE
            break
    if is_iso_char == FALSE:
        raiseerror("File contains char not contained in ISO/IEC 8859-1 and needs to be rejected")
        do_error_handling()
    is_iso_char = FALSE

I studied business informatics and lack the theoretical background for algorithm theory. This feels like a very inefficient method and since EDIFACT needs to be processed quickly I don't want this functionality to be a bottleneck.

Is there an inbuilt pyhton way to do what I want to achieve better?

Update #1:

I wrote this code as suggested by Barmar. To check it I added the Chinese characters for "World" in the file (世界). I expected .decode to throw an error. However it just decodes the byte string and adds some strange characters at the beginning.

File Contents: 世界UNB+UNOC:3+9903323000007:500+9900080000007:500+250102:0900+Y48A42R58CRR43++++++

with open(Filename, "rb") as edifact_file:
    edifact_bytes = edifact_file.read()

try:
    verified_edifact_string = edifact_bytes.decode(encoding='latin_1', errors='strict')
except:
    print("String does not conform to ISO specification")

print(verified_edifact_string)

Prints: If I just copy Stackoverflow cuts away some of the characters.

Edit #2: According to Python documentation the ISO/IEC_8859-1 specification is called latin_1 when using Python's .decode() and .encode() methods.

Short question: What is the most efficient way to check whether a .TXT file contains only characters defined in a selected ISO specification?

Question with full context: In the German energy market EDIFACT is used to automatically exchange information. Each file exchanged has a header segment which contains information about the contents of the file.

Please find an example of this segment below.

UNB+UNOC:3+9903323000007:500+9900080000007:500+250102:0900+Y48A42R58CRR43++++++

As you can see after the UNB+ we find the content UNOC. This tells us which character set is used in the file. In this case it is ISO/IEC 8859-1.

I would like a python method which checks whether the EDIFACT file contains only characters specified in ISO/IEC 8859-1.

The most simple solution I can think of is doing something like this (pseudo code).

ISO_string = "All characters contained in ISO/IEC 8859-1"
EDIFACT_string = "Contents of EDIFACT file"
is_iso_char = FALSE

For EDIFACT_Char in EDIFACT_string:
    For ISO_char in ISO_string:
        if EDIFACT_Char = ISO_char:
            is_iso_char = TRUE
            break
    if is_iso_char == FALSE:
        raiseerror("File contains char not contained in ISO/IEC 8859-1 and needs to be rejected")
        do_error_handling()
    is_iso_char = FALSE

I studied business informatics and lack the theoretical background for algorithm theory. This feels like a very inefficient method and since EDIFACT needs to be processed quickly I don't want this functionality to be a bottleneck.

Is there an inbuilt pyhton way to do what I want to achieve better?

Update #1:

I wrote this code as suggested by Barmar. To check it I added the Chinese characters for "World" in the file (世界). I expected .decode to throw an error. However it just decodes the byte string and adds some strange characters at the beginning.

File Contents: 世界UNB+UNOC:3+9903323000007:500+9900080000007:500+250102:0900+Y48A42R58CRR43++++++

with open(Filename, "rb") as edifact_file:
    edifact_bytes = edifact_file.read()

try:
    verified_edifact_string = edifact_bytes.decode(encoding='latin_1', errors='strict')
except:
    print("String does not conform to ISO specification")

print(verified_edifact_string)

Prints: If I just copy Stackoverflow cuts away some of the characters.

Edit #2: According to Python documentation the ISO/IEC_8859-1 specification is called latin_1 when using Python's .decode() and .encode() methods.

Share Improve this question edited Feb 7 at 21:37 Merlin Nestler asked Feb 7 at 20:01 Merlin NestlerMerlin Nestler 4382 silver badges17 bronze badges 7
  • @Barmar I updated my question with a test I did. Is this what you suggeste? – Merlin Nestler Commented Feb 7 at 20:32
  • That's the general idea. – Barmar Commented Feb 7 at 20:32
  • I think the strange characters at the beginning may be the BOM sequence. – Barmar Commented Feb 7 at 20:33
  • @Barmar I believe the BOM sequence does look differently. It seems to me like python tried to translate the bytes of the Chinese characters with the latin_1 encoding. For some bytes it worked for some there is no character defined in the ISO and a placeholder char is returned. I saw this a lot when working with Chinese characters when I was in Japan. – Merlin Nestler Commented Feb 7 at 20:40
  • Any idea why no error is thrown? – Merlin Nestler Commented Feb 7 at 20:40
 |  Show 2 more comments

1 Answer 1

Reset to default 0

Credits to Barmar for suggesting the use of .decode()

I found a solution which looks smooth to me.

If I encode the string using the latin_1 encoding the Chinese characters seem to not be encoded into bytes. I didn't check but I guess the .encode() method omits them since they don't belong to latin_1. If I then convert the encoded string back using .decode() I get a string without the Chinese characters. If I then compare the original with the encoded and decoded string my question is answered whether characters were contained which don't belong to latin_1.

with open(Filename, "r", encoding="utf-8") as edifact_file:
    edifact_string = edifact_file.read()

encoded_edifact_string = edifact_string.encode('latin_1', 'ignore')
if encoded_edifact_string.decode('latin_1', 'ignore') == edifact_string:
    print('Is latin_1')
    print(edifact_string)
    print(encoded_edifact_string.decode('latin_1', 'ignore'))
else:
    print('Is no latin_1')
    print(edifact_string)
    print(encoded_edifact_string.decode('latin_1', 'ignore'))

Next question is now whether looping over the strings and comparing each character is faster or slower than encoding and decoding and comparing afterwards. But I can check that myself.

发布评论

评论列表(0)

  1. 暂无评论