Short question:
What is the most efficient way to check whether a .TXT
file contains only characters defined in a selected ISO specification?
Question with full context: In the German energy market EDIFACT is used to automatically exchange information. Each file exchanged has a header segment which contains information about the contents of the file.
Please find an example of this segment below.
UNB+UNOC:3+9903323000007:500+9900080000007:500+250102:0900+Y48A42R58CRR43++++++
As you can see after the UNB+
we find the content UNOC
. This tells us which character set is used in the file. In this case it is ISO/IEC 8859-1.
I would like a python method which checks whether the EDIFACT file contains only characters specified in ISO/IEC 8859-1.
The most simple solution I can think of is doing something like this (pseudo code).
ISO_string = "All characters contained in ISO/IEC 8859-1"
EDIFACT_string = "Contents of EDIFACT file"
is_iso_char = FALSE
For EDIFACT_Char in EDIFACT_string:
For ISO_char in ISO_string:
if EDIFACT_Char = ISO_char:
is_iso_char = TRUE
break
if is_iso_char == FALSE:
raiseerror("File contains char not contained in ISO/IEC 8859-1 and needs to be rejected")
do_error_handling()
is_iso_char = FALSE
I studied business informatics and lack the theoretical background for algorithm theory. This feels like a very inefficient method and since EDIFACT needs to be processed quickly I don't want this functionality to be a bottleneck.
Is there an inbuilt pyhton way to do what I want to achieve better?
Update #1:
I wrote this code as suggested by Barmar. To check it I added the Chinese characters for "World" in the file (世界). I expected .decode
to throw an error. However it just decodes the byte string and adds some strange characters at the beginning.
File Contents: 世界UNB+UNOC:3+9903323000007:500+9900080000007:500+250102:0900+Y48A42R58CRR43++++++
with open(Filename, "rb") as edifact_file:
edifact_bytes = edifact_file.read()
try:
verified_edifact_string = edifact_bytes.decode(encoding='latin_1', errors='strict')
except:
print("String does not conform to ISO specification")
print(verified_edifact_string)
Prints: If I just copy Stackoverflow cuts away some of the characters.
Edit #2:
According to Python documentation the ISO/IEC_8859-1 specification is called latin_1
when using Python's .decode()
and .encode()
methods.
Short question:
What is the most efficient way to check whether a .TXT
file contains only characters defined in a selected ISO specification?
Question with full context: In the German energy market EDIFACT is used to automatically exchange information. Each file exchanged has a header segment which contains information about the contents of the file.
Please find an example of this segment below.
UNB+UNOC:3+9903323000007:500+9900080000007:500+250102:0900+Y48A42R58CRR43++++++
As you can see after the UNB+
we find the content UNOC
. This tells us which character set is used in the file. In this case it is ISO/IEC 8859-1.
I would like a python method which checks whether the EDIFACT file contains only characters specified in ISO/IEC 8859-1.
The most simple solution I can think of is doing something like this (pseudo code).
ISO_string = "All characters contained in ISO/IEC 8859-1"
EDIFACT_string = "Contents of EDIFACT file"
is_iso_char = FALSE
For EDIFACT_Char in EDIFACT_string:
For ISO_char in ISO_string:
if EDIFACT_Char = ISO_char:
is_iso_char = TRUE
break
if is_iso_char == FALSE:
raiseerror("File contains char not contained in ISO/IEC 8859-1 and needs to be rejected")
do_error_handling()
is_iso_char = FALSE
I studied business informatics and lack the theoretical background for algorithm theory. This feels like a very inefficient method and since EDIFACT needs to be processed quickly I don't want this functionality to be a bottleneck.
Is there an inbuilt pyhton way to do what I want to achieve better?
Update #1:
I wrote this code as suggested by Barmar. To check it I added the Chinese characters for "World" in the file (世界). I expected .decode
to throw an error. However it just decodes the byte string and adds some strange characters at the beginning.
File Contents: 世界UNB+UNOC:3+9903323000007:500+9900080000007:500+250102:0900+Y48A42R58CRR43++++++
with open(Filename, "rb") as edifact_file:
edifact_bytes = edifact_file.read()
try:
verified_edifact_string = edifact_bytes.decode(encoding='latin_1', errors='strict')
except:
print("String does not conform to ISO specification")
print(verified_edifact_string)
Prints: If I just copy Stackoverflow cuts away some of the characters.
Edit #2:
According to Python documentation the ISO/IEC_8859-1 specification is called latin_1
when using Python's .decode()
and .encode()
methods.
- @Barmar I updated my question with a test I did. Is this what you suggeste? – Merlin Nestler Commented Feb 7 at 20:32
- That's the general idea. – Barmar Commented Feb 7 at 20:32
- I think the strange characters at the beginning may be the BOM sequence. – Barmar Commented Feb 7 at 20:33
- @Barmar I believe the BOM sequence does look differently. It seems to me like python tried to translate the bytes of the Chinese characters with the latin_1 encoding. For some bytes it worked for some there is no character defined in the ISO and a placeholder char is returned. I saw this a lot when working with Chinese characters when I was in Japan. – Merlin Nestler Commented Feb 7 at 20:40
- Any idea why no error is thrown? – Merlin Nestler Commented Feb 7 at 20:40
1 Answer
Reset to default 0Credits to Barmar for suggesting the use of .decode()
I found a solution which looks smooth to me.
If I encode the string using the latin_1
encoding the Chinese characters seem to not be encoded into bytes. I didn't check but I guess the .encode()
method omits them since they don't belong to latin_1
. If I then convert the encoded string back using .decode()
I get a string without the Chinese characters. If I then compare the original with the encoded and decoded string my question is answered whether characters were contained which don't belong to latin_1
.
with open(Filename, "r", encoding="utf-8") as edifact_file:
edifact_string = edifact_file.read()
encoded_edifact_string = edifact_string.encode('latin_1', 'ignore')
if encoded_edifact_string.decode('latin_1', 'ignore') == edifact_string:
print('Is latin_1')
print(edifact_string)
print(encoded_edifact_string.decode('latin_1', 'ignore'))
else:
print('Is no latin_1')
print(edifact_string)
print(encoded_edifact_string.decode('latin_1', 'ignore'))
Next question is now whether looping over the strings and comparing each character is faster or slower than encoding and decoding and comparing afterwards. But I can check that myself.