I'm running this code on my Python code with version 3.10.6:
def downloadRevistaXML():
url= f".zip"
try:
response = requests.get(url)
response.raise_for_status()
if response.headers.get("Content-Type") == "application/zip":
with zipfile.ZipFile(io.BytesIO(response.content)) as z:
for file_name in z.namelist():
if file_name.endswith(".xml"):
xml_file = z.open(file_name)
xml_content = xml_file.read()
xml_content = xml_content.decode("utf-8")
#This doesn't print right
print(xml_content)
except requests.exceptions.RequestException as e:
raise RuntimeError(f"Error during download: {e}")
except zipfile.BadZipFile:
raise ValueError("The file downloaded is not a valid ZIP archive.")
I'm running this on Windows 10 with python 3.10.6.
When I directly run the code, it prints the xml content with the correct string encoding utf-8. One node looks like this:
<despacho codigo="IPAS009" nome="Publicação de pedido de registro para oposição (exame formal concluído)"/>
But, if I run the same code LOCALLY using an azure function, it prints like this:
<despacho codigo="IPAS009" nome="PublicaþÒo de pedido de registro para oposiþÒo (exame formal concluÝdo)"/>
I was thinking this was some problem with my code encoding something in the wrong way, I tried every way of decoding this later but didn't find any solution. Then I tested without the azure function and it works perfectly.
I'm using a http trigger Azure Function. Right now it's only hosting a server and when I call an endpoint it runs the function without any parameters.
I've tried forcing other encodings like xml_content = xml_content.decode("UTF-8")
but it doesn't change anything.
I've also tried encoding and decoding again but not luck.
-------- EDIT I'll add my comment here to make it more clear:
I'm printing on console.
This is what it looks like when I run "python myscript.py" without the ".decode("utf-8")"
<despacho codigo="IPAS009" nome="Publica\xc3\xa7\xc3\xa3o de pedido de registro para oposi\xc3\xa7\xc3\xa3o (exame formal conclu\xc3\xaddo)">
This below is what shows up when I run "func host start --tail --port 5083
"
<despacho codigo="IPAS009" nome="Publica\xc3\xa7\xc3\xa3o de pedido de registro para oposi\xc3\xa7\xc3\xa3o (exame formal conclu\xc3\xaddo)">
However, as in the original post, when I run "python myscript.py" and decode utf8 it works fine.
Printing "print(ascii(xml_content))" using the Azure function:
<despacho codigo="IPAS009" nome="Publica\xc3\xa7\xc3\xa3o de pedido de registro para oposi\xc3\xa7\xc3\xa3o (exame formal conclu\xc3\xaddo)"/>
Now "print(ascii(xml_content))" on "python myscript.py"
<despacho codigo="IPAS009" nome="Publica\xc3\xa7\xc3\xa3o de pedido de registro para oposi\xc3\xa7\xc3\xa3o (exame formal conclu\xc3\xaddo)"/>
I'm running this code on my Python code with version 3.10.6:
def downloadRevistaXML():
url= f"https://revistas.inpi.gov.br/txt/RM2823.zip"
try:
response = requests.get(url)
response.raise_for_status()
if response.headers.get("Content-Type") == "application/zip":
with zipfile.ZipFile(io.BytesIO(response.content)) as z:
for file_name in z.namelist():
if file_name.endswith(".xml"):
xml_file = z.open(file_name)
xml_content = xml_file.read()
xml_content = xml_content.decode("utf-8")
#This doesn't print right
print(xml_content)
except requests.exceptions.RequestException as e:
raise RuntimeError(f"Error during download: {e}")
except zipfile.BadZipFile:
raise ValueError("The file downloaded is not a valid ZIP archive.")
I'm running this on Windows 10 with python 3.10.6.
When I directly run the code, it prints the xml content with the correct string encoding utf-8. One node looks like this:
<despacho codigo="IPAS009" nome="Publicação de pedido de registro para oposição (exame formal concluído)"/>
But, if I run the same code LOCALLY using an azure function, it prints like this:
<despacho codigo="IPAS009" nome="PublicaþÒo de pedido de registro para oposiþÒo (exame formal concluÝdo)"/>
I was thinking this was some problem with my code encoding something in the wrong way, I tried every way of decoding this later but didn't find any solution. Then I tested without the azure function and it works perfectly.
I'm using a http trigger Azure Function. Right now it's only hosting a server and when I call an endpoint it runs the function without any parameters.
I've tried forcing other encodings like xml_content = xml_content.decode("UTF-8")
but it doesn't change anything.
I've also tried encoding and decoding again but not luck.
-------- EDIT I'll add my comment here to make it more clear:
I'm printing on console.
This is what it looks like when I run "python myscript.py" without the ".decode("utf-8")"
<despacho codigo="IPAS009" nome="Publica\xc3\xa7\xc3\xa3o de pedido de registro para oposi\xc3\xa7\xc3\xa3o (exame formal conclu\xc3\xaddo)">
This below is what shows up when I run "func host start --tail --port 5083
"
<despacho codigo="IPAS009" nome="Publica\xc3\xa7\xc3\xa3o de pedido de registro para oposi\xc3\xa7\xc3\xa3o (exame formal conclu\xc3\xaddo)">
However, as in the original post, when I run "python myscript.py" and decode utf8 it works fine.
Printing "print(ascii(xml_content))" using the Azure function:
<despacho codigo="IPAS009" nome="Publica\xc3\xa7\xc3\xa3o de pedido de registro para oposi\xc3\xa7\xc3\xa3o (exame formal conclu\xc3\xaddo)"/>
Now "print(ascii(xml_content))" on "python myscript.py"
<despacho codigo="IPAS009" nome="Publica\xc3\xa7\xc3\xa3o de pedido de registro para oposi\xc3\xa7\xc3\xa3o (exame formal conclu\xc3\xaddo)"/>
Share
Improve this question
edited Mar 19 at 1:02
SFG
asked Mar 18 at 17:19
SFGSFG
11 bronze badge
5
|
1 Answer
Reset to default 0I have modified your code to explicitly set the encoding to UTF-8 while reading and writing the file by adding the below lines.
sys.stdout.reconfigure(encoding="utf-8")
locale.setlocale(locale.LC_ALL, "en_US.UTF-8")
xml_content = xml_file.read().decode("utf-8", errors="replace")
function_app.py :
import azure.functions as func
import requests
import zipfile
import io
import json
import sys
import locale
app = func.FunctionApp(http_auth_level=func.AuthLevel.FUNCTION)
sys.stdout.reconfigure(encoding="utf-8")
locale.setlocale(locale.LC_ALL, "en_US.UTF-8")
def download_revista_xml():
url = "https://revistas.inpi.gov.br/txt/RM2823.zip"
try:
response = requests.get(url)
response.raise_for_status()
if response.headers.get("Content-Type") == "application/zip":
with zipfile.ZipFile(io.BytesIO(response.content)) as z:
for file_name in z.namelist():
if file_name.endswith(".xml"):
xml_file = z.open(file_name)
xml_content = xml_file.read().decode("utf-8", errors="replace")
return xml_content
except requests.exceptions.RequestException as e:
return f"Error during download: {e}"
except zipfile.BadZipFile:
return "The file downloaded is not a valid ZIP archive."
@app.function_name(name="DownloadXML")
@app.route(route="download-xml", methods=["GET"])
def main(req: func.HttpRequest) -> func.HttpResponse:
xml_content = download_revista_xml()
if isinstance(xml_content, str):
return func.HttpResponse(
json.dumps({"xml": xml_content}, ensure_ascii=False),
mimetype="application/json",
charset="utf-8",
)
return func.HttpResponse("Failed to retrieve XML", status_code=500)
I can see the correct UTF-8 encoded output both locally and in production.
Local Output :
Azure Function App Output :
.decode('utf8')
would work without actually being encoded in UTF-8 due to it's decoding algorithm. It would be interesting to see a debug representation of the Unicode string that is decoded. Addprint(ascii(xml_content))
and report the results in both cases (should be the same). It seems likely Python is encoding the Unicode string to the detected terminal's encoding (some single byte encoding) but the terminal is actually using a different encoding. If it was UTF-8-encoded, each accented glyph would be two or more characters if a single-byte encoding was used. – Mark Tolonen Commented Mar 18 at 20:03sys.stdout.reconfigure(encoding='utf-8')
at the start of your script to fix console output encoding in Azure Functions. – Dasari Kamali Commented Mar 19 at 3:39'Publicação'.encode( 'cp1252').decode( 'cp850')
returns'PublicaþÒo'
… – JosefZ Commented Mar 19 at 9:46