最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

utf 8 - Azure functions decoding utf8 strings in Python - Stack Overflow

programmeradmin2浏览0评论

I'm running this code on my Python code with version 3.10.6:

def downloadRevistaXML():
    url= f".zip"

    try:
        response = requests.get(url)
        response.raise_for_status()
    
        if response.headers.get("Content-Type") == "application/zip":
            with zipfile.ZipFile(io.BytesIO(response.content)) as z:
                for file_name in z.namelist():
                    if file_name.endswith(".xml"):
                        xml_file = z.open(file_name)
                        xml_content = xml_file.read()
    
                        
                        xml_content = xml_content.decode("utf-8")
    
                        #This doesn't print right
                        print(xml_content)
                
    
    except requests.exceptions.RequestException as e:
        raise RuntimeError(f"Error during download: {e}")
    except zipfile.BadZipFile:
        raise ValueError("The file downloaded is not a valid ZIP archive.")

I'm running this on Windows 10 with python 3.10.6.

When I directly run the code, it prints the xml content with the correct string encoding utf-8. One node looks like this:

<despacho codigo="IPAS009" nome="Publicação de pedido de registro para oposição (exame formal concluído)"/>

But, if I run the same code LOCALLY using an azure function, it prints like this:

<despacho codigo="IPAS009" nome="PublicaþÒo de pedido de registro para oposiþÒo (exame formal concluÝdo)"/>

I was thinking this was some problem with my code encoding something in the wrong way, I tried every way of decoding this later but didn't find any solution. Then I tested without the azure function and it works perfectly.

I'm using a http trigger Azure Function. Right now it's only hosting a server and when I call an endpoint it runs the function without any parameters.

I've tried forcing other encodings like xml_content = xml_content.decode("UTF-8") but it doesn't change anything.

I've also tried encoding and decoding again but not luck.

-------- EDIT I'll add my comment here to make it more clear:

I'm printing on console.

This is what it looks like when I run "python myscript.py" without the ".decode("utf-8")"

<despacho codigo="IPAS009" nome="Publica\xc3\xa7\xc3\xa3o de pedido de registro para oposi\xc3\xa7\xc3\xa3o (exame formal conclu\xc3\xaddo)">

This below is what shows up when I run "func host start --tail --port 5083"

<despacho codigo="IPAS009" nome="Publica\xc3\xa7\xc3\xa3o de pedido de registro para oposi\xc3\xa7\xc3\xa3o (exame formal conclu\xc3\xaddo)">

However, as in the original post, when I run "python myscript.py" and decode utf8 it works fine.

Printing "print(ascii(xml_content))" using the Azure function:

<despacho codigo="IPAS009" nome="Publica\xc3\xa7\xc3\xa3o de pedido de registro para oposi\xc3\xa7\xc3\xa3o (exame formal conclu\xc3\xaddo)"/>

Now "print(ascii(xml_content))" on "python myscript.py"

<despacho codigo="IPAS009" nome="Publica\xc3\xa7\xc3\xa3o de pedido de registro para oposi\xc3\xa7\xc3\xa3o (exame formal conclu\xc3\xaddo)"/>

I'm running this code on my Python code with version 3.10.6:

def downloadRevistaXML():
    url= f"https://revistas.inpi.gov.br/txt/RM2823.zip"

    try:
        response = requests.get(url)
        response.raise_for_status()
    
        if response.headers.get("Content-Type") == "application/zip":
            with zipfile.ZipFile(io.BytesIO(response.content)) as z:
                for file_name in z.namelist():
                    if file_name.endswith(".xml"):
                        xml_file = z.open(file_name)
                        xml_content = xml_file.read()
    
                        
                        xml_content = xml_content.decode("utf-8")
    
                        #This doesn't print right
                        print(xml_content)
                
    
    except requests.exceptions.RequestException as e:
        raise RuntimeError(f"Error during download: {e}")
    except zipfile.BadZipFile:
        raise ValueError("The file downloaded is not a valid ZIP archive.")

I'm running this on Windows 10 with python 3.10.6.

When I directly run the code, it prints the xml content with the correct string encoding utf-8. One node looks like this:

<despacho codigo="IPAS009" nome="Publicação de pedido de registro para oposição (exame formal concluído)"/>

But, if I run the same code LOCALLY using an azure function, it prints like this:

<despacho codigo="IPAS009" nome="PublicaþÒo de pedido de registro para oposiþÒo (exame formal concluÝdo)"/>

I was thinking this was some problem with my code encoding something in the wrong way, I tried every way of decoding this later but didn't find any solution. Then I tested without the azure function and it works perfectly.

I'm using a http trigger Azure Function. Right now it's only hosting a server and when I call an endpoint it runs the function without any parameters.

I've tried forcing other encodings like xml_content = xml_content.decode("UTF-8") but it doesn't change anything.

I've also tried encoding and decoding again but not luck.

-------- EDIT I'll add my comment here to make it more clear:

I'm printing on console.

This is what it looks like when I run "python myscript.py" without the ".decode("utf-8")"

<despacho codigo="IPAS009" nome="Publica\xc3\xa7\xc3\xa3o de pedido de registro para oposi\xc3\xa7\xc3\xa3o (exame formal conclu\xc3\xaddo)">

This below is what shows up when I run "func host start --tail --port 5083"

<despacho codigo="IPAS009" nome="Publica\xc3\xa7\xc3\xa3o de pedido de registro para oposi\xc3\xa7\xc3\xa3o (exame formal conclu\xc3\xaddo)">

However, as in the original post, when I run "python myscript.py" and decode utf8 it works fine.

Printing "print(ascii(xml_content))" using the Azure function:

<despacho codigo="IPAS009" nome="Publica\xc3\xa7\xc3\xa3o de pedido de registro para oposi\xc3\xa7\xc3\xa3o (exame formal conclu\xc3\xaddo)"/>

Now "print(ascii(xml_content))" on "python myscript.py"

<despacho codigo="IPAS009" nome="Publica\xc3\xa7\xc3\xa3o de pedido de registro para oposi\xc3\xa7\xc3\xa3o (exame formal conclu\xc3\xaddo)"/>
Share Improve this question edited Mar 19 at 1:02 SFG asked Mar 18 at 17:19 SFGSFG 11 bronze badge 5
  • 1 How / where do you print? To a console? Does the console and the font used support utf8? – Robert Commented Mar 18 at 17:19
  • It's unlikely that .decode('utf8') would work without actually being encoded in UTF-8 due to it's decoding algorithm. It would be interesting to see a debug representation of the Unicode string that is decoded. Add print(ascii(xml_content)) and report the results in both cases (should be the same). It seems likely Python is encoding the Unicode string to the detected terminal's encoding (some single byte encoding) but the terminal is actually using a different encoding. If it was UTF-8-encoded, each accented glyph would be two or more characters if a single-byte encoding was used. – Mark Tolonen Commented Mar 18 at 20:03
  • I added the extra info you guys asked on my original post for clarity. @Robert I started having this problem when I sent the data as a string to a webhook I'm using. I'm saving data from the same sorce on database, but when I run json.loads on an array with that string and it gets decoded in the wrong format. I later tested by getting the same data from appscripts and it works fine. It's only getting the wrong decoding when I run via the azure function. – SFG Commented Mar 19 at 1:08
  • @SFG Set sys.stdout.reconfigure(encoding='utf-8') at the start of your script to fix console output encoding in Azure Functions. – Dasari Kamali Commented Mar 19 at 3:39
  • You face a mojibake case: 'Publicação'.encode( 'cp1252').decode( 'cp850') returns 'PublicaþÒo' – JosefZ Commented Mar 19 at 9:46
Add a comment  | 

1 Answer 1

Reset to default 0

I have modified your code to explicitly set the encoding to UTF-8 while reading and writing the file by adding the below lines.

sys.stdout.reconfigure(encoding="utf-8")
locale.setlocale(locale.LC_ALL, "en_US.UTF-8")
xml_content = xml_file.read().decode("utf-8", errors="replace")

function_app.py :

import azure.functions as func
import requests
import zipfile
import io
import json
import sys
import locale

app = func.FunctionApp(http_auth_level=func.AuthLevel.FUNCTION)

sys.stdout.reconfigure(encoding="utf-8")
locale.setlocale(locale.LC_ALL, "en_US.UTF-8")

def download_revista_xml():
    url = "https://revistas.inpi.gov.br/txt/RM2823.zip"

    try:
        response = requests.get(url)
        response.raise_for_status()

        if response.headers.get("Content-Type") == "application/zip":
            with zipfile.ZipFile(io.BytesIO(response.content)) as z:
                for file_name in z.namelist():
                    if file_name.endswith(".xml"):
                        xml_file = z.open(file_name)
                        xml_content = xml_file.read().decode("utf-8", errors="replace")
                        return xml_content

    except requests.exceptions.RequestException as e:
        return f"Error during download: {e}"
    except zipfile.BadZipFile:
        return "The file downloaded is not a valid ZIP archive."

@app.function_name(name="DownloadXML")
@app.route(route="download-xml", methods=["GET"])
def main(req: func.HttpRequest) -> func.HttpResponse:
    xml_content = download_revista_xml()

    if isinstance(xml_content, str):
        return func.HttpResponse(
            json.dumps({"xml": xml_content}, ensure_ascii=False),
            mimetype="application/json",
            charset="utf-8",
        )
    return func.HttpResponse("Failed to retrieve XML", status_code=500)

I can see the correct UTF-8 encoded output both locally and in production.

Local Output :

Azure Function App Output :

发布评论

评论列表(0)

  1. 暂无评论