最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - UnicodeDecodeError codec can't decode error using pandas read_csv - Stack Overflow

programmeradmin1浏览0评论

I'm opening a csv file using pandas.

import pandas as pd 
df = pd.read_csv('/file/planned.csv') 

I'm opening a file that contains about 2,000 records collected from all over the places in the world. When I'm trying to open this file with pandas, I'm getting the following errors for

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xec in position 34: invalid continuation byte

After I searched through the web, I was able to put the following encoding options hoping that I could open the file. However, I'm still getting the following error messages for each encoding options I tried.

utf-8

df_planned = pd.read_csv('/content/sample_data/planned.csv', encoding='utf-8')
    > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xec in position 34: invalid continuation byte

utf-16

df_planned = pd.read_csv('/content/sample_data/planned.csv', encoding='utf-16') 
    > UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 234-235: illegal encoding

euc-kr

df_planned = pd.read_csv('/content/sample_data/planned.csv', encoding='euc-kr')
UnicodeDecodeError: 'euc_kr' codec can't decode byte 0x84 in position 37: illegal multibyte sequence

I'm still not able to open the file into the dataframe using the pandas.

cp949

df_planned = pd.read_csv('/content/sample_data/planned.csv', encoding='cp949')
UnicodeDecodeError: 'cp949' codec can't decode byte 0xe8 in position 43: illegal multibyte sequence

Could anyone help? Thank you so much.

I'm opening a csv file using pandas.

import pandas as pd 
df = pd.read_csv('/file/planned.csv') 

I'm opening a file that contains about 2,000 records collected from all over the places in the world. When I'm trying to open this file with pandas, I'm getting the following errors for

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xec in position 34: invalid continuation byte

After I searched through the web, I was able to put the following encoding options hoping that I could open the file. However, I'm still getting the following error messages for each encoding options I tried.

utf-8

df_planned = pd.read_csv('/content/sample_data/planned.csv', encoding='utf-8')
    > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xec in position 34: invalid continuation byte

utf-16

df_planned = pd.read_csv('/content/sample_data/planned.csv', encoding='utf-16') 
    > UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 234-235: illegal encoding

euc-kr

df_planned = pd.read_csv('/content/sample_data/planned.csv', encoding='euc-kr')
UnicodeDecodeError: 'euc_kr' codec can't decode byte 0x84 in position 37: illegal multibyte sequence

I'm still not able to open the file into the dataframe using the pandas.

cp949

df_planned = pd.read_csv('/content/sample_data/planned.csv', encoding='cp949')
UnicodeDecodeError: 'cp949' codec can't decode byte 0xe8 in position 43: illegal multibyte sequence

Could anyone help? Thank you so much.

Share Improve this question asked 13 hours ago headfatheadfat 852 silver badges10 bronze badges 6
  • csv files are supposed to be text files, and your file does not appear to be a text file. What exactly is the data you are storing in the CSV file? – President James K. Polk Commented 12 hours ago
  • If you open the file in an editor which lets you choose the encoding, can you figure out what encoding the file is actually in? Second question, if the data is from all over the world, is it possible sub-files of different encodings were concatenated together? In that case, you might have a bigger problem, you'd have to split the file back out to parts with a single encoding each. – joanis Commented 12 hours ago
  • Think I found that the CSV file contains non text characters too. Also, I've tried to open the file using the editor (Visual Studio Code) which says the following message displayed on the screen "The file is not displayed in the text editor because it is either binary or uses an unsupported text encoding.". Seems like I'm dealing with something I can't deal with Pandas or Python... – headfat Commented 10 hours ago
  • The data contains the social media data which I assume that the problem is caused by the influencers IDs that are consisted of special characters or emoticons. Yet, these IDs are also in different languages like some are in Japanese, Chinese, Arabic, Spanish, Vietnamese etc. Would I be dealing with the dataset which cannot be handled with Python in CSV format? – headfat Commented 10 hours ago
  • So my approach here would be to try to determine the encoding line by line. You can read the file as bytes instead of text, and then maybe for each line try to convert it to various encoding and see which one passes? open("file", "rb") will open the file in binary mode. Then if you have a line in line, you can try to decode it with line.decode(encoding) for various encodings, and see which encodings give you exceptions or not. – joanis Commented 1 hour ago
 |  Show 1 more comment

1 Answer 1

Reset to default 0

You will have to find encoding of CSV file first with the following:

import chardet
import pandas as pd

with open('your_file.csv', 'rb') as f:
   enc = chardet.detect(f.read())  # or readline if the file is large

encoding = enc['encoding']

Once you know the encoding then you can use your method to read the file with.

df_planned = pd.read_csv('/content/sample_data/planned.csv', encoding=encoding)

(replace with found ending)

发布评论

评论列表(0)

  1. 暂无评论