I'm opening a csv file using pandas.

import pandas as pd 
df = pd.read_csv('/file/planned.csv')

I'm opening a file that contains about 2,000 records collected from all over the places in the world. When I'm trying to open this file with pandas, I'm getting the following errors for

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xec in position 34: invalid continuation byte

After I searched through the web, I was able to put the following encoding options hoping that I could open the file. However, I'm still getting the following error messages for each encoding options I tried.

utf-8

df_planned = pd.read_csv('/content/sample_data/planned.csv', encoding='utf-8')
    > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xec in position 34: invalid continuation byte

utf-16

df_planned = pd.read_csv('/content/sample_data/planned.csv', encoding='utf-16') 
    > UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 234-235: illegal encoding

euc-kr

df_planned = pd.read_csv('/content/sample_data/planned.csv', encoding='euc-kr')
UnicodeDecodeError: 'euc_kr' codec can't decode byte 0x84 in position 37: illegal multibyte sequence

I'm still not able to open the file into the dataframe using the pandas.

cp949

df_planned = pd.read_csv('/content/sample_data/planned.csv', encoding='cp949')
UnicodeDecodeError: 'cp949' codec can't decode byte 0xe8 in position 43: illegal multibyte sequence

Could anyone help? Thank you so much.

I'm opening a csv file using pandas.

import pandas as pd 
df = pd.read_csv('/file/planned.csv')

I'm opening a file that contains about 2,000 records collected from all over the places in the world. When I'm trying to open this file with pandas, I'm getting the following errors for

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xec in position 34: invalid continuation byte

utf-8

df_planned = pd.read_csv('/content/sample_data/planned.csv', encoding='utf-8')
    > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xec in position 34: invalid continuation byte

utf-16

df_planned = pd.read_csv('/content/sample_data/planned.csv', encoding='utf-16') 
    > UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 234-235: illegal encoding

euc-kr

df_planned = pd.read_csv('/content/sample_data/planned.csv', encoding='euc-kr')
UnicodeDecodeError: 'euc_kr' codec can't decode byte 0x84 in position 37: illegal multibyte sequence

I'm still not able to open the file into the dataframe using the pandas.

cp949

df_planned = pd.read_csv('/content/sample_data/planned.csv', encoding='cp949')
UnicodeDecodeError: 'cp949' codec can't decode byte 0xe8 in position 43: illegal multibyte sequence

Could anyone help? Thank you so much.

Share Improve this question asked 13 hours ago headfat 852 silver badges10 bronze badges

csv files are supposed to be text files, and your file does not appear to be a text file. What exactly is the data you are storing in the CSV file? – President James K. Polk Commented 12 hours ago
If you open the file in an editor which lets you choose the encoding, can you figure out what encoding the file is actually in? Second question, if the data is from all over the world, is it possible sub-files of different encodings were concatenated together? In that case, you might have a bigger problem, you'd have to split the file back out to parts with a single encoding each. – joanis Commented 12 hours ago
Think I found that the CSV file contains non text characters too. Also, I've tried to open the file using the editor (Visual Studio Code) which says the following message displayed on the screen "The file is not displayed in the text editor because it is either binary or uses an unsupported text encoding.". Seems like I'm dealing with something I can't deal with Pandas or Python... – headfat Commented 10 hours ago
The data contains the social media data which I assume that the problem is caused by the influencers IDs that are consisted of special characters or emoticons. Yet, these IDs are also in different languages like some are in Japanese, Chinese, Arabic, Spanish, Vietnamese etc. Would I be dealing with the dataset which cannot be handled with Python in CSV format? – headfat Commented 10 hours ago
So my approach here would be to try to determine the encoding line by line. You can read the file as bytes instead of text, and then maybe for each line try to convert it to various encoding and see which one passes? open("file", "rb") will open the file in binary mode. Then if you have a line in line, you can try to decode it with line.decode(encoding) for various encodings, and see which encodings give you exceptions or not. – joanis Commented 1 hour ago

| Show 1 more comment

1 Answer 1

Sorted by: Reset to default 0

You will have to find encoding of CSV file first with the following:

import chardet
import pandas as pd

with open('your_file.csv', 'rb') as f:
   enc = chardet.detect(f.read())  # or readline if the file is large

encoding = enc['encoding']

Once you know the encoding then you can use your method to read the file with.

df_planned = pd.read_csv('/content/sample_data/planned.csv', encoding=encoding)

(replace with found ending)

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - UnicodeDecodeError codec can't decode error using pandas read_csv - Stack Overflow

utf-8

utf-16

euc-kr

cp949

utf-8

utf-16

euc-kr

cp949

1 Answer 1

与本文相关的文章

评论列表(0)