I'm using something similar to this as input.txt
040525 $$$$$ 9999 12345
040525 $$$$$ 8888 12345
040525 $$$$$ 7777 12345
040525 $$$$$ 6666 12345
Due to the way this input is being pre-processed, I cannot correctly use pd.read_csv. I must first create a list from the input; Then, create a DataFrame from the list.
data_list = []
with open('input.txt', 'r') as data:
for line in data:
data_list.append(line.strip())
df = pd.DataFrame(data_list)
This results in each row being considered 1 column
print(df.shape)
print(df)
print(df.columns.tolist())
(4, 1)
0
0 040525 $$$$$ 9999 12345
1 040525 $$$$$ 8888 12345
2 040525 $$$$$ 7777 12345
3 040525 $$$$$ 6666 12345
[0]
How can I create 4 columns in this DataFrame? Desired output would be:
(4, 4)
a b c d
0 40525 $$$$$ 9999 12345
1 40525 $$$$$ 8888 12345
2 40525 $$$$$ 7777 12345
3 40525 $$$$$ 6666 12345
['a', 'b', 'c', 'd']
I'm using something similar to this as input.txt
040525 $$$$$ 9999 12345
040525 $$$$$ 8888 12345
040525 $$$$$ 7777 12345
040525 $$$$$ 6666 12345
Due to the way this input is being pre-processed, I cannot correctly use pd.read_csv. I must first create a list from the input; Then, create a DataFrame from the list.
data_list = []
with open('input.txt', 'r') as data:
for line in data:
data_list.append(line.strip())
df = pd.DataFrame(data_list)
This results in each row being considered 1 column
print(df.shape)
print(df)
print(df.columns.tolist())
(4, 1)
0
0 040525 $$$$$ 9999 12345
1 040525 $$$$$ 8888 12345
2 040525 $$$$$ 7777 12345
3 040525 $$$$$ 6666 12345
[0]
How can I create 4 columns in this DataFrame? Desired output would be:
(4, 4)
a b c d
0 40525 $$$$$ 9999 12345
1 40525 $$$$$ 8888 12345
2 40525 $$$$$ 7777 12345
3 40525 $$$$$ 6666 12345
['a', 'b', 'c', 'd']
Share
Improve this question
asked Feb 17 at 15:23
yodishyodish
8054 gold badges13 silver badges30 bronze badges
4
|
1 Answer
Reset to default 1In your loop, you should split the strings into a list of substrings for the fields:
for line in input_txt:
data_list.append(line.strip().split())
This will give you the correct number of columns.
Alternatively, keep your loop as it is, but create a Series
and str.split
with expand=True
. This might be less efficient, but could be more robust if you don't have a consistent number of fields:
data_list = []
with open('input.txt', 'r') as data:
for line in data:
data_list.append(line.strip())
df = pd.Series(data_list).str.split(expand=True)
Output:
0 1 2 3
0 040525 $$$$$ 9999 12345
1 040525 $$$$$ 8888 12345
2 040525 $$$$$ 7777 12345
3 040525 $$$$$ 6666 12345
For the first approach, if you want column names:
df = pd.DataFrame(data_list, columns=['a', 'b', 'c', 'd'])
Output:
a b c d
0 040525 $$$$$ 9999 12345
1 040525 $$$$$ 8888 12345
2 040525 $$$$$ 7777 12345
3 040525 $$$$$ 6666 12345
pd.read_csv
. If you think it is a problem, that's of course fine, but then please edit your original question with more details and a proper minimal reproducible example so that users can run your code. This question has nothing new that should alter users' opinion. – ouroboros1 Commented Feb 17 at 15:31'input.txt'
as provided,pd.read_csv('input.txt', sep=r'\s+', names=[*'abcd'])
should normally work perfectly fine. If it doesn't, please update the original question with the result of trying that, together with 1) any error message, 2) unexpected output vs expected output, 3) version ofpd
. – ouroboros1 Commented Feb 17 at 15:41