最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

dataframe - Julia: equivalent of Pythons selection by multiindex level (especially columns) - Stack Overflow

programmeradmin1浏览0评论

I am new to Julia and Julia DataFrames. My understanding is that DataFrames do not support MultiIndexing, which generally does not pose much problems, but translating some pythonic habits to Julia poses difficulties. I wonder how one could load and subselect features by columns, as in the example below.

import numpy as np
import pandas as pd

#generating sample data
nsmpls = 10
smpls = [f'smpl{j}' for j in range(nsmpls)]

nfeats = 5
feats = [f'feat{j}' for j in range(nfeats)]

data = np.random.rand(nfeats, nsmpls)

countries = ['France'] * 2 + ['UK'] * 3 + ['US'] * 5

df = pd.DataFrame(data, index=feats, columns=pd.MultiIndex.from_tuples(zip(countries, smpls)))
df.to_csv('./data.tsv', sep='\t')

#---------------------------------------------------------------------
#loading dataset
df = pd.read_csv('./data.tsv', sep='\t', index_col=0, header=[0,1])

#extracting subset
dg = df.xs('France', level=0, axis=1)
print(dg.shape)

#iterating
for country, group in df.groupby(level=0, axis=1):
    print('#samples: {}'.format(group.shape[1]))

I am new to Julia and Julia DataFrames. My understanding is that DataFrames do not support MultiIndexing, which generally does not pose much problems, but translating some pythonic habits to Julia poses difficulties. I wonder how one could load and subselect features by columns, as in the example below.

import numpy as np
import pandas as pd

#generating sample data
nsmpls = 10
smpls = [f'smpl{j}' for j in range(nsmpls)]

nfeats = 5
feats = [f'feat{j}' for j in range(nfeats)]

data = np.random.rand(nfeats, nsmpls)

countries = ['France'] * 2 + ['UK'] * 3 + ['US'] * 5

df = pd.DataFrame(data, index=feats, columns=pd.MultiIndex.from_tuples(zip(countries, smpls)))
df.to_csv('./data.tsv', sep='\t')

#---------------------------------------------------------------------
#loading dataset
df = pd.read_csv('./data.tsv', sep='\t', index_col=0, header=[0,1])

#extracting subset
dg = df.xs('France', level=0, axis=1)
print(dg.shape)

#iterating
for country, group in df.groupby(level=0, axis=1):
    print('#samples: {}'.format(group.shape[1]))
Share Improve this question edited 2 days ago Christoph Rackwitz 15.4k5 gold badges39 silver badges51 bronze badges asked 2 days ago Roger V.Roger V. 6325 silver badges18 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 1

Something like this ?

using DataFrames, CSV

# Used your sample data
df = DataFrame(CSV.File("data.tsv"))

# Filter the columns by country name
france_cols = findall(x -> occursin("France", x), names(df))

# Subset the df
dg = select(df, france_cols)

# Optional : use "sampleX" as col names instead of the country name
rename!(dg, collect(dg[1, :]))
dg = dg[2:end, :]

display(dg)
println(size(dg))

By default, DataFrames adds numbers to similar column names like this : France, France_1 etc so I selected all the columns containing "France".

发布评论

评论列表(0)

  1. 暂无评论