I am working with data from 174 subjects, stored in a dataframe (df_behavioral) where one row represents one subject. Some subjects are related to one another, as indicated by a variable called 'Family_ID', which assigns each subject to a family.
I need to split the sample into two subsamples of approximately equal size while ensuring that there are only unrelated subjects in one subsample. In other words: Subjects from the same family cannot be in the same subsample.
Additionally, the split should be stratified by Neuroticism scores (variable 'NEOFAC_N'), so that the distribution of Neuroticism scores is approximately equal in the two subsamples.
I would greatly appreciate your help!
This is my code so far. I was able to receive two subsamples of unrelated subjects but still need help in implementing the stratification by Neuroticism scores.
import numpy as np
import pandas as pd
# Set seed for reproducibility
np.random.seed(42)
# Generate sample dataset
num_subjects = 174
num_families = 87 # 87 unique families
# Create Family_IDs (maximally two subjects per family)
family_ids = np.repeat(np.arange(num_families), 2)[:num_subjects]
np.random.shuffle(family_ids)
# Generate Neuroticism scores (normally distributed, scale 0-40, integers)
neuroticism_scores = np.clip(np.random.normal(loc=20, scale=5, size=num_subjects), 0, 40).astype(int)
# Create random subject names
subject_ids = [f'Subject_{i}' for i in range(num_subjects)]
# Create dataframe
df_behavioral = pd.DataFrame({
'Subject_ID': subject_ids,
'Family_ID': family_ids,
'NEOFAC_N': neuroticism_scores
})
print(df_behavioral.head()) # Preview dataset
siblings_list = []
for family in df_behavioral["Family_ID"].unique():
siblings_list.append(df_behavioral.query("Family_ID==@family").index.values)
print("There are {} unique families in the dataset".format(len(siblings_list)))
subj_set1 = [family[0] for family in siblings_list]
subj_set2 = [family[1] for family in siblings_list if len(family)>1]
# Make sure subjects are sorted
subj_set1.sort()
subj_set2.sort()
print("There are {} unrelated subjects in set 1".format(len(subj_set1)))
print("There are {} unrelated subjects in set 2".format(len(subj_set2)))
# Save subsamples as DataFrames
sample_1 = df_behavioral.loc[subj_set1].reset_index(drop=True)
sample_2 = df_behavioral.loc[subj_set2].reset_index(drop=True)
I am working with data from 174 subjects, stored in a dataframe (df_behavioral) where one row represents one subject. Some subjects are related to one another, as indicated by a variable called 'Family_ID', which assigns each subject to a family.
I need to split the sample into two subsamples of approximately equal size while ensuring that there are only unrelated subjects in one subsample. In other words: Subjects from the same family cannot be in the same subsample.
Additionally, the split should be stratified by Neuroticism scores (variable 'NEOFAC_N'), so that the distribution of Neuroticism scores is approximately equal in the two subsamples.
I would greatly appreciate your help!
This is my code so far. I was able to receive two subsamples of unrelated subjects but still need help in implementing the stratification by Neuroticism scores.
import numpy as np
import pandas as pd
# Set seed for reproducibility
np.random.seed(42)
# Generate sample dataset
num_subjects = 174
num_families = 87 # 87 unique families
# Create Family_IDs (maximally two subjects per family)
family_ids = np.repeat(np.arange(num_families), 2)[:num_subjects]
np.random.shuffle(family_ids)
# Generate Neuroticism scores (normally distributed, scale 0-40, integers)
neuroticism_scores = np.clip(np.random.normal(loc=20, scale=5, size=num_subjects), 0, 40).astype(int)
# Create random subject names
subject_ids = [f'Subject_{i}' for i in range(num_subjects)]
# Create dataframe
df_behavioral = pd.DataFrame({
'Subject_ID': subject_ids,
'Family_ID': family_ids,
'NEOFAC_N': neuroticism_scores
})
print(df_behavioral.head()) # Preview dataset
siblings_list = []
for family in df_behavioral["Family_ID"].unique():
siblings_list.append(df_behavioral.query("Family_ID==@family").index.values)
print("There are {} unique families in the dataset".format(len(siblings_list)))
subj_set1 = [family[0] for family in siblings_list]
subj_set2 = [family[1] for family in siblings_list if len(family)>1]
# Make sure subjects are sorted
subj_set1.sort()
subj_set2.sort()
print("There are {} unrelated subjects in set 1".format(len(subj_set1)))
print("There are {} unrelated subjects in set 2".format(len(subj_set2)))
# Save subsamples as DataFrames
sample_1 = df_behavioral.loc[subj_set1].reset_index(drop=True)
sample_2 = df_behavioral.loc[subj_set2].reset_index(drop=True)
Share
Improve this question
edited Mar 6 at 11:16
ouroboros1
14.9k7 gold badges48 silver badges58 bronze badges
asked Mar 6 at 11:16
Johanna PoppJohanna Popp
311 silver badge2 bronze badges
2 Answers
Reset to default 1A possible solution is to use sklearn
-sklearn.model_selection.StratifiedKFold
, which seems to do exactly what you are looking for and seamlessly integrates with pandas and numpy.
You can try to use the function train_test_split
with stratifying for some column what you need. See the docs of sklearn.model_selection.train_test_split
.
import pandas as pd
from sklearn.model_selection import train_test_split
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Henry'],
'Age': [25, 30, 35, 40, 28, 33, 38, 45],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'New York', 'Los Angeles', 'Chicago', 'Houston'],
'Salary': [70000, 80000, 90000, 100000, 75000, 85000, 95000, 110000]
}
df = pd.DataFrame(data)
# Split the dataset into two parts, stratifying by the 'City' column
part1, part2 = train_test_split(df, test_size=0.5, random_state=42, stratify=df['City'], shuffle=True)
print("Part 1:")
print(part1)
print("\nPart 2:")
print(part2)
Results:
Part 1:
Name Age City Salary
3 David 40 Houston 100000
4 Eve 28 New York 75000
6 Grace 38 Chicago 95000
5 Frank 33 Los Angeles 85000
Part 2:
Name Age City Salary
0 Alice 25 New York 70000
2 Charlie 35 Chicago 90000
7 Henry 45 Houston 110000
1 Bob 30 Los Angeles 80000