python - Stratified sample split

I am working with data from 174 subjects, stored in a dataframe (df_behavioral) where one row represents one subject. Some subjects are related to one another, as indicated by a variable called 'Family_ID', which assigns each subject to a family.

I need to split the sample into two subsamples of approximately equal size while ensuring that there are only unrelated subjects in one subsample. In other words: Subjects from the same family cannot be in the same subsample.

Additionally, the split should be stratified by Neuroticism scores (variable 'NEOFAC_N'), so that the distribution of Neuroticism scores is approximately equal in the two subsamples.

I would greatly appreciate your help!

This is my code so far. I was able to receive two subsamples of unrelated subjects but still need help in implementing the stratification by Neuroticism scores.

import numpy as np
import pandas as pd

# Set seed for reproducibility
np.random.seed(42)

# Generate sample dataset
num_subjects = 174
num_families = 87  # 87 unique families

# Create Family_IDs (maximally two subjects per family)
family_ids = np.repeat(np.arange(num_families), 2)[:num_subjects]
np.random.shuffle(family_ids)

# Generate Neuroticism scores (normally distributed, scale 0-40, integers)
neuroticism_scores = np.clip(np.random.normal(loc=20, scale=5, size=num_subjects), 0, 40).astype(int)

# Create random subject names
subject_ids = [f'Subject_{i}' for i in range(num_subjects)]

# Create dataframe
df_behavioral = pd.DataFrame({
    'Subject_ID': subject_ids,
    'Family_ID': family_ids,
    'NEOFAC_N': neuroticism_scores
})

print(df_behavioral.head())  # Preview dataset

siblings_list = []
 
for family in df_behavioral["Family_ID"].unique():
    siblings_list.append(df_behavioral.query("Family_ID==@family").index.values)
 
print("There are {} unique families in the dataset".format(len(siblings_list)))
 
subj_set1 = [family[0] for family in siblings_list]
subj_set2 = [family[1] for family in siblings_list if len(family)>1]
 
# Make sure subjects are sorted
subj_set1.sort()
subj_set2.sort()
 
print("There are {} unrelated subjects in set 1".format(len(subj_set1)))
print("There are {} unrelated subjects in set 2".format(len(subj_set2)))

# Save subsamples as DataFrames 
sample_1 = df_behavioral.loc[subj_set1].reset_index(drop=True)
sample_2 = df_behavioral.loc[subj_set2].reset_index(drop=True)

Additionally, the split should be stratified by Neuroticism scores (variable 'NEOFAC_N'), so that the distribution of Neuroticism scores is approximately equal in the two subsamples.

I would greatly appreciate your help!

This is my code so far. I was able to receive two subsamples of unrelated subjects but still need help in implementing the stratification by Neuroticism scores.

import numpy as np
import pandas as pd

# Set seed for reproducibility
np.random.seed(42)

# Generate sample dataset
num_subjects = 174
num_families = 87  # 87 unique families

# Create Family_IDs (maximally two subjects per family)
family_ids = np.repeat(np.arange(num_families), 2)[:num_subjects]
np.random.shuffle(family_ids)

# Generate Neuroticism scores (normally distributed, scale 0-40, integers)
neuroticism_scores = np.clip(np.random.normal(loc=20, scale=5, size=num_subjects), 0, 40).astype(int)

# Create random subject names
subject_ids = [f'Subject_{i}' for i in range(num_subjects)]

# Create dataframe
df_behavioral = pd.DataFrame({
    'Subject_ID': subject_ids,
    'Family_ID': family_ids,
    'NEOFAC_N': neuroticism_scores
})

print(df_behavioral.head())  # Preview dataset

siblings_list = []
 
for family in df_behavioral["Family_ID"].unique():
    siblings_list.append(df_behavioral.query("Family_ID==@family").index.values)
 
print("There are {} unique families in the dataset".format(len(siblings_list)))
 
subj_set1 = [family[0] for family in siblings_list]
subj_set2 = [family[1] for family in siblings_list if len(family)>1]
 
# Make sure subjects are sorted
subj_set1.sort()
subj_set2.sort()
 
print("There are {} unrelated subjects in set 1".format(len(subj_set1)))
print("There are {} unrelated subjects in set 2".format(len(subj_set2)))

# Save subsamples as DataFrames 
sample_1 = df_behavioral.loc[subj_set1].reset_index(drop=True)
sample_2 = df_behavioral.loc[subj_set2].reset_index(drop=True)

Share Improve this question edited Mar 6 at 11:16 ouroboros1 14.9k7 gold badges48 silver badges58 bronze badges asked Mar 6 at 11:16 Johanna Popp 311 silver badge2 bronze badges

Add a comment |

2 Answers 2

Sorted by: Reset to default 1

A possible solution is to use sklearn -sklearn.model_selection.StratifiedKFold, which seems to do exactly what you are looking for and seamlessly integrates with pandas and numpy.

You can try to use the function train_test_split with stratifying for some column what you need. See the docs of sklearn.model_selection.train_test_split .

import pandas as pd
from sklearn.model_selection import train_test_split

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Henry'],
    'Age': [25, 30, 35, 40, 28, 33, 38, 45],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'New York', 'Los Angeles', 'Chicago', 'Houston'],
    'Salary': [70000, 80000, 90000, 100000, 75000, 85000, 95000, 110000]
}

df = pd.DataFrame(data)

# Split the dataset into two parts, stratifying by the 'City' column
part1, part2 = train_test_split(df, test_size=0.5, random_state=42, stratify=df['City'], shuffle=True)

print("Part 1:")
print(part1)
print("\nPart 2:")
print(part2)

Results:

Part 1:
    Name  Age         City  Salary
3  David   40      Houston  100000
4    Eve   28     New York   75000
6  Grace   38      Chicago   95000
5  Frank   33  Los Angeles   85000

Part 2:
      Name  Age         City  Salary
0    Alice   25     New York   70000
2  Charlie   35      Chicago   90000
7    Henry   45      Houston  110000
1      Bob   30  Los Angeles   80000

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Stratified sample split - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)