I have a database/table with this structure:
Year | Age | Gender | OrderID |
---|---|---|---|
2012 | 18 | M | 4268 |
2021 | 75 | M | 7569 |
2015 | 56 | F | 5381 |
2018 | 29 | M | 2876 |
2014 | 33 | F | 3749 |
I have a database/table with this structure:
Year | Age | Gender | OrderID |
---|---|---|---|
2012 | 18 | M | 4268 |
2021 | 75 | M | 7569 |
2015 | 56 | F | 5381 |
2018 | 29 | M | 2876 |
2014 | 33 | F | 3749 |
What I am trying to acheive is that I want 400 records/table rows pulled at random to form a smaller sample, but I need 200 male and 200 female records. On top of this I need each male record to give a female record with the same Age value, so I essentially end up with 200 pairs of results - each pair having a male and female of the same Age.
I have already produced and tried the following code:
DROP TABLE IF EXISTS #SampleTableM
DROP TABLE IF EXISTS #SampleTableF
SELECT TOP (200) [Year],[Age],[Gender],[OrderID]
INTO #SampleTableM
FROM [database.name]
WHERE Age <=90 AND Sex = 'M'
ORDER BY NEWID()
SELECT TOP (200) [Year],[Age],[Gender],[OrderID]
INTO #SampleTableF
FROM [database.name]
WHERE Age <=90 AND Sex = 'F'
ORDER BY NEWID()
SELECT * FROM #SampleTableM
UNION
SELECT * FROM #SampleTableF;
However, this just gets me 200 random Male results and 200 random Female results without each result being matched to one of the opposite Gender with the same age.
Share Improve this question edited Feb 5 at 13:09 Thom A 95.6k11 gold badges60 silver badges92 bronze badges asked Feb 5 at 12:37 ArsinqArsinq 134 bronze badges 4- 1 This question is similar to: Select n random rows from SQL Server table. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. – Filburt Commented Feb 5 at 12:40
- The existing answers don't give me any idea on the matching pairs of results side of things. – Arsinq Commented Feb 5 at 12:46
- Do you want to prioritize the age distribution somehow? Or is it better to prioritize that you have pairs – siggemannen Commented Feb 5 at 13:07
- The priority is the matching pairs, age distribution shouldn't matter too much. – Arsinq Commented Feb 5 at 14:10
1 Answer
Reset to default 1Based on the comments, I first selected random 200 male sample and then matched the age for female gender corresponding to the collected 200 male samples.Unsure how your output should look like as you did not share the expected output, you can adjust the columns in output as required.
Here is an example
WITH male_sample AS (
SELECT TOP (200) [Year], [Age], [Gender], [OrderID]
FROM test
WHERE Age <= 90 AND Gender = 'M'
),
female_sample AS (
SELECT TOP (200) m.[Year] AS male_year, m.[Age] AS age, m.[Gender] AS male_gender, m.[OrderID] AS male_OrderID,
f.[Year] AS female_year, f.[Gender] AS female_gender, f.[OrderID] AS female_OrderID
FROM male_sample m
INNER JOIN test f ON m.Age = f.Age
WHERE f.Gender = 'F'
)
SELECT *
FROM female_sample;
Fiddle
Male_Year | Age | Male_Gender | Male_OrderID | Female_Year | Female_Gender | Female_OrderID |
---|---|---|---|---|---|---|
2012 | 18 | M | 4268 | 2013 | F | 4269 |
2021 | 75 | M | 4269 | 2020 | F | 4270 |
2018 | 29 | M | 4271 | 2019 | F | 4272 |
2016 | 56 | M | 4273 | 2015 | F | 4270 |
2014 | 33 | M | 4274 | 2014 | F | 4272 |
2022 | 40 | M | 4001 | 2017 | F | 4002 |
2023 | 50 | M | 5001 | 2011 | F | 5002 |
2024 | 60 | M | 6001 | 2010 | F | 6002 |
As per comment we can get the matching pairs in subsequent rows using UNION ALL and then order by age, gender. There may or not be equal distribution of age for each gender but atleast similar age will appear together.
Note : Visually this might still look uneven for when the number of male and female are more than 2 for an age.
WITH male_sample AS (
SELECT TOP (200) [Year], [Age], [Gender], [OrderID]
FROM test
WHERE Age <= 90 AND Gender = 'M'
),
female_sample AS (
SELECT TOP (200) m.[Year] AS male_year, m.[Age] AS age, m.[Gender] AS male_gender, m.[OrderID] AS male_OrderID,
f.[Year] AS female_year, f.[Gender] AS female_gender, f.[OrderID] AS female_OrderID
FROM male_sample m
INNER JOIN test f ON m.Age = f.Age
WHERE f.Gender = 'F'
),
combined_sample AS (
SELECT [male_year] AS [Year], [Age], [male_gender] AS [Gender], [male_OrderID] AS [OrderID]
FROM female_sample
UNION ALL
SELECT [female_year] AS [Year], [Age], [female_gender] AS [Gender], [female_OrderID] AS [OrderID]
FROM female_sample
)
SELECT [Year], [Age], [Gender], [OrderID]
FROM combined_sample
ORDER BY age, Gender ;
Fiddle
Output
Year | Age | Gender | OrderID |
---|---|---|---|
2013 | 18 | F | 4269 |
2012 | 18 | M | 4268 |
2019 | 29 | F | 4272 |
2018 | 29 | M | 4271 |
2014 | 33 | F | 4272 |
2014 | 33 | M | 4274 |
2017 | 40 | F | 4002 |
2022 | 40 | M | 4001 |
2011 | 50 | F | 5002 |
2023 | 50 | M | 5001 |
2015 | 56 | F | 4270 |
2016 | 56 | M | 4273 |
2010 | 60 | F | 6002 |
2024 | 60 | M | 6001 |
2020 | 75 | F | 4270 |
2021 | 75 | M | 4269 |