最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

r - Subset data to keep only paired sample ID's within a larger dataset with unpaired samples (contol = 1 row, sample =1

programmeradmin1浏览0评论

enter image description here

A text version of what my dataset looks like

ID  Sample_Type Species
1   Sample      A
1   Sample      B
1   Sample      C
1   Sample      D
2   Sample      A
2   Sample      B
2   Sample      C
2   Sample      D
2   Sample      E
2   Sample      F
2   control     NA
3   sample      B
3   sample      C
3   sample      D
3   sample      E
3   sample      G
3   control     B
4   sample      A
4   sample      B
4   sample      C
4   sample      G
4   sample      J
4   sample      K
5   control     NA

Description

This is a basic example of my dataset. In this example I want to retain all rows for ID's 2 and 3 because they contain a sample and a control. Whereas samples 1 and 4 do not contain controls and therefore I do not need to analyze them at this time. Simularly examples like ID 5 are independent controls that are not paired with other samples, I dont want these included either.

I intend to compare the results of paired samples and controls. The issue is that the paired ID's are within a broader dataset which is very large and the paired samples/controls represent less than 10% of the overall dataset.

I am not looking to randomly sample this dataset nor keep 1 row like much of the other questions on stack. Only to clean it to where I keep every row that meets the criteria of the if statement below and remove all observations that dont meet that criteria. For example to exclude samples with no controls or independent controls that dont have samples associated with them.

data2<- data %>% 
  select(ID, Sample_Type, Species) %>% 
  filter(if (data$Sample_Type = "Control" & "Sample") {SiteID = TRUE})

enter image description here

A text version of what my dataset looks like

ID  Sample_Type Species
1   Sample      A
1   Sample      B
1   Sample      C
1   Sample      D
2   Sample      A
2   Sample      B
2   Sample      C
2   Sample      D
2   Sample      E
2   Sample      F
2   control     NA
3   sample      B
3   sample      C
3   sample      D
3   sample      E
3   sample      G
3   control     B
4   sample      A
4   sample      B
4   sample      C
4   sample      G
4   sample      J
4   sample      K
5   control     NA

Description

This is a basic example of my dataset. In this example I want to retain all rows for ID's 2 and 3 because they contain a sample and a control. Whereas samples 1 and 4 do not contain controls and therefore I do not need to analyze them at this time. Simularly examples like ID 5 are independent controls that are not paired with other samples, I dont want these included either.

I intend to compare the results of paired samples and controls. The issue is that the paired ID's are within a broader dataset which is very large and the paired samples/controls represent less than 10% of the overall dataset.

I am not looking to randomly sample this dataset nor keep 1 row like much of the other questions on stack. Only to clean it to where I keep every row that meets the criteria of the if statement below and remove all observations that dont meet that criteria. For example to exclude samples with no controls or independent controls that dont have samples associated with them.

data2<- data %>% 
  select(ID, Sample_Type, Species) %>% 
  filter(if (data$Sample_Type = "Control" & "Sample") {SiteID = TRUE})
Share Improve this question asked Feb 1 at 10:33 mbasistambasista 111 silver badge1 bronze badge
Add a comment  | 

2 Answers 2

Reset to default 2

For filtering you can check if both "sample" & "control" are present in ID group, all() checks that all left side operands are present in right side of %in% and through .by we are making that check within each ID group. tolower() to handle varying capitalization in input data.

data |> 
  dplyr::filter(all(c("sample", "control") %in% tolower(Sample_Type)), .by = ID)
#>    ID Sample_Type Species
#> 1   2      Sample       A
#> 2   2      Sample       B
#> 3   2      Sample       C
#> 4   2      Sample       D
#> 5   2      Sample       E
#> 6   2      Sample       F
#> 7   2     control    <NA>
#> 8   3      sample       B
#> 9   3      sample       C
#> 10  3      sample       D
#> 11  3      sample       E
#> 12  3      sample       G
#> 13  3     control       B

Example data:

data <- read.table(header = TRUE, text = "
ID  Sample_Type Species
1   Sample      A
1   Sample      B
1   Sample      C
1   Sample      D
2   Sample      A
2   Sample      B
2   Sample      C
2   Sample      D
2   Sample      E
2   Sample      F
2   control     NA
3   sample      B
3   sample      C
3   sample      D
3   sample      E
3   sample      G
3   control     B
4   sample      A
4   sample      B
4   sample      C
4   sample      G
4   sample      J
4   sample      K
5   control     NA")
data2<-data %>% group_by(ID) %>% mutate(Sample_Type = factor(Sample_Type, levels = c("Control","Sample"))) %>% filter(all(levels(Sample_Type) %in% Sample_Type)) %>% ungroup()

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论