最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

snakemake - Is there any ways to use wildcards in output? - Stack Overflow

programmeradmin0浏览0评论

My workflow gets 3 samples to deal with. Since the sample files are verg large, I use config.yaml to specify that each sample should be split into multiple parts like below.

split2nfiles: 
  SRR23538290: 12
  SRR23538291: 15
  SRR23538292: 14

In My Snakefile, I use SPLIT2NFILES to store the number of splits and SPLIT_IDS to store a list that each sample should be split to with suffix id.

SPLIT2NFILES = config["split2nfiles"]
SPLIT_IDS = defaultdict(list)
for k in SAMPLE2DATA.keys():
    SPLIT_IDS[k]=[f"{num:04d}" for num in range(1, SPLIT2NFILES[k] + 1)]

And this is my split_part rule. I want to use this rule to split each sample into different number of parts according to the config file with SPLIT2NFILES and SPLIT_IDS. My input files are all like {sample}.{ref}.sam. {sample} is in SAMPLE2DATA.keys(), {ref} is just genome. But each sample should be split to different numbers (id is not the same, e.g. sample1 be split to 10 parts then it gets id from "0001" to "0010". But sample2 be split to 13 parts then it gets id from "0001" to "0013").

I want this rule just generate the parts for each sample in dag (I have 3 samples which means this rule just be called by 3 times), but When I generate the dag graph snakemake --dag | dot -Tpng > dag_dyn_ids.png, the graph begins with rule revert_split_bam. All the rules before split_part (include itself) are gone.

rule split_part:
    input:
        ready=TEMPDIR / "bam2sam_barrier/ready",
        sams=TEMPDIR / "split_part/sams/{sample}.{ref}.sam",
    output:
        expand(TEMPDIR / "split_part/splits/{{sample}}/{{sample}}.{{ref}}_{id}.sam", id=SPLIT_IDS["{{sample}}"] ),
     params:
        n_file=lambda wildcards : SPLIT2NFILES[wildcards.sample],
        tmp=os.environ["TMPDIR"],
        sample_name = lambda wildcards: wildcards.sample,
     shell:
     ........

rule revert_split_bam:
    input:
        rules.split_part.output
    output:
        temp(TEMPDIR / "split_part/bams/{sample}/{sample}.{ref}_{id}.bam"),

Are there any possible ways to dynamic specify the split_part 's output to match {sample} with its own id?

I have tried id = lambda wildcards: SPLIT_IDS[wildcards.sample], But it seems that I shouldn't use lambda in expand. The error is "Only input files can be specified as functions".

I check this issue and I find that using params in output can not work.

My workflow gets 3 samples to deal with. Since the sample files are verg large, I use config.yaml to specify that each sample should be split into multiple parts like below.

split2nfiles: 
  SRR23538290: 12
  SRR23538291: 15
  SRR23538292: 14

In My Snakefile, I use SPLIT2NFILES to store the number of splits and SPLIT_IDS to store a list that each sample should be split to with suffix id.

SPLIT2NFILES = config["split2nfiles"]
SPLIT_IDS = defaultdict(list)
for k in SAMPLE2DATA.keys():
    SPLIT_IDS[k]=[f"{num:04d}" for num in range(1, SPLIT2NFILES[k] + 1)]

And this is my split_part rule. I want to use this rule to split each sample into different number of parts according to the config file with SPLIT2NFILES and SPLIT_IDS. My input files are all like {sample}.{ref}.sam. {sample} is in SAMPLE2DATA.keys(), {ref} is just genome. But each sample should be split to different numbers (id is not the same, e.g. sample1 be split to 10 parts then it gets id from "0001" to "0010". But sample2 be split to 13 parts then it gets id from "0001" to "0013").

I want this rule just generate the parts for each sample in dag (I have 3 samples which means this rule just be called by 3 times), but When I generate the dag graph snakemake --dag | dot -Tpng > dag_dyn_ids.png, the graph begins with rule revert_split_bam. All the rules before split_part (include itself) are gone.

rule split_part:
    input:
        ready=TEMPDIR / "bam2sam_barrier/ready",
        sams=TEMPDIR / "split_part/sams/{sample}.{ref}.sam",
    output:
        expand(TEMPDIR / "split_part/splits/{{sample}}/{{sample}}.{{ref}}_{id}.sam", id=SPLIT_IDS["{{sample}}"] ),
     params:
        n_file=lambda wildcards : SPLIT2NFILES[wildcards.sample],
        tmp=os.environ["TMPDIR"],
        sample_name = lambda wildcards: wildcards.sample,
     shell:
     ........

rule revert_split_bam:
    input:
        rules.split_part.output
    output:
        temp(TEMPDIR / "split_part/bams/{sample}/{sample}.{ref}_{id}.bam"),

Are there any possible ways to dynamic specify the split_part 's output to match {sample} with its own id?

I have tried id = lambda wildcards: SPLIT_IDS[wildcards.sample], But it seems that I shouldn't use lambda in expand. The error is "Only input files can be specified as functions".

I check this issue https://github.com/snakemake/snakemake/issues/1122 and I find that using params in output can not work.

Share Improve this question asked Feb 6 at 17:31 LaplaceLaplace 31 silver badge1 bronze badge New contributor Laplace is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
Add a comment  | 

1 Answer 1

Reset to default 1

The short answer is "no", you cannot have a variable number of outputs for a rule. The outputs of a rule are always a fixed list of template strings. However, you can achieve what you want, ie. to process your SAM files in chunks.

One approach would be to have your split_part rule always output max(split2nfiles.values()) outputs and then for all the cases where the actual number of chunks is lower just create empty files to pad the output (but never actually try to process them). It will work, but it's a bit icky.

Or you could just re-run the whole workflow for each sample rather than processing everything at once. This might actually be the simplest way. There used to be a "subworkflows" feature that supported this, but it was removed because it's actually just cleaner and simpler to run snakemake multiple times from a loop in a shell script.

Another approach is to use a checkpoint to add new jobs to the DAG after the split has happened, but in this case a checkpoint is unnecessary as you already know in advance the number of chunks that you expect to see per SAM file, so the checkpoint is superfluous. However, you do still need to have the split rule output a directory of files (which is what you would typically do with a checkpoint rule). Here's a self-contained example:

# Say I have three input files:

FILE_SIZES = {"a": 12, "b": 15, "c": 14}

# Ensure that f and p wildcards only match to letters and numbers respectively.
wildcard_constraints:
    f = r"[a-z]",
    p = r"\d+",

rule main:
    input: expand("doubled_{f}.txt", f=FILE_SIZES.keys())

rule combine_output:
    output: "doubled_{f}.txt"
    input:
        lambda wc: expand( "doubled_{f}_{p:02d}.txt",
                                f = wc.f,
                                p = range(FILE_SIZES[wc.f]) )
    shell:
        "cat {input} > {output}"

# As a proxy for a rule that processes a chunk of a BAM file, this just
# doubles the content of a line, eg: "foo" -> "foo foo"
rule double_up:
    output: "doubled_{f}_{p}.txt"
    input:  "infile_{f}_{p}.txt"
    shell:
        "paste {input} {input} > {output}"

# You could remove this rule and avoid making the symlinks if you just have the
# "double_up" rule read the input directly, but I think it's cleaner to keep
# this "shim" rule.
# The explicit "test -e" avoids making dangling links if a file is somehow missing,
# which is quite possible as we are relying on the filenames created by "split".
rule link_a_chunk:
    output: "infile_{f}_{p}.txt"
    input:  "infile_{f}_split"
    shell:
       """test -e {input}/{output}
          ln -srn {input}/{output} {output}
       """

# This rule has a directory output, as that's the only way to have a rule with a
# variable number of output files.
rule split_input:
    output: directory("infile_{f}_split")
    input:  "infile_{f}.txt"
    shell:
       """mkdir {output}
          split -l1 {input} --numeric=0 {output}/infile_{wildcards.f}_ --add .txt
       """

rule gen_inputs:
    output: "infile_{f}.txt"
    params:
        num_lines = lambda wc: FILE_SIZES[wc.f]
    shell:
        "seq {params.num_lines} > {output}"

In the approach above, conceptually what we are doing is converting each input file into a directory of split-up files, then extracting one chunk of this converted data at a time. The extraction step in rule link_a_chunk simply entails making a symlink to the file within the directory.

A final option may be possible depending on exactly what is in those SAM files. You could have a rule that runs once per chunk, like link_a_chunk above, but rather that pre-splitting the inputs each job would read the big SAM file directly and just extract one chunk at a time. Of course, if this entails scanning through the whole big SAM file 15 times then this is going to be slow, but maybe if they were converted to BAM, indexed, and extracted with "samtools view -M" it could be efficient? Depends on how you are making the split.

发布评论

评论列表(0)

  1. 暂无评论