My workflow gets 3 samples to deal with. Since the sample files are verg large, I use config.yaml to specify that each sample should be split into multiple parts like below.
split2nfiles:
SRR23538290: 12
SRR23538291: 15
SRR23538292: 14
In My Snakefile, I use SPLIT2NFILES
to store the number of splits and SPLIT_IDS
to store a list that each sample should be split to with suffix id.
SPLIT2NFILES = config["split2nfiles"]
SPLIT_IDS = defaultdict(list)
for k in SAMPLE2DATA.keys():
SPLIT_IDS[k]=[f"{num:04d}" for num in range(1, SPLIT2NFILES[k] + 1)]
And this is my split_part rule. I want to use this rule to split each sample into different number of parts according to the config file with SPLIT2NFILES
and SPLIT_IDS
. My input files are all like {sample}.{ref}.sam. {sample} is in SAMPLE2DATA.keys()
, {ref} is just genome
. But each sample should be split to different numbers (id is not the same, e.g. sample1 be split to 10 parts then it gets id from "0001" to "0010". But sample2 be split to 13 parts then it gets id from "0001" to "0013").
I want this rule just generate the parts for each sample in dag (I have 3 samples which means this rule just be called by 3 times), but When I generate the dag graph snakemake --dag | dot -Tpng > dag_dyn_ids.png
, the graph begins with rule revert_split_bam
. All the rules before split_part
(include itself) are gone.
rule split_part:
input:
ready=TEMPDIR / "bam2sam_barrier/ready",
sams=TEMPDIR / "split_part/sams/{sample}.{ref}.sam",
output:
expand(TEMPDIR / "split_part/splits/{{sample}}/{{sample}}.{{ref}}_{id}.sam", id=SPLIT_IDS["{{sample}}"] ),
params:
n_file=lambda wildcards : SPLIT2NFILES[wildcards.sample],
tmp=os.environ["TMPDIR"],
sample_name = lambda wildcards: wildcards.sample,
shell:
........
rule revert_split_bam:
input:
rules.split_part.output
output:
temp(TEMPDIR / "split_part/bams/{sample}/{sample}.{ref}_{id}.bam"),
Are there any possible ways to dynamic specify the split_part
's output to match {sample} with its own id
?
I have tried id = lambda wildcards: SPLIT_IDS[wildcards.sample]
, But it seems that I shouldn't use lambda in expand
. The error is "Only input files can be specified as functions".
I check this issue and I find that using params in output can not work.
My workflow gets 3 samples to deal with. Since the sample files are verg large, I use config.yaml to specify that each sample should be split into multiple parts like below.
split2nfiles:
SRR23538290: 12
SRR23538291: 15
SRR23538292: 14
In My Snakefile, I use SPLIT2NFILES
to store the number of splits and SPLIT_IDS
to store a list that each sample should be split to with suffix id.
SPLIT2NFILES = config["split2nfiles"]
SPLIT_IDS = defaultdict(list)
for k in SAMPLE2DATA.keys():
SPLIT_IDS[k]=[f"{num:04d}" for num in range(1, SPLIT2NFILES[k] + 1)]
And this is my split_part rule. I want to use this rule to split each sample into different number of parts according to the config file with SPLIT2NFILES
and SPLIT_IDS
. My input files are all like {sample}.{ref}.sam. {sample} is in SAMPLE2DATA.keys()
, {ref} is just genome
. But each sample should be split to different numbers (id is not the same, e.g. sample1 be split to 10 parts then it gets id from "0001" to "0010". But sample2 be split to 13 parts then it gets id from "0001" to "0013").
I want this rule just generate the parts for each sample in dag (I have 3 samples which means this rule just be called by 3 times), but When I generate the dag graph snakemake --dag | dot -Tpng > dag_dyn_ids.png
, the graph begins with rule revert_split_bam
. All the rules before split_part
(include itself) are gone.
rule split_part:
input:
ready=TEMPDIR / "bam2sam_barrier/ready",
sams=TEMPDIR / "split_part/sams/{sample}.{ref}.sam",
output:
expand(TEMPDIR / "split_part/splits/{{sample}}/{{sample}}.{{ref}}_{id}.sam", id=SPLIT_IDS["{{sample}}"] ),
params:
n_file=lambda wildcards : SPLIT2NFILES[wildcards.sample],
tmp=os.environ["TMPDIR"],
sample_name = lambda wildcards: wildcards.sample,
shell:
........
rule revert_split_bam:
input:
rules.split_part.output
output:
temp(TEMPDIR / "split_part/bams/{sample}/{sample}.{ref}_{id}.bam"),
Are there any possible ways to dynamic specify the split_part
's output to match {sample} with its own id
?
I have tried id = lambda wildcards: SPLIT_IDS[wildcards.sample]
, But it seems that I shouldn't use lambda in expand
. The error is "Only input files can be specified as functions".
I check this issue https://github.com/snakemake/snakemake/issues/1122 and I find that using params in output can not work.
Share Improve this question asked Feb 6 at 17:31 LaplaceLaplace 31 silver badge1 bronze badge New contributor Laplace is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.1 Answer
Reset to default 1The short answer is "no", you cannot have a variable number of outputs for a rule. The outputs of a rule are always a fixed list of template strings. However, you can achieve what you want, ie. to process your SAM files in chunks.
One approach would be to have your split_part
rule always output max(split2nfiles.values())
outputs and then for all the cases where the actual number of chunks is lower just create empty files to pad the output (but never actually try to process them). It will work, but it's a bit icky.
Or you could just re-run the whole workflow for each sample rather than processing everything at once. This might actually be the simplest way. There used to be a "subworkflows" feature that supported this, but it was removed because it's actually just cleaner and simpler to run snakemake multiple times from a loop in a shell script.
Another approach is to use a checkpoint
to add new jobs to the DAG after the split has happened, but in this case a checkpoint is unnecessary as you already know in advance the number of chunks that you expect to see per SAM file, so the checkpoint is superfluous. However, you do still need to have the split rule output a directory of files (which is what you would typically do with a checkpoint rule). Here's a self-contained example:
# Say I have three input files:
FILE_SIZES = {"a": 12, "b": 15, "c": 14}
# Ensure that f and p wildcards only match to letters and numbers respectively.
wildcard_constraints:
f = r"[a-z]",
p = r"\d+",
rule main:
input: expand("doubled_{f}.txt", f=FILE_SIZES.keys())
rule combine_output:
output: "doubled_{f}.txt"
input:
lambda wc: expand( "doubled_{f}_{p:02d}.txt",
f = wc.f,
p = range(FILE_SIZES[wc.f]) )
shell:
"cat {input} > {output}"
# As a proxy for a rule that processes a chunk of a BAM file, this just
# doubles the content of a line, eg: "foo" -> "foo foo"
rule double_up:
output: "doubled_{f}_{p}.txt"
input: "infile_{f}_{p}.txt"
shell:
"paste {input} {input} > {output}"
# You could remove this rule and avoid making the symlinks if you just have the
# "double_up" rule read the input directly, but I think it's cleaner to keep
# this "shim" rule.
# The explicit "test -e" avoids making dangling links if a file is somehow missing,
# which is quite possible as we are relying on the filenames created by "split".
rule link_a_chunk:
output: "infile_{f}_{p}.txt"
input: "infile_{f}_split"
shell:
"""test -e {input}/{output}
ln -srn {input}/{output} {output}
"""
# This rule has a directory output, as that's the only way to have a rule with a
# variable number of output files.
rule split_input:
output: directory("infile_{f}_split")
input: "infile_{f}.txt"
shell:
"""mkdir {output}
split -l1 {input} --numeric=0 {output}/infile_{wildcards.f}_ --add .txt
"""
rule gen_inputs:
output: "infile_{f}.txt"
params:
num_lines = lambda wc: FILE_SIZES[wc.f]
shell:
"seq {params.num_lines} > {output}"
In the approach above, conceptually what we are doing is converting each input file into a directory of split-up files, then extracting one chunk of this converted data at a time. The extraction step in rule link_a_chunk
simply entails making a symlink to the file within the directory.
A final option may be possible depending on exactly what is in those SAM files. You could have a rule that runs once per chunk, like link_a_chunk
above, but rather that pre-splitting the inputs each job would read the big SAM file directly and just extract one chunk at a time. Of course, if this entails scanning through the whole big SAM file 15 times then this is going to be slow, but maybe if they were converted to BAM, indexed, and extracted with "samtools view -M" it could be efficient? Depends on how you are making the split.