最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Files are not created with snakemake even though they are created if manually run - Stack Overflow

programmeradmin0浏览0评论

I am using snakemake as a pipeline to run simulations using c++ and do some data analysis. In the snakefile below I am showing only the rule concerning the data generation. Unexpectedly, a few files are not created even though they are created if I run manually the c++ file through a bash script. For each beta and b pair (read from a csv file) the simulation should generate 500 csv files. To illustrate the problem, I reduced the beta and b combination csv file to contain only 2 such pairs. Thus, we expect a total of 1000 jobs. What is the reason for missing a few of these files ? Given the warning in the output, it seems as if there is a lack of coordination between the time the DAG is created and the moment the csv file is generated. What should I do ?

#Read beta-b combinations from CSV
combinations_file = "beta_b_combinations.csv"
bash_script = "run_extended_metropolis.sh"
executable = "metropolis_extended"
      
# Read the CSV file
beta_b_combinations = []
with open(combinations_file, 'r') as f:
    next(f)  # Skip header
    for line in f:
        beta, b = line.strip().split(',')
        beta_b_combinations.append((float(beta), float(b)))  # Convert to float for  safety

# Define the rule 'all'
rule all:
    input:
        expand(
            "beta_{beta}_b_{b}/data_1_first500/replica_{replica}.csv",
             beta=[beta for beta, _ in beta_b_combinations],
             b=[b for _, b in beta_b_combinations],
             replica=range(1, 501)  # 500 replicas
        )

# Rule for running the metropolis simulation
rule run_metropolis:
    input:
        bash_script=bash_script,
        executable=executable
    output:
        "beta_{beta}_b_{b}/data_1_first500/replica_{replica}.csv"
    params:
        beta="{beta}",
        b="{b}",
        outdir="beta_{beta}_b_{b}/data_1_first500"
    shell:
        """
            mkdir -p {params.outdir}
            echo "Running script with beta={params.beta}, b={params.b},    outdir={params.outdir}" > {params.outdir}/debug.log
           ./{input.bash_script} {input.executable} >> {params.outdir}/debug.log 2>&1

           # Log existing files
           echo "Files in {params.outdir}:" >> {params.outdir}/debug.log
           ls -lh {params.outdir} >> {params.outdir}/debug.log

           # Check if output file exists
           if [ ! -f {output} ]; then
               echo "ERROR: Output file {output} was not created!" >> {params.outdir}/ debug.log
               exit 1
           fi
        """

Here is the output:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1       all
    1000    run_metropolis
    1001

 

[Tue Mar  4 22:03:43 2025]
rule run_metropolis:
input: run_extended_metropolis.sh, metropolis_extended
output: beta_0.1_b_0.01/data_1_first500/replica_309.csv
jobid: 309
wildcards: beta=0.1, b=0.01, replica=309

[Tue Mar  4 22:05:23 2025]
Finished job 309.
1 of 1001 steps (0.10%) done

[Tue Mar  4 22:05:23 2025]
rule run_metropolis:
input: run_extended_metropolis.sh, metropolis_extended
output: beta_0.1_b_0.001/data_1_first500/replica_262.csv
jobid: 762
wildcards: beta=0.1, b=0.001, replica=262

Warning: the following output files of rule run_metropolis were not present when the      DAG was created:
{'beta_0.1_b_0.001/data_1_first500/replica_262.csv'}
[Tue Mar  4 22:07:04 2025]
Finished job 762.
2 of 1001 steps (0.20%) done

[Tue Mar  4 22:07:04 2025]
rule run_metropolis:
input: run_extended_metropolis.sh, metropolis_extended
output: beta_0.1_b_0.001/data_1_first500/replica_382.csv
jobid: 882
wildcards: beta=0.1, b=0.001, replica=382

 Warning: the following output files of rule run_metropolis were not present when the  DAG was created:
  {'beta_0.1_b_0.001/data_1_first500/replica_382.csv'}

Here is the improved version of the snakefile, whereby I run directly the c++ file with the snakefile without the bash script. Nevertheless, each created csv file is present in double copies. Why this happens ?

combinations_file = "beta_b_combinations.csv"
executable = "metropolis_extended"

beta_b_combinations = []
with open(combinations_file, 'r') as f:
    next(f)
    for line in f:
    beta, b = line.strip().split(',')
    beta_b_combinations.append((float(beta), float(b)))

rule all:
    input:
        expand(
            "beta_{beta}_b_{b}/data_1_first500/replica_{replica}.csv",
            beta=[beta for beta, _ in beta_b_combinations],
            b=[b for _, b in beta_b_combinations],
            replica=range(1, 501)
        )

rule run_metropolis:
    output:
        "beta_{beta}_b_{b}/data_1_first500/replica_{replica}.csv"
    log:
        "beta_{beta}_b_{b}/data_1_first500/replica_{replica}.log"
    shell:
        """
        ./{executable} {wildcards.replica} {output}  {wildcards.beta} {wildcards.b} > {log} 2>&1
        """

I decided to also show the main part of the c++ code:

void metropolis(int D, int N, double beta, double b, int seed, int N_D, int num_config, const std::string& filename) {
Xoroshiro128Plus rng(seed);
std::vector<int> lattice(N_D);
for (int n = 0; n < N_D; ++n) {
    lattice[n] = (rng.next() < 0.5) ? 1 : -1;
}

std::vector<int> powers = precompute_powers(N, D);
std::vector<std::vector<int>> neighbors(N_D);
for (int i = 0; i < N_D; ++i) {
    std::vector<int> coords = convert_flat_index_to_lattice_coord(N, D, i, powers);
    std::vector<std::vector<int>> neighbor_coords = compute_neighbors(coords, N, D);
    neighbors[i] = convert_lattice_coord_to_flat_index(neighbor_coords, N, D, powers);
}

std::ofstream outfile(filename); // Open the file only once
if (!outfile.is_open()) {
    std::cerr << "Error: Could not open file " << filename << std::endl;
    return;
}

outfile << "Configuration,Magnetization,Hamiltonian\n"; // Write header

for (int step = 1; step <= num_config; ++step) {
    for (int n = 0; n < N_D; ++n) {
        double dH = deltaH(lattice, neighbors, n, beta, b);
        if (dH < -std::log(rng.next())) {
            lattice[n] = -lattice[n];
        }
    }

    int M = computeMagnetization(lattice);
    double H = computeHamiltonian(lattice, neighbors, beta, b);

    std::ostringstream oss;
    oss << std::fixed << std::setprecision(2) << step << "," << M << "," << H << "\n";
    outfile << oss.str();
}

outfile.close(); // Close the file 

}

int main(int argc, char** argv) {
    if (argc != 5) {
        std::cerr << "Usage: " << argv[0] << " <seed> <filename>  <beta> <b>" << std::endl;
        return 1;
    }

    int seed = std::stoi(argv[1]);
    std::string filename = argv[2];
    double beta = std::stod(argv[3]);
    double b = std::stod(argv[4]);

    // Add logging to C++ code
    std::ofstream cpp_log("cpp_execution.log", std::ios_base::app);
    if (cpp_log.is_open()) {
        cpp_log << "Execution started: seed=" << seed << ",  filename=" << filename << ", beta=" << beta << ", b=" << b <<  std::endl;
        cpp_log.close();
    }

    int D = 2;
    int N = 100;
    int num_config = 1000;

    std::vector<int> powers = precompute_powers(N, D);
    int N_D = powers[D];

    metropolis(D, N, beta, b, seed, N_D, num_config, filename);

    // Add logging to C++ code
    cpp_log.open("cpp_execution.log", std::ios_base::app);
    if (cpp_log.is_open()) {
        cpp_log << "Execution finished: seed=" << seed << ",  filename=" << filename << ", beta=" << beta << ", b=" << b <<  std::endl;
        cpp_log.close();
    }


    return 0;
}

I am using snakemake as a pipeline to run simulations using c++ and do some data analysis. In the snakefile below I am showing only the rule concerning the data generation. Unexpectedly, a few files are not created even though they are created if I run manually the c++ file through a bash script. For each beta and b pair (read from a csv file) the simulation should generate 500 csv files. To illustrate the problem, I reduced the beta and b combination csv file to contain only 2 such pairs. Thus, we expect a total of 1000 jobs. What is the reason for missing a few of these files ? Given the warning in the output, it seems as if there is a lack of coordination between the time the DAG is created and the moment the csv file is generated. What should I do ?

#Read beta-b combinations from CSV
combinations_file = "beta_b_combinations.csv"
bash_script = "run_extended_metropolis.sh"
executable = "metropolis_extended"
      
# Read the CSV file
beta_b_combinations = []
with open(combinations_file, 'r') as f:
    next(f)  # Skip header
    for line in f:
        beta, b = line.strip().split(',')
        beta_b_combinations.append((float(beta), float(b)))  # Convert to float for  safety

# Define the rule 'all'
rule all:
    input:
        expand(
            "beta_{beta}_b_{b}/data_1_first500/replica_{replica}.csv",
             beta=[beta for beta, _ in beta_b_combinations],
             b=[b for _, b in beta_b_combinations],
             replica=range(1, 501)  # 500 replicas
        )

# Rule for running the metropolis simulation
rule run_metropolis:
    input:
        bash_script=bash_script,
        executable=executable
    output:
        "beta_{beta}_b_{b}/data_1_first500/replica_{replica}.csv"
    params:
        beta="{beta}",
        b="{b}",
        outdir="beta_{beta}_b_{b}/data_1_first500"
    shell:
        """
            mkdir -p {params.outdir}
            echo "Running script with beta={params.beta}, b={params.b},    outdir={params.outdir}" > {params.outdir}/debug.log
           ./{input.bash_script} {input.executable} >> {params.outdir}/debug.log 2>&1

           # Log existing files
           echo "Files in {params.outdir}:" >> {params.outdir}/debug.log
           ls -lh {params.outdir} >> {params.outdir}/debug.log

           # Check if output file exists
           if [ ! -f {output} ]; then
               echo "ERROR: Output file {output} was not created!" >> {params.outdir}/ debug.log
               exit 1
           fi
        """

Here is the output:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1       all
    1000    run_metropolis
    1001

 

[Tue Mar  4 22:03:43 2025]
rule run_metropolis:
input: run_extended_metropolis.sh, metropolis_extended
output: beta_0.1_b_0.01/data_1_first500/replica_309.csv
jobid: 309
wildcards: beta=0.1, b=0.01, replica=309

[Tue Mar  4 22:05:23 2025]
Finished job 309.
1 of 1001 steps (0.10%) done

[Tue Mar  4 22:05:23 2025]
rule run_metropolis:
input: run_extended_metropolis.sh, metropolis_extended
output: beta_0.1_b_0.001/data_1_first500/replica_262.csv
jobid: 762
wildcards: beta=0.1, b=0.001, replica=262

Warning: the following output files of rule run_metropolis were not present when the      DAG was created:
{'beta_0.1_b_0.001/data_1_first500/replica_262.csv'}
[Tue Mar  4 22:07:04 2025]
Finished job 762.
2 of 1001 steps (0.20%) done

[Tue Mar  4 22:07:04 2025]
rule run_metropolis:
input: run_extended_metropolis.sh, metropolis_extended
output: beta_0.1_b_0.001/data_1_first500/replica_382.csv
jobid: 882
wildcards: beta=0.1, b=0.001, replica=382

 Warning: the following output files of rule run_metropolis were not present when the  DAG was created:
  {'beta_0.1_b_0.001/data_1_first500/replica_382.csv'}

Here is the improved version of the snakefile, whereby I run directly the c++ file with the snakefile without the bash script. Nevertheless, each created csv file is present in double copies. Why this happens ?

combinations_file = "beta_b_combinations.csv"
executable = "metropolis_extended"

beta_b_combinations = []
with open(combinations_file, 'r') as f:
    next(f)
    for line in f:
    beta, b = line.strip().split(',')
    beta_b_combinations.append((float(beta), float(b)))

rule all:
    input:
        expand(
            "beta_{beta}_b_{b}/data_1_first500/replica_{replica}.csv",
            beta=[beta for beta, _ in beta_b_combinations],
            b=[b for _, b in beta_b_combinations],
            replica=range(1, 501)
        )

rule run_metropolis:
    output:
        "beta_{beta}_b_{b}/data_1_first500/replica_{replica}.csv"
    log:
        "beta_{beta}_b_{b}/data_1_first500/replica_{replica}.log"
    shell:
        """
        ./{executable} {wildcards.replica} {output}  {wildcards.beta} {wildcards.b} > {log} 2>&1
        """

I decided to also show the main part of the c++ code:

void metropolis(int D, int N, double beta, double b, int seed, int N_D, int num_config, const std::string& filename) {
Xoroshiro128Plus rng(seed);
std::vector<int> lattice(N_D);
for (int n = 0; n < N_D; ++n) {
    lattice[n] = (rng.next() < 0.5) ? 1 : -1;
}

std::vector<int> powers = precompute_powers(N, D);
std::vector<std::vector<int>> neighbors(N_D);
for (int i = 0; i < N_D; ++i) {
    std::vector<int> coords = convert_flat_index_to_lattice_coord(N, D, i, powers);
    std::vector<std::vector<int>> neighbor_coords = compute_neighbors(coords, N, D);
    neighbors[i] = convert_lattice_coord_to_flat_index(neighbor_coords, N, D, powers);
}

std::ofstream outfile(filename); // Open the file only once
if (!outfile.is_open()) {
    std::cerr << "Error: Could not open file " << filename << std::endl;
    return;
}

outfile << "Configuration,Magnetization,Hamiltonian\n"; // Write header

for (int step = 1; step <= num_config; ++step) {
    for (int n = 0; n < N_D; ++n) {
        double dH = deltaH(lattice, neighbors, n, beta, b);
        if (dH < -std::log(rng.next())) {
            lattice[n] = -lattice[n];
        }
    }

    int M = computeMagnetization(lattice);
    double H = computeHamiltonian(lattice, neighbors, beta, b);

    std::ostringstream oss;
    oss << std::fixed << std::setprecision(2) << step << "," << M << "," << H << "\n";
    outfile << oss.str();
}

outfile.close(); // Close the file 

}

int main(int argc, char** argv) {
    if (argc != 5) {
        std::cerr << "Usage: " << argv[0] << " <seed> <filename>  <beta> <b>" << std::endl;
        return 1;
    }

    int seed = std::stoi(argv[1]);
    std::string filename = argv[2];
    double beta = std::stod(argv[3]);
    double b = std::stod(argv[4]);

    // Add logging to C++ code
    std::ofstream cpp_log("cpp_execution.log", std::ios_base::app);
    if (cpp_log.is_open()) {
        cpp_log << "Execution started: seed=" << seed << ",  filename=" << filename << ", beta=" << beta << ", b=" << b <<  std::endl;
        cpp_log.close();
    }

    int D = 2;
    int N = 100;
    int num_config = 1000;

    std::vector<int> powers = precompute_powers(N, D);
    int N_D = powers[D];

    metropolis(D, N, beta, b, seed, N_D, num_config, filename);

    // Add logging to C++ code
    cpp_log.open("cpp_execution.log", std::ios_base::app);
    if (cpp_log.is_open()) {
        cpp_log << "Execution finished: seed=" << seed << ",  filename=" << filename << ", beta=" << beta << ", b=" << b <<  std::endl;
        cpp_log.close();
    }


    return 0;
}
Share Improve this question edited Mar 11 at 17:22 user249018 asked Mar 4 at 21:31 user249018user249018 5692 gold badges8 silver badges22 bronze badges 1
  • Please show the logs after adding --printshellcmds to the snakemake invocation. This way we can see the actual shell CMD executed. I think you are setting the parameters wrongly. There's no need for the params section. Use wildcards directly in the shell section. – Cornelius Roemer Commented Mar 5 at 15:55
Add a comment  | 

1 Answer 1

Reset to default 2

Your main command being run within the run_metropolis rule is:

./run_extended_metropolis.sh metropolis_extended

The replica number is not specified here, so how does this program know what output file name to create? My assumption is that within the shell script it looks for the existing files matching beta_0.1_b_0.01/data_1_first500/replica_*.csv and then creates a file with the next number in sequence.

This is not going to wash with Snakemake, since Snakemake is not running the jobs in order of the {replica} wildcard. See how it starts with replica 309. Even if you forced it to run in order, presumably you want your workflow to be able to run jobs in parallel, or retry failed jobs.

You are going to need to modify your bash_script so that the output file name can be explicitly supplied by Snakemake, not picked by the script. Depending on what that script does, you may even find it easier to invoke metropolis_extended directly.

If you get this working, then as pointed out by @cornelius-roemer in the comment above, you also have some extraneous stuff in your Snakefile:

  • The {b} and {beta} params are doing nothing useful
  • There is no need to mkdir -p {params.outdir}; Snakemake does this for you
  • There is no need to check that the output file is created; Snakemake does this for you

And you should also use an explicit log: directive to specify the log file, rather than {params.outdir}/debug.log.

发布评论

评论列表(0)

  1. 暂无评论