Skip to content
Snippets Groups Projects
user avatar
TRON Kelly authored
a9df372b
History

m2reprod


Overview

This project conducts a reproducible data analysis using R and containerized environments to ensure consistent package versions and dependencies. We download and process a dataset, manage dependencies with micromamba, and run analysis scripts organized via a Makefile workflow. This README will guide users through setting up, executing, and understanding each component of the analysis pipeline.

Setup Instructions

Step 1: Install Micromamba

To install micromamba, please refer to the "01install.md" documentation, which provides detailed installation and setup instructions.

Step 2: Data Download and Integrity Check

The data is sourced from an external URL and checked for integrity using an MD5 checksum to ensure reproducibility. Follow the instructions in src/download_data.R to download and verify the data.

Step 3: Creating the Virtual Image

Before running the workflow, please refer to the complete documentation in the "01install.md" file for instructions on how to create the virtual image .sif. This ensures that all dependencies are properly encapsulated.

Step 4: Analysis Pipeline Setup

The analysis is organized into four sequential scripts (tp1.R to tp4.R) and managed via a Makefile, ensuring that each step only executes if necessary. The Makefile enforces dependencies, avoiding redundant execution.

Step 5 : Execute the Workflow

To run the workflow, please refer to the "02run" documentation, which provides detailed instructions on how to execute the workflow within the micromamba environment :

apptainer exec results/containers/m2bsgreprod3.sif make -f workflows/makefile

This command will:

  1. Download and prepare data files.
  2. Execute analysis scripts in sequence, as defined in the Makefile.
  3. Generate output in the results/ directory for each analysis stage.

Scripts

  1. download_data.R: Downloads the dataset, verifies the MD5 checksum, and extracts the files if valid.

  2. tp1.R:

    • Loads required libraries.
    • Reads and processes genotype data.
    • Selects a random subset of 250,000 SNPs for analysis.
    • Outputs processed data to results/tp1.
  3. tp2.R:

    • Loads data from the previous script.
    • Initializes additional libraries.
    • Saves the intermediate processed data to results/tp2.
  4. tp3.R:

    • Loads data from tp2.
    • Performs additional processing and saves to results/tp3.
  5. tp4.R:

    • Loads data from tp3.
    • Finalizes data processing and saves results to results/tp4.

This README provides a comprehensive guide for setting up, executing, and troubleshooting your project, ensuring clarity and reproducibility for each step.