Skip to content
Snippets Groups Projects
Select Git revision
  • main default protected
  • develop
  • 0.1.1
  • 0.1.0
4 results

m2reprod

  • Clone with SSH
  • Clone with HTTPS
  • user avatar
    TRON Kelly authored
    f88123e7
    History

    m2reprod


    Overview

    This project conducts a reproducible data analysis using R and containerized environments to ensure consistent package versions and dependencies. We download and process a dataset, manage dependencies with micromamba, and run analysis scripts organized via a Makefile workflow. This README will guide users through setting up, executing, and understanding each component of the analysis pipeline.

    Setup Instructions

    Step 1: Install Micromamba

    To install micromamba, please refer to the "01install.md" documentation, which provides detailed installation and setup instructions.

    Step 2: Data Download and Integrity Check

    The data is sourced from an external URL and checked for integrity using an MD5 checksum to ensure reproducibility. Follow the instructions in src/download_data.R to download and verify the data.

    Step 3: Creating the Virtual Image

    Before running the workflow, please refer to the complete documentation in the "01install.md" file for instructions on how to create the virtual image .sif. This ensures that all dependencies are properly encapsulated.

    Step 4: Analysis Pipeline Setup

    The analysis is organized into four sequential scripts (tp1.R to tp4.R) and managed via a Makefile, ensuring that each step only executes if necessary. The Makefile enforces dependencies, avoiding redundant execution.

    Step 5 : Execute the Workflow

    To run the workflow, please refer to the "02run" documentation, which provides detailed instructions on how to execute the workflow within the micromamba environment :

    apptainer exec results/containers/m2bsgreprod3.sif make -f workflows/makefile

    Scripts

    1. download_data.R: Downloads the dataset, verifies the MD5 checksum, and extracts the files if valid.

    2. tp1.R:

      • Loads required libraries.
      • Reads and processes genotype data.
      • Selects a random subset of 250,000 SNPs for analysis.
      • Outputs processed data to results/tp1.
    3. tp2.R:

      • Loads data from the previous script.
      • Initializes additional libraries.
      • Saves the intermediate processed data to results/tp2.
    4. tp3.R:

      • Loads data from tp2.
      • Performs additional processing and saves to results/tp3.
    5. tp4.R:

      • Loads data from tp3.
      • Finalizes data processing and saves results to results/tp4.

    This README provides a comprehensive guide for setting up, executing, and troubleshooting your project, ensuring clarity and reproducibility for each step.