README.md

# m2reprod

---

## Overview

This project conducts a reproducible data analysis using R and containerized environments to ensure consistent package versions and dependencies. We download and process a dataset, manage dependencies with micromamba, and run analysis scripts organized via a Makefile workflow. This README will guide users through setting up, executing, and understanding each component of the analysis pipeline.

## Setup Instructions

### Step 1: Install Micromamba
To install micromamba, please refer to the "01install.md" documentation, which provides detailed installation and setup instructions.

### Step 2: Data Download and Integrity Check
The data is sourced from an external URL and checked for integrity using an MD5 checksum to ensure reproducibility. Follow the instructions in `src/download_data.R` to download and verify the data.

### Step 3: Creating the Virtual Image
Before running the workflow, please refer to the complete documentation in the "01install.md" file for instructions on how to create the virtual image .sif. This ensures that all dependencies are properly encapsulated.

### Step 4: Analysis Pipeline Setup
The analysis is organized into four sequential scripts (`tp1.R` to `tp4.R`) and managed via a Makefile, ensuring that each step only executes if necessary. The Makefile enforces dependencies, avoiding redundant execution.

### Step 5 : Execute the Workflow
To run the workflow, please refer to the "02run" documentation, which provides detailed instructions on how to execute the workflow within the micromamba environment :

```
apptainer exec results/containers/m2bsgreprod3.sif make -f workflows/makefile
```
### Scripts

1. **download_data.R**: Downloads the dataset, verifies the MD5 checksum, and extracts the files if valid.
   
2. **tp1.R**:
   - Loads required libraries.
   - Reads and processes genotype data.
   - Selects a random subset of 250,000 SNPs for analysis.
   - Outputs processed data to `results/tp1`.

3. **tp2.R**:
   - Loads data from the previous script.
   - Initializes additional libraries.
   - Saves the intermediate processed data to `results/tp2`.

4. **tp3.R**:
   - Loads data from `tp2`.
   - Performs additional processing and saves to `results/tp3`.

5. **tp4.R**:
   - Loads data from `tp3`.
   - Finalizes data processing and saves results to `results/tp4`.
   
--- 

This README provides a comprehensive guide for setting up, executing, and troubleshooting your project, ensuring clarity and reproducibility for each step.