Skip to content

An efficient, documented, reproducible Snakemake methylation analysis pipeline for BS-seq and EM-seq samples, including cfDNA.

License

Notifications You must be signed in to change notification settings

semenko/serpent-methylation-pipeline

Repository files navigation

Serpent Methylation Pipeline (for Snakemake)

Snakemake License: MIT Code style: black Platform: Linux Super-Linter

Serpent Pipeline Logo

A standardized, reproducible pipeline to process WGBS bisulfite & EM-seq data. This goes from .fastq to methylation calls (via biscuit) and includes extensive QC and plotting, using a Snakemake pipeline.

At a high level, this pipeline reproducibly:

Getting Started

This pipeline is designed to be straightforward:

  1. Clone this repository and open the directory:
    git clone https://github.com/semenko/serpent-methylation-pipeline.git
    cd serpent-methylation-pipeline
    
  2. Install Snakemake via mamba (or conda)
    mamba install -c bioconda -c conda-forge snakemake snakemake-storage-plugin-http
    
  3. (Optional) Create a separate conda environment for pipeline dependencies:
    mamba env create -n serpent_pipeline_env -f workflow/envs/env.yaml
    
    Then activate it with:
    conda activate serpent_pipeline_env
    

Test Run

Use:

nice snakemake --cores 4 --use-conda --printshellcmds --rerun-incomplete --rerun-triggers mtime --keep-going --dry-run

to quickly validate the pipeline and see what would be executed. Remove --dry-run to run the full process.

After removing the --dry-run flag, this will download reference genomes and build indices.

Mac OS X Compatibility

Unfortunately, critical dependencies don't support Mac OS architectures yet (e.g., bwa-mem2 only supports linux-aarch64), though this might be supported using brew. Please open an issue or submit a PR if you're able to improve this situation! 😃

Data Definition

Expected Output

Raw data files from data are processed and analyzed by this snakemake workflow. Within each project directory, the output is (roughly) structured as:

SAMPLE_01/                  # e.g. melanoma / crc / healthy, etc.
│   SAMPLE.bam              # The final alignment file 
|   SAMPLE.bam.bai          #   (and its index)
|── biscuit_qc/             # biscuit QC.sh text files
|── epibeds/                # epibed files (bgzip-compressed & tabix-indexed)
├── fastp/                  # fastp statistics & logs
├── fastqc/                 # fastqc graphs 
├── goleft/                 # goleft coverage plots
├── logs/                   # runlogs from each pipeline component
├── methyldackel/           # mbias plots
├── raw/
│   ├── ...fastq.gz         # Raw reads
|   ├── ...md5.txt          # Checksums and validation
├── samtools/               # samtools statistics
SAMPLE_02/
...
...
multiqc/                    # A project-level multiqc stats across all data

Note each project also has a _subsampled directory with identical structure, which is the result of the pipeline run on only 10M reads/sample.

Production Runs

Pipeline Details

This pipeline was designed for highly reproducible, explainable alignments and analysis of epigenetic sequencing data.

Reference Genome

I chose GRCh38, with these specifics:

You can see a good explanation of the rationale for some of these components at this NCBI explainer.

Requirements

All software requirements are specified in env.yaml.

Most are relatively common, but a few are semi-unique:

biscuit was chosen after a comparison with bwa-meth and bismark — its latest version was the most flexible with extremely well annotated .bams (some critical tags are missing from bwa-meth for identifying read level methylation, and would require patching MethylDackel to extract data).

I briefly experimented with wgbs_tools (which defines nice .pat/.beta formats) but its licensing is too restrictive to use.

Trimming Approach

I chose a relatively conservative approach to trimming -- which is needed due to end-repair bias, adaptase bias, and more.

For EMseq, I trim 10 bp everywhere, after personal QC and offline discussions with NEB. See my notes here.

For BSseq, I trim 15 bp 5' R2, and 10 bp everywhere else due to adaptase bias.

For all reads, I set --trim_poly_g (due to two color bias) and set a --length_required (minimum read length) of 10 bp.

No Quality Filtering

Notably I do NOT do quality filtering here (I set --disable_quality_filtering), and save this for downstream analyses as desired.

I experimented with more stringent quality filtering early on, and found it had little yield / performance benefit.

Background & Inspiration

I strongly suggest reading work from Felix Krueger (author of Bismark) as background. In particular:

For similar pipelines and inspiration, see:

Pipeline Graph

Here's a high-level overview of the Snakemake pipeline (generated via snakemake --rulegraph | dot -Tpng > rules.png)

image

About

An efficient, documented, reproducible Snakemake methylation analysis pipeline for BS-seq and EM-seq samples, including cfDNA.

Topics

Resources

License

Stars

Watchers

Forks