Serpent Methylation Pipeline (for Snakemake)

A standardized, reproducible pipeline to process WGBS bisulfite & EM-seq data. This goes from .fastq to methylation calls (via biscuit) and includes extensive QC and plotting, using a Snakemake pipeline.

At a high level, this pipeline reproducibly:

Builds a reference genome
Trims & (minimally) filters reads
Aligns & calls methylation using biscuit
Flags non-converted reads
Generates standardized outputs & QC including
- FastQC
- fastp
- Biscuit QC
- samtools stats
- MethylDackel mbias plots
- Goleft covplots
- epibed/epiread files
Runs multiqc across entire projects

Getting Started

This pipeline is designed to be straightforward:

Clone this repository and open the directory:

git clone https://github.com/semenko/serpent-methylation-pipeline.git
cd serpent-methylation-pipeline

Install Snakemake via mamba (or conda)

mamba install -c bioconda -c conda-forge snakemake snakemake-storage-plugin-http

(Optional) Create a separate conda environment for pipeline dependencies:

mamba env create -n serpent_pipeline_env -f workflow/envs/env.yaml

Then activate it with:

conda activate serpent_pipeline_env

Test Run

Use:

nice snakemake --cores 4 --use-conda --printshellcmds --rerun-incomplete --rerun-triggers mtime --keep-going --dry-run

to quickly validate the pipeline and see what would be executed. Remove --dry-run to run the full process.

After removing the --dry-run flag, this will download reference genomes and build indices.

Mac OS X Compatibility

Unfortunately, critical dependencies don't support Mac OS architectures yet (e.g., bwa-mem2 only supports linux-aarch64), though this might be supported using brew. Please open an issue or submit a PR if you're able to improve this situation! 😃

Data Definition

Expected Output

Raw data files from data are processed and analyzed by this snakemake workflow. Within each project directory, the output is (roughly) structured as:

SAMPLE_01/                  # e.g. melanoma / crc / healthy, etc.
│   SAMPLE.bam              # The final alignment file 
|   SAMPLE.bam.bai          #   (and its index)
|── biscuit_qc/             # biscuit QC.sh text files
|── epibeds/                # epibed files (bgzip-compressed & tabix-indexed)
├── fastp/                  # fastp statistics & logs
├── fastqc/                 # fastqc graphs 
├── goleft/                 # goleft coverage plots
├── logs/                   # runlogs from each pipeline component
├── methyldackel/           # mbias plots
├── raw/
│   ├── ...fastq.gz         # Raw reads
|   ├── ...md5.txt          # Checksums and validation
├── samtools/               # samtools statistics
SAMPLE_02/
...
...
multiqc/                    # A project-level multiqc stats across all data

Note each project also has a _subsampled directory with identical structure, which is the result of the pipeline run on only 10M reads/sample.

Production Runs

Pipeline Details

This pipeline was designed for highly reproducible, explainable alignments and analysis of epigenetic sequencing data.

Reference Genome

I chose GRCh38, with these specifics:

No patches
Includes the hs38d1 decoy
Includes Alt chromosomes
Applies the U2AF1 masking file
Applies the Encode DAC exclusion

You can see a good explanation of the rationale for some of these components at this NCBI explainer.

Requirements

All software requirements are specified in env.yaml.

Most are relatively common, but a few are semi-unique:

biscuit (for alignment)
NEB's mark-nonconverted-reads (to mark partially converted reads)

biscuit was chosen after a comparison with bwa-meth and bismark — its latest version was the most flexible with extremely well annotated .bams (some critical tags are missing from bwa-meth for identifying read level methylation, and would require patching MethylDackel to extract data).

I briefly experimented with wgbs_tools (which defines nice .pat/.beta formats) but its licensing is too restrictive to use.

Trimming Approach

I chose a relatively conservative approach to trimming -- which is needed due to end-repair bias, adaptase bias, and more.

For EMseq, I trim 10 bp everywhere, after personal QC and offline discussions with NEB. See my notes here.

For BSseq, I trim 15 bp 5' R2, and 10 bp everywhere else due to adaptase bias.

For all reads, I set --trim_poly_g (due to two color bias) and set a --length_required (minimum read length) of 10 bp.

No Quality Filtering

Notably I do NOT do quality filtering here (I set --disable_quality_filtering), and save this for downstream analyses as desired.

I experimented with more stringent quality filtering early on, and found it had little yield / performance benefit.

Background & Inspiration

I strongly suggest reading work from Felix Krueger (author of Bismark) as background. In particular:

TrimGalore's RRBS guide
The Babraham WGBS/RRBS tutorials

For similar pipelines and inspiration, see:

NEB's EM-seq pipeline
Felix Krueger's Nextflow WGBS Pipeline
The Snakepipes WGBS pipeline

Pipeline Graph

Here's a high-level overview of the Snakemake pipeline (generated via snakemake --rulegraph | dot -Tpng > rules.png)

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github		.github
config		config
data		data
workflow		workflow
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.snakemake-workflow-catalog.yml		.snakemake-workflow-catalog.yml
LICENSE		LICENSE
README.md		README.md
serpent-logo.png		serpent-logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Serpent Methylation Pipeline (for Snakemake)

Getting Started

Test Run

Mac OS X Compatibility

Data Definition

Expected Output

Production Runs

Pipeline Details

Reference Genome

Requirements

Trimming Approach

No Quality Filtering

Background & Inspiration

Pipeline Graph

About

Languages

License

semenko/serpent-methylation-pipeline

Folders and files

Latest commit

History

Repository files navigation

Serpent Methylation Pipeline (for Snakemake)

Getting Started

Test Run

Mac OS X Compatibility

Data Definition

Expected Output

Production Runs

Pipeline Details

Reference Genome

Requirements

Trimming Approach

No Quality Filtering

Background & Inspiration

Pipeline Graph

About

Topics

Resources

License

Stars

Watchers

Forks

Languages