A standardized, reproducible pipeline to process WGBS bisulfite & EM-seq data. This goes from .fastq to methylation calls (via biscuit) and includes extensive QC and plotting, using a Snakemake pipeline.
At a high level, this pipeline reproducibly:
- Builds a reference genome
- Trims & (minimally) filters reads
- Aligns & calls methylation using biscuit
- Flags non-converted reads
- Generates standardized outputs & QC including
- FastQC
- fastp
- Biscuit QC
- samtools stats
- MethylDackel mbias plots
- Goleft covplots
- epibed/epiread files
- Runs multiqc across entire projects
This pipeline is designed to be straightforward:
- Clone this repository and open the directory:
git clone https://github.com/semenko/serpent-methylation-pipeline.git cd serpent-methylation-pipeline
- Install Snakemake via mamba (or conda)
mamba install -c bioconda -c conda-forge snakemake snakemake-storage-plugin-http
- (Optional) Create a separate conda environment for pipeline dependencies:
Then activate it with:
mamba env create -n serpent_pipeline_env -f workflow/envs/env.yaml
conda activate serpent_pipeline_env
Use:
nice snakemake --cores 4 --use-conda --printshellcmds --rerun-incomplete --rerun-triggers mtime --keep-going --dry-run
to quickly validate the pipeline and see what would be executed. Remove --dry-run
to run the full process.
After removing the --dry-run
flag, this will download reference genomes and build indices.
Unfortunately, critical dependencies don't support Mac OS architectures yet (e.g., bwa-mem2
only supports linux-aarch64
), though this might be supported using brew
. Please open an issue or submit a PR if you're able to improve this situation! 😃
Raw data files from data are processed and analyzed by this snakemake workflow. Within each project directory, the output is (roughly) structured as:
SAMPLE_01/ # e.g. melanoma / crc / healthy, etc.
│ SAMPLE.bam # The final alignment file
| SAMPLE.bam.bai # (and its index)
|── biscuit_qc/ # biscuit QC.sh text files
|── epibeds/ # epibed files (bgzip-compressed & tabix-indexed)
├── fastp/ # fastp statistics & logs
├── fastqc/ # fastqc graphs
├── goleft/ # goleft coverage plots
├── logs/ # runlogs from each pipeline component
├── methyldackel/ # mbias plots
├── raw/
│ ├── ...fastq.gz # Raw reads
| ├── ...md5.txt # Checksums and validation
├── samtools/ # samtools statistics
SAMPLE_02/
...
...
multiqc/ # A project-level multiqc stats across all data
Note each project also has a _subsampled
directory with identical structure, which is the result of the pipeline run on only 10M reads/sample.
This pipeline was designed for highly reproducible, explainable alignments and analysis of epigenetic sequencing data.
I chose GRCh38, with these specifics:
- No patches
- Includes the hs38d1 decoy
- Includes Alt chromosomes
- Applies the U2AF1 masking file
- Applies the Encode DAC exclusion
You can see a good explanation of the rationale for some of these components at this NCBI explainer.
All software requirements are specified in env.yaml.
Most are relatively common, but a few are semi-unique:
- biscuit (for alignment)
- NEB's mark-nonconverted-reads (to mark partially converted reads)
biscuit was chosen after a comparison with bwa-meth and bismark — its latest version was the most flexible with extremely well annotated .bams (some critical tags are missing from bwa-meth for identifying read level methylation, and would require patching MethylDackel to extract data).
I briefly experimented with wgbs_tools (which defines nice .pat/.beta formats) but its licensing is too restrictive to use.
I chose a relatively conservative approach to trimming -- which is needed due to end-repair bias, adaptase bias, and more.
For EMseq, I trim 10 bp everywhere, after personal QC and offline discussions with NEB. See my notes here.
For BSseq, I trim 15 bp 5' R2, and 10 bp everywhere else due to adaptase bias.
For all reads, I set --trim_poly_g
(due to two color bias) and set a --length_required
(minimum read length) of 10 bp.
Notably I do NOT do quality filtering here (I set --disable_quality_filtering
), and save this for downstream analyses as desired.
I experimented with more stringent quality filtering early on, and found it had little yield / performance benefit.
I strongly suggest reading work from Felix Krueger (author of Bismark) as background. In particular:
- TrimGalore's RRBS guide
- The Babraham WGBS/RRBS tutorials
For similar pipelines and inspiration, see:
- NEB's EM-seq pipeline
- Felix Krueger's Nextflow WGBS Pipeline
- The Snakepipes WGBS pipeline
Here's a high-level overview of the Snakemake pipeline (generated via snakemake --rulegraph | dot -Tpng > rules.png
)