helsesorost
Oslo University Hospital

HMST-Seq-Analyzer: A New Python Tool for Differential Methylation and Hydroxymethylation Analysis in Various DNA Methylation Sequencing Data

Authors:

Amna Farooq 1, Sindre Grønmyr 2, Omer Ali 1, Torbjørn Rognes 2,4, Katja Scheffler 5,6 , Magnar Bjørås 3,4, Junbai Wang 1*

1. Department of Pathology, Oslo University Hospital - Norwegian Radium Hospital, Oslo, Norway

2. Department of Informatics, University of Oslo, Oslo, Norway

3. Institute for Clinical and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway.

4. Department of Microbiology, Oslo University Hospital and University of Oslo, Oslo, Norway.

5. Department of Neuromedicine and Movement Science and Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway.

6. Department of Neurology and Department of Laboratory Medicine, St. Olavs Hospital, Trondheim, Norway.

*To whom correspondence should be addressed.Email: junbai.wang@rr-research.no

Abstract:

DNA methylation (5mC) and hydroxymethylation (5hmC) are chemical modifications of cytosine bases which play a crucial role in epigenetic gene regulation. However, cost, data complexity and unavailability of comprehensive analytical tools is one of the major challenges in exploring these epigenetic marks. Hydroxymethylation-and Methylation-Sensitive Tag sequencing (HMST-seq) is one of the most cost-effective techniques that enables simultaneous detection of 5mC and 5hmC at single base pair resolution. We present HMST-Seq-Analyzer as a comprehensive and robust method for performing simultaneous differential methylation analysis on 5mC and 5hmC data sets. HMST-Seq-Analyzer can detect Differentially Methylated Regions (DMRs), annotate them, give a visual overview of methylation status and also perform preliminary quality check on the data. In addition to HMST-Seq, our tool can be used on whole-genome bisulfite sequencing (WGBS) and reduced representation bisulfite sequencing (RRBS) data sets as well. The tool is written in Python with capacity to process data in parallel and is available at (https://hmst-seq.github.io/hmst/).

How to start:

HMST-Seq-Analyzer is written in python. It can be installed and accessed from command line and is avalible for both linux and mac operating systems. The package can be downloaded here

Prior to installing the package, dependencies must be fulfilled.List of dependencies is as follows:

  • bedtools
  • setuptools
  • itertools
  • pandas
  • numpy
  • argparse
  • os
  • shutil
  • multiprocessing
  • matplotlib
  • seaborn
  • datetime
  • scipy
  • tempfile
  • time
  • matlab.engine
  • numba

It is advised to install dependencies using miniconda.

Package contains a file requirments.txt which can be used for automatic installation of dependencies from conda or pip.

To install the package, go to the HMST-Seq-Analyzer directory and type: python setup.py install

For more detials, follow the readme file in the package

Contents of the package:

The package folder will contain following:

  • demo : Contains demo data sets.
  • hmst-seq-analyzer : Contains python soruce code of pipeline.
  • readme.txt : Instructions about usage of package.
  • requirments.txt : List of requirments. Can be used for automatic installation from miniconda or pip.
  • setup.py: Setup file for package.

Pipeline Tasks:

The pipeline consists of follwoing 8 tasks. To run a task, type hmst_seq_analyzer <task> [<args>]. To see what are the options for each task of the pipeline, please run: hmst_seq_analyzer -h

  • gene_annotation : Cleans reference file and creates genomic region files (TSS, geneBody, TES, 5dist and intergenic) from the reference
  • data_preprocessing : Creation of 5mC and 5hmC files, quantile normalization
  • find_MRs : Extracts genomic regions from 5mC/5hmC-files and finds methylated regions
  • prepare_for_DMR_finding : Finds overlapping methylated regions between MRs in WT condition samples and KO condition samples
  • DMR_search : Finds differentially methylated regions
  • prep4plot : Prepares files for plotting
  • plot_all : Plots hyper versus hypo differentially methylated regions, and relative density of significantly modified sites in MRs
  • clean_files : Removes some unwanted files. Please only use after prep4plot is already done

Demo:

Test run is available on public hg19 data, present in demo folder.

In folder HMST-Seq-analyzer/demo , there is a sbatch file: job_demo_HMST.sbatch

which can be run by entering: sbatch job_demo_HMST.sbatch , in the command line to run the demo automatically.

Tool Box:

The Tool Box can be used for: Cleaning and sorting input methylation data, Extracting data for single chromosome or splitting it chromosome wise from input methylation files and calculating average read count between the range 1-5, 6-10, 11-15, 16-20 for quality check of input data.

The tool box can be found : HMST-Seq-Analyzer_amna/demo

    For cleaning and sorting input: python avg_read_coverage.py --In_path data-sample.txt --Out_path out --Org hum

    For cleaning sorting input and extracting data for one chromosome : python avg_read_coverage.py --In_path data-sample.txt --Out_path out --Org hum --Chr 17

    For cleaning sorting input and splitting data chromosome wise : avg_read_coverage.py --In_path data-sample.txt --Out_path out --Org hum --Chr all

    For cleaning sorting input and calculating average read count : python avg_read_coverage.py --In_path data-sample.txt --Out_path out --Org hum --Avg_read y

References:

  1. Gao, F., et al., Integrated detection of both 5-mC and 5-hmC by high-throughput tag sequencing technology highlights methylation reprogramming of bivalent genes during cellular differentiation. Epigenetics, 2013. 8(4): p. 421-430.