In this guide, we illustrate here two common downstream analysis workflows for ChIP-seq experiments, one is for comparing and combining peaks for single transcription factor (TF) with replicates, and the other is for comparing binding profiles from ChIP-seq experiments with multiple TFs.

Workflow for ChIP-seq experiments of single transcription factor with replicates

This workflow shows how to convert BED/GFF files to GRanges, find overlapping peaks between two peak sets, and visualize the number of common and specific peaks with Venn diagram.

Import data and obtain overlapping peaks from replicates

The input for ChIPpeakAnno1 is a list of called peaks identified from ChIP-seq experiments or any other experiments that yield a set of chromosome coordinates. Although peaks are represented as GRanges in ChIPpeakAnno, other common peak formats such as BED, GFF and MACS can be converted to GRanges easily using a conversion toGRanges method. For detailed information on how to use this method, please type ?toGRanges.

The following examples illustrate the usage of this method to convert BED and GFF file to GRanges, add metadata from orignal peaks to the overlap GRanges using function addMetadata, and visualize the overlapping using function makeVennDiagram.

library(ChIPpeakAnno)
bed <- system.file("extdata", "MACS_output.bed", package="ChIPpeakAnno")
gr1 <- toGRanges(bed, format="BED", header=FALSE) 
## one can also try import from rtracklayer
gff <- system.file("extdata", "GFF_peaks.gff", package="ChIPpeakAnno")
gr2 <- toGRanges(gff, format="GFF", header=FALSE, skip=3)
## must keep the class exactly same as gr1$score, i.e., numeric.
gr2$score <- as.numeric(gr2$score) 
ol <- findOverlapsOfPeaks(gr1, gr2)
## add metadata (mean of score) to the overlapping peaks
ol <- addMetadata(ol, colNames="score", FUN=mean) 
ol$peaklist[["gr1///gr2"]][1:2]
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames        ranges strand |                           peakNames
##          <Rle>     <IRanges>  <Rle> |                     <CharacterList>
##   [1]     chr1 713791-715578      * | gr1__MACS_peak_13,gr2__001,gr2__002
##   [2]     chr1 724851-727191      * |          gr2__003,gr1__MACS_peak_14
##           score
##       <numeric>
##   [1]   850.203
##   [2]    29.170
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
makeVennDiagram(ol, fill=c("#009E73", "#F0E442"), # circle fill color
                col=c("#D55E00", "#0072B2"), #circle border color
                cat.col=c("#D55E00", "#0072B2")) # label color, keep same as circle border color
Venn diagram of overlaps for replicated experiments

Venn diagram of overlaps for replicated experiments

## $p.value
##      gr1 gr2 pval
## [1,]   1   1    0
## 
## $vennCounts
##      gr1 gr2 Counts count.gr1 count.gr2
## [1,]   0   0      0         0         0
## [2,]   0   1     61         0        61
## [3,]   1   0     62        62         0
## [4,]   1   1    166       168       169
## attr(,"class")
## [1] "VennCounts"

Prepare annotation data

Annotation data should be an object of GRanges. Same as import peaks, we use the method toGRanges, which can return an object of GRanges, to represent the annotation data. An annotation data be constructed from not only BED, GFF or user defined readable text files, but also EnsDb or TxDb object, by calling the toGRanges method. Please type ?toGRanges for more information.

Note that the version of the annotation data must match with the genome used for mapping because the coordinates may differ for different genome releases. For example, if you are using Mus_musculus.v103 for mapping, you’d best also use EnsDb.Mmusculus.v103 for annotation. For more information about how to prepare the annotation data, please refer ?getAnnotation.

library(EnsDb.Hsapiens.v75) ##(hg19)
## create annotation file from EnsDb or TxDb
annoData <- toGRanges(EnsDb.Hsapiens.v75, feature="gene")
annoData[1:2]
## GRanges object with 2 ranges and 1 metadata column:
##                   seqnames      ranges strand |   gene_name
##                      <Rle>   <IRanges>  <Rle> | <character>
##   ENSG00000223972     chr1 11869-14412      + |     DDX11L1
##   ENSG00000227232     chr1 14363-29806      - |      WASH7P
##   -------
##   seqinfo: 273 sequences (1 circular) from 2 genomes (hg19, GRCh37)

Visualize binding site distribution relative to features

After finding the overlapping peaks, the distribution of the distance of overlapped peaks to the nearest feature such as the transcription start sites (TSS) can be plotted by binOverFeature function. The sample code here plots the distribution of peaks around the TSS.

overlaps <- ol$peaklist[["gr1///gr2"]]
binOverFeature(overlaps, annotationData=annoData,
               radius=5000, nbins=20, FUN=length, errFun=0,
               xlab="distance from TSS (bp)", ylab="count", 
               main="Distribution of aggregated peak numbers around TSS")
Distribution of peaks around transcript start sites.

Distribution of peaks around transcript start sites.

In addition, genomicElementDistribution can be used to summarize the distribution of peaks over different type of features such as exon, intron, enhancer, proximal promoter, 5’ UTR and 3’ UTR. This distribution can be summarized in peak centric or nucleotide centric view using the function genomicElementDistribution. Please note that one peak might span multiple type of features, leading to the number of annotated features greater than the total number of input peaks. At the peak centric view, precedence will dictate the annotation order when peaks span multiple type of features.

## check the genomic element distribution of the duplicates
## the genomic element distribution will indicates the 
## the correlation between duplicates.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
peaks <- GRangesList(rep1=gr1,
                     rep2=gr2)
genomicElementDistribution(peaks, 
                           TxDb = TxDb.Hsapiens.UCSC.hg19.knownGene,
                           promoterRegion=c(upstream=2000, downstream=500),
                           geneDownstream=c(upstream=0, downstream=2000))
Peak distribution over different genomic features.

Peak distribution over different genomic features.

## check the genomic element distribution for the overlaps
## the genomic element distribution will indicates the 
## the best methods for annotation.
## The percentages in the legend show the percentage of peaks in 
## each category.
out <- genomicElementDistribution(overlaps, 
                           TxDb = TxDb.Hsapiens.UCSC.hg19.knownGene,
                           promoterRegion=c(upstream=2000, downstream=500),
                           geneDownstream=c(upstream=0, downstream=2000),
                           promoterLevel=list(
                         # from 5' -> 3', fixed precedence 3' -> 5'
                             breaks = c(-2000, -1000, -500, 0, 500),
                             labels = c("upstream 1-2Kb", "upstream 0.5-1Kb", 
                                        "upstream <500b", "TSS - 500b"),
                             colors = c("#FFE5CC", "#FFCA99", 
                                        "#FFAD65", "#FF8E32")))
Peak distribution over different genomic features.

Peak distribution over different genomic features.

## check the genomic element distribution by upset plot.
## by function genomicElementUpSetR, no precedence will be considered.
library(UpSetR)
x <- genomicElementUpSetR(overlaps, 
                          TxDb.Hsapiens.UCSC.hg19.knownGene)
upset(x$plotData, nsets=13, nintersects=NA)
Peak distribution over different genomic features.

Peak distribution over different genomic features.

Metagene plot may also provide information for annotation.

metagenePlot(peaks, TxDb.Hsapiens.UCSC.hg19.knownGene)

Annotate peaks

As shown from the distribution of aggregated peak numbers around TSS and the distribution of peaks in different of chromosome regions, most of the peaks locate around TSS. Therefore, it is reasonable to use annotatePeakInBatch or annoPeaks to annotate the peaks to the promoter regions of Hg19 genes. Promoters can be specified with bindingRegion. For the following example, promoter region is defined as upstream 2000 and downstream 500 from TSS (bindingRegion=c(-2000, 500)).

overlaps.anno <- annotatePeakInBatch(overlaps, 
                                     AnnotationData=annoData, 
                                     output="nearestBiDirectionalPromoters",
                                     bindingRegion=c(-2000, 500))
library(org.Hs.eg.db)
overlaps.anno <- addGeneIDs(overlaps.anno,
                            "org.Hs.eg.db",
                            IDs2Add = "entrez_id")
head(overlaps.anno)
## GRanges object with 6 ranges and 11 metadata columns:
##       seqnames        ranges strand |                           peakNames
##          <Rle>     <IRanges>  <Rle> |                     <CharacterList>
##    X1     chr1 713791-715578      * | gr1__MACS_peak_13,gr2__001,gr2__002
##    X1     chr1 713791-715578      * | gr1__MACS_peak_13,gr2__001,gr2__002
##    X3     chr1 839467-840090      * |          gr1__MACS_peak_16,gr2__004
##    X4     chr1 856361-856999      * |          gr1__MACS_peak_17,gr2__005
##    X5     chr1 859315-860144      * |          gr2__006,gr1__MACS_peak_18
##   X10     chr1 901118-902861      * |          gr2__012,gr1__MACS_peak_23
##           score        peak         feature feature.ranges feature.strand
##       <numeric> <character>     <character>      <IRanges>          <Rle>
##    X1   850.203          X1 ENSG00000228327  700237-714006              -
##    X1   850.203          X1 ENSG00000237491  714150-745440              +
##    X3    73.120          X3 ENSG00000272438  840214-851356              +
##    X4    54.690          X4 ENSG00000223764  852245-856396              -
##    X5    81.485          X5 ENSG00000187634  860260-879955              +
##   X10   119.910         X10 ENSG00000187583  901877-911245              +
##        distance insideFeature distanceToSite     gene_name   entrez_id
##       <integer>   <character>      <integer>   <character> <character>
##    X1         0  overlapStart              0 RP11-206L10.2        <NA>
##    X1         0  overlapStart              0 RP11-206L10.9   105378580
##    X3       123      upstream            123  RP11-54O7.16        <NA>
##    X4         0  overlapStart              0   RP11-54O7.3   100130417
##    X5       115      upstream            115        SAMD11      148398
##   X10         0  overlapStart              0       PLEKHN1       84069
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
ol.anno.out <- unname(overlaps.anno)
ol.anno.out$peakNames <- NULL # remove the CharacterList to avoid error message.
write.csv(as.data.frame(ol.anno.out), "anno.csv")

The distribution of the common peaks around features can be visual