Reporter gene assays and chromatin-level assays define substantially non-overlapping sets of sequences as enhancers
Authors: Daniel Lindhorst 1; Marc S. Halfon 1,2
Affiliations: 1) University at Buffalo-State University of New York; 2) NY State Center of Excellence in Bioinformatics & Life Sciences
Keywords: e. enhancers; n. other (regulatory sequences)
Enhancers are critical for eukaryotic transcriptional regulation. However, just how enhancers should be defined remains unclear. While reporter gene assays, which are function based, have been the traditional benchmark for enhancer identification, next-gen sequencing-based techniques that scan for open chromatin, histone modifications, or specific transcription factors (e.g. ATAC-Seq, ChIP-Seq) have become a new source of powerful, high-throughput methods for defining enhancers. Whether these various enhancer definitions consistently lead to identification of the same sequences is unknown. To compare the functional and the chromatin-level enhancer definitions, we analyzed the overlap between enhancers defined in two enhancer databases, REDfly (Rivera et al. 2019, NAR 47:D828) and EnhancerAtlas2.0 (Gao and Quan 2019, NAR 48:D58). REDfly uses primarily a functional definition based on reporter gene analysis, while EnhancerAtlas integrates the results of chromatin-level assays using a supervised learning model. We used REDfly’s search capabilities to build tissue-specific enhancer datasets and compared these with tissue-specific EnhancerAtlas datasets. Surprisingly, we found that only 4 of 11 sets (36%) showed statistically significant overlap. From this, we hypothesized that the observed discrepancies could be caused by the way data from multiple techniques/assays are integrated by the EnhancerAtlas method. We took the underlying EnhancerAtlas data subsets and compared them individually with their matched REDfly sets. 66% of the EnhancerAtlas subsets had significant overlap, a substantial increase from the previous, full-set comparisons, although still limited. However, these EnhancerAtlas subsets only had a median intersection with REDfly enhancers of 39%. Thus, even the sets with significant overlap include fewer than half of the expected reporter-gene defined enhancers from the corresponding REDfly set. We derive two conclusions from our findings. First, during the integration of the EnhancerAtlas data sets, enhancers present in the underlying EnhancerAtlas data are being lost, suggesting that a more sensitive learning model may be required. More importantly, the poor overlap between the reporter-gene defined enhancers and the chromatin-assay defined enhancers suggests that one or both of these approaches carries a high error rate. Further investigation will be required to determine which approaches lead to the most accurate enhancer definition.