337W Poster - Quantitative Genetics
Wednesday June 08, 9:15 PM - 10:00 PM

Assessing the impacts of single-end and paired-end RNA-seq on gene expression estimates and eQTL detection.


Authors:
Sam Ardery 1; Selcan Aydin 1; Daniel A. Skelly 1; Matthew Pankratz 2; Devin K. Porter 2; Ted Choi 2; Laura G. Reinholdt 1; Christopher L. Baker 1; Gary A. Churchill 1; Steven C. Munger 1

Affiliations:
1) The Jackson Laboratory, Bar Harbor, ME; 2) Predictive Biology, Inc., Carlsbad, CA

Keywords:
Complex traits

Short read RNA sequencing (RNA-seq) has become the prevailing method for quantifying gene expression levels genome-wide. The most common RNA-seq platforms sequence 75-100bp reads from one or both ends of RNA fragments, termed single-end (SE) or paired-end (PE), respectively. While SE and PE reads yield similar expression estimates for most genes in isogenic samples, for the subset of genes that do differ substantially, PE reads have generally been found to yield more accurate estimates owing to better alignment specificity. However, this finding has not been verified in genetically diverse samples nor is it clear how these differences affect our ability to detect expression quantitative trait loci (eQTLs). To address these questions, we analyzed a large 2x75bp PE RNA-seq dataset from 185 genetically diverse mouse embryonic stem cell lines (mESCs). We compared gene-level estimates of transcript abundance from aligning just the forward reads (SE) to those from the full paired read (PE) alignments, and then used both expression values along with mESC genotyping data to map eQTLs. We identified nearly 1,500 genes as expressed in one of the analyses but not the other (1,065 in SE, 427 in PE), and gene annotation showed that the SE list was overrepresented for pseudogenes while the PE list was overrepresented for protein-coding genes. Analysis of uniquely aligning reads in the SE data show that many are likely transcribed from protein coding genes but misalign to pseudogenes, a problem exacerbated by the high genetic diversity in the mESC lines. These alignment errors appear to affect eQTL detection in two related ways by causing spurious genetic signals (false positive) and missing real genetic signals (false negative). Importantly, by limiting spurious read alignment to pseudogenes and correctly assigning more reads to protein coding genes, PE sequencing results in fewer false positive eQTLs for pseudogenes and fewer false negative eQTLs for protein coding genes, and in so doing provides a more accurate understanding of gene regulatory variation compared to SE RNA-seq. Future efforts will be focused on improving SE expression estimates using alternative read aligners or alignment strategies. We recommend that researchers use PE RNA-seq in eQTL mapping studies, especially when using samples with high levels of genetic variance from the reference genome used for alignment.