Identification of Drosophila new genes using machine learning
Authors: Gabriel Goldstein 1,2; Maria Vibranovski 1,3; Yong Zhang 2
Affiliations: 1) Universidade de São Paulo, USP, São Paulo, Brazil; 2) Chinese Academy of Sciences, CAS, Beijing, China; 3) Arizona State University, ASU, Tempe, USA
Keywords: l. computational algorithms; a. genome evolution
New genes are defined by their presence in a taxon and absence in sibling taxa. These genes have great biological importance and are involved in processes of high selective pressure, being expressed in tissues such as brain and testis. There are a variety of genetic mechanisms that can lead to the generation of new genes, such as duplications and retrotranspositions for example, but most new genes are derived from duplications. The exact functions of these genes in organisms are still being studied, but some work has already shown a relationship with the resolution of sexual conflicts, for example. Despite this, there are a number of biological characteristics that are known to be different between new and old genes. An example of this is the expression profile of these groups, since new genes are mostly expressed in male gametogenesis and old genes are broadly expressed. The main gene dating method for identifying new genes uses synteny, which is the phenomenon of conservation of the order and gene content of a region in the genome that occurs in related species, and parsimony when comparing genomes of related species to date all genes of a focal species. Despite the accuracy of the method, it is extremely dependent on the assembly and annotation of the genome of interest, which limits its application to model species that have a manual and curated annotation. With these facts in mind, we propose in this work a method of identifying new genes that uses biological information to separate new and old genes through the use of machine learning. Machine learning algorithms are those able to change with experience and are able to identify complex patterns and identify classes from a variety of information. With this, we trained a model with the random forest algorithm in the model species Drosophila melanogaster and obtained 0.508 precision and 0.718 recall with generated data. In addition, we identified the 1523 new genes of D. pseudoobscura using the existing method so that we can use this species as a second control point for our method.