329A Poster - 03. Evolution
Thursday April 07, 2:00 PM - 4:00 PM

Genomic Benchmarks: A Collection of Datasets For DNA Sequence Classification


Authors:
Petr Simecek; Katarina Gresova; Vlastimil Martinek; David Cechak; Panagiotis Alexiou

Affiliation: Central European Institute of Technology, Masaryk University

Keywords:
t. bioinformatic and genome tools; a. genome evolution

Recently, deep neural network models have been successfully applied to identify functional elements in the genomes of D. melanogaster and other organisms, e.g., promoters [1], chromatin folding [2], splice sites [3]... Unfortunately, it is not easy to compare the quality of these methods since they use different data preprocessing approaches. In other fields, benchmarks datasets have been established as a gold standard for comparison, e.g., ImageNet for image recognition, IMDB Sentiment for text classification, SQuAD for question answering, and CASP data for protein folding prediction [4].

We are proposing a collection of datasets that may serve as a benchmark for the classification of genomic sequences. The collection is distributed as a Python package 'genomic-benchmarks' that is distributed through The Python Package Index (PyPI). Each dataset is stored both as a list of genomic interval coordinates and DNA sequences. The package provides utilities for conversion between these two formats, data cleaning procedures and checks. Furthermore, it contains functions that make the training of a neural network classifier easier, like PyTorch and TensorFlow data loaders. We hope other researchers will use our datasets to evaluate the quality of their algorithms.

The package 'genomic-benchmarks' and demo notebooks on how to use it are available on GitHub:

https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks

[1] Umarov, Ramzan Kh, and Victor V. Solovyev. "Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks." PloS one 12.2 (2017): e0171410.
[2] Rozenwald, Michal B., et al. "A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features." PeerJ Computer Science 6 (2020): e307.
[3] Albaradei, Somayah, et al. "Splice2Deep: An ensemble of deep convolutional neural networks for improved splice site prediction in genomic DNA." Gene: X 5 (2020): 100035.
[4] Jumper, John, et al. "Highly accurate protein structure prediction with AlphaFold." Nature 596.7873 (2021): 583-589.