Datasets

Here you can find all the datasets used by the Bioinformatics research branch of GISED.


MEvoLib


Sequence files used in the biological examples included in the manual and in the performance study of the Genes method.

 

Biological examples

MT-ATP6 and MT-ATP8 genes

32626 sequences from the human mitochondrial DNA (hmtDNA) containing information about the MT-ATP6 and MT-ATP8 genes were downloaded from GenBank on 07/Jul/2016. These sequences match the following query:

“homo sapiens”[porgn] AND mitochondrion[Filter] NOT mRNA[Filter] AND (atp6 OR atpase6 OR “atpase 6” OR “atp synthase 6” OR “atpase subunit 6” OR “atp synthetase subunit 6” OR “atp synthase f0 subunit 6” OR “atp synthase fo subunit 6” OR atp8 OR atpase8 OR “atpase 8” OR “atp synthase 8” OR “atpase subunit 8” OR “atp synthetase subunit 8” OR “atp synthase f0 subunit 8” OR “atp synthase fo subunit 8”))

The GENBANK and report files saved with MEvoLib containing the 32626 sequences can be downloaded here. The report file is also provided separately here (the .tar.gz file is greater than 400MB).

A second dataset has been created from the previous one removing the sequences FR695060.1 and DQ862537.1 due to errors in their metadata related with the MT-ATP6 and MT-ATP8 genes. The GENBANK and report files saved with MEvoLib containing the 32624 sequences can be downloaded here. The report file is also provided separately here (the .tar.gz file is greater than 400MB).

 

Borrelia burgdorferi bacteria

The sequence files used are the ones provided by the PubMLST website for the Borrelia burgdorferi s.l. (BB) alleles. We have generated a tar.gz file with all the available files (on 15/Jul/2016) that can be downloaded here.

The example presented in the manual requires the file mlst_info.py that encloses in a Python list the same information about the multilocus sequence typing (MLST) of BB available at PubMLST.

 

Genes method performance

31755 complete sequences from the human mitochondrial DNA (hmtDNA) were downloaded from GenBank on 07/Jul/2016. These sequences match the following query:

“homo sapiens”[porgn] AND mitochondrion[Filter] NOT mRNA[Filter] AND “complete genome”

The GENBANK and report files saved with MEvoLib containing the 31755 sequences can be downloaded here. The report file is also provided separately here (the .tar.gz file is greater than 400MB).

A second dataset has been created from the previous one removing the sequences KP702293.1, FR695060.1 and DQ862537.1 due to errors in their metadata. The GENBANK and report files saved with MEvoLib containing the 31752 sequences can be downloaded here. The report file is also provided separately here (the .tar.gz file is greater than 400MB).

In order to perform a scalability test, 3 more datasets were generated from the one with 31752 complete hmtDNA sequences: 100, 1000 and 10000 sequences. These datasets are composed by the first 100, 1000 and 10000 sequences of the source dataset, respectively.