Databases
Many research projects use reference or external databases. This page describes databases that exist on Mahuika for use as well as recommendations for using some specific external databases.
Maintained databases on Mahuika¶
Some databases are readable for all users on Mahuika.
These databases can be found at /opt/nesi/db.
Some environmental modules depend on these databases and connect to these directories automatically.
| Dataset | Path | Licence Status | Notes |
|---|---|---|---|
| AlphaFold | /opt/nesi/db/alphafold_db | CC-BY-4.0 | Predicted protein structures generated by AlphaFold |
| BLAST | /opt/nesi/db/blast | Public | NCBI BLAST nucleotide and protein databases |
| cartopy | /opt/nesi/db/cartopy | BSD-3-Clause | Databases for cartopy module |
| centrifuge | /opt/nesi/db/centrifuge | GPL-3.0 | Databases for centrifuge module |
| CheckM2 | /opt/nesi/db/CheckM2_DB | GPL-3.0 | Database for CheckM2 module |
| CheckM | /opt/nesi/db/CheckM_DB | GPL-3.0 | Database for CheckM module |
| DAS Tool | /opt/nesi/db/DAS_DB | Public | Database for DAS_Tool module |
| dammit | /opt/nesi/db/dammit_db | BSD | Databases for dammit module |
| dfam 3.9 | /opt/nesi/db/dfam_3.9 | CC0 1.0 | Transposable element profile HMM database |
| DRAM 1.3.5 | /opt/nesi/db/DRAM_1.3.5 | GPL-3.0 | Databases for DRAM module |
| eggnogdb | /opt/nesi/db/eggnog_db | Unspecified | Orthologous group and functional annotation database |
| FCS-GX | /opt/nesi/db/FCS-GX | United States Government Work | Database for FCS-GX module |
| gtdbtk_202 | /opt/nesi/db/gtdbtk_202 | CC-BY-SA 4.0 | Genome Taxonomy Database release 202 used by GTDB-Tk module |
| gtdbtk_207_v2 | /opt/nesi/db/gtdbtk_207_v2 | CC-BY-SA 4.0 | Genome Taxonomy Database release 207 v2 used by GTDB-Tk module |
| gtdbtk_214 | /opt/nesi/db/gtdbtk_214 | CC-BY-SA 4.0 | Genome Taxonomy Database release 214 used by GTDB-Tk module |
| gtdbtk_220 | /opt/nesi/db/gtdbtk_220 | CC-BY-SA 4.0 | Genome Taxonomy Database release 220 used by GTDB-Tk module |
| HUMAnN | /opt/nesi/db/Humann | MIT | Databases for HUMAnN module |
| Kaiju | /opt/nesi/db/Kaiju | GPL-3.0 | Database index for Kaiju module |
| Kraken2 | /opt/nesi/db/Kraken2 | MIT | Databases for Kraken2 module |
| megaX | /opt/nesi/db/megaX | Free for academics | Evolutionary analysis reference data for MegaX module |
| nullarbor | /opt/nesi/db/nullarbor_db | GPL-2.0 | Reference databases used by the Nullarbor module |
| Pfam | /opt/nesi/db/Pfam | CC0 | Protein family HMM database |
| PhyloPhlAn | /opt/nesi/db/PhyloPhlAn | MIT | Databases of universal markers for prokaryotes |
| prokka | /opt/nesi/db/prokka | GPL-3.0 | Databases needed for prokka environmental module |
| ProteinDataBank | /opt/nesi/db/ProteinDataBank | CC0 1.0 | 3D structures of proteins and nucleic acids |
| RQCFilterData | /opt/nesi/db/RQCFilterData | Unspecified | Reference data for read quality control |
| sortmerna | /opt/nesi/db/sortmerna_db | GPL-3.0 | rRNA reference databases for SortMeRNA module |
| SqueezeMeta | /opt/nesi/db/SqueezeMeta | GPL-3.0 | Databases needed for SqueezeMeta module |
| StrVCTVRE | /opt/nesi/db/StrVCTVRE | MIT | Training and reference data for StrVCTVRE module |
| Trinotate | /opt/nesi/db/Trinotate | Public | Databases needed for Trinotate module |
| Uniprot | /opt/nesi/db/Uniprot | CC-BY 4.0 | Protein sequence and functional annotation database |
| VEP | /opt/nesi/db/VariantEffectPredictor | No restrictions | Ensembl annotation data for variant effect prediction |
| VIBRANT | /opt/nesi/db/VIBRANT_v1.2.1_databases | GPL-3.0 | Viral genome and HMM databases used by VIBRANT environmental module |
| VirSorter | /opt/nesi/db/VirSorter | GPL-2.0 | Viral hallmark gene and profile databases |
| waafle | /opt/nesi/db/waafle | MIT | Reference sets for gene neighborhood analysis for waafle environmental module |
| checkv | /opt/nesi/db/checkv-db-v0.6 | BSD 3-Clause-style | Viral genome completeness and contamination database for CheckV environmental module |
There are also some versioned databases which are accessible through environmental modules, specifically:
- AlphaFold2DB:
module avail AlphaFold2DB - AlphaFold3DB:
module avail AlphaFold3DB - BLASTDB:
module avail BLASTDB
Requesting new or updated databases
If there is a database you think may be useful to many Mahuika users, or if you would like an updated version of one of the maintained databases, please Contact our Support Team with details about the source and version of the database of interest.
Recommendations for obtaining data from selected external databases¶
JGI Portals¶
The Joint Genome Institute has many databases and data portals available. To download/access files from JGI you will need to register for an account. We recommend you utilize the Globus endpoint provided by JGI to directly transfer files from the JGI servers to Mahuika. For more information about using Globus on Mahuika see the Globus docs section.
NCBI¶
We have severak environmental modules to aid in finding and downloading data from NCBI:
- the NCBI Datasets command-line tools:
module avail datasets - the EDirect command line interface with the Entrez search system:
module avail entrez-direct - the SRA Toolkit:
module avail sratoolkit