Other Databases

Drug Databases

DrugBank The DrugBank database is a blended bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information. The database contains nearly 4800 drug entries including >1,350 FDA-approved small molecule drugs, 123 FDA-approved biotech (protein/peptide) drugs, 71 nutraceuticals and >3,243 experimental drugs. DrugBank also contains extensive SNP-drug data that is useful for pharmacogenomics studies.
Therapeutic Target DB The Therapeutic Target Database (TTD) is a drug database designed to provide information about the known therapeutic protein and nucleic acid targets described in the literature, the targeted disease conditions, the pathway information and the corresponding drugs/ligands directed at each of these targets. The database currently contains 1535 targets and 2107 drugs/ligands.
PharmGKB The PharmGKB database is a central repository for genetic, genomic, molecular and cellular phenotype data and clinical information about people who have participated in pharmacogenomics research studies. The data includes, but is not limited to, clinical and basic pharmacokinetic and pharmacogenomic research in the cardiovascular, pulmonary, cancer, pathways, metabolic and transporter domains. Its aim is to aid researchers in understanding how genetic variation among individuals contributes to differences in reactions to drugs. PharmGKB contains searchable data on genes (>20,000), diseases (>3000), drugs (>2500) and pathways (53). It also has detailed information on 470 genetic variants (SNP data) affecting drug metabolism.
STITCH STITCH (‘search tool for interactions of chemicals’) is a searchable database that integrates information about interactions from metabolic pathways, crystal structures, binding experiments and drug–target relationships. Text mining and chemical structure similarity is used to predict relations between chemicals. Each proposed interaction can be traced back to the original data sources. The database contains interaction information for over 68 000 different chemicals, including 2200 drugs, and connects them to 1.5 million genes across 373 genomes.
SuperTarget SuperTarget is a database that contains a core dataset of about 7300 drug-target relations of which 4900 interactions have been subjected to a more extensive manual annotation effort. SuperTarget provides tools for 2D drug screening and sequence comparison of the targets. The database contains more than 2500 target proteins, which are annotated with about 7300 relations to 1500 drugs; the vast majority of entries have pointers to the respective literature source. A subset of 775 more extensively annotated drugs is provided separately through the Matador database (Manually Annotated Targets And Drugs Online Resource).

Metabolic Pathway Databases

SMPDB SMPDB (Small Molecule Pathway Database) is a comprehensive database of metabolic, drug, and disease pathways.
KEGG KEGG (Kyoto Encyclopedia of Genes and Genomes) is one of the most complete and widely used databases containing metabolic pathways (372 reference pathwasy) from a wide variety of organisms (>700). These pathways are hyperlinked to metabolite and protein/enzyme information. Currently KEGG has >15,000 compounds (from animals, plants and bacteria), 7742 drugs (including different salt forms and drug carriers) and nearly 11,000 glycan structures.
MetaCyc MetaCyc is a database of nonredundant, experimentally elucidated metabolic pathways. MetaCyc contains more than 1,100 pathways from more than 1,500 different organisms. MetaCyc is curated from the scientific experimental literature and contains pathways involved in both primary and secondary metabolism, as well as associated compounds, enzymes, and genes.
HumanCyc HumanCyc is a bioinformatics database that describes the human metabolic pathways and the human genome. The current version of HumanCyc was constructed using Build 31 of the human genome. The resulting pathway/genome database (PGDB) includes information on 28,783 genes, their products and the metabolic reactions and pathways they catalyze.
BioCyc BioCyc is a collection of 371 Pathway/Genome Databases. Each database in the BioCyc collection describes the genome and metabolic pathways of a single organism. The databases within the BioCyc collection are organized into tiers according to the amount of manual review and updating they have received. Tier 1 DBs have been created through intensive manual efforts and include EcoCyc, MetaCyc and the BioCyc Open Compounds Database (BOCD). BOCD includes metabolites, enzyme activators, inhibitors, and cofactors derived from hundreds of organisms. Tier 2 and Tier 3 databases contain computationally predicted metabolic pathways, as well as predictions as to which genes code for missing enzymes in metabolic pathways, and predicted operons.
Reactome Reactome is a curated, peer-reviewed knowledgbase of biological pathways, including metabolic pathways as well as protein trafficking and signaling pathways. Reactome includes several types of reactions in its pathway diagram collection including experimentally confirmed, manually inferred and electronically inferred reactions. Reactome has pathway data on more than 20 different organisms but the primary organism of interest is Homo sapiens. Reactome has data and pathway diagrams for >2700 proteins, 2800 reactions and 860 pathways for humans.

Compound or Compound-Specific Databases

PubChem PubChem is a freely available database of chemical structures of small organic molecules and information on their biological activities. It contains structure, nomenclature and calculated physico-chemical data and is linked with NIH PubMed/Entrez information. PubChem is organized as three linked databases within the NCBI’s Entrez information retrieval system. These are PubChem Substance, PubChem Compound, and PubChem BioAssay. PubChem also provides a fast chemical structure similarity search tool. PubChem has >19 million unique chemical structures.
ChEBI Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on ‘small’ chemical compounds. The chemical entities in ChEBI are either products of nature (metabolites) or synthetic products used to intervene in the processes of living organisms (drugs or toxins). ChEBI contains structure and nomenclature information along with hyperlinks to many well-regarded databases. ChEBI uses a carefully developed ontological classification, whereby the relationships between molecular entities or classes of entities and their parents and/or children are precisely specified. ChEBI has >15,500 chemical entities in its database.
ChemSpider ChemSpider is an aggregated database of organic molecules containing more than 20 million compounds from many different providers. At present the database contains information from such diverse sources as a marine natural products database, ACD-Labs chemical databases, the EPA’s DSSTox databases and from a series of chemical vendors. It has extensive search utilities and most compounds have a large number of calculated physico-chemical property values.
KEGG Glycan The KEGG GLYCAN database is a collection of experimentally determined glycan structures. It contains all unique structures taken from CarbBank, structures entered from recent publications, and structures present in KEGG pathways. KEGG Glycan has >11,000 glycan structures from a large number of eukaryotic and prokaryotic sources.
Toxin, Toxin Target Database (T3DB) T3DB is a database of common toxins and their associated toxin targets. It contains detailed target and compound information similar to DrugBank, including structure, properties, mechanism of action, gene/protein sequence, and associated SNPs.

Spectral Databases

HMDB The Human Metabolome Database (HMDB) is a freely available electronic database containing detailed information about small molecule metabolites found in the human body. It contains experimental MS/MS data for 800 compounds, experimental 1H and 13C NMR data (and assignments) for 790 compounds and GC/MS spectral and retention index data for 260 compounds. Additionally, predicted 1H and 13C NMR spectra have been generated for 3100 compounds. All spectral databases are downloadable and searchable.
BMRB The BioMagResBank (BMRB) is the central repository for experimental NMR spectral data, primarily for macromolecules. The BMRB also contains a recently established subsection for metabolite data. The current metabolomics database contains structures, structure viewing applets, nomenclature data, extensive 1D and 2D spectral peak lists (from 1D, TOCSY, DEPT, HSQC experiments), raw spectra and FIDs for nearly 500 molecules. The data is both searchable and downloadable.
MMCD The Madison Metabolomics Consortium Database (MMCD) is a database on small molecules of biological interest gathered from electronic databases and the scientific literature. It contains approximately 10,000 metabolite entries and experimental spectral data on about 500 compounds. Each metabolite entry in the MMCD is supported by information in an average of 50 separate data fields, which provide the chemical formula, names and synonyms, structure, physical and chemical properties, NMR and MS data on pure compounds under defined conditions where available, NMR chemical shifts determined by empirical and/or theoretical approaches, information on the presence of the metabolite in different biological species, and extensive links to images, references, and other public databases.
MassBank MassBank is a mass spectral database of experimentally acquired high resolution MS spectra of metabolites. Maintained and supported by he JST-BIRD project, it offers various query methods for standard spectra obtained from Keio University, RIKEN PSC, and other Japanese research institutions. It is officially sanctioned bythe Mass Spectrometry Society of Japan. The database has very detailed MS data and excellent spectral/structure searching utilities. More than 13,000 spectra from 1900 different compounds are available.
Golm Metabolome Database The Golm Metabolome Database provides public access to custom GC/MS libraries which are stored as Mass Spectral (MS) and Retention Time Index (RI) Libraries (MSRI). These libraries of mass spectral and retention time indices can be used with the NIST/AMDIS software to identify metabolites according their spectral tags and RI’s. The libraries are both searchable and downloadable and have been carefully collected under defined conditions on several types of GC/MS instruments (quadrupole and TOF).
Metlin The METLIN Metabolite Database is a repository for mass spectral metabolite data. All metabolites are neutral or free acids. It is a collaborative effort between the Siuzdak and Abagyan groups and Center for Mass Spectrometry at The Scripps Research Institute. METLIN is searchable by compound name, mass, formula or structure. It contains 15,000 structures, including more than 8000 di and tripeptides. METLIN contains MS/MS, LC/MS and FTMS data that can be searched by peak lists, mass range, biological source or disease.
Fiehn GC-MS Database This library contains data on 713 compounds (name, structure, CAS ID, other links) for which GC/MS data (spectra and retention indices) have been collected by the Fiehn laboratory. A locally maintain program called BinBase/Bellerophon filters input GC/MS spectra and uses the spectral library to identify compounds. The actual GC/MS library is available from several different GC/MS vendors.

Disease & Physiology Databases

OMIM Online Mendelian Inheritance in Man (OMIM) is a comprehensive compendium of human genes and genetic phenotypes. The full-text, referenced overviews in OMIM contain information on all known Mendelian disorders and over 12,000 genes. OMIM focuses on the relationship between phenotype and genotype. It is updated daily, and the entries contain many links to other genetics resources. OMIM contains 379 diseases with associated gene sequence data as well as 2385 conditions with a disease phenotype and a known genetic cause.
METAGENE METAGENE is a knowledgebase for inborn errors of metabolism providing information about the disease, genetic cause, treatment and the characteristic metabolite concentrations or clinical tests that may be used to diagnose or monitor the condition. It has data on 431 genetic diseases.
OMMBID OMMBID or the On-Line Metabolic and Molecular Basis to Inherited Disease is an web-accessible book/encyclopedia describing the genetics, metabolism, diagnosis and treatment of hundreds of metabolic disorders contributed from hundreds of experts. It also contains extensive reviews, detailed pathways, chemical structures, physiological data and tables that are particularly useful for clinical biochemists. Most university libraries have subscriptions to this resource. OMMBID was originally developed by Charles Scriver at McGill.

Comprehensive Metabolomic Databases

HMDB The Human Metabolome Database (HMDB) is a freely available electronic database containing detailed information about small molecule metabolites found (and experimentally verified) in the human body. The database contains three kinds of data: 1) chemical data, 2) clinical data, and 3) molecular biology/biochemistry data. HMDB contains information on more than 6500 metabolites. Additionally, approximately 1500 protein (and DNA) sequences are linked to these metabolite entries. Each MetaboCard entry contains more than 100 data fields with 2/3 of the information being devoted to chemical/clinical data and the other 1/3 devoted to enzymatic or biochemical data. Many data fields are hyperlinked to other databases (KEGG, PubChem, MetaCyc, ChEBI, PDB, Swiss-Prot, and GenBank) and a variety of structure and pathway viewing applets.
BiGG The BiGG database is a metabolic reconstruction of human metabolism designed for systems biology simulation and metabolic flux balance modeling. It is a comprehensive literature-based genome-scale metabolic reconstruction that accounts for the functions of 1,496 ORFs, 2,004 proteins, 2,766 metabolites, and 3,311 metabolic and transport reactions. It was assembled from build 35 of the human genome.
SYSTOMONAS SYSTOMONAS (SYSTems biology of pseudOMONAS) is a database for systems biology studies of Pseudomonas species. It contains extensive transcriptomic, proteomic and metabolomic data as well as metabolic reconstructions of this pathogen. Reconstruction of metabolic networks in SYSTOMONAS was achieved via comparative genomics. Broad data integration with well established databases BRENDA, KEGG and PRODORIC is also maintained. Several tools for the analysis of stored data and for the visualization of the corresponding results are provided, enabling a quick understanding of metabolic pathways, genomic arrangements or promoter structures of interest.
HMDB Serum Metabolome The Human Metabolome Database (HMDB) is a freely available database containing detailed information about small molecule metabolites found in the human body. It is intended to be used for applications in metabolomics, clinical chemistry, biomarker discovery and general education. The database is designed to contain or link three kinds of data: 1) chemical data, 2) clinical data, and 3) molecular biology/biochemistry data. HMDB contains over 7900 metabolite entries including both water-soluble and lipid soluble metabolites as well as metabolites that would be regarded as either abundant (> 1 uM) or relatively rare (< 1 nM). Additionally, approximately 7200 protein (and DNA) sequences are linked to these metabolite entries.
PPT-DB The Protein Property Prediction and Testing Database (PPT-DB) is a collection of protein property databases for over 20 different protein properties including secondary structure, trans-membrane helices and beta barrels, accessible surface area, signal peptides, and more.
HMDB CSF Metabolome The CSF Metabolome database is a freely available electronic database containing detailed information about 468 small molecule metabolites found in human CSF along with 1650 concentration values. The data tables may be sorted and searched by concentration values and ranges. The information includes literature and experimentally derived chemical data, clinical data and molecular/biochemistry data.
CyberCell Database (CCDB) The CyberCell Database (CCDB) is a comprehensive, web-accessible database designed to support and coordinate international efforts in modeling an Escherichia coli cell on a computer. The CCDB brings together both observed and derived quantitative data from numerous independent sources covering many aspects of the genomic, proteomic and metabolomic character of E.coli (strain K12).
Yeast Metabolome Database The Yeast Metabolome Database (YMDB) is a manually curated database of small molecule metabolites found in or produced by Saccharomyces cerevisiae(also known as Baker’s yeast and Brewer’s yeast). This database covers metabolites described in textbooks, scientific journals, metabolic reconstructions and other electronic databases. YMDB contains metabolites arising from normal S. cerevisiae metabolism under defined laboratory conditions as well as metabolites generated by S. cerevisiae when used in baking and in the production of wines, beers and spirits. YMDB currently contains 2010 small molecules with 857 associated enzymes and 138 associated transporters.
Bovine Metabolome Database (BMDB) The Bovine Metabolome Database (BMDB) The Bovine Metabolome Database (BMDB) is a freely available electronic database containing detailed information about small molecule metabolites found in beef and dairy cattle. The information includes literature and experimentally derived information on bovine meat, bovine serum, bovine milk, bovine urine and bovine ruminal fluid.
E. coli Metabolome Database (ECMDB) E. coli Metabolome Database (ECMDB) is a freely available eletronic database containing detailed information about the >1620 metabolites found in E. coli (strain K12, MG1655). The information includes literature and experimentally derived information on the chemical data, spectral data and the molecular/biochemistry data.
MarkerDB MarkerDB will be a freely available resource that attempts to consolidate information on all known clinical biomarkers into a single source. Multiple types of markers are covered including metabolite based, genetic based, protein based and cell based markers.

Web Servers

PepMake generates a PDB coordinate file for polypeptide backbones using only the sequence and backbone dihedral angles as input.

VADAR (Volume, Area, Dihedral Angle Reporter) is a compilation of more than 15 different algorithms and programs for analyzing and assessing peptide and protein structures from their PDB coordinate data.

MetaboAnalyst MetaboAnalyst is a comprehensive, Web-based tool designed for processing, analyzing, and interpreting metabolomic data. It handles most of the common metabolomic data types including compound concentration lists, spectral bin lists, peak lists, and raw MS spectra.

MetATT is a easy-to-use, web-based tool designed for time-series and two-factor metabolomics data analysis. MetATT offers a number of complementary approaches including 3D interactive principal component analysis, two-way heatmap visualization, two-way ANOVA, ANOVA-simultaneous component analysis and multivariate empirical Bayes time-series analysis.

MetPA (Metabolomics Pathway Analysis) is a free and easy-to-use web application designed to perform pathway analysis and visualization of quantitative metabolomic data.

MSEA is a web-based tool to help identify and interpret patterns of metabolite concentration changes in a biologically meaningful context for human and mammalian metabolomic studies.

MetaboMiner is a tool which can be used to automatically or semi-automatically identify metabolites in complex biofluids from 2D NMR spectra. MetaboMiner is able to handle both 1H-1H total correlation spectroscopy (TOCSY) and 1H-13C heteronuclear single quantum correlation (HSQC) data. It identifies compounds by comparing 2D spectral patterns in the NMR spectrum of the biofluid mixture with specially constructed libraries containing reference spectra of approximately 500 pure compounds.

PolySearch supports >50 different classes of queries against nearly a dozen different types of text, scientific abstract or bioinformatic databases. The typical query supported by PolySearch is ‘Given X, find all Y’s’ where X or Y can be diseases, tissues, cell compartments, gene/protein names, SNPs, mutations, drugs and metabolites.

Receiver Operating Characteristic (ROC) curves are generally considered the method of choice for evaluating the performance of potential biomarkers. ROCCET is a freely available web-based tool designed to assist clinicians and bench biologists in performing common ROC based analyses on their metabolomic data using both classical univariate and more recently developed multivariate approaches.

User-friendly, web-based analytical pipeline for comparative metagenomic studies. Input can be derived from either 16S rRNA data or NextGen shotgun sequencing.

Proteus is a high-performing integrated web server and a stand-alone application three high-performing de novo structure prediction methods (PSIPRED, JNET and TRANSSEC [a locally developed predictor]), a jury-of-experts consensus tool and a robust PDB-based structure alignment process to generate all of its secondary structure predictions. For water-soluble protein Proteus is able to achieve a very high level of accuracy (Q3=88%, SOV=90%). In the rare situation (20-30%) where a query protein shows no similarity whatsoever to any known structure, PROTEUS is still able to achieve a Q3 score of 79%. Proteus is not restricted to generating accurate secondary structures for water-soluble proteins, as it appears to perform well for integral membrane proteins (both helix-containing proteins and beta-sheet containing porins) that have remote homologues or a portion of a homologue in the PDB.

PROTEUS2 is a web server designed to support comprehensive protein structure prediction and structure-based annotation. PROTEUS2 accepts either single sequences (for directed studies) or multiple sequences (for whole proteome annotation) and predicts the secondary and, if possible, tertiary structure of the query protein(s). Unlike most other tools or servers, PROTEUS2 bundles signal peptide identification, transmembrane helix prediction, transmembrane β-strand prediction, secondary structure prediction (for soluble proteins) and homology modeling (i.e. 3D structure generation) into a single prediction pipeline.

BASys (Bacterial Annotation System) is a web server that supports automated, in-depth annotation of bacterial genomic (chromosomal and plasmid) sequences.

An interactive visual database containing all publicly available bacterial genomes. A fully labeled and zoomable genome map is provided for each genome. Sequence and text queries can be used to identify genes of interest, or maps can be navigated using a simple interface. BacMap is designed to serve as an intuitive and convenient tool for identifying orthologues and paralogues, studying operon conservation, and determining gene function.

ResProx (Resolution-by-proxy or Res(p)) is a web server that predicts the atomic resolution of NMR protein structures using only PDB coordinate data as input. More specfically, ResProx uses machine learning techniques to accurately estimate (with a correlation coefficient of 0.92 between observed and calculated) the atomic resolution of a protein structure from 25 measurable features that can be derived from its atomic coordinates. Because atomic resolution is a simple and near-universal measure of structure quality (i.e. < 2.0 Å is good, > 4.0 Å is bad), ResProx offers X-ray crystallographers and NMR spectroscopists the opportunity to easily assess the accuracy and quality of their 3D protein structures. It also allows them to assess whether their refinement methods have made their structures better (or worse) than what the experimental data suggests. Furthermore, since coordinate data is common to both X-ray and NMR, ResProx should allow structural biologists to use a single, easily understood number to compare the structures determined by NMR with those determined by X-ray crystallography.

CS23D 2.0 is a web server for rapidly generating accurate 3D protein structures using only assigned NMR chemical shifts as input. Unlike conventional NMR methods, which require NOE and/or J-coupling data, CS23D2.0 uses only chemical shift information to generate a 3D structure of the protein of interest. CS23D2.0 accepts chemical shift files in either SHIFTY or BMRB formats and produces a set of PDB coordinates for the protein in about 10-15 minutes. CS23D2.0 uses a combination of maximal subfragment assembly, chemical shift threading, shift-based torsion angle prediction and chemical shift refinement to generate and refine the protein coordinates. Tests indicate that CS23D2.0 converges (i.e. finds a solution) for about 90% of protein queries. The performance is dependent on the completeness of the chemical shift assignments and the similarity of the query protein to known 3D folds.

SHIFTX2 predicts both the backbone and side chain 1H, 13C and 15N chemical shifts for proteins using their structural (PDB) coordinates as input. SHIFTX2 combines ensemble machine learning methods with sequence alignment-based methods to calculate protein chemical shifts for backbone and side chain atoms.

The Re-referenced Protein Chemical shift Database (RefDB) is a database of carefully corrected or re-referenced chemical shifts, derived from the BioMagRes Bank. The process involves predicting protein 1H, 13C and 15N chemical shifts using X-ray or NMR coordinate data via SHIFTX and then comparing those predictions to the observed shifts reported in the BMRB (via SHIFTCOR). RefDB provides a standard chemical shift resource for NMR spectroscopists, wishing to derive or compute chemical shift trends in peptides and proteins.

PHAST(PHAge Search Tool) is a web server designed to rapidly and accurately identify, annotate and graphically display prophage sequences within bacterial genomes or plasmids. It accepts either raw DNA sequence data or partially annotated GenBank formatted data and rapidly performs a number of database comparisons as well as phage “cornerstone” feature identification steps to locate, annotate and display prophage sequences and prophage features. Relative to other prophage identification tools, PHAST is up to 40 times faster and up to 15% more sensitive. It is also able to process and annotate both raw DNA sequence data and Genbank files, provide richly annotated tables on prophage features and prophage “quality” and distinguish between intact and incomplete prophage. PHAST also generates downloadable, high quality, interactive graphics that display all identified prophage components in both circular and linear genomic views.Furthermore, tests indicate that PHAST is as accurate or slightly more accurate than all available phage finding tools, with sensitivity of 85.4% and positive predictive value of 94.2%.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s