MicrobesOnline: Tools and Downloads for Programmers

This page gives technical advice on how to programmatically obtain data from the MicrobesOnline data. Please see the tutorial for background on MicrobesOnline and on the data it contains.

How to connect to pub.microbesonline.org

We have set up a public read-only SQL server at pub.microbesonline.org. (This SQL server may move to another host in the future.) This server will include only data from public genomes. Data about genes and genomes will be in the "genomics" database. Functional genomics data, such as microarray data, will ultimately be in the "microarray" database. (The microarray database will not be available initially.) For both databases, use the username "guest" and the password "guest".

To connect, use the mysql command-line tool, e.g.

mysql -h pub.microbesonline.org -u guest -pguest genomics
or
mysql -h pub.microbesonline.org -u guest -pguest genomics -B -e 'my query' > output_file

Please limit your active queries to two, and active connections to three. If a query of yours is killed, it may be because it was running for too long, using too much memory, or using too many connections. If this happens to you, please email us at help@microbesonline.org and we will help you debug any problematic queries.

How to download auxiliary data: BLAST and FastBLAST databases

Nucleotide and protein BLAST databases and a FastBLAST database of the predicted proteins will be available for download at a later date.

Example SQL queries

Get all VIMSS locus Ids and genbank ids for protein-coding genes in E. coli K12 (NCBI taxonomyId 511145):

SELECT Locus.locusId, Synonym.name AS genbankId
FROM Scaffold JOIN Locus USING (scaffoldId) 
JOIN Synonym USING (locusId,version)
WHERE Locus.priority=1
AND Locus.type=1
AND Scaffold.isActive=1
AND Scaffold.taxonomyId=511145
AND Synonym.type=2;

Get all taxonomy Ids for sequenced delta-Proteobacteria (NCBI taxonomyId 28221), and their names:

SELECT DISTINCT Taxonomy.taxonomyId, Taxonomy.shortName
FROM Scaffold JOIN Taxonomy USING (taxonomyId)
JOIN TaxParentChild ON TaxParentChild.childId=Taxonomy.taxonomyId
WHERE Scaffold.isActive=1 AND Scaffold.isGenomic=1 AND TaxParentChild.parentId=28221;

Schema documentation

Groups of tables:

Alphabetic list of tables:

AASeq
ACL
Annotation
AnnotationDetail
COG
COGCount
COGFun
COGInfo
COGrpsblast
Carts
CrisprFamily
Description
DomainInfo
ECInfo
FastBLASTDomains
FasthmmFamily2HMM
FasthmmRawHits
GoCount
GroupUsers
Groups
HitInfo
IPR2Go
IPRInfo
InterPro
Jobs
KEGG2Taxonomy
KEGGCompound
KEGGConf
KEGGInfo
KEGGMap
KEGGblast
Locus
Locus2Domain
Locus2Ec
Locus2Go
Locus2Ipr
Locus2Operon
Locus2Pdb
Locus2RegPrecise
Locus2RTB
Locus2RTBArticles
Locus2SwissProt
Locus2TigrFunCat
Locus2Tree
LocusCount
LocusSeq
LocusType
MOG
MOGComponent
MOGMember
MOGNeighborScores
ObjectDescr
OGMember
Operon
Ortholog
OrthologGroup
PdbEntries
PdbReps
PdbSeq
PfamClan
Position
RegulonCluster
RegulonLinks
Scaffold
ScaffoldIsActiveFlag
ScaffoldPosChunks
ScaffoldSeq
SwissProt
SwissProt2Locus
SwissProt2Pubmed
SwissProtBlast
Synonym
SynonymType
TIGRInfo
TIGRroles
TaxName
TaxNode
TaxParentChild
Taxonomy
Taxonomy2Tree
Tree
UserSessions
UserSessionsURLTracking
Users
graph_path
term
term2term
term_definition
term_synonym

Deprecated tables: AutoAnno, BLASTp, BLASTp_dddd where dddd is a taxonomyId, TMHMM, FamilyAlignments, FastBLASTDomains

MicrobesOnline also includes microarray data, in the "microarray" database, but this is not in the public SQL database yet.

AASeq

locusIdint(10) unsigned
versionint(2) unsigned
sequencelongblob
The amino acid sequence of a protein-coding gene.

ACL

requesterIdint(10) unsigned
requesterTypeenum('user','group')
resourceIdint(10) unsigned
resourceTypeenum('cart','job','uarray','scaffold')
readtinyint(1)
writetinyint(1)
admintinyint(1)
Permissions to access scaffolds, carts, jobs, and microarray experiments, either by user or by group. All of the data in pub.microbesonline.org is accessible by group 1 (public / all users).

Annotation

annotationIdint(10) unsigned
sourcevarchar(64)
dateint(10) unsigned
This table stores the history of all the annotations, including both MicrobesOnline automated annotations, annotations from RefSeq, and manual annotations created by MicrobesOnline users, sorted by date. (This table may not exist in pub.microbesonline.org.) The source field is a userId. The actual content of the annotation is in the AnnotationDetail table.

AnnotationDetail

annotationIdint(10) unsigned
locusIdint(10) unsigned
typeenum('name','synonym','description','ecNum','go','comment')
actionenum('append','replace','delete')
annotationtext
AnnotationDetail contains the actual data for each annotation. There may be multiple changes with the same annotationId.

COG

cogIdint(10) unsigned
locusIdint(10) unsigned
versionint(2) unsigned
cogInfoIdint(10) unsigned
The COG assignment for the gene is in cogInfoId. (Please ignore cogId.) Genes are only assigned to the best-matching COG, and the gene must cover at least 60% of the COG.

COGCount

count(l.locusId)bigint(21)
funCodevarchar(5)
taxonomyIdint(10)
scaffoldIdint(10) unsigned
Number of genes in each COG functional category on each scaffold.

COGFun

funCodechar(1)
descriptionvarchar(70)
funGroupvarchar(50)
Descriptions of the COG function codes.

COGInfo

cogInfoIdint(10) unsigned
funCodevarchar(5)
descriptionvarchar(255)
geneNamevarchar(20)
cddIdvarchar(255)
lengthint(10) unsigned
Metadata on each COG id (cogInfoId). cddId is the conserved domain database identifier for this COG. funCode can be multiple characters, each of which has its own row in COGFun.

COGrpsblast

locusIdint(10) unsigned
versionint(2) unsigned
subjectvarchar(20)
identityfloat unsigned
alignLengthint(10) unsigned
mismatchint(10) unsigned
gapint(10) unsigned
qBeginint(10) unsigned
qEndint(10) unsigned
sBeginint(10) unsigned
sEndint(10) unsigned
evaluedouble
scorefloat
All hits, from rpsblast, to each COG. The subject is the COG number (as in COGInfo.cogInfoId). A gene may have hits to multiple COGs, even though it is only assigned to one of them (or to none of them if the hits are too short).

Carts

cartIdint(10) unsigned
userIdint(10) unsigned
namevarchar(32)
seqDatalongtext
seqCountint(10) unsigned
timeint(10) unsigned
activeint(1) unsigned
User-created lists of genes, or "carts." This table is empty in pub.microbesonline.org.

CrisprFamily

locusIdint(10) unsigned
familyIdint(10) unsigned
typesmallint(5) unsigned
Clusters of CRISPR repeats and spacers based on BLAST hit connectivity.

Description

descriptionIdint(10) unsigned
descriptiontext
sourcevarchar(255)
createddate
locusIdint(10) unsigned
versionint(2) unsigned
The gene descriptions, e.g., "DNA-binding transcriptional activator, 3HPP-binding". These are often incorrect or misleading.

DomainInfo

domainDbvarchar(20)
domainIdvarchar(20)
domainNamevarchar(50)
iprIdvarchar(10)
iprNamevarchar(100)
domainLenint(5) unsigned
fileNamevarchar(20)
Meta-data on interpro domains and sites. iprId is defined in the IPRInfo table.

ECInfo

ecNumvarchar(20)
namevarchar(255)
Metadata on enzyme classification numbers.

FastBLASTDomains

domainIdvarchar(20)
locusIdint(10) unsigned
versionint(10) unsigned
seqBeginint(5) unsigned
seqEndint(5) unsigned
Associations of genes with ad-hoc domains from FastBLAST. This table has been deprecated.

FasthmmFamily2HMM

domainDbvarchar(20)
domainIdvarchar(20)
hmmNamevarchar(10)
domainLenint(5) unsigned
Data on each HMM for the InterPro families, with domainDb and domainId linked to hmmName (the same as hmmId in FasthmmRawHits). Many InterPro families are described by more than one HMM.

FasthmmRawHits

hmmIdvarchar(20)
locusIdint(10) unsigned
versionint(2) unsigned
seqBeginint(5) unsigned
seqEndint(5) unsigned
domainBeginint(5) unsigned
domainEndint(5) unsigned
scorefloat
evaluefloat
The raw FastHMM hits for each InterPro domain (Gene3D, PANTHER, Pfam, PIRSF, SMART, Superfam, TIGRFAM). The hmmIds are described in FasthmmFamily2HMM.hmmName.

GoCount

goIdint(10) unsigned
goCountint(10) unsigned
taxIdint(10)
Numbers of genes (goCount) in each gene ontology category (goId) for each genome (taxId). goId maps to id in the term table and taxId maps to taxonomyId in the Taxonomy table.

GroupUsers

groupIdint(10) unsigned
userIdint(10) unsigned
activetinyint(1) unsigned
timeint(10) unsigned
Which users are members of which groups.

Groups

groupIdint(10) unsigned
namevarchar(32)
descriptionvarchar(255)
adminUserIdint(10) unsigned
Groups of users are defined for access control. Group 1 is special: all users belong to it.

HitInfo

hitIdvarchar(100)
synIdvarchar(100)
sLengthint(10) unsigned
descriptionlongtext
Deprecated. This table is empty on pub.microbesonline.org.

IPR2Go

iprIdvarchar(10)
goIdvarchar(10)
Which interpro id (iprId, defined in the IPRInfo table) maps to which goId (defined by id in the term table).

IPRInfo

iprIdvarchar(9)
typevarchar(16)
shortNamevarchar(50)
proteinCountint(5)
iprNamevarchar(255)
Each interpro id defines a conserved gene function (e.g., "IPR005471" corresponds to "Transcriptional regulator IclR, N-terminal"). proteinCount is from the InterPro database and does not reflect the number of genes in MicrobesOnline that is associated with that interpro id.

InterPro

locusIdint(10) unsigned
versionint(2) unsigned
checksumvarchar(16)
lengthint(5) unsigned
domainDbvarchar(30)
domainIdvarchar(30)
domainDescvarchar(255)
domainStartint(5)
domainEndint(5)
evaluefloat
statusvarchar(10)
datevarchar(50)
iprIdvarchar(9)
iprNamevarchar(255)
geneOntologylongtext
Assignments of genes to InterPro sites, such as from PRINTS or PROSITE. Assignments to InterPro families (e.g. PFams, TIGRFAMs, etc.) are in the Locus2Domain table, not in this table.

Jobs

jobIdint(10) unsigned
parentJobIdint(10) unsigned
userIdint(10) unsigned
cartIdint(10) unsigned
jobNamevarchar(32)
jobTypevarchar(32)
jobDatatext
jobCmdtext
statusint(2) unsigned
timeint(10) unsigned
doneTimeint(10) unsigned
savedint(1) unsigned
Computations requested by users. This table is empty in pub.microbesonline.org.

KEGG2Taxonomy

keggOrgIdvarchar(5)
taxonomyIdint(10)
The mapping between NCBI taxonomy ids for genomes and KEGG organism ids.

KEGGCompound

compoundvarchar(7)
nametinytext
formulatinytext
massdouble
Small molecules (metabolic compounds) from KEGG.

KEGGConf

mapIdvarchar(20)
objectvarchar(20)
typeint(1)
urlvarchar(255)
coordvarchar(100)
Where the objects are on the KEGG maps, to support clicking through to compounds or enyzymes.

KEGGInfo

hitIdvarchar(100)
sLengthint(10) unsigned
descriptionlongtext
Descriptions of genes from the KEGG database. hitId is a KEGG identifier.

KEGGMap

mapIdvarchar(20)
titlevarchar(255)
Titles of the KEGG maps -- the actual maps are in image files.

KEGGblast

locusIdint(10) unsigned
versionint(2) unsigned
subjectvarchar(30)
identityfloat unsigned
alignLengthint(10) unsigned
mismatchint(10) unsigned
gapint(10) unsigned
qBeginint(10) unsigned
qEndint(10) unsigned
sBeginint(10) unsigned
sEndint(10) unsigned
evaluedouble
scorefloat
Hits of MicrobesOnline genes to KEGG genes. Subjects are KEGG ids, e.g. "aae:aq_001". Despite the name, these hits are from BLAT, not from BLAST. Due to size limitations, this table currently does not exist.

Locus

locusIdint(10) unsigned
versionint(2) unsigned
prioritytinyint(3) unsigned
createddate
posIdint(10) unsigned
evidencevarchar(50)
scaffoldIdint(10) unsigned
typesmallint(5) unsigned

The central object in the MicrobesOnline database is a gene or "Locus". Locus includes protein-coding genes, non-coding RNA genes and pseudogenes. The Locus table includes a locusId, which is not unique, a version number, and a priority flag. The combination of a locusId and a version number is unique, and priority=1 for the current version. To avoid using stale and/oor redundant data, please make sure that you require "Locus.priority=1" whenever you query this table. Locus ids are also referred to as VIMSS ids and MicrobesOnline gene ids. (They are also referred to as orfIds in some of our older code.) Locus.type gives the type of the locus, e.g. type=1 for protein-coding gene. See the LocusType table for more information.

Locus also links to Scaffold.scaffoldId and to Position.posId. The Scaffold table includes a taxonomyId, so you can use that to determine what genome the Locus is from. The Position table stores the location of the gene. For protein-coding genes, the Position stores the range from the predicted start codon to the stop codon.

A large number of tables give additional information about a locus. Most of these tables include both locusId and version. Some of the tables includes only locusId but no version -- those entries implicitly refer to the active version of the locus (the one with priority=1).

Locus2Domain

domainIdvarchar(20)
locusIdint(10) unsigned
versionint(2) unsigned
seqBeginint(5) unsigned
seqEndint(5) unsigned
domainBeginint(5) unsigned
domainEndint(5) unsigned
scorefloat
evaluefloat
Assignments of genes to InterPro families. Within each database (e.g., within Pfam), these assignments are based on the best-hitting family for each region. There may be many more hits, or multiple hits to different HMMs for the same domainId, in FasthmmRawHits.

Locus2Ec

locusIdint(10) unsigned
versionint(2) unsigned
scaffoldIdint(10) unsigned
ecNumvarchar(20)
evidencevarchar(50)
Assignments of genes to enzyme classification numbers such as "1.4.1.15". These assignments are from KEGG or by best BLAST hit to KEGG, and are not very reliable. Via the ecNum field, this also links to descriptions of enzymes in the ECInfo table.

Locus2Go

locusIdint(10) unsigned
goIdint(10) unsigned
evidencevarchar(255)
Assignments of genes to Gene Ontology identifiers. These assignments have many false positives, partly because they are based on individual domains rather than the entire gene. goId maps to id from the term table.

Locus2Ipr

locusIdint(10) unsigned
iprIdvarchar(9)
taxonomyIdint(10)
Assignment of genes to InterPro ids, which are described in the IPRInfo table.

Locus2Operon

locusIdint(10) unsigned
tuIdint(10) unsigned

MicrobesOnline operon predictions. Genes predicted to be transcribed by themselves will still have an operon, but with only one gene in it. For E. coli, we also have known operons, as explained in the Operon table. The tuId field links to the Operon table.

Plasmids, viruses, incomplete genomes, and eukaryotes usually do not have operon predictions.

Locus2Pdb

pdbIdvarchar(6)
pdbChainchar(1)
locusIdint(10) unsigned
versionint(2) unsigned
seqBeginint(5) unsigned
seqEndint(5) unsigned
pdbBeginint(5) unsigned
pdbEndint(5) unsigned
identitydecimal(5,2) unsigned
alignLengthint(5) unsigned
mismatchint(5) unsigned
gapint(5) unsigned
evaluedouble
evalueDisptext
scoredecimal(10,2) unsigned
BLAST hits of MicrobesOnline genes to proteins of known structure in the protein data bank ("PDB").

Locus2RegPrecise

locusIdint(11)
geneInOperonIndexint(11)
sourceTypeTermIdint(11)
regulonIdint(11)
regulatorNamevarchar(255)
regulatorLocusIdint(11)

Regulatory predictions from RegPrecise. regulatorLocusId is the regulator and locusId is the regulated gene. sourceTypeTermId is 1 for manually curated predictions and 2 for predictions that are automatically propagated to orthologs that have a conserved match to the weight matrix. regulonId is from RegPrecise itself and should be used together with sourceTypeTermId. geneInOperonIndex is the position in the operon (e.g., 1 for the 1st gene in the operon).

Locus2RTB

locusIdint(10) unsigned
rtb_seqfeature_idint(10) unsigned
rtb_characterizedint(1) unsigned
Mapping of MicrobesOnline genes to RegTransBase genes. If rtb_characterized is set then the regulation of this gene has been characterized by a paper.

Locus2RTBArticles

locusIdint(10) unsigned
pubMedIdvarchar(20)
wetExptinyint(1) unsigned
Mapping of MicrobesOnline genes to papers about them from the RegTransBase database. The RegTransBase database stores data, collected from published papers by curators, about gene regulation.

Locus2SwissProt

locusIdint(10) unsigned
versionint(2) unsigned
idvarchar(20)
accessionvarchar(20)
identityint(3) unsigned
bidirint(2) unsigned

Locus2SwissProt stores a mapping between MicrobesOnline genes and UniProt (formerly known as SwissProt) accesssions and identifiers. This table only stores the best hit(s) of each MicrobesOnline gene in UniProt. The SwissProt2Locus table stores a best-hit mapping in the other direction. The Locus2SwissProt.bidir flag is set if this relation is also in the SwissProt2Locus table.

Redundant mappings are common if multiple strains of a genome have been sequenced, because all of the copies of the protein sequence will usually be identical. Although UniProt contains organism information, it generally does not specify the strain. So, we do not use the UniProt organism information to make the mapping -- we only use the protein sequences.

Usually, a gene from a complete genome will have an exact or nearly-exact match in UniProt, but hits down to 80% identity are used if no closer match is available.

Locus2TigrFunCat

locusIdint(10) unsigned
mainRolevarchar(255)
subRolevarchar(255)
evidencevarchar(64)
Links genes to TIGR functional categories. This is based on annotations made by CMR, and does not reference TIGRFam hits from HMMER. For those hits you should use the the Locus2Domain table, the metadata on TIGRFams in the TIGRInfo table, and the descriptions of the roles in the TIGRroles table. The mainRole and subRole columns in the Locus2TigrFunCat table are text descriptions not role ids.

Locus2Tree

treeIdint(10) unsigned
locusIdint(10) unsigned
versionint(2) unsigned
aaTreetinyint(1)
beginint(10) unsigned
endint(10) unsigned
nAlignedint(10) unsigned
scoredecimal(5,1) unsigned
scaffoldIdint(10) unsigned
Links genes to gene trees. aaTree is 1 if the tree is based on protein sequences. begin and end is the region of the gene that is in the alignment that was used to build the tree. (These are 1-based positions within the gene sequence.) Note that because of domain duplications, a gene can occur in a tree more than once. nAligned is the number of aligned positions, and may be less than end-begin+1 because of how profile-based alignment works and because of trimming. score is deprecated. scaffoldId is the scaffoldId of that locus.

LocusCount

scaffoldIdint(10) unsigned
taxonomyIdint(10)
nGenesint(10) unsigned
Counts the number of genes on each scaffold. Also links to Taxonomy.taxonomyId.

LocusSeq

locusIdint(10) unsigned
versionint(2) unsigned
sequencelongblob
The nucleotide sequence of each gene. For protein-coding genes, this includes only the open reading frame (including the stop codon).

LocusType

typesmallint(5) unsigned
descriptionvarchar(100)
This table explains the values of the type field in the Locus table. This is a controlled vocabulary, e.g. "protein-coding gene" (type=1), large subunit ribosomal RNA (type=2), pseudogene of a protein-coding gene (type=7), etc.

MOG

mogIdint(10) unsigned
nComponentsint(10) unsigned
nLociint (10) unsigned
metricfloat

A MicrobesOnline Ortholog Group, or MOG, is the union of maximal ortholog groups, from different trees, that have similar membership. The "components" of a MOG are the maximal ortholog groups, see MOGComponent. The loci that belong to a MOG are stored in MOGMember. Note that a genome can have more than one representative in a MOG, but a locus belongs to at most one MOG. While constructing MOGs, we use the "best" maximal ortholog groups first, as determined by the metric of the average alignment length of the ogmembers. (We may use another metric, such as alignment score, in the future.) A MOG's metric is that of the first ortholog group that was placed in this MOG.

Also see the orthologs overview or the OrthologGroup or COG tables.

MOGComponent

mogIdint(10) unsigned
treeIdint(10) unsigned
ogIdint(10) unsigned
metricfloat
nMembersint(10) unsigned
nMembersBestint(10) unsigned
A MOG component is a maximal ortholog-group (an ortholog group for a tree that is not a duplication and is not contained within any any other non-duplication ortholog group) that was added to a MOG. nMembers is the number of members, and nMembersBest is the number of members for which this was the best ortholog group.

MOGMember

mogIdint(10) unsigned
locusIdint(10) unsigned
versionint(2) unsigned
minBeginint(10) unsigned
maxEndint(10) unsigned
treeIdint(10) unsigned
ogIdint(10) unsigned
metricfloat
nAlignedint(10) unsigned
taxonomyIdint(10)
MOGMember reports which MOG a protein-coding gene belongs to (if any) and which ortholog group was used to add this locus to that MOG. minBegin and maxEnd are the range of positions in the protein that are covered by the trees/ortholog-groups that are included in this MOG. metric is average metric for the ortholog group, while nAligned describes the number of aligned positions for this gene in that tree (this ultimately is from Locus2Tree.nAligned).

MOGNeighborScores

mog1int(10) unsigned
mog2int(10) unsigned
scorefloat
nTaxGroupsBothint(10) unsigned
nTaxGroups1int(10) unsigned
nTaxGroups2int(10) unsigned

MOGNeighborScores reports which MOGs are conserved near each other across more than one group of closely-related genomes. Only pairs with mog1 < mog2 are included in the table. nTaxGroupsBoth is the number of genome groups for which the genes are near each other. nTaxGroups1 is the number of genome groups that contain at least one representative of mog1. score = nTaxGroupsBoth / max(nTaxGroups1, nTaxGroups2). Genes are considered to be near each other if they are adjacent and spaced by less than 200 nucleotides, or if they are within a run of genes on the same strand for which each adjacent pair of which is spaced by less than 200 nucleotides. Groups of closely-related genomes are defined by clades from the MicrobesOnline species tree that have diverged by at most 0.02 from their common ancestor. (For example, this threshold groups strains of Escherichia coli and various Shigella species with each other but separately from Salmonella species.)

ObjectDescr

objectIdint(2) unsigned
descriptionvarchar(255)
This table explains the values of the objectId fields in the Position table, e.g. if the position is for a Locus or an Operon.

OGMember

treeIdint(10) unsigned
ogIdint(10) unsigned
locusIdint(10) unsigned
versionint(2) unsigned
beginint(5) unsigned
endint(5) unsigned
taxonomyIdint(10)
nMemberThisGenomeint(10) unsigned
aaLengthint(10) unsigned

This table describes the members of each ortholog group. A gene with a domain duplication can be in a tree more than once, which is why begin and end are necessary to guarantee uniqueness. nMemberThisGenomes helps determine if this is a 1:1 ortholog. aaLength (the length of this protein in amino acids) helps determine if the orthology relation covers the full length of the gene.

Operon

tuIdint(10) unsigned
posIdint(10) unsigned
evidencevarchar(255)
Each entry in the Operon links together one or more genes that are known or predicted to be transcribed together. The evidence field is either "VIMSS Predicted" for our operon predictions or "Confirmed" for experimentally characterized operons. Operon also links to the Position table, which stores the extent of the operon. Those positions are based on the extents of the genes, not on predicting transcription starts or stops.

Ortholog

locus1int(10) unsigned
locus2int(10) unsigned

Ortholog stores putative orthology relationships between pairs of genes. Any gene will map to at most 1 gene in another genome. These "orthologs" are from a FastBLAST-based variant of bidirectional best BLASTp hits and are often incorrect. Also, the table is growing very rapidly (it grows as the square of the number of genomes). So, these will be discontinued soon. Use the OrthologGroup or MOG tables instead.

OrthologGroup

treeIdint(10) unsigned
ogIdint(10) unsigned
parentOGint(10) unsigned
isDuplicationtinyint(1)
nGenesint(10) unsigned
nGenomesint(10) unsigned
nNonUniqueGenomesint(10) unsigned
splitTaxIdint(10)

The primary key of this table is the combination of treeId and ogId. ogIds themselves are not unique.

Our definition of ortholog group is a clade in a gene tree that is mostly one copy per genome, or, if isDuplication is set, a clade of genes from a closely-related group of genomes that have undergone duplication. These are not evolutionary orthologs as originally defined by Fitch (evolutionary orthologs are genes that diverged by a speciation event). Nor is there any guaranatee that these orthologs have the same function. However, be believe that these tree-based orthologs, if used carefully (see below), are more accurate than bidirectional-best hits, and they are faster to compute and smaller to store. The members of the ortholog groups are in the OGMember table. Every gene in a gene tree will be in an ortholog group, even if the ortholog "group" contains only that gene.

Ortholog groups can be nested inside larger ortholog groups, as described by parentOG (which may be NULL). For example, suppose that two closely related genomes 1 and 2 have a duplication of a gene A, into A1/A2 and a1/a2. There will be ortholog groups for A1/A2, for a1/a2, another with isDuplication=1 for A1/A2/a1/a2, and then (potentially) a larger ortholog group with isDuplication=0 that contains A1/A2/a1/a2 as well as orthologs of these genes from other genomes that did not undergo duplication. splitTaxId is a genome that contains paralogs that are just outside this ortholog group and is hence at least partially responsible for this ortholog group being split here.

nGenes and nGenomes are the total number of genes, and nNonUniqueGenomes is the number of genomes with more than one gene in this ortholog group.

Complications with using tree-based orthologs. These are handled by Gene::treeOrthologs().

Also see the orthologs overview or the MOG or COG tables.

PdbEntries

pdbIdvarchar(6)
headertext
ascessionDatedate
compoundtext
sourcetext
authorListtext
resolutionfloat
experimentTypetext
dbSourcetext
Data on proteins of known structure from the protein data bank ("PDB").

PdbReps

pdbIdRepvarchar(6)
pdbChainRepchar(1)
pdbIdvarchar(6)
pdbChainchar(1)
Data on chains within proteins from PDB. pdbChain is not unique -- only the combination with pdbId is unique.

PdbSeq

pdbIdvarchar(6)
pdbChainchar(1)
versionint(2) unsigned
sequencelongblob
The sequence of each chain of a protein in PDB.

PfamClan

clanIdvarchar(10)
domainIdvarchar(10)
For some diverse families, which are difficult to represent with a single HMM, PFam contains multiple models with overlapping specificity, or "Clans." Thus, PfamClan table stores the mapping between PFam clans and domainIds.

Position

posIdint(10) unsigned
strandenum('+','-')
beginint(10) unsigned
endint(10) unsigned
scaffoldIdint(10) unsigned
objectIdint(2) unsigned

Position stores the position of objects such as genes or operons. begin and end are 1-based, and strand is either "+" or "-". For objects on linear chromosomes, begin is always less than end and strand, so for a protein-coding gene on the "-" strand, the start codon will be at the end position.

Objects on circular chromosomes (see Scaffold.isCircular) can wrap around the origin. In this case, both begin and end will be within the range [1,Scaffold.length], and end will be less than begin. In other words, end < begin indicates that the object wraps around the origin.

RegulonCluster

clusterIdint(10) unsigned
locusIdint(10) unsigned
updatedtimestamp
linkvarchar(50)

RegulonLinks

cluster1int(10) unsigned
cluster2int(10) unsigned
linkvarchar(50)
scoredecimal(10,3) unsigned
updatedtimestamp
RegulonCluster and RegulonLinks store predicted "regulons", or genes that are likely to have related functions and/or regulation. Each clusterId links several locusId in the same genome together, and RegulonLinks gives a higher-level clustering of those clusters. At this time, all predictions are based on conserved proximity, or the "gene neighbor" method, and the link field is "GNScore". In the future, gene expression-based links may also be included. Also, at this time, each cluster contains only genes that are in the same operon, and the links link the operons together.

Scaffold

scaffoldIdint(10) unsigned
chr_numint(10) unsigned
isCircularint(1)
lengthint(10) unsigned
filevarchar(32)
isGenomicint(1)
giint(10) unsigned
taxonomyIdint(10)
commentvarchar(255)
isActiveint(1)
isPartialint(1)
createddate
allowUpdatestinyint(3) unsigned

This table represents chromosomes, plasmids, or contigs. A genome may have multiple scaffolds and each scaffold may have thousands of genes. To make sure that you do not use out-of-date data, please require "Scaffold.isActive=1" whenever you query this table. If you don't want data from plasmids, require "isGenomic=1". Require "isPartial=0" to consider only complete genomes. isCircular is 1 for circular chromosomes and 0 for linear chromosomes.

For scaffolds from RefSeq, file will be the NC_nnnnnn identifier and gi should be a genbank id.

ScaffoldIsActiveFlag

isActiveint(10) unsigned
descriptionvarchar(255)
This table describes the levels of the Scaffold.isActive flag.

ScaffoldPosChunks

locusIdint(10) unsigned
versionint(2) unsigned
posIdint(10) unsigned
scaffoldIdint(10) unsigned
kbtint(10) unsigned

This table indexes the positions of genes on scaffolds. This way you can quickly find genes at between, say, 110 and 120 kB on the E. coli chromosome. The posId field contains the gene's position (the same as Locus.posId), and the kbt value is the gene's position divided by 10,000 and rounded down. If a gene overlaps more than one 10 kB chunk of the genome then it will appear multiple times in this table.

ScaffoldSeq

scaffoldIdint(10) unsigned
sequencelongblob
This table stores the DNA sequence of each scaffold.

SwissProt

idvarchar(20)
accessionvarchar(20)
lengthint(10)
descriptiontext
This table stores data about each UniProt protein.

SwissProt2Locus

locusIdint(10) unsigned
versionint(2) unsigned
idvarchar(20)
accessionvarchar(20)
identityint(3) unsigned
SwissProt2Locus stores a mapping from a UniProt protein to a MicrobesOnline gene. This table only stores the best hit of each UniProt gene in MicrobesOnline. The Locus2SwissProt table stores a best-hit mapping in the other direction. See its description for more explanation of this mapping.

SwissProt2Pubmed

idvarchar(20)
accessionvarchar(20)
PubMedIdint(20) unsigned
isDetailedtinyint(1)
SwissProt2Pubmed stores a mapping between UniProt proteins and papers at PubMed. isDetailed is set if the paper links to 5 genes or less. Most of the entries with isDetailed=0 are from genome papers; most of the entries with isDetailed=1 contain experimental data about the function or regulation of a gene.

SwissProtBlast

locusIdint(10) unsigned
versionint(2) unsigned
subjectvarchar(20)
identityfloat unsigned
alignLengthint(10) unsigned
mismatchint(10) unsigned
gapint(10) unsigned
qBeginint(10) unsigned
qEndint(10) unsigned
sBeginint(10) unsigned
sEndint(10) unsigned
evaluedouble
scorefloat
This paper stores the homology relations between SwissProt genes and MicrobesOnline genes. Despite the name, the hits are from BLAT, not BLAST. The subjects are of the form accession|id, e.g. "Q31A92|PSBA_PROM9". Due to size limitations, this table currently does not exist.

Synonym

namevarchar(100)
locusIdint(10) unsigned
versionint(2) unsigned
typeint(2) unsigned
createddate
sourcevarchar(50)
This table contains names and identifiers for genes, such as gene names (type 0: "cheA"), locus tags (type 1: "b1888"), genbank ids (type 2: "16129840"), or genbank accessions (type 3: "NP_416402.1").

SynonymType

typeint(10) unsigned
descriptionvarchar(255)
This table should document the types of identifiers or synonyms for genes.

TIGRInfo

idvarchar(16)
tigrIdvarchar(30)
typevarchar(32)
roleIdint(5)
geneSymbolvarchar(10)
ecvarchar(16)
definitionlongtext
Metadata on each TIGRFam. TIGRFam tigrId are the same as domain Ids that appear in the Locus2Domain and DomainInfo tables, and the roleIds are described in the TIGRroles table.

TIGRroles

roleIdint(5)
levelenum('main','sub1')
descriptionvarchar(255)
Data on TIGR "roles", a functional classification of many of the TIGRFAM gene families.

TaxName

taxonomyIdint(20)
nametext
uniqueNametext
classvarchar(255)
From the NCBI taxonomy database: names for taxonomic groups. Note that the Taxonomy table only contains data for organisms with genomes (i.e., strains or species), while this table contains all taxa from the NCBI database. See TaxName for names for these taxa, and TaxParentChild for genomes of organisms within each taxonomic group.

TaxNode

taxonomyIdint(10) unsigned
parentIdint(10) unsigned
rankvarchar(50)
emblvarchar(10)
divIdint(3) unsigned
flag1int(1) unsigned
flag2int(1) unsigned
flag3int(1) unsigned
flag4int(1) unsigned
flag5int(1) unsigned
flag6int(1) unsigned
flag7int(1) unsigned
commentsvarchar(255)
From the NCBI taxonomy database: the parent/child relationship between taxonomic groups. Also see TaxName and TaxParentChild.

TaxParentChild

parentIdint(10)
childIdint(10)
nDistanceint(10) unsigned
This table expands the parent-child relationships for all genomes that are in our database and their higher-level taxa. For example, this table lets you select all sequenced genomes within a taxonomic group with a single query. parentId and childId are NCBI taxonomyIds from the TaxNode table. The childId may be in the Taxonomy table as well. nDistances is the number of links on the path from parent down to child.

Taxonomy

taxonomyIdint(10)
namevarchar(255)
placementint(10)
shortNamevarchar(100)
taxDispGroupIdint(10)
createddate
PMIDvarchar(50)
Publicationvarchar(255)
Uniprotvarchar(10)

This table includes only organisms (or viruses or plasmids) whose genome sequences are in our database. The taxonomyId values are from the NCBI taxonomy database. These are usually species- or strain-level taxa.

Higher-level taxa are available from TaxParentChild, TaxNode, or TaxName. Note that if a plasmid from an organism is sequenced, then that organism may appear in this table, even though its genome is not sequenced. Use Scaffold.isGenomic=1 to avoid plasmid scaffolds, and use TaxParentChild to get genomes within a certain taxa. In particular, to avoid viruses, use parentId=131567, which corresponds to all cellular organisms.

Use Scaffold to get to complete chromosomes or contigs, and from there you can get to genes (the Locus table).

Taxonomy2Tree

treeIdint(10) unsigned
taxonomyIdint(10)
Links genomes to species trees. Plasmids, viruses, incomplete genomes, and mixed "genomes" do not appear in the species trees.

Tree

treeIdint(10) unsigned
namevarchar(64)
typevarchar(30)
modifiedtimestamp
newicklongblob

Gene trees and species trees. The newick field contains the actual tree, in Newick format. For gene trees, the leaf ids are of the form locusId_version_begin_end or locusId_begin_end. For species trees, the leaf ids are taxonomy ids. See the tree-browser documentation for more information on how these trees are constructed. The full species tree always has name="MOSpeciesTree". There are also other species trees for subgroups of genomes. The type field is "species" for species trees or otherwise the type of family that was used to build the tree: 16S, Adhoc, COG, FastBLAST, Gene3D, PFAM, PIRSF, SMART, SSF, TIGRFAMs. For gene trees, the name is some sort of name for the family (e.g., the cogInfoId for COG trees).

Because private genomes are removed from pub.microbesonline.org, these trees may contain genes or organisms that are not in the public versions of the tables. Use the TreeUtils::simplify() routine or the Genomics/stat/simplifyTree.pl program to remove these leaves from the trees.

UserSessions

userSessionIdint(10) unsigned
sessionIdvarchar(24)
ipvarchar(16)
uaHashvarchar(24)
userIdint(10) unsigned
browservarchar(32)
versionvarchar(16)
platformvarchar(16)
archvarchar(16)
unknownUavarchar(255)
Records cookies to keep track of user's sessions. This table is empty in pub.microbesonline.org.

UserSessionsURLTracking

userSessionIdint(10) unsigned
requestUrivarchar(255)
requestParamslongtext
methodenum('GET','POST')
accessTimeint(10) unsigned
Records HTTP requests. This table is empty in pub.microbesonline.org.

Users

userIdint(10) unsigned
namevarchar(40)
orgvarchar(40)
emailvarchar(80)
pwhashvarchar(32)
annotationTrustsmallint(6)
emailDataReleaseint(1)
emailSiteMaintenanceint(1)
emailSoftwareint(1)
isSysAdminint(1) unsigned
Records registered users. This table is empty in pub.microbesonline.org.

graph_path

idint(11)
term1_idint(11)
term2_idint(11)
distanceint(11)
This records parent-child relationships in the Gene Ontology hierarchy. term1_id and term2_id link to term.id, and term1_id is the parent id. distance is the number of steps between them.

term

idint(11)
namevarchar(255)
term_typevarchar(55)
accvarchar(255)
is_obsoleteint(11)
is_rootint(11)
Gene Ontology identifiers. id (goId or term_id in tables that refer to this one) is an internal identifier; most people use the accession ("acc"), such as GO:0000018, regulation of DNA recombination. Also see Locus2Go and GoCount tables, and the other GO tables below.

term2term

idint(11)
relationship_type_idint(11)
term1_idint(11)
term2_idint(11)
completeint(11)
Relationships between Gene Ontology identifiers. term1_id and term2_id are ids in the term table. The relationship_type_id is in the term_definition table -- usually it is 2, which means a parent-child relation.

term_definition

term_idint(11)
term_definitiontext
dbxref_idint(11)
term_commentmediumtext
referencevarchar(255)
Longer descriptions of term ids in the term table.

term_synonym

term_idint(11)
term_synonymtext
acc_synonymvarchar(255)
synonym_type_idint(11)
synonym_category_idint(11)
Alternate names for Gene Ontology identifiers.

Obtaining our source code

You can download our source code from here. Note that the executables in this download (in Genomics/browser/bin/) are all 64-bit Linux executables. Please double-check any results you get from using our scripts, as the database schema may have changed since the script was written.

Popular perl scripts

  • Genomics/util/getGenes.pl: Get names, descriptions, and other annotations for all genes in a genome or a given list of loci.
  • Genomics/util/matchGenes.pl: Given a tab-delimited file, map the gene names to locusIds ("VIMSS ids")
  • Genomics/util/scaffoldSubseq.pl: Get sequences for ranges within scaffolds or for ranges relative to a gene
  • Genomics/seq/getUpstream.pl: Get 500 nt upstream of all genes in a genome, truncating parts that overlap other genes

    Overview of the perl modules

    The key modules are in Genomics/Perl/modules. There are also additional modules, mainly to support the web site code, in Genomics/browser/lib. The key modules are:

    Perl Examples

    (Also see the source code of the above perl scripts.)

    Fetch all genes in E. coli K12 (taxonomyId 511145) and tree-orthologs in another strain of E. coli and in D. vulgaris Hildenborough (taxonomyId 83334 and 882). (This takes ~1/2 a second per gene, so doing the whole genome will take a little while.)

    use GenomicsUtils;
    use Genome;
    use Gene;
    GenomicsUtils::connect();
    my $genome = Genome::new("taxonomyId" => 511145);
    my @genes = $genome->genes();
    push @genes, $genome->nongenomicGenes();
    foreach my $gene (@genes) {
       if($gene->type()==1) {
           my $l = $gene->treeOrthologs("taxa" => [83334,882]);
           while (my ($orthId, $hash) = each %$l) {
               print join("\t",$gene->locusId(), $orthId, $hash->{taxonomyId})."\n";
           }
       }
    }
    

    Fetch all 1:1 MOG-based orthologs between two loci of interest from E. coli K12 (VIMSS 14484 and 14485) in another strain of E.c oli and in D. vulgaris Hildenborough. (Note: MOGs need not be 1:1, but Gene::MOGForTaxa removes non-1:1 hitsorthologs.)

    use GenomicsUtils;
    use Gene;
    GenomicsUtils::connect();
    my @loci = (14484,14485);
    my @taxa = (83334,882);
    my $mogOrthologs = Gene::MOGForTaxa(\@loci, \@taxa);
    while (my ($locus,$moghash) = each %$mogOrthologs) {
       while (my ($taxId,$orthId) = each %$moghash) {
          print join("\t", $locus, $orthId, $taxId)."\n";
       }
    }
    

    Installing your own web server

    You can install your own MicrobesOnline web server, if you want to modify the cgi scripts and/or add your own data to the database. The first step is to install your own copy of the database. We will provide mysqldump files suitable for loading into your own MySQL server at a later date; please contact us at help@microbesonline.org if you are interested. Very rough guideliness on installing your own web server and on what what other packages you need to install are in Genomics/config/README. Please contact us for further advice.

    Linking to the MicrobesOnline genome browser

    You can use the MicrobesOnline browser to view a set of genomes or to show where items that you have identified (e.g., BLAST hits) are located relative to genes. You can use URLs with http://www.microbesonline.org/cgi-bin/browser? and various mode= and data= options: