MicrobesOnline: Tools and Downloads for Programmers
- Access to data
- Code
- Links
This page gives technical advice on how to programmatically obtain data from the MicrobesOnline data. Please see the tutorial for background on MicrobesOnline and on the data it contains.
Due to security concerns, as of April 5, 2018, we are restricting access to our public SQL server.
Please contact us at help@microbesonline.org if you
wish to have access to our server.
We have set up a public read-only SQL server at
pub.microbesonline.org. (This SQL server may move to another host in
the future.)
This server will include only data from public genomes.
Data about genes and genomes will be in the
"genomics" database. Functional genomics data, such as microarray
data, will ultimately be in the "microarray" database. (The microarray database will not be available
initially.) For both databases, use the username "guest" and the password "guest".
To connect, use the mysql command-line tool, e.g.
mysql -h pub.microbesonline.org -u guest -pguest genomics
or
mysql -h pub.microbesonline.org -u guest -pguest genomics -B -e 'my query' > output_file
Please limit your active queries to two, and active connections to three. If a query of yours is killed, it may be because it was running for too long, using too much memory, or using too many connections.
If this happens to you, please email us at help@microbesonline.org
and we will help you debug any problematic queries.
Nucleotide and protein BLAST databases and a FastBLAST database of the predicted proteins will be available for download
at a later date.
Get all VIMSS locus Ids and genbank ids for protein-coding genes in E. coli K12 (NCBI taxonomyId 511145):
SELECT Locus.locusId, Synonym.name AS genbankId
FROM Scaffold JOIN Locus USING (scaffoldId)
JOIN Synonym USING (locusId,version)
WHERE Locus.priority=1
AND Locus.type=1
AND Scaffold.isActive=1
AND Scaffold.taxonomyId=511145
AND Synonym.type=2;
Get all taxonomy Ids for sequenced delta-Proteobacteria (NCBI taxonomyId 28221), and their names:
SELECT DISTINCT Taxonomy.taxonomyId, Taxonomy.shortName
FROM Scaffold JOIN Taxonomy USING (taxonomyId)
JOIN TaxParentChild ON TaxParentChild.childId=Taxonomy.taxonomyId
WHERE Scaffold.isActive=1 AND Scaffold.isGenomic=1 AND TaxParentChild.parentId=28221;
Groups of tables:
|
Alphabetic list of tables:
Deprecated tables: AutoAnno, BLASTp, BLASTp_dddd where dddd is a taxonomyId, TMHMM, FamilyAlignments, FastBLASTDomains
|
MicrobesOnline also includes microarray data, in the "microarray" database, but this is not in the public SQL database yet.
The amino acid sequence of a protein-coding gene.
requesterId | int(10) unsigned |
requesterType | enum('user','group') |
resourceId | int(10) unsigned |
resourceType | enum('cart','job','uarray','scaffold') |
read | tinyint(1) |
write | tinyint(1) |
admin | tinyint(1) |
Permissions to access scaffolds,
carts, jobs, and microarray experiments, either by
user or by group. All of the data in pub.microbesonline.org
is accessible by group 1 (public / all users).
annotationId | int(10) unsigned |
source | varchar(64) |
date | int(10) unsigned |
This table stores the history of all the annotations, including both MicrobesOnline automated annotations, annotations from RefSeq, and manual annotations created by MicrobesOnline users, sorted by date.
(This table may not exist in pub.microbesonline.org.)
The source field is a userId. The actual content of the annotation is in the AnnotationDetail table.
annotationId | int(10) unsigned |
locusId | int(10) unsigned |
type | enum('name','synonym','description','ecNum','go','comment') |
action | enum('append','replace','delete') |
annotation | text |
AnnotationDetail contains the actual data for each annotation. There may be multiple changes with the same annotationId.
The COG assignment for the gene is in cogInfoId. (Please ignore cogId.) Genes are only assigned to the best-matching COG, and the gene must cover at least 60% of the COG.
Number of genes in each COG functional category on each scaffold.
funCode | char(1) |
description | varchar(70) |
funGroup | varchar(50) |
Descriptions of the COG function codes.
cogInfoId | int(10) unsigned |
funCode | varchar(5) |
description | varchar(255) |
geneName | varchar(20) |
cddId | varchar(255) |
length | int(10) unsigned |
Metadata on each COG id (cogInfoId). cddId is the conserved domain database identifier for this COG. funCode can be multiple characters, each of which has its own row in COGFun.
locusId | int(10) unsigned |
version | int(2) unsigned |
subject | varchar(20) |
identity | float unsigned |
alignLength | int(10) unsigned |
mismatch | int(10) unsigned |
gap | int(10) unsigned |
qBegin | int(10) unsigned |
qEnd | int(10) unsigned |
sBegin | int(10) unsigned |
sEnd | int(10) unsigned |
evalue | double |
score | float |
All hits, from rpsblast, to each COG. The subject is the COG number (as in COGInfo.cogInfoId). A gene may have hits to multiple COGs, even though it is only assigned to one of them (or to none of them if the hits are too short).
cartId | int(10) unsigned |
userId | int(10) unsigned |
name | varchar(32) |
seqData | longtext |
seqCount | int(10) unsigned |
time | int(10) unsigned |
active | int(1) unsigned |
User-created lists of genes, or "carts." This table is empty in pub.microbesonline.org.
locusId | int(10) unsigned |
familyId | int(10) unsigned |
type | smallint(5) unsigned |
Clusters of CRISPR repeats and spacers based on BLAST hit connectivity.
descriptionId | int(10) unsigned |
description | text |
source | varchar(255) |
created | date |
locusId | int(10) unsigned |
version | int(2) unsigned |
The gene descriptions, e.g., "DNA-binding transcriptional activator, 3HPP-binding". These are often incorrect or misleading.
domainDb | varchar(20) |
domainId | varchar(20) |
domainName | varchar(50) |
iprId | varchar(10) |
iprName | varchar(100) |
domainLen | int(5) unsigned |
fileName | varchar(20) |
Meta-data on interpro domains and sites. iprId is defined in the IPRInfo table.
ecNum | varchar(20) |
name | varchar(255) |
Metadata on enzyme classification numbers.
domainId | varchar(20) |
locusId | int(10) unsigned |
version | int(10) unsigned |
seqBegin | int(5) unsigned |
seqEnd | int(5) unsigned |
Associations of genes with ad-hoc domains from FastBLAST. This table has been deprecated.
domainDb | varchar(20) |
domainId | varchar(20) |
hmmName | varchar(10) |
domainLen | int(5) unsigned |
Data on each HMM for the InterPro families, with domainDb and domainId linked to hmmName (the same as hmmId in FasthmmRawHits). Many InterPro families are described by more than one HMM.
hmmId | varchar(20) |
locusId | int(10) unsigned |
version | int(2) unsigned |
seqBegin | int(5) unsigned |
seqEnd | int(5) unsigned |
domainBegin | int(5) unsigned |
domainEnd | int(5) unsigned |
score | float |
evalue | float |
The raw FastHMM hits for each InterPro domain (Gene3D, PANTHER, Pfam, PIRSF, SMART, Superfam, TIGRFAM). The hmmIds are described in FasthmmFamily2HMM.hmmName.
goId | int(10) unsigned |
goCount | int(10) unsigned |
taxId | int(10) |
Numbers of genes (goCount) in each gene ontology category (goId) for each genome (taxId). goId maps to id in the term table and taxId maps to taxonomyId in the Taxonomy table.
groupId | int(10) unsigned |
userId | int(10) unsigned |
active | tinyint(1) unsigned |
time | int(10) unsigned |
Which users are members of which groups.
groupId | int(10) unsigned |
name | varchar(32) |
description | varchar(255) |
adminUserId | int(10) unsigned |
Groups of users are defined for access control. Group 1 is special: all users belong to it.
hitId | varchar(100) |
synId | varchar(100) |
sLength | int(10) unsigned |
description | longtext |
Deprecated. This table is empty on pub.microbesonline.org.
Which interpro id (iprId, defined in the IPRInfo table) maps to which goId (defined by id in the term table).
iprId | varchar(9) |
type | varchar(16) |
shortName | varchar(50) |
proteinCount | int(5) |
iprName | varchar(255) |
Each interpro id defines a conserved gene function (e.g., "IPR005471" corresponds to "Transcriptional regulator IclR, N-terminal"). proteinCount is from the InterPro database and does not reflect the number of genes in MicrobesOnline that is associated with that interpro id.
locusId | int(10) unsigned |
version | int(2) unsigned |
checksum | varchar(16) |
length | int(5) unsigned |
domainDb | varchar(30) |
domainId | varchar(30) |
domainDesc | varchar(255) |
domainStart | int(5) |
domainEnd | int(5) |
evalue | float |
status | varchar(10) |
date | varchar(50) |
iprId | varchar(9) |
iprName | varchar(255) |
geneOntology | longtext |
Assignments of genes to InterPro sites, such as from PRINTS or PROSITE.
Assignments to InterPro families (e.g. PFams, TIGRFAMs, etc.) are in the Locus2Domain table, not in this table.
jobId | int(10) unsigned |
parentJobId | int(10) unsigned |
userId | int(10) unsigned |
cartId | int(10) unsigned |
jobName | varchar(32) |
jobType | varchar(32) |
jobData | text |
jobCmd | text |
status | int(2) unsigned |
time | int(10) unsigned |
doneTime | int(10) unsigned |
saved | int(1) unsigned |
Computations requested by users. This table is empty in pub.microbesonline.org.
The mapping between NCBI taxonomy ids for genomes and KEGG organism ids.
compound | varchar(7) |
name | tinytext |
formula | tinytext |
mass | double |
Small molecules (metabolic compounds) from KEGG.
mapId | varchar(20) |
object | varchar(20) |
type | int(1) |
url | varchar(255) |
coord | varchar(100) |
Where the objects are on the KEGG maps, to support clicking through to compounds or enyzymes.
hitId | varchar(100) |
sLength | int(10) unsigned |
description | longtext |
Descriptions of genes from the KEGG database. hitId is a KEGG identifier.
mapId | varchar(20) |
title | varchar(255) |
Titles of the KEGG maps -- the actual maps are in image files.
locusId | int(10) unsigned |
version | int(2) unsigned |
subject | varchar(30) |
identity | float unsigned |
alignLength | int(10) unsigned |
mismatch | int(10) unsigned |
gap | int(10) unsigned |
qBegin | int(10) unsigned |
qEnd | int(10) unsigned |
sBegin | int(10) unsigned |
sEnd | int(10) unsigned |
evalue | double |
score | float |
Hits of MicrobesOnline genes to KEGG genes. Subjects are KEGG ids, e.g. "aae:aq_001". Despite the name, these hits are from BLAT, not from BLAST. Due to size limitations, this table currently does not exist.
locusId | int(10) unsigned |
version | int(2) unsigned |
priority | tinyint(3) unsigned |
created | date |
posId | int(10) unsigned |
evidence | varchar(50) |
scaffoldId | int(10) unsigned |
type | smallint(5) unsigned |
The central object in the MicrobesOnline database is a gene or "Locus". Locus includes protein-coding genes, non-coding RNA genes and pseudogenes. The Locus table includes a locusId, which is not unique, a version number, and a priority flag. The combination of a locusId and a version number is unique, and priority=1 for the current version. To avoid using stale and/oor redundant data, please make sure that you require "Locus.priority=1" whenever you query this table. Locus ids are also referred to as VIMSS ids and MicrobesOnline gene ids. (They are also referred to as orfIds in some of our older code.) Locus.type gives the type of the locus, e.g. type=1 for protein-coding gene. See the LocusType table for more information.
Locus also links to Scaffold.scaffoldId and to Position.posId. The Scaffold table includes a taxonomyId, so you can use that to determine what genome the Locus is from. The Position table stores the location of the gene. For protein-coding genes, the Position stores the range from the predicted start codon to the stop codon.
A large number of tables give additional information about a locus. Most of these tables include both locusId and version.
Some of the tables includes only locusId but no version -- those entries implicitly refer
to the active version of the locus (the one with priority=1).
domainId | varchar(20) |
locusId | int(10) unsigned |
version | int(2) unsigned |
seqBegin | int(5) unsigned |
seqEnd | int(5) unsigned |
domainBegin | int(5) unsigned |
domainEnd | int(5) unsigned |
score | float |
evalue | float |
Assignments of genes to InterPro families. Within each database (e.g., within Pfam), these assignments are based on the best-hitting family for each region. There may be many more hits, or multiple hits to different HMMs for the same domainId, in FasthmmRawHits.
Assignments of genes to enzyme classification numbers such as "1.4.1.15". These assignments are from KEGG or by best BLAST hit to KEGG, and are not very reliable. Via the ecNum field, this also links to descriptions of enzymes in the ECInfo table.
locusId | int(10) unsigned |
goId | int(10) unsigned |
evidence | varchar(255) |
Assignments of genes to Gene Ontology identifiers. These assignments have many false positives, partly because
they are based on individual domains rather than the entire gene.
goId maps to id from the term table.
Assignment of genes to InterPro ids, which are described in the IPRInfo table.
MicrobesOnline operon predictions. Genes
predicted to be transcribed by themselves will still have an operon,
but with only one gene in it. For E. coli, we also have known operons,
as explained in the Operon table. The tuId field links to the Operon table.
Plasmids, viruses, incomplete genomes, and eukaryotes usually do not have operon predictions.
pdbId | varchar(6) |
pdbChain | char(1) |
locusId | int(10) unsigned |
version | int(2) unsigned |
seqBegin | int(5) unsigned |
seqEnd | int(5) unsigned |
pdbBegin | int(5) unsigned |
pdbEnd | int(5) unsigned |
identity | decimal(5,2) unsigned |
alignLength | int(5) unsigned |
mismatch | int(5) unsigned |
gap | int(5) unsigned |
evalue | double |
evalueDisp | text |
score | decimal(10,2) unsigned |
BLAST hits of MicrobesOnline genes to proteins of known structure in the protein data bank ("PDB").
locusId | int(11) |
geneInOperonIndex | int(11) |
sourceTypeTermId | int(11) |
regulonId | int(11) |
regulatorName | varchar(255) |
regulatorLocusId | int(11) |
Regulatory predictions from RegPrecise. regulatorLocusId is the regulator and locusId is the regulated gene. sourceTypeTermId is 1 for manually curated predictions and 2 for predictions that are automatically propagated to orthologs that have a conserved match to the weight matrix. regulonId is from RegPrecise itself and should be used together with sourceTypeTermId. geneInOperonIndex is the position in the operon (e.g., 1 for the 1st gene in the operon).
locusId | int(10) unsigned |
rtb_seqfeature_id | int(10) unsigned |
rtb_characterized | int(1) unsigned |
Mapping of MicrobesOnline genes to RegTransBase genes. If rtb_characterized is set then the regulation of this gene has been characterized by a paper.
locusId | int(10) unsigned |
pubMedId | varchar(20) |
wetExp | tinyint(1) unsigned |
Mapping of MicrobesOnline genes to papers about them from the RegTransBase database. The RegTransBase database stores data, collected from published papers by curators, about gene regulation.
locusId | int(10) unsigned |
version | int(2) unsigned |
id | varchar(20) |
accession | varchar(20) |
identity | int(3) unsigned |
bidir | int(2) unsigned |
Locus2SwissProt stores a mapping between MicrobesOnline genes and UniProt (formerly known as SwissProt) accesssions and identifiers. This table only stores the best hit(s) of each MicrobesOnline gene in UniProt. The SwissProt2Locus table stores a best-hit mapping in the other direction. The Locus2SwissProt.bidir flag is set if this relation is also in the SwissProt2Locus table.
Redundant mappings are common if multiple strains of a genome have been sequenced, because all of the copies of the protein sequence will usually be identical. Although UniProt contains organism information, it generally does not specify the strain. So, we do not use the UniProt organism information to make the mapping -- we only use the protein sequences.
Usually, a gene from a complete genome will have an exact or nearly-exact match in UniProt, but hits down to 80% identity are used if no closer match is available.
locusId | int(10) unsigned |
mainRole | varchar(255) |
subRole | varchar(255) |
evidence | varchar(64) |
Links genes to TIGR functional categories. This is based on annotations made by CMR, and does not reference TIGRFam hits from HMMER. For those hits you should use the the Locus2Domain table, the metadata on TIGRFams in the TIGRInfo table, and the descriptions of the roles in the TIGRroles table. The mainRole and subRole columns in the Locus2TigrFunCat table are text descriptions not role ids.
treeId | int(10) unsigned |
locusId | int(10) unsigned |
version | int(2) unsigned |
aaTree | tinyint(1) |
begin | int(10) unsigned |
end | int(10) unsigned |
nAligned | int(10) unsigned |
score | decimal(5,1) unsigned |
scaffoldId | int(10) unsigned |
Links genes to gene trees. aaTree is 1 if the tree is based on protein sequences. begin and end is the region of the gene that is in the alignment that was used to build the tree. (These are 1-based positions within the gene sequence.) Note that because of domain duplications, a gene can occur in a tree more than once. nAligned is the number of aligned positions, and may be less than end-begin+1 because of how profile-based alignment works and because of trimming. score is deprecated. scaffoldId is the scaffoldId of that locus.
Counts the number of genes on each scaffold. Also links to Taxonomy.taxonomyId.
The nucleotide sequence of each gene. For protein-coding genes, this includes only the open reading frame (including the stop codon).
type | smallint(5) unsigned |
description | varchar(100) |
This table explains the values of the type field in the Locus table. This is a controlled vocabulary,
e.g. "protein-coding gene" (type=1), large subunit ribosomal RNA
(type=2), pseudogene of a protein-coding gene (type=7), etc.
mogId | int(10) unsigned |
nComponents | int(10) unsigned |
nLoci | int (10) unsigned |
metric | float |
A MicrobesOnline Ortholog Group, or MOG, is the union of maximal ortholog groups, from different trees, that have similar membership. The "components" of a MOG are the maximal ortholog groups, see MOGComponent. The loci that belong to a MOG are stored in MOGMember. Note that a genome can have more than one representative in a MOG, but a locus belongs to at most one MOG. While constructing MOGs, we use the "best" maximal ortholog groups first, as determined by the metric of the average alignment length of the ogmembers. (We may use another metric, such as alignment score, in the future.) A MOG's metric is that of the first ortholog group that was placed in this MOG.
Also see the orthologs overview or the OrthologGroup or COG tables.
mogId | int(10) unsigned |
treeId | int(10) unsigned |
ogId | int(10) unsigned |
metric | float |
nMembers | int(10) unsigned |
nMembersBest | int(10) unsigned |
A MOG component is a maximal ortholog-group (an ortholog group for a
tree that is not a duplication and is not contained within any any
other non-duplication ortholog group) that was added to a
MOG. nMembers is the number of members, and nMembersBest is the number
of members for which this was the best ortholog group.
mogId | int(10) unsigned |
locusId | int(10) unsigned |
version | int(2) unsigned |
minBegin | int(10) unsigned |
maxEnd | int(10) unsigned |
treeId | int(10) unsigned |
ogId | int(10) unsigned |
metric | float |
nAligned | int(10) unsigned |
taxonomyId | int(10) |
MOGMember reports which MOG a protein-coding gene belongs to (if any) and which ortholog group was used to add this locus to that MOG. minBegin and maxEnd are the range of positions in the protein that are covered by the trees/ortholog-groups that are included in this MOG. metric is average metric for the ortholog group, while nAligned describes the number of aligned positions for this gene in that tree (this ultimately is from Locus2Tree.nAligned).
mog1 | int(10) unsigned |
mog2 | int(10) unsigned |
score | float |
nTaxGroupsBoth | int(10) unsigned |
nTaxGroups1 | int(10) unsigned |
nTaxGroups2 | int(10) unsigned |
MOGNeighborScores reports which MOGs are conserved near each other across more than one group of closely-related genomes. Only pairs with mog1 < mog2 are included in the table. nTaxGroupsBoth is the number of genome groups for which the genes are near each other. nTaxGroups1 is the number of genome groups that contain at least one representative of mog1. score = nTaxGroupsBoth / max(nTaxGroups1, nTaxGroups2). Genes are considered to be near each other if they are adjacent and spaced by less than 200 nucleotides, or if they are within a run of genes on the same strand for which each adjacent pair of which is spaced by less than 200 nucleotides. Groups of closely-related genomes are defined by clades from the MicrobesOnline species tree that have diverged by at most 0.02 from their common ancestor. (For example, this threshold groups strains of Escherichia coli and various Shigella species with each other but separately from Salmonella species.)
objectId | int(2) unsigned |
description | varchar(255) |
This table explains the values of the objectId fields in the Position table, e.g. if the
position is for a Locus or an Operon.
This table describes the members of each
ortholog group. A gene with a domain
duplication can be in a tree more than once, which is why begin and
end are necessary to guarantee uniqueness. nMemberThisGenomes helps
determine if this is a 1:1 ortholog. aaLength (the length of this
protein in amino acids) helps determine if the orthology relation
covers the full length of the gene.
tuId | int(10) unsigned |
posId | int(10) unsigned |
evidence | varchar(255) |
Each entry in the Operon links together one or more genes that are known or predicted to be transcribed together. The evidence field is either "VIMSS Predicted" for our operon predictions or "Confirmed" for experimentally characterized operons. Operon also links to the Position table, which stores the extent of the operon. Those positions are based on the extents of the genes, not on predicting transcription starts or stops.
Ortholog stores putative orthology relationships between pairs of genes. Any gene will map to at most 1 gene in another genome. These "orthologs" are from a FastBLAST-based variant of bidirectional best BLASTp hits and are often incorrect. Also, the table is growing very rapidly (it grows as the square of the number of genomes). So, these will be discontinued soon. Use the OrthologGroup or MOG tables instead.
treeId | int(10) unsigned |
ogId | int(10) unsigned |
parentOG | int(10) unsigned |
isDuplication | tinyint(1) |
nGenes | int(10) unsigned |
nGenomes | int(10) unsigned |
nNonUniqueGenomes | int(10) unsigned |
splitTaxId | int(10) |
The primary key of this table is the combination of treeId and ogId. ogIds themselves are not unique.
Our definition of ortholog group is a clade in a gene tree that is mostly one copy
per genome, or, if isDuplication is set, a clade of genes from a
closely-related group of genomes that have undergone duplication.
These are not evolutionary orthologs as originally defined by
Fitch (evolutionary orthologs are genes that diverged by a speciation event).
Nor is there any guaranatee that these orthologs have the same function.
However, be believe that these tree-based orthologs, if used carefully (see below), are more accurate than bidirectional-best hits, and they are faster to compute and smaller to store.
The
members of the ortholog groups are in the OGMember table. Every gene in a gene tree will
be in an ortholog group, even if the ortholog "group" contains only that gene.
Ortholog groups can be nested inside larger ortholog groups, as
described by parentOG (which may be NULL). For example, suppose that two closely related genomes 1 and 2 have a duplication of a gene A, into A1/A2 and a1/a2. There will be ortholog groups for A1/A2, for a1/a2, another with isDuplication=1 for A1/A2/a1/a2, and then (potentially) a larger ortholog group with isDuplication=0 that contains A1/A2/a1/a2 as well as orthologs of these genes from other genomes that did not undergo duplication. splitTaxId is a genome that contains paralogs that are just outside this ortholog group and is hence at least partially responsible for this ortholog group being split here.
nGenes and nGenomes are the total number of genes, and
nNonUniqueGenomes is the number of genomes with more than one gene in
this ortholog group.
Complications with using tree-based orthologs. These are handled by Gene::treeOrthologs().
- A small number of duplications by horizontal gene transfer ("xenoparalogs") are also
allowed within an ortholog group. If you want a stricter definition of orthology, use only the
groups with no non-unique genomes. And, you probably want to examine the smaller orthology groups first.
- Genes typically belong to many trees, so you need to choose which tree(s) to use.
- Genes with very different domain structures, or truncated genes, may nevertheless be in some of the same gene trees, and you might not want to include these as orthologs.
Also see the orthologs overview or the MOG or COG tables.
pdbId | varchar(6) |
header | text |
ascessionDate | date |
compound | text |
source | text |
authorList | text |
resolution | float |
experimentType | text |
dbSource | text |
Data on proteins of known structure from the protein data bank ("PDB").
pdbIdRep | varchar(6) |
pdbChainRep | char(1) |
pdbId | varchar(6) |
pdbChain | char(1) |
Data on chains within proteins from PDB. pdbChain is not unique -- only the combination with pdbId is unique.
pdbId | varchar(6) |
pdbChain | char(1) |
version | int(2) unsigned |
sequence | longblob |
The sequence of each chain of a protein in PDB.
For some diverse families, which are difficult to represent with a single HMM,
PFam contains multiple models with overlapping specificity, or "Clans."
Thus, PfamClan table stores the mapping between PFam clans and domainIds.
posId | int(10) unsigned |
strand | enum('+','-') |
begin | int(10) unsigned |
end | int(10) unsigned |
scaffoldId | int(10) unsigned |
objectId | int(2) unsigned |
Position stores the position of objects such as genes or
operons. begin and end are 1-based, and strand is either "+" or "-".
For objects on linear chromosomes,
begin is always less than end and strand, so for
a protein-coding gene on the "-" strand, the start codon will be at
the end position.
Objects on circular chromosomes (see Scaffold.isCircular) can wrap
around the origin. In this case, both begin and end will be within the
range [1,Scaffold.length], and end will be less than begin. In other
words, end < begin indicates that the object wraps around the origin.
clusterId | int(10) unsigned |
locusId | int(10) unsigned |
updated | timestamp |
link | varchar(50) |
cluster1 | int(10) unsigned |
cluster2 | int(10) unsigned |
link | varchar(50) |
score | decimal(10,3) unsigned |
updated | timestamp |
RegulonCluster and RegulonLinks store predicted "regulons", or genes that are likely to
have related functions and/or regulation. Each clusterId links several
locusId in the same genome together, and RegulonLinks gives a
higher-level clustering of those clusters. At this time, all
predictions are based on conserved proximity, or the "gene neighbor"
method, and the link field is "GNScore". In the future, gene
expression-based links may also be included. Also, at this time, each
cluster contains only genes that are in the same operon, and the links
link the operons together.
scaffoldId | int(10) unsigned |
chr_num | int(10) unsigned |
isCircular | int(1) |
length | int(10) unsigned |
file | varchar(32) |
isGenomic | int(1) |
gi | int(10) unsigned |
taxonomyId | int(10) |
comment | varchar(255) |
isActive | int(1) |
isPartial | int(1) |
created | date |
allowUpdates | tinyint(3) unsigned |
This table represents chromosomes, plasmids, or contigs. A genome may have multiple
scaffolds and each scaffold may have thousands of genes. To make sure
that you do not use out-of-date data, please require "Scaffold.isActive=1"
whenever you query this table. If you don't want data from plasmids,
require "isGenomic=1". Require "isPartial=0" to consider only
complete genomes. isCircular is 1 for circular chromosomes and 0 for linear chromosomes.
For scaffolds from RefSeq, file will be the NC_nnnnnn identifier and gi should be a genbank id.
isActive | int(10) unsigned |
description | varchar(255) |
This table describes the levels of the Scaffold.isActive flag.
This table indexes the positions of genes on scaffolds. This way you can quickly find genes at between, say, 110 and 120 kB on the E. coli chromosome. The posId field contains the gene's position (the same as Locus.posId), and the kbt value is the gene's position divided by 10,000 and rounded down. If a gene overlaps more than one 10 kB chunk of the genome then it will appear multiple times in this table.
This table stores the DNA sequence of each scaffold.
id | varchar(20) |
accession | varchar(20) |
length | int(10) |
description | text |
This table stores data about each UniProt protein.
SwissProt2Locus stores a mapping from a UniProt protein to a MicrobesOnline gene. This table only stores the best hit of each UniProt gene in MicrobesOnline. The Locus2SwissProt table stores a best-hit mapping in the other direction. See its description for more explanation of this mapping.
id | varchar(20) |
accession | varchar(20) |
PubMedId | int(20) unsigned |
isDetailed | tinyint(1) |
SwissProt2Pubmed stores a mapping between UniProt proteins and papers at PubMed. isDetailed is set if the paper links to 5 genes or less. Most of the entries with isDetailed=0 are from genome papers; most of the entries with isDetailed=1 contain experimental data about the function or regulation of a gene.
locusId | int(10) unsigned |
version | int(2) unsigned |
subject | varchar(20) |
identity | float unsigned |
alignLength | int(10) unsigned |
mismatch | int(10) unsigned |
gap | int(10) unsigned |
qBegin | int(10) unsigned |
qEnd | int(10) unsigned |
sBegin | int(10) unsigned |
sEnd | int(10) unsigned |
evalue | double |
score | float |
This paper stores the homology relations between SwissProt genes and MicrobesOnline genes. Despite the name, the hits are from BLAT, not BLAST. The subjects are of the form accession|id, e.g. "Q31A92|PSBA_PROM9". Due to size limitations, this table currently does not exist.
name | varchar(100) |
locusId | int(10) unsigned |
version | int(2) unsigned |
type | int(2) unsigned |
created | date |
source | varchar(50) |
This table contains names and identifiers for genes, such as gene names (type 0: "cheA"), locus tags (type 1: "b1888"), genbank ids (type 2: "16129840"), or genbank accessions (type 3: "NP_416402.1").
type | int(10) unsigned |
description | varchar(255) |
This table should document the types of identifiers or synonyms for genes.
id | varchar(16) |
tigrId | varchar(30) |
type | varchar(32) |
roleId | int(5) |
geneSymbol | varchar(10) |
ec | varchar(16) |
definition | longtext |
Metadata on each TIGRFam. TIGRFam tigrId are the same as domain Ids that appear in the Locus2Domain and DomainInfo tables, and the roleIds are described in the TIGRroles table.
roleId | int(5) |
level | enum('main','sub1') |
description | varchar(255) |
Data on TIGR "roles", a functional classification of many of the TIGRFAM gene families.
taxonomyId | int(20) |
name | text |
uniqueName | text |
class | varchar(255) |
From the NCBI taxonomy database: names for taxonomic groups. Note that the Taxonomy table only contains data for organisms with genomes (i.e., strains or species), while this table contains all taxa from the NCBI database. See TaxName for names for these taxa, and TaxParentChild for genomes of organisms within each taxonomic group.
taxonomyId | int(10) unsigned |
parentId | int(10) unsigned |
rank | varchar(50) |
embl | varchar(10) |
divId | int(3) unsigned |
flag1 | int(1) unsigned |
flag2 | int(1) unsigned |
flag3 | int(1) unsigned |
flag4 | int(1) unsigned |
flag5 | int(1) unsigned |
flag6 | int(1) unsigned |
flag7 | int(1) unsigned |
comments | varchar(255) |
From the NCBI taxonomy database: the parent/child relationship between taxonomic groups. Also see TaxName and TaxParentChild.
This table expands the parent-child relationships for all genomes that
are in our database and their higher-level taxa.
For example, this table lets you select all sequenced genomes within
a taxonomic group with a single query.
parentId and childId are NCBI taxonomyIds from the TaxNode table.
The childId may be in the Taxonomy table as well.
nDistances is the number of
links on the path from parent down to child.
taxonomyId | int(10) |
name | varchar(255) |
placement | int(10) |
shortName | varchar(100) |
taxDispGroupId | int(10) |
created | date |
PMID | varchar(50) |
Publication | varchar(255) |
Uniprot | varchar(10) |
This table includes only organisms (or viruses or plasmids) whose genome sequences are in our database.
The taxonomyId values are from the
NCBI taxonomy database.
These are usually species- or strain-level taxa.
Higher-level taxa are available from TaxParentChild, TaxNode, or
TaxName. Note that if a plasmid from an organism is sequenced, then that organism may appear in this table, even though its genome is not sequenced. Use Scaffold.isGenomic=1 to avoid plasmid scaffolds, and use TaxParentChild to get genomes within a certain taxa. In particular, to avoid viruses, use parentId=131567, which corresponds to all cellular organisms.
Use Scaffold to get to complete chromosomes or contigs, and from there you can get to genes
(the Locus table).
Links genomes to species trees. Plasmids, viruses, incomplete genomes, and mixed "genomes" do not appear in the species trees.
treeId | int(10) unsigned |
name | varchar(64) |
type | varchar(30) |
modified | timestamp |
newick | longblob |
Gene trees and species trees. The newick field contains the actual tree, in Newick format.
For gene trees, the leaf ids are of the form locusId_version_begin_end or locusId_begin_end.
For species trees, the leaf ids are taxonomy ids.
See the tree-browser documentation for more information on how these trees are constructed. The full species tree always has name="MOSpeciesTree". There are also other species trees for subgroups of genomes.
The type field is "species" for species trees or otherwise the type of family that was used to build the tree: 16S, Adhoc, COG, FastBLAST, Gene3D, PFAM, PIRSF, SMART, SSF, TIGRFAMs. For gene trees, the name is some sort of name for the family (e.g., the cogInfoId for COG trees).
Because private genomes are removed from pub.microbesonline.org, these trees may contain genes or organisms that are not in the public versions of the tables. Use the TreeUtils::simplify() routine or the Genomics/stat/simplifyTree.pl program to remove these leaves from the trees.
userSessionId | int(10) unsigned |
sessionId | varchar(24) |
ip | varchar(16) |
uaHash | varchar(24) |
userId | int(10) unsigned |
browser | varchar(32) |
version | varchar(16) |
platform | varchar(16) |
arch | varchar(16) |
unknownUa | varchar(255) |
Records cookies to keep track of user's sessions. This table is empty in pub.microbesonline.org.
userSessionId | int(10) unsigned |
requestUri | varchar(255) |
requestParams | longtext |
method | enum('GET','POST') |
accessTime | int(10) unsigned |
Records HTTP requests. This table is empty in pub.microbesonline.org.
userId | int(10) unsigned |
name | varchar(40) |
org | varchar(40) |
email | varchar(80) |
pwhash | varchar(32) |
annotationTrust | smallint(6) |
emailDataRelease | int(1) |
emailSiteMaintenance | int(1) |
emailSoftware | int(1) |
isSysAdmin | int(1) unsigned |
Records registered users. This table is empty in pub.microbesonline.org.
This records parent-child relationships in the Gene Ontology hierarchy. term1_id and term2_id link to term.id, and term1_id is the parent id. distance is the number of steps between them.
id | int(11) |
name | varchar(255) |
term_type | varchar(55) |
acc | varchar(255) |
is_obsolete | int(11) |
is_root | int(11) |
Gene Ontology identifiers. id (goId or term_id in tables that refer to this one) is an internal identifier; most people use the accession ("acc"), such as GO:0000018, regulation of DNA recombination. Also see Locus2Go and GoCount tables, and the other GO tables below.
id | int(11) |
relationship_type_id | int(11) |
term1_id | int(11) |
term2_id | int(11) |
complete | int(11) |
Relationships between Gene Ontology identifiers. term1_id and term2_id are ids in the term table. The relationship_type_id is in the term_definition table -- usually it is 2, which means a parent-child relation.
term_id | int(11) |
term_definition | text |
dbxref_id | int(11) |
term_comment | mediumtext |
reference | varchar(255) |
Longer descriptions of term ids in the term table.
term_id | int(11) |
term_synonym | text |
acc_synonym | varchar(255) |
synonym_type_id | int(11) |
synonym_category_id | int(11) |
Alternate names for Gene Ontology identifiers.
You can download our source code from here.
Note that the executables in this download (in Genomics/browser/bin/)
are all 64-bit Linux executables.
Please double-check any results you get from using our scripts, as the database schema may have changed since the script was written.
Genomics/util/getGenes.pl: Get names, descriptions, and other annotations for all genes in a genome or a given list of loci.
Genomics/util/matchGenes.pl: Given a tab-delimited file, map the gene names to locusIds ("VIMSS ids")
Genomics/util/scaffoldSubseq.pl: Get sequences for ranges within scaffolds or for ranges relative to a gene
Genomics/seq/getUpstream.pl: Get 500 nt upstream of all genes in a genome, truncating parts that overlap other genes
The key modules are in Genomics/Perl/modules. There are also additional modules, mainly to support the web site code, in Genomics/browser/lib. The key modules are:
- GenomicsUtils -- Use GenomicsUtils::connect(hostname) to connect to a MicrobesOnline sql database, before you create genome, gene, or scaffold objects.
- Genome -- Use Genome::new(taxonomyId => 511145) to get a genome object for E. coli, and $genome->genes() or $genome->scaffolds() to get gene or scaffold objects. Use $genome->nongenomicGenes() or $genome->nongenomicScaffolds() to get any plasmid genes or scaffolds.
- Gene -- Has many methods to obtain attributes of genes such as begin(), end(), strand(), synonym(), description(), etc.
Perl Examples
(Also see the source code of the above perl scripts.)
Fetch all genes in E. coli K12 (taxonomyId 511145) and tree-orthologs in another strain of E. coli and in D. vulgaris Hildenborough (taxonomyId 83334 and 882). (This takes ~1/2 a second per gene, so doing the whole genome will take a little while.)
use GenomicsUtils;
use Genome;
use Gene;
GenomicsUtils::connect();
my $genome = Genome::new("taxonomyId" => 511145);
my @genes = $genome->genes();
push @genes, $genome->nongenomicGenes();
foreach my $gene (@genes) {
if($gene->type()==1) {
my $l = $gene->treeOrthologs("taxa" => [83334,882]);
while (my ($orthId, $hash) = each %$l) {
print join("\t",$gene->locusId(), $orthId, $hash->{taxonomyId})."\n";
}
}
}
Fetch all 1:1 MOG-based orthologs between two loci of interest from E. coli K12 (VIMSS 14484 and 14485) in another strain of E.c oli and in D. vulgaris Hildenborough. (Note: MOGs need not be 1:1, but Gene::MOGForTaxa removes non-1:1 hitsorthologs.)
use GenomicsUtils;
use Gene;
GenomicsUtils::connect();
my @loci = (14484,14485);
my @taxa = (83334,882);
my $mogOrthologs = Gene::MOGForTaxa(\@loci, \@taxa);
while (my ($locus,$moghash) = each %$mogOrthologs) {
while (my ($taxId,$orthId) = each %$moghash) {
print join("\t", $locus, $orthId, $taxId)."\n";
}
}
You can install your own MicrobesOnline web server, if you want to
modify the cgi scripts and/or add your own data to the database. The
first step is to install your own copy of the database. We will
provide mysqldump files suitable for loading into your own MySQL
server at a later date; please contact us at
help@microbesonline.org
if you are interested.
Very rough guideliness on installing your own web server and on what
what other packages you need to install are in
Genomics/config/README. Please contact us for further advice.
You can use the MicrobesOnline browser to view a set of genomes or to show where items that you have identified (e.g., BLAST hits) are located relative to genes. You can use URLs with http://www.microbesonline.org/cgi-bin/browser? and various mode= and data= options:
- To show a locus and its orthologs in other organisms, use mode=2 and data=locusId,tax1,tax2,...,taxN, e.g.
- To show how your objects relate to the genes in MicrobesOnline, use mode=4 and a colon-delimited set of object specifiers with the prefixes s for scaffold, c for center, f for from, n for label, and d for description. The scaffold specifier (e.g. "s7023") must be the first entry in each track specifier, and the "c" or "f" entry to specify where the object goes must be the second entry. This way will put each object in its own track, but you can also have multiple objects within a single track -- just leave out the later scaffold specifiers. Example track specifiers:
- "s7023:c20000" (position 20000 on scaffold 7023, centered at 7023)
- "s7023:c20000:-" (centered at position 20000 on the - strand)
- "s7023:f395t478:-:nLabel:dDescription" (feature from 395 to 478 on - strand, with name "Label" and popup-text "Description")
- "s7023:f478t395:nLabel:dDescription" (feature from 478 to 395 on the - strand; this is on the minus strand as implied by to < from)
- Also see an example URL with many tracks or with multiple objects within a track