Orthologs in MicrobesOnline:
Tree-orthologs, MOGs, and COGs

There are two popular uses of the term "ortholog":

For example, two genes that are separated by a horizontal gene transfer may well be functional orthologs, but are not evolutionary orthologs.

MicrobesOnline uses several methods to identify potential functional orthologs:

Unless specified otherwise, the "orthologs" on our web site are tree-orthologs. Another popular way to identify orthologs, which MicrobesOnline used until early 2009 but no longer uses today, is bidirectional best BLAST hits (BBHs). The MicrobesOnline tree-browser can help you identify evolutionary orthologs, but MicrobesOnline does not include a fully automated tool for this.

Tree-orthologs

MicrobesOnline computes "tree-orthologs" for a gene by examining the pre-computed gene trees. In principle, orthology relationships can be many-to-1, and MicrboseOnline's internal code can handle these relationships, but on the web site, tree-orthologs are designed to be 1:1. That is, a gene from genome A has either 0 or 1 orthologs in genome B. Conversely, the tree-orthologs for all the genes in genome A would ideally list a specific gene from genome B either 0 or 1 times, but there will be rare exceptions because of inconsistencies between trees.

To identify tree orthologs, MicrobesOnline examines each gene tree. Within a tree, clades that are mostly present in one copy per genome are defined as ortholog groups. Clades that reflect duplications within a small group of bacteria (as indicated by the MicrobesOnline species tree) are flagged as lineage-specific expansions. These can lie within other ortholog groups. Any other cases in which a genome contains more than one gene within a clade is assumed to be too diverged to place within an ortholog group. These may reflect errors in the tree or more ancient duplications within a lineage, but they apparent paralogs are usually due to horizontal gene transfer of a related gene into this organism ("xenoparalogs"). A low proportion of these duplicates (in up to 5% of species groups) are also allowed within these "ortholog groups" -- otherwise, small errors in the topology or rare horizontal gene transfer events, will lead to unreasonably small ortholog groups. Furthermore, these ortholog groups are hierarchical and can lie inside each other -- this way, despite a duplication in one lineage, the rest of the genes in the clade can still be considered orthologs. Given these hierarchical ortholog groups, genes are orthologs if they are both present as the only representative of their genome in a (non-LSE) ortholog group.

Because most genes are present in many different trees, MicrobesOnline computes tree-orthologs for a gene by combining the results over the best trees for that gene. This allows genes to be orthologs even though they are missing from the best tree. A gene that present but not a tree-ortholog by a better tree is never called an ortholog due to a weaker tree.

MOGs: MicrobesOnline Ortholog Groups

Because MicrobesOnline may include many trees for a gene, computing the tree-orthologs for a gene can take as long as a second. Thus, tree-orthologs are not suitable for large-scale analyses, such as phylogenetic profile search. To support these analyses, MicrobesOnline stores MOGs, which are clusters of ortholog groups.

To compute the MOGs, we consider the best ortholog groups (as determined by number of aligned positions) first. Overlapping ortholog groups are merged if they have majority overlap.

A genome may contain multiple representatives of a MOG. Usually this occurs because of recent duplications in that lineage. These genes will not have any 1:1 MOG-based orthologs, but this only matters for comparisons between closely related species, as these duplicated genes cannot have 1:1 orthologs in more distant genomes in any case.

COGs: Clusters of Orthologous Groups

COGs, or clusters of orthologous groups, were originally defined as "triangles" of genes that were best hits of each other amongst a few genomes (roughly 60 genomes). Although many COGs are present in one copy in most of the genomes that they are found in, some of the COGs are often present at many copies per genome. Thus, COGs are often more broad than ortholog groups. Also, many genes are not found in COGs, even though they have a significant number of homlogs. MicrobesOnline assigns genes to COGs by running PSI-BLAST against the profiles of each COG from NCBI's Conserved Domain Database.


For more information about MicrobesOnline orthologs, please contact us at gtlweb@vimss.lbl.gov.