FastHmm

FastHmm uses profile Hidden Markov Models (profile HMMs) to find instances of known families and domains within a protein sequence database. HMMs are very sensitive but also rather slow, so to speed up the identification, FastHmm uses BLAST with gaps, position-specific weights, and sensitive settings to quickly find candidate members of each family. Given these candidates, FastHmm uses HMMer to select the true family members.

FastHmm works with all seven databases of HMMs that are part of InterPro. FastHMM runs about 40-times faster than InterProScan on genome-scale inputs and gives very similar results.

More information: The databases of HMMs:

Performance and Sensitivity

Because FastHMM uses the hmmsearch program of HMMer to validate the results from BLAST's blastpgp program, there are no false positives. To increase sensitivity on the non-redundant databases(Pfam, Pirsf, SMART, and TIGRFAM), FastHMM uses hmmsearch on weak or "hard" models. For these weak families, sequences in the seed alignment are often missed by blastpgp (or even by hmmsearch).

With default settings, the sensitivity (the proportion of all hits that meet the e-value or score cutoffs) is over 99% for six of the seven databases. The sensitivity for Superfam hits is lower (94%), but Superfam includes redundant models, which reduces the importance of the misses for annotation. Most of these missed hits are very weak: for Superfam, their median score is 12 bits, and their median evalue is 0.002.

The speedup of FastHMM increases with the size of the input -- FastHMM issues several commands for each of the tens of thousands of HMMs in the databases, so it has considerable "fixed costs." Thus, we do not recommend the use of FastHMM on small numbers of sequences. For ~20,000 bacterial proteins, FastHMM is about 20 times faster than exhaustively running hmmsearch, and hmmsearch is itself about twice as fast as hmmpfam (which is the basis for InterPro). For the small genome of Archaeoglobus fulgidus (2,420 proteins and 669,597 residues), FastHMM is only about 4 times faster than hmmsearch. FastHMM is designed to use minimal memory even on large datasets, so you need not chop the input database into sections. Indeed, doing so will degrade performance.

System Requirements

FastHMM is written in Perl, and it requires the Devel::Size library if run in debugging mode. FastHMM also requires at least 15 gigabytes of free disk space if you download and install all seven family databases. Finally, FastHMM requires the following sequence analysis tools: HMMer, NCBI BLAST, MUSCLE, and CD-HIT.

FastHMM has been tested on x86 Linux machines (both 32-bit and 64-bit), with Perl 5.8.5, HMMer 2.3.2, BLAST 2.2.14, MUSCLE 3.52, and CD-HIT 2006. We do not know of any reason why it would not work on other platforms or with other versions of these packages, but we have not tested it. If you run it on a non-Unix machine, you also need to install Unix-like tools (e.g. cygwin).

Downloading and Installing FastHMM

You can download the FastHMM code and processed versions of the HMM databases from here. See the installation guide for more advice on how to install it.

Running FastHMM on a Single Computer

Once FastHMM is installed, you can run it as follows:

export FASTHMM_DIR=~/fasthmm
$FASTHMM_DIR/bin/fastHmm.pl -i input_database -t all -j NCPus -o output_directory

The input database must be both a fasta-format file and a valid BLASTp database. You can use the -f argument if you want fastHmm.pl to make the BLASTp database for you.

The -j and -o arguments are optional. For each of the databases of HMMs, produces two files: raw output in output_directory/results.input.hmmdb.hmmhits and processed output output_directory/results.input.hmmdb.domains. The hmmhits files contain the raw hits to the models that met the thresholds, and the domains files have a non-redundant subset of those results, which corresponds roughly to predicted domains. These files are described in more detail in the output section.

fastHmm.pl writes many temporary files to /tmp and also to the output directory (-o, which defaults to the current working directory). Do not run multiple instances of fastHmm.pl into the same output directory at the same time.

To run fastHmm.pl on a single database of HMMs, specify gene3d, panther, pfam, pirsf, smart, superfam, or tigrfam with the -t argument.

To run hmmsearch exhaustively for every model in the database, use the -ha option. This is useful for testing, but is very slow.

fastHmm.pl has many more options -- for information about them, run fastHmm.pl without any arguments.

Running FastHMM on a Cluster

To run FastHMM on a cluster, use the -b option to prepare sets of commands. These commands can run independently of each other, so you can submit them to a cluster, using whatever scheduler you choose. Then you run fastHMM.pl again with -m to "merge" the results. For example, to issue jobs as batches of 100 models, and to use 2 CPUs in each issued job:

export FASTHMM_DIR=~/fasthmm
$FASTHMM_DIR/bin/fastHmm.pl -t all -j 2 -i `pwd`/input.faa -b 100 -f -o `pwd` > fasthmm.cmds
submit each line in fasthmm.cmds as a cluster job, and wait for them to finish
$FASTHMM_DIR/bin/fastHmm.pl -t all -j 2 -i `pwd`/Aful.faa -b 100 -f -o `pwd` -m

$FASTHMM_DIR/fastHmm.pl -t all -j 2 -i `pwd`/Aful.faa -b 100 -f -o `pwd` -p
submit each line in result.Aful.panther.hmmhits.cmds.1 as a cluster job, and wait for them to finish
run the commands in result.Aful.panther.hmmhits.cmds.2 (will not take long)

The second set of cluster jobs, beginning with fastHmm.pl -p, is only required for making the domains calls for PANTHER, and can be omitted if you're running another database or do not need the domains calls for PANTHER.

FastHMM Output Formats

For each HMM database, FastHMM produces a results.input.hmmdb.hmmhits file and a results.input.hmmdb.domains file. The hmmhits files include all hits to the HMM that meet the cutoffs recommended by the curators of the HMM (e.g., for Pfam, the trusted cutoff). The domains files include a non-redundant subset of those hits, which correspond to predicted domains. Both files are tab-delimited with the following fields:

The aligned ranges in the domains file may not be interpretable without additional information, because a single domain can correspond to multiple models.

Cutoffs, Rules for Calling Domains, and Differences from InterProScan Results

The cutoffs and the rules for converting the raw hits into the non-redundant domain hits are different for each family. FastHMM includes its own code for these conversions, which can lead to differences from InterProScan output, even when FastHMM is run in exhaustive mode with the -ha option. Another (rare) source of differences in the ouutput is that FastHMM uses hmmsearch, while InterPro uses hmmpfam. FastHMM adjusts the evalues by the size of the HMM database, so that it returns evalues are on the same scale as those returned by hmmpfam, but because hmmsearch and hmmpfam compute slightly different evalues, occasionally a hit will be above the cutoff with one tool and below the cutoff with the other.

Gene3D: Gene3D has multiple HMMs per superfamily. InterPro eliminates Gene3D hits that contain 10 residues or less and with e-values above 0.001, and then chooses the best of overlapping hits. FastHMM uses the same e-value cutoff, but uses an alignment overlap criterion to choose which hits to keep (see Superfam).

PANTHER: For PANTHER, InterPro uses BLAST against consensus sequences as a pre-filter. The consensus sequences include one for each subfamily. Sequences that match a family or subfamily consensus are than run against the family HMM and against any matching subfamilies as well. Finally, an e-value cutoff of 10-3 is applied, and only the best hit is kept. In contrast, FastHMM uses a PSI-BLAST prefilter. Both the FastHMM and the consensus-BLAST prefilters miss about 5% of hits, so the FastHMM results are different (but not worse). Once a gene is assigned to a family, FastHMM includes a post-processor that tests which subfamily HMM (if any) it matches, and reports the one it has the best hit to. The domains output file includes the best-hitting family and the best-hitting subfamily (if any), while the hmmhits output file includes all of the family hits.

Pfam: For each Pfam, InterPro includes two HMMs, a ".ls" model that is used to find complete domains and a ".fs" model that can find domain fragments. If a region hits both HMMs, then only the ".ls" hit is kept. In the raw (*.hmmhits) FastHMM output, the ".fs" hits have a domain name ending with ".fs" (e.g., PF05470.fs), and the ".ls" hits have no suffix (e.g., PF06470). After running the HMMs, FastHMM removes hits that are redundant, either because both families are related (they belong to the same ``clan''), or because one hit is nested inside a higher-scoring hit. The "alignment mode" annotation for each Pfam determines whether global (ls) matches or fragment matches should be preferred, or whether the best-scoring hit should be used. InterProScan uses very similar rules. Also, both InterPro and FastHMM use the trusted cutoff defined by the Pfam curators when running HMMer.

PIRSF: Both FastHMM and InterPro eliminate hits whose scores are below the thresholds in the pirsf.dat file, or whose length is different from expectation. The two implementations of these rules give almost identical results. InterPro then uses BLAST against the entire database to decide which subfamily to assign a sequence to. (PIRSF includes both family and subfamily HMMs.) In contrast, FastHMM uses the best bit score from HMMer, not BLAST, to choose the best hit, and FastHMM keeps the best hit for any region of the gene, not just one hit for the entire sequence.

SMART: InterPro reports artificially inflated (less-significant) E-values for hits, by about a factor of 1,000. We verified with reversed sequences that the E-values reported by FastHMM, which are obtained from hmmsearch and are corrected for searching against many families (using the -Z option), are reasonable: for the genome of Archaeoglobus fulgidus, we found 641 hits with FastHMM and 7 hits for the reversed sequence, for a false discovery rate of only 1\%. Also, InterPro has additional post-processing filters for SMART, including proprietary scoring thresholds. These are not included in the results of InterProScan if you run it locally, and are not included in FastHMM. Instead, FastHMM uses an e-value threshold of 2.04e-5, which corresponds to the actual threshold used by InterProScan. There is no processing for SMART (the *.domains and *.hmmhits file are identical).

Superfam: Because SUPERFAMILY contains multiple HMMs for each superfamily, InterPro uses an assignment script to report a single superfamily for each domain. Unfortunately the assignment script is very slow, as it parses the alignment files from HMMer many times instead of just once. We wrote a faster script (SsfAssignFast.pl) that uses the same approach as the SUPERFAMILY script and gives identical results over 99.9% of the time, but parses the alignment files only once. For a given gene, SsfAssignFast.pl starts with the most significant hits, and ignores hits that share >35% of their aligned positions with a previously accepted hit. It also ignores all further hits for a protein once all but 15 amino acids are covered by hits.

TIGRFAM: For TIGRFam, there is no post-processing -- the *.domains and *.hmmhits files are identical. Both InterPro and FastHMM use the gathering cutoff defined by the TIGRFAM curators when running HMMer.

Contact Us

FastHMM was developed by Morgan N. Price and Y. Wayne Huang. For more information, please contact us at fasthmm@microbesonline.org.