Up to FastHMM/FastBLAST main page

FastHMM Cutoffs and Rules

The cutoffs and the rules for converting the raw hits into the non-redundant domain hits are different for each domain database. FastHMM includes its own code for these conversions, which can lead to differences from InterProScan output, even when FastHMM is run in exhaustive mode with the -ha option. Another (rare) source of differences in the ouutput is that FastHMM uses hmmsearch, while InterPro uses hmmpfam. FastHMM adjusts the evalues by the size of the HMM database, so that it returns evalues are on the same scale as those returned by hmmpfam, but because hmmsearch and hmmpfam compute slightly different evalues, occasionally a hit will be above the cutoff with one tool and below the cutoff with the other.

Gene3D: Gene3D has multiple HMMs per superfamily. InterPro eliminates Gene3D hits that contain 10 residues or less and with e-values above 0.001, and then chooses the best of overlapping hits. FastHMM uses the same e-value cutoff, but uses an alignment overlap criterion to choose which hits to keep (see Superfam).

PANTHER: For PANTHER, InterPro uses BLAST against consensus sequences as a pre-filter. The consensus sequences include one for each subfamily. Sequences that match a family or subfamily consensus are than run against the family HMM and against any matching subfamilies as well. Finally, an e-value cutoff of 10-3 is applied, and only the best hit is kept. In contrast, FastHMM uses a PSI-BLAST prefilter. Both the FastHMM and the consensus-BLAST prefilters miss about 5% of hits, so the FastHMM results are different (but not worse). Once a gene is assigned to a family, FastHMM includes a post-processor that tests which subfamily HMM (if any) it matches, and reports the one it has the best hit to. The domains output file includes the best-hitting family and the best-hitting subfamily (if any), while the hmmhits output file includes all of the family hits.

Pfam: For each Pfam, InterPro includes two HMMs, a ".ls" model that is used to find complete domains and a ".fs" model that can find domain fragments. If a region hits both HMMs, then only the ".ls" hit is kept. In the raw (*.hmmhits) FastHMM output, the ".fs" hits have a domain name ending with ".fs" (e.g., PF05470.fs), and the ".ls" hits have no suffix (e.g., PF06470). After running the HMMs, FastHMM removes hits that are redundant, either because both families are related (they belong to the same ``clan''), or because one hit is nested inside a higher-scoring hit. The "alignment mode" annotation for each Pfam determines whether global (ls) matches or fragment matches should be preferred, or whether the best-scoring hit should be used. InterProScan uses very similar rules. Also, both InterPro and FastHMM use the gathering cutoff defined by the Pfam curators when running HMMer.

PIRSF: Both FastHMM and InterPro eliminate hits whose scores are below the thresholds in the pirsf.dat file, or whose length is different from expectation. The two implementations of these rules give almost identical results. InterPro then uses BLAST against the entire database to decide which subfamily to assign a sequence to. (PIRSF includes both family and subfamily HMMs.) In contrast, FastHMM uses the best bit score from HMMer, not BLAST, to choose the best hit, and FastHMM keeps the best hit for any region of the gene, not just one hit for the entire sequence.

SMART: InterPro reports artificially inflated (less-significant) E-values for hits, by about a factor of 1,000. We verified with reversed sequences that the E-values reported by FastHMM, which are obtained from hmmsearch and are corrected for searching against many families (using the -Z option), are reasonable: for the genome of Archaeoglobus fulgidus, at a (generous) e-value cutoff of 0.02, we found 641 hits for the true sequences but only 7 hits for the reversed sequences, for a false discovery rate of only 1%. Also, InterPro has additional post-processing filters for SMART, including proprietary scoring thresholds. These are not included in the results of InterProScan if you run it yourself, and are not included in FastHMM. Instead, FastHMM uses an e-value threshold of 2.04e-5, which corresponds to the actual threshold used by InterProScan. FastHMM does not do any post-processing for SMART (the *.domains and *.hmmhits file are identical).

Superfam: Because SUPERFAMILY contains multiple HMMs for each superfamily, InterPro uses an assignment script to report a single superfamily for each domain. Unfortunately the assignment script is very slow, as it parses the alignment files from HMMer many times instead of just once. We wrote a faster script (SsfAssignFast.pl) that uses the same approach as the SUPERFAMILY script and gives identical results over 99.9% of the time, but parses the alignment files only once. For a given gene, SsfAssignFast.pl starts with the most significant hits, and ignores hits that share >35% of their aligned positions with a previously accepted hit. It also ignores all further hits for a protein once all but 15 amino acids are covered by hits.

TIGRFAM: For TIGRFam, there is no post-processing -- the *.domains and *.hmmhits files are identical. Both InterPro and FastHMM use the trusted cutoff defined by the TIGRFAM curators when running HMMer.