Hidden Markov Models
These instructions are my experience for creating a good, valid, Hidden Markov Model (HMM) of a set of amino acid sequences with something in common (homologues), using HMMER3: software.
Crap in equals crap out
It is very important to have knowledge about the proteins in question - you have to have a data set that are all true, i.e. only containing sequences which have the common feature(s), and that do not contain fragments/partial sequences. Also, even though there are some weight capabilities in HMMER, I always reduce the sequence identity to maximum 80% and minimum 40% in pairwise comparisons.
Input - multiple alignment
Create a multiple alignment, I recommend mafft (especially mafft-linsi) but otherwise choose your favorite alignment software:
>mafft-linsi --ep 0.123 --quiet unaligned_sequences.fasta > aligned_sequences.fasta
View the resulting alignment in an editor, I tend to use JalView
*Trim N- and C-terminals of alignment if necessary, i.e. if the sequences are very diverse in the beginning and/or the end, then don’t bother training an HMM to capture these parts.
*Are there any fragments or odd looking sequences? -> Remove them
*Sometimes it is necessary to manually correct the alignment, don’t overdo it but use your knowledge about the group in terms of functions, active sites, 3D structure, etc.
*Make non-redundant (menu Edit>Remove Redundancy) and remove outliers (I have not seen this function in JalView unfortunately but if you check which sequences would be left if 40% redundancy reduction you get an idea). Honestly, I’ve written a program to remove fragments, make non-redundant and remove outliers.
How many sequences in your alignment?
An HMM can be over-trained, i.e. will only be able to identify the sequences in the training set but not orthologues/homologues, and one source for that problem is of course redundancy, but the number of sequences are important too. I’ve used a minimum of 20 sequences as a rule of thumb (not entirely empirically tested though, you can always try with less and check if it works), but the more the merrier…
Build the HMM
There are all sorts of options to set, but in general the default values take you a long way (except for the --informat option, the default alignment format is Stockholm, and my alignments are always in fasta format):
>hmmbuild --informat afa my_model.hmm my_aligned_sequences.fasta
Do a database search
This is the reason to create the HMM, to find sequences not in you training set that share the same feature, but also: a database search is the only way to see if the HMM is good. This command definitely could use some options, the default output is huge, but it is a matter of taste which outformat options to set.
>hmmsearch --noali --domtblout my_search_results.txt my_model.hmm my_sequence_database
Where are the true positives and where are the false positives? Which threshold to use is dependent on the purpose: If some false hits are okay as long as you get as many true hits as possible then choose a generous threshold, if it’s important that all hits are true then a strict threshold must be used. I tend to always look at the opt-score instead of the E-value (and to use --domT threshold option in the hmmsearch command once I’ve decided the threshold). There’s no shortcut, you need to be able to decide which hits are true and which are false, and the thresholds will be different for each HMM.
Validate the HMM
It always makes sense to validate your HMM, and a good way to do that is a jack-knifing approach: Remove one sequence at time from the original training set, train an HMM on the remaining ones, do a database search and see if the sequence left out is found previous to the false sequences. This means that your database should consist of the sequence left out as well as definitely false sequences. The false ones should be similar enough in sequence to true ones, so that it will be a true challenge for the HMM to be able to separate between them. Another, easier approach, is to set a specific threshold (which excludes false hits) and just score the sequence left out against the HMM - if above threshold then success otherwise a false negative.
This process of leave a sequence out, train HMM, score, and evaluate, is repeated until all sequences in the training set has been left out once.
If your data set is solid, I think that all sequences in the training set should pass this test, otherwise remove them from the final training set.
Since the more the merrier is true when it comes to training sets for HMM building, it is always good to see if the training set can be extended. Hence, if the database search finds sequences that are qualified to be part of the training set, then include them. By qualified I mean above the threshold, not redundant to any sequence already in the training set, or an outlier, or a fragment (and of course not false hit ;-)). Take the fishing as far as possible, i.e. do this iteratively, if for no other reason than to see how far you can go until no more new sequences are found (ideally) or the HMM gets worthless/washed out and the hit list ‘explodes’.
Yvonne Kallberg BILS