1. General Information
This tool is free for both academic and non-academic use. No registration is required to access the full functionality of the server. However, if you need to process larger protein sets or perform motif searches frequently, please contact the authors to get a standalone version for LINUX.
Please note that results are stored on the server only for 7 days. It is recommended to either download the plain text summary or save the complete result HTML page locally through your browser (right click -> Save page as...) for further access.
The website design is kept simple and light-weight. Cookies as well as additional plugins are not required. All modern browsers (starting from IE9) are supported. The website is fully mobile-friendly.
Security and privacy
The web server does not collect any information about the users. Submission of the data can be done completely anonymously; however, users have the option to provide an e-mail address for a notification on job completion. Each accepted job gets a unique randomly generated identifier (a 40-character long hexadecimal string), which is required to access the results. Please share this identifier only with persons, who are authorized to view the results. The results will be permanently deleted from the server on the 8th day upon the job completion.
This guide addresses mainly the technical aspects of using the server. For information about the scope of use and internal workflow, please visit the About page.
2. De Novo Input
Input protein set
The input set must contain ONLY full protein sequences, as the evolutionary conservation and solvent accessibility filters will be otherwise skewed! To restrict a motif search only to specific domains, the query regions file in advanced mode must be used. The number of proteins in the input set must be no lower than 3 and no higher than 50.
A FASTA file containing sequences of all proteins in the set, one sequence per protein. Alignments as well as empty sequences are not allowed. Each sequence must be no shorter than 50 and no longer than 10,000 amino acids. Alternatively, the file content can be copied in the corresponding text field. There is no difference between these input options; however, the user must use either text field or file, but not both simultaneously.
A set of FASTA files with input protein sequences, one file per protein. Each file can be single or multiple FASTA. A multiple FASTA will be understood as a collection of orthologs of the same protein. In case orthology search is activated, only the first FASTA records will be considered and orthologs will be re-detected with the standard 2-way-BLAST-based algorithm. Alignments as well as empty sequences are not allowed. Each sequence must be no shorter than 50 and no longer than 10,000 amino acids. The set must be provided as ZIP archive containing only FASTA files or, alternatively, a single directory with FASTA files; subdirectories are not allowed. All files and directories with names not starting with a Roman letter or digit will be ignored. Filenames must not contain colons, as well as non-ANSI characters. Length of a filename must not exceed 72 characters.
A FASTA file can be also submitted instead of the ZIP archive, as in standard mode. In this case all the sequences will be treated as separate proteins and possibilities to submit user-defined orthologs as well as regions of interest are lost.
Query regions file (advanced mode only)
A file specifying the opening and closing residue numbers of the regions of interest in each protein to look for motifs in. This option is useful for a motif search only in specified domains (never truncate the input sequences for this, as this will skew the evolutionary conservation and solvent accessibility filters!). Each line of this file is a filename followed immediately by the colon and then sequence regions (as opening and closing residues connected by dash) separated by commas. If an input FASTA file name is not in this file, the whole sequence is considered. Filenames that do not correspond to files in the protein set are not allowed. Specifying regions that are shorter than 20 residues should be avoided, as these would not contain enough information for reliable motif detection. The algorithm can move boundaries between masked and unmasked regions for 1 residue in either direction if one of them is too short. The regions file can be submitted only with a ZIP archive as data set input, not with a FASTA file.
Search for orthologs
Search for proteins with the same function and evolutionary origin in other species. These are considered in the evolutionary conservation filtering: motifs found in non-conserved stretches will get lower scores or will ultimately be removed from the results. The search is performed against the NCBI non-redundant database. No orthologs will be found for sequences that do not correspond exactly to one of the database records. This means, no orthologs will also be found for partial or mutant sequences.
Orthology search is always on in standard mode.
Restrict gap length in found motif instances to 1 residue. This restriction conforms with almost all known motifs. However, if long flexible linkers are expected to be contained in the sought-for motifs, this option should be deactivated.
Gap restriction is always on in standard mode.
Check surface accessibility
Predict, resolve, and mask the regions that are most likely lie inside globular domains and do not show up on the surface. However, it may lead to the loss of motifs located in cavities. Transmembrane regions will also be generally predicted as not accessible and subsequently masked.
Surface accessibility check is always on in standard mode.
Predict, resolve, and mask the regions that are most likely to be too ordered or rigid to harbor functional short motifs. This option is shown to result in a slight increase in motif prediction accuracy.
Disorder check is always off in standard mode.
Smart homology filtering
Discard motif trees located primarily to longer regions of homology and shared conserved domains. This filter works on pairwise level within the context of individual motif trees and therefore does not mask whole input sequences or their parts by themselves. Motif discovery in non-homologous regions of the sequences is not prohibited. Therefore, can tolerate homology among input sequences to quite large extent. However, the correct motif prediction in very close homologs is still not possible. Deactivating this filter normally results in increase in recall with simultaneous decrease in precision.
The option is always on in standard mode.
Show best suboptimal if no motifs found
If no motif trees can be found according to the input settings, then relax the search parameters and show the best suboptimal motif tree. This option may be useful to evaluate potential motif candidates, even if they are not strong enough to pass through the filters of HH-MOTiF. The suboptimal candidate will not be shown, if at least one strong enough motif tree is found. If this option is activated and gets exercised, the corresponding message appears at the top of the results page. In some extreme cases (for example, if submitted sequences are too similar, too diverse or specified query regions are too narrow), however, no results may be found even with the option activated.
The option is always off in standard mode.
Maximal regex p-value
After predicting motif trees, the server tries to represent each region in form of a classical motif by building the corresponding regular expression (regex). The quality of the resulting regex depends heavily on the degree of the underlying motif's conservation. The regex p-value is calculated on the basis of single amino acid frequencies, pair frequencies, and number of proteins containing the putative motif. The p-value lies in the range [0, 1] and estimates the probability of this motif to occur in these particular proteins just by chance. However, a regex is a very poor representation for weakly conserved motifs and therefore often shows high values even for experimentally confirmed motifs. That is why the default value is set to 0.3; however, if sought-for motifs are expected to be highly conserved, a value of 0.1 may be more appropriate.
Maximal regex p-value is always 0.3 in standard mode.
3. De Novo Output
All found motif trees are displayed in the form of a table, one line per motif tree. Protein number-codes are provided in place of the user defined FASTA headers or filenames (nevertheless, the full FASTA records are visible further down the page). Each tree repesents a distinct motif, which is generally independent from other found motifs. More information about a motif appears upon clicking.
Absence of the results means that no motifs could be found. In this case a search with more relaxed parameters might be helpful.
The input sequences are shown in the HTML version of the output. All sequences are number-coded (starting from 0000). In standard mode the order is preserved from the input file. In advanced mode the sequences are ordered alphabetically according to input filenames; only the first sequence from each file is shown (although the others are considered - as orthologs - during the calculations). If query regions were specified, they are also indicated in the output (residues lying outside are masked and colored gray).
Motif trees are the most important information on the output page. A motif tree is an evolution-oriented model of a putative sequence motif.
According to the model, each tree consists of two distinct components:
- The motif root in a particular protein
- Motif leaves in other proteins
Motif information panel
The panel appears upon clicking on a motif root within a protein sequence or the results overview section. It contains, besides the sequence logo, the information about the motif's regex, scores, leaves, as well as a link to the corresponding FASTA file.
The regular expression (regex) is the compact representation of conserved amino acids in a motif. Dots represent completely unconserved positions (a.k.a. columns). If there is at least one gap or a mismatched residue in the column, it is considered completely unconserved. By conserved or partially conserved columns all occurring amino acids are listed in square brackets. The regex p-value is one of the estimates of a motif's strength.
The average alignment score is calculated upon alignment scores of all the root-leaf pairs, which are in turn based on the Viterbi algorithm scores. The higher the score - the stronger the motif tree.
The motif root appears as the first sequence stretch and is followed by the motif leaves ordered by the alignment score. Although true sequence alignments are guaranteed only for root-leaf pairs (and not between individual leaves), for visualization purposes the whole tree is shown as multiple sequence alignment (MSA). In this context it is called a pseudo-MSA.
The corresponding FASTA file contains the (pseudo-)alignment of the motif tree. It can be further submitted as an input motif file for a proteome-wide motif search in order to find the same motif in other proteins and/or organisms.
The motif leaves of each motif tree are listed together with their alignment scores (in brackets) in the motif information panel. They are also highlighted in purple upon clicking on the corresponding motif root. The red underlinement of all the motif roots is kept so that one can easily see the overlaps of leaves of a particular motif tree with the roots of other trees. The strong and unambiguous results would imply mutually overlapping motif trees with definite reciprocality. However, owing to weak conservation of real world motifs as well as the presence of noise, it is not always the case. Therefore, motif leaves that fail to overlap with another motif's root by at least 3 residues are marked as 'nonreciprocal' in the pseudo-MSAs for further user consideration. Due to their potentially large number and overlapping nature, the motif leaves are not displayed by default (they appear upon clicking).
Plain text summary
The output is alternatively provided in the plain text format, which is more suitable for machine parsing.
If provided, the job name is displayed at the beginning of the file.
The information in the file is divided into four sections.
The first section contains input parameters.
The second section contains the protein coding for the next two sections.
In the third section all the input proteins (including those without any found motifs) are listed with all the predicted motifs tree roots appearing in sequence order with the average alignment scores in brackets.
In the fourth section - detailed information - every found motif tree is listed together with its regex, scores and motif leaves in other proteins.
Motif trees with roots in one protein are separated with
-s. Different proteins are separated with
The information about the job submission (including all the input parameters) is provided at the bottom of the output page.
4. Proteome-wide Input
Input motif file
The aligned FASTA file with already known instances of this motif. For a better performance as many as possible should be provided. If there are several instances of the motif in the same protein (i.e., the motif is a tandem repeat), they should be provided separately. At least 2 motif instances must be provided, but much more are needed for accurate and reliable results. The alignment must be no shorter than 3 and no longer than 50 columns.
The results from a de novo motif search on this server can be very easily submitted as the input for a proteome-wide search. For this download the FASTA file in the corresponding motif information panel.
The organism to search for the motif in. Human and the mostly popular model organisms are supported.
Only full-length matches
Leave only those matches that have the same length as the input motif. This option is useful when the input motif begins and ends with conserved residues. It must be turned off if less conserved flanking residues are provided.
5. Proteome-wide Output
The information about the job submission is partially provided at the beginning and is continued at the bottom of the output page. Besides the download link to the original input file, the calculated consensus sequence is provided.
Sequences with found motifs
The sequences from the selected organism's proteome, in which motifs were found. The sequences are taken from the NCBI database; the corresponding identifiers (GI and accession numbers) are provided in the headers. The found motifs are highlighted in underlined red. A small information box at the right side with the actual underlying alignment appears upon clicking.
Assuming no motifs are found, the corresponding message is shown instead of the sequences. The failure to find any results usually indicates that the input motif is too short and/or too unconserved.
Plain text summary
The output is alternatively provided (at the bottom of the page) in plain text format, which is more suitable for machine parsing.
If provided, the job name is displayed at the beginning of the file.
The information in the file is divided into three sections.
The first section contains only the input motif consensus sequence.
The second section lists the input parameters.
The last section consists of all found motif matches.
Each motif match is provided in the form of an alignment of the found hit with the input motif consensus sequence.
Below the alignment its Viterbi algorithm score is indicated.
Motifs found in different proteins are separated with