Search pattern
From NMPDR Wiki
Contents |
Basic Rules
- PatScan uses the standard codes
- {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y} for the amino acids.
- {A, C, G, T} for the nucleotides.
- Upper and Lower case are equivalent.
- T and U can be used interchangeably.
- Ambiguity Codes may be used to represent unknown characters in a pattern.
- For proteins, only X is allowed (representing any possible amino acid).
- For nucleotides, you have the standard ambiguity codes, each representing one character from their respective nucleotide subset.
Rules for Patterns
- A pattern is a sequence of pattern units.
- A pattern unit is either a simple pattern unit or of the form (X | Y) where X and Y are pattern units. The construct (X | Y) will match successfully if either X matches or Y matches. (NOTE: The spaces before and after the vertical bar | are necessary.
- A simple pattern unit is either
- a named pattern unit,
- a complementation rule pattern unit, or
- a basic pattern unit.
Named Pattern Units
A named pattern unit is of the form
pi=X
where pi is one of {p1,p2,p3,...} and X is a basic pattern unit. When a named simple pattern unit successfully matches a section of a sequence, that section can be later referred to in constructs such as
p1 p2[0,1,0] ~p3
and so forth. The name saves the value of the matched substring.
Complementation Rule Pattern Units
A complementation rule pattern unit is of the form
ri=complements
where ri is one of {r1,r2,r3,...} and complements is a set defining what is meant by a complement under the named rule. For example,
r1={au,ua,gc,cg,gu,ug,ga,ag} r2={au,ua,gc,cg}
shows two complementation rule pattern units defining two specialized notions of complementation.
Explicitly defined complementation rules are useful when scanning for helices in nucleotide sequences, especially when unusual constraints exist for specific positions.
Normally, one uses the standard complementation rule, i.e. the set
{at,ta,cg,gc}
The scan tool assumes this to be the default rule, and it does not need to be defined explicitly.
Basic Pattern Units
There are seven different types of basic pattern units
String Pattern Units
A string pattern unit is a string of characters, optionally followed by a match qualifier of the form
- [Mismatches,Deletions,Insertions]
For example, a string pattern of the form
RNYRNYRNYRNY[1,0,0]
would match 12 characters in a nucleotide string (where R stands for a purine and Y stands for a pyrimidine. A maximum of 1 mismatch is allowed, but no insertions or deletions.
A mismatch in this case is a character position where the matched sequence does not match the pattern. A deletion is a character in the pattern for which no character in the matched sequence corresponds, while an "insertion" is a character in the sequence which does not correspond to any character in the string pattern unit. The following are examples of string pattern units.
AGYGGT YCXXGA TATAA[1,0,0]
Range Pattern Units
A range pattern unit is of the form
- Min...Max
which indicates that it will match any subsequence with length between Min and Max. For example, the following pattern unit will match any sequence of 3 to 8 characters.
3...8
Complement Pattern Units
A complement pattern unit is used to match against the reverse complement of a string previously matched by a named pattern unit. Thus,
~p1
matches the reverse complement of whatever p1 represents. If a special rule of complementation is required, it precedes the ~, thus
r1~p2
matches the reverse complement of whatever p2 matched, where complementation is defined by complementation rule 2. You can also add a match qualifier of the sort used in string pattern units; thus,
~p2[1,0,1]
would allow a single mismatch and a single character bulge in the helix.
Complement pattern units can only be used for DNA searches.
Repeat Pattern Units
A repeat pattern unit is used to match against the value saved in a previously matched name pattern unit. Thus,
p1=3...3 p1 p1
would match a 9-character string made up of a 3-mer repeated three times. You can qualify the matches; for example,
p1=3...3 p1[1,0,0] p1[1,0,0] p1[1,0,0] p1[1,0,0]
matches a 15-character string which might be thought of as 5 repetitions of a 3-mer that have experienced a few mutations.
Any-Of Pattern Units
An any-of pattern unit is used in constructing protein patterns, since we do not allow a rich set of ambiguity codes for amino acids. It has the form any(AAs) where AAs is a string of acceptable characters.
any(IV)
Any-of pattern units can only be used for protein searches.
Not-Any-Of Pattern Units
A not-any-of pattern unit is similar to the any-of pattern unit, except that it matches any character not in the designated set.
notany(IVL)
Not-any-of pattern units can only be used for protein searches.
Weight Pattern Units
A weight pattern unit is a somewhat clumsy way to represent a standard weight matrix (it was intended that patterns be generated by programs, and so we deemed the syntax minimally acceptable). The form of the pattern unit is
- { List of N-tuples } > MinValue
Suppose that you wanted to match a sequence of eight characters. The consensus of these eight characters is GRCACCGS, but the actual frequencies of occurrence are given in the matrix below. Thus, the first character is an A 16% the time and a G 84% of the time. The second is an A 57% of the time, a C 10% of the time, a G 29% of the time, and a T 4% of the time.
| C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | |
| A | 16 | 57 | 0 | 95 | 0 | 18 | 0 | 0 |
| C | 0 | 10 | 80 | 0 | 100 | 60 | 0 | 50 |
| G | 84 | 29 | 0 | 0 | 0 | 20 | 100 | 50 |
| T | 0 | 4 | 20 | 5 | 0 | 2 | 0 | 0 |
One could use the following pattern unit to search for inexact matches related to such a weight matrix:
{(16,0,84,0),(57,10,29,4),(0,80,0,20),(95,0,0,5),
(0,100,0,0),(18,60,20,2),(0,0,100,0),(0,50,50,0)} > 450
This pattern unit will attempt to match exactly eight characters. For each character in the sequence, the entry in the corresponding tuple is added to an accumulated sum. If the sum is greater than 450, the match succeeds; else it fails. For protein sequences, you must use 20-tuples (with the entries corresponding to the amino acids in alphabetical order). This will be used only by the most serious aficionadoes.
Length-Limit Pattern Units
A length-limit pattern unit puts a bound on the sum of the lengths matched by previous named pattern units (which probably named a sequence matched by a range pattern unit). It does not "consume" any of the sequence; rather, it just succeeds or fails. Thus,
p1=5...5 p2=1...5 p3=3...7 ~p1[1,0,0] p4=3...8 ~p3 length(p2+p4) < 10
would match a pseudo-knot like structure, setting a maximum size on the two unpaired internal subsequences; the length of p2 can be 1 to 5 and the length of p4 can be 3 to 8, but the combined length of both must be less than 10.
Additional Rules
End of String
Use a dollar sign to indicate the end of the sequence. For a protein pattern, this is the end of the gene. For a DNA pattern, this is the end of the contig. For example,
TTF $
matches TTF at the end of a gene.
Beginning of String
Use a caret to indicate the beginning of the sequence. For a protein pattern, this is the start of the gene. For a DNA pattern, this is the beginning of the contig. For example
^ TTF
matches TTF at the beginning of a gene.
Palindrome Sequence
A left angle bracket can be used to indicate a palindrome. For example
p1=4...4 <p1
matches all of the following,
AGGD DGGA FAFL LFAF GSAP PASG SAPR RPAS
That is, it matches any four characters followed by their reverse. (This is the actual palindrome, not the biologically common meaning of reverse complement.)
