Detect the presence of patterns in sequences
seq_detect_pattern(x, pattern, max_error = 0)
a DNA, RNA or AA vector.
a DNA, RNA or AA vectors (but same as x
)
or a character vector of regular expressions, or a list.
See section Patterns.
numeric value ranging from 0 to 1 and giving the maximum error rate allowed between the target sequence and the pattern. Error rate is relative to the length of the pattern.
A logical vector.
It is important to understand how patterns are treated in bioseq.
Patterns are recycled along the sequences (usually the x
argument).
This means that if a pattern (vector or list) is of length > 1, it will be
replicated until it is the same length as x
.
The reverse is not true and a vector of patterns longer than
a vector of sequences will raise a warning.
Patterns can be DNA, RNA or AA vectors (but they must be from the same class as the sequences they are matched against). If patterns are DNA, RNA or AA vectors, they are disambiguated prior to matching. For example pattern dna("ARG") will match AAG or AGG.
Alternatively, patterns can be a simple character vector containing regular expressions.
Vectors of patterns (DNA, RNA, AA or regex) can also be provided in a list. In that case, each vector of the list will be collapsed prior matching, which means that each vector element will be used as an alternative pattern. For example pattern list(c("AAA", "CCC"), "GG") will match AAA or CCC in the first sequence, GG in the second sequence, AAA or CCC in the third, and so on following the recycling rule.
@section Fuzzy matching:
When max_error
is greater than zero, the function perform
fuzzy matching. Fuzzy matching does not support regular expression.
stri_detect
from stringi,
str_detect
from stringr and
afind
from stringdist
for the underlying implementation.
Other string operations:
seq-replace
,
seq_combine()
,
seq_count_pattern()
,
seq_crop_pattern()
,
seq_crop_position()
,
seq_extract_pattern()
,
seq_extract_position()
,
seq_remove_pattern()
,
seq_remove_position()
,
seq_replace_position()
,
seq_split_kmer()
,
seq_split_pattern()
x <- dna(c("ACGTTAGTGTAGCCGT", "CTCGAAATGA"))
seq_detect_pattern(x, dna(c("CCG", "AAA")))
#> [1] TRUE TRUE
# Regular expression
seq_detect_pattern(x, "^A.{2}T")
#> [1] TRUE FALSE
# Fuzzy matching
seq_detect_pattern(x, dna("AGG"), max_error = 0.2)
#> [1] FALSE FALSE
# No match. The pattern has three character, the max_error
# has to be > 1/3 to allow one character difference.
seq_detect_pattern(x, dna("AGG"), max_error = 0.4)
#> [1] TRUE TRUE
# Match