Cluster sequences by similarity — seq_cluster • bioseq

Cluster sequences by similarity

seq_cluster(x, threshold = 0.05, method = "complete")

Arguments

x: a DNA, RNA or AA vector of sequences to clustered.
threshold: Threshold value (range in [0, 1]).
method: the clustering method (see details).

Value

An integer vector with group memberships.

Details

The function uses ape dist.dna and dist.aa functions to compute pairwise distances among sequences and hclust for clustering.

Computing a full pairwise diastance matrix can be computationally expensive. It is recommended to use this function for moderate size dataset.

Supported methods are:

"single" (= Nearest Neighbour Clustering)
"complete" (= Farthest Neighbour Clustering)
"average" (= UPGMA)
"mcquitty" (= WPGMA)

See also

Function seq_consensus to compute consensus and representative sequences for clusters.

Other aggregation operations: seq_consensus()

Examples


x <- c("-----TACGCAGTAAAAGCTACTGATG",
       "CGTCATACGCAGTAAAAACTACTGATG",
       "CTTCATACGCAGTAAAAACTACTGATG",
       "CTTCATATGCAGTAAAAACTACTGATG",
       "CTTCATACGCAGTAAAAACTACTGATG",
       "CGTCATACGCAGTAAAAGCTACTGATG",
       "CTTCATATGCAGTAAAAGCTACTGACG")
x <- dna(x)
seq_cluster(x)
#> [1] 1 1 1 2 1 1 3