This function assesses the genetic similarity among sequences within each taxa. It takes user defined thresholds (one threshold per taxonomic level) to warn about sequences which are singularly different (based on median distance) from the others. Sequences in the reference database must be aligned.
refdb_check_seq_homogeneity(x, levels, min_n_seq = 3)a reference database (sequences must be aligned).
a named vector of genetic similarity thresholds.
Names must correspond to taxonomic levels (taxonomic fields)
and values must be included in the interval [0, 1].
For example to assess homogeneity at 5 percents (within species) and
10 percents (within genus): levels = c(species = 0.05, genus = 0.1)
the minimum number of sequences for a taxon to be tested.
A dataframe reporting suspicious sequences whose median distance to other sequences of the same taxon is greater than the specified threshold. The first column "level_threshold_homogeneity" indicates the lowest taxonomic level for which the threshold has been exceeded and the second column "value_threshold_homogeneity" gives the computed median distance.
For every tested taxonomic levels, the algorithm
checks all sequences in every taxa
(for which the total number of sequence is > min_n_seq)
In each taxon, the pairwise distance matrix among all the sequences
belonging to this taxon is computed. A sequence is tagged as suspicious
and returned by the function
if its median genetic distance from the other sequences is higher than
the threshold set by the user (levels argument).
lib <- read.csv(system.file("extdata", "homogeneity.csv", package = "refdb"))
lib <- refdb_set_fields_BOLD(lib)
refdb_check_seq_homogeneity(lib, levels = c(species = 0.05, genus = 0.1))
#> level_threshold_homogeneity value_threshold_homogeneity source sequenceID
#> 1 species 0.1961806 BOLD 11680869
#> 2 species 0.1909722 BOLD 11680870
#> 3 species 0.1927083 BOLD 11680871
#> 4 species 0.1718750 BOLD 9698884
#> 5 species 0.1666667 BOLD 9698872
#> 6 species 0.1666667 BOLD 9698875
#> 7 species 0.1684028 BOLD 9698883
#> 8 genus 0.1631944 BOLD 9698885
#> 9 genus 0.1666667 BOLD 9698887
#> 10 genus 0.1649306 BOLD 9698886
#> markercode phylum_name class_name order_name family_name subfamily_name
#> 1 COI-5P Arthropoda Insecta Ephemeroptera Baetidae Baetinae
#> 2 COI-5P Arthropoda Insecta Ephemeroptera Baetidae Baetinae
#> 3 COI-5P Arthropoda Insecta Ephemeroptera Baetidae Baetinae
#> 4 COI-5P Arthropoda Insecta Ephemeroptera Baetidae Baetinae
#> 5 COI-5P Arthropoda Insecta Ephemeroptera Baetidae Baetinae
#> 6 COI-5P Arthropoda Insecta Ephemeroptera Baetidae Baetinae
#> 7 COI-5P Arthropoda Insecta Ephemeroptera Baetidae Baetinae
#> 8 COI-5P Arthropoda Insecta Ephemeroptera Baetidae Baetinae
#> 9 COI-5P Arthropoda Insecta Ephemeroptera Baetidae Baetinae
#> 10 COI-5P Arthropoda Insecta Ephemeroptera Baetidae Baetinae
#> genus_name species_name subspecies_name
#> 1 Baetis Baetis alpinus NA
#> 2 Baetis Baetis alpinus NA
#> 3 Baetis Baetis alpinus NA
#> 4 Baetis Baetis alpinus NA
#> 5 Baetis Baetis alpinus NA
#> 6 Baetis Baetis alpinus NA
#> 7 Baetis Baetis alpinus NA
#> 8 Baetis Baetis melanonyx NA
#> 9 Baetis Baetis melanonyx NA
#> 10 Baetis Baetis melanonyx NA
#> nucleotides
#> 1 TACCCTTTATTTTATTTTTGGTGCGTGGTCGGGTATGGTGGGGACTTCACTGAGCCTGCTAATTCGAGCCGAACTTGGAAACCCGGGTTCTTTAATTGGGGATGACCAGATTTACAACGTGATTGTTACTGCCCATGCGTTTATTATAATTTTTTTTATAGTGATACCAATTATAATTGGTGGGTTTGGCAATTGGTTAGTGCCTTTAATACTAGGGGCCCCAGACATGGCCTTTCCTCGTATAAACAATATAAGTTTTTGGTTGTTGCCGCCTTCGTTAACCTTGCTTGTGTCAAGTAGAATTGTAGATGTTGGTGCGGGAACGGGATGGACGGTGTACCCACCGCTTTCAGCCAATATCGCTCATGGTGGGGCATCTGTCGATTTTGCTATTTTTTCTCTACATTTGGCAGGGGTGTCTTCTATCTTAGGGGCTGTAAATTTTATTACAACTGTTGTTAACATGCGTAGCCCTGGAATGACTCTGGACCGAATACCCCTGTTTGTATGGTCTGTAGTAATTACCGCTGTTCTCTTACTCCTGTCGCTGCCGGTGCTGGCAGGAGCTATTACTATGTTACTAACGGATCGTAATTTGAATACCTCGTTTTTTGACCCTGCAGGTGGGGGTGATCCTATTTTGTACCAACATTTATTT
#> 2 TACTTTATACTTCATTTTTGGTGCATGAGCTGGAATAGTGGGCACTTCTTTGAGTTTATTAATTCGTGCAGAGCTAGGTAATCCTGGTTCTTTAATTGGTGATGACCAGATTTATAATGTTATTGTTACTGCCCATGCTTTTATTATAATTTTCTTTATAGTTATACCAATTATGATTGGTGGTTTTGGTAATTGATTAGTTCCTCTTATATTAGGAGCTCCTGATATAGCTTTTCCTCGTATAAATAATATAAGTTTTTGATTATTGCCTCCTTCATTGACTTTATTAGTATCAAGTAGTTTAGTAGATATAGGAGCTGGTACGGGTTGAACTGTATATCCTCCTTTAGCGGCTAATATTGCTCATGGAGGGTCATCAGTTGATTACGCTATTTTTTCTTTACATTTAGCTGGGGTATCTTCTATTTTAGGTGCTGTAAATTTTATTACAACAGTAATTAACATGCGTAGCCCTGGTATAACTTTAGATCGTATTCCTTTATTTGTATGATCTGTTGTAATTACTGCTGTGTTATTGCTTTTATCGCTACCTGTATTAGCCGGTGCTATTACTATACTTTTAACCGATCGTAATCTTAATACTTCTTTTT-----------------------------------------------
#> 3 TACTTTATACTTCATTTTTGGTGCATGAGCTGGTATAGTGGGCACTTCATTGAGTTTATTAATTCGTGCGGAACTAGGGAATCCTGGATCTTTAATTGGTGATGATCAAATTTATAATGTTATTGTTACTGCTCACGCTTTTATTATAATTTTCTTTATGGTTATACCAATCATGATTGGAGGTTTTGGTAATTGATTAGTACCTCTTATATTAGGAGCACCTGATATGGCCTTCCCTCGTATAAATAATATAAGTTTTTGATTATTACCTCCTTCATTAACTTTATTAGTATCAAGTAGTTTAGTTGATATGGGAGCCGGTACAGGCTGAACAGTTTATCCTCCTTTAGCTGCAAATATTGCTCATGGAGGGTCATCCGTTGATTATGCTATTTTTTCTTTACATTTAGCTGGGGTATCTTCTATTTTAGGTGCTGTAAATTTTATTACAACAGTAATTAATATGCGTAGTCCGGGTATAACTTTAGATCGAATTCCTTTATTTGTATGATCTGTTGTAATTACTGCTGTTTTATTATTATTATCACTACCTGTGTTAGCTGGTGCTATTACTATACTTTTAACTGATCGTAATCTTAATACTTCTTTTTTTGATCCTGCTGGAGGGGGTGATCCTATTTTATACCAACATTTA---
#> 4 -----------------------------------TGGTGGGGACCTCCCTCAGTCTTTTAATTCGAGCCGAGCTTGGCAATCCTGGGTCTCTAATTGGTGACGACCAAATTTACAACGTTATTGTCACTGCTCATGCGTTTATCATAATTTTTTTTATAGTGATACCAATTATAATCGGCGGGTTTGGTAATTGGCTTGTTCCGCTCATGTTAGGAGCCCCAGATATGGCATTCCCCCGCATAAATAATATAAGTTTCTGGTTGTTACCCCCTTCGCTCACACTGCTTGTATCAAGCAGTGTTGTTGATGTAGGGGCGGGCACTGGGTGGACCGTATATCCCCCCTTGGCTGCCAACATCGCACATGGGGGTGCTTCTGTGGATTTCGCAATTTTTTCGCTACACCTGGCGGGGGTGTCTTCAATTTTGGGCGCTGTAAACTTCATTACAACTGTAATTAATATGCGCAGCCCAGGAATAACGTTGGATCGCATACCGCTATTCGTTTGGTCTGTTGTAATTACTGCGGTTTTATTACTACTTTCTCTACCGGTTCTTGCCGGGGCAATTACTATACTGCTAACTGACCGTAATTTAAATACTTCATTTTTTGATCCGGCGGGGGGAGGAGACCCAATTTTATACCA----------
#> 5 -----------------------------------TGGTGGGGACCTCCCTCAGTCTTTTAATTCGAGCCGAGCTTGGCAATCCTGGGTCTCTAATTGGTGACGACCAAATTTACAACGTTATTGTCACTGCTCATGCGTTTATTATAATTTTTTTTATAGTGATACCAATTATAATCGGCGGGTTTGGTAATTGGCTTGTTCCGCTCATGTTAGGAGCCCCAGATATGGCATTTCCCCGCATAAATAATATAAGTTTCTGGTTGTTACCCCCTTCGCTCACACTGCTTGTATCAAGCAGTGTTGTTGATGTAGGGGCGGGCACTGGGTGGACCGTGTATCCCCCCTTGGCTGCCAACATCGCACATGGGGGTGCTTCTGTGGATTTCGCAATTTTTTCGCTACACCTGGCGGGGGTGTCTTCAATTTTGGGCGCTGTAAACTTCATTACAACTGTAATTAATATGCGCAGCCCAGGAATAACGTTGGATCGCATACCGCTATTCGTTTGGTCTGTTGTAATTACTGCGGTTTTATTACTACTTTCTCTACCGGTTCTTGCCGGGGCAATTACTATACTGCTAACTGACCGTAATTTAAATACTTCATTTTTTGATCCGGCGGGGGGAGGAGACCCAATTTTATACCA----------
#> 6 -----------------------------------TGGTGGGGACCTCCCTCAGTCTTTTAATTCGAGCCGAGCTTGGCAATCCTGGGTCTCTAATTGGTGACGACCAAATTTACAACGTTATTGTCACTGCTCATGCGTTTATTATAATTTTTTTTATAGTGATACCAATTATAATCGGCGGGTTTGGTAATTGGCTTGTTCCGCTCATGTTAGGAGCCCCAGATATGGCATTCCCCCGCATAAATAATATAAGTTTCTGGTTGTTACCCCCTTCGCTCACACTGCTTGTATCAAGCAGTGTTGTTGATGTAGGGGCGGGCACTGGGTGGACCGTGTATCCCCCCTTGGCTGCCAACATCGCACATGGGGGTGCTTCTGTGGATTTCGCAATTTTTTCACTACACCTGGCGGGGGTGTCTTCAATTTTGGGCGCTGTAAACTTCATTACAACTGTAATTAATATGCGCAGCCCAGGAATAACGTTGGATCGCATACCGCTATTCGTTTGGTCTGTTGTAATTACTGCGGTTTTATTACTACTTTCTCTACCGGTTCTTGCCGGGGCAATTACTATACTGCTAACTGACCGTAATTTAAATACTTCATTTTTTGATCCGGCGGGGGGAGGAGACCCAATTTTATACCA----------
#> 7 -----------------------------------TGGTGGGGACCTCCCTCAGTCTTTTAATTCGAGCCGAGCTTGGCAATCCTGGGTCTCTAATTGGTGACGACCAAATTTACAACGTTATTGTCACTGCTCATGCGTTTATTATAATTTTTTTTATAGTGATACCAATTATAATCGGCGGGTTTGGTAATTGGCTTGTTCCGCTCATGTTAGGAGCCCCAGATATGGCATTCCCCCGCATAAATAATATAAGTTTCTGGTTGTTACCCCCTTCGCTCACACTGCTTGTATCAAGCAGTGTTGTTGATGTAGGGGCGGGCACTGGGTGGACCGTGTATCCCCCCTTGGCTGCCAACATCGCACATGGGGGTGCTTCTGTGGATTTCGCAATTTTTTCGCTACACCTGGCGGGGGTGTCTTCAATTTTGGGCGCTGTAAACTTCATTACAACTGTAATTAATATGCGCAGCCCAGGAATAACGTTGGATCGCATACCGCTATTCGTTTGGTCTGTTGTAATTACTGCGGTTTTATTACTACTTTCTCTACCGGTTCTTGCCGGGGCAATTACTATACTGCTAACTGACCGTAATTTAAATACTTCATTTTTTGATCCGGCGGGGGGAGGAGACCCAATTTTATACCA----------
#> 8 -----------------------------------TGGTGGGGACATCGCTCAGTCTTTTAATTCGAGCCGAGCTTGGTAACCCTGGGTCCTTAATTGGTGACGACCAGATTTACAACGTTATTGTTACTGCTCACGCGTTTATTATAATTTTTTTTATAGTGATACCAATTATAATCGGGGGGTTTGGTAATTGGCTTGTTCCCCTTATATTAGGGGCCCCCGACATAGCATTTCCCCGCATAAATAATATAAGTTTCTGGTTGTTACCTCCTTCCCTTACACTACTTGTGTCGAGCAGTGTGGTTGATGTAGGGGCGGGCACTGGGTGGACCGTGTATCCCCCCTTGGCTGCCAACATCGCGCATGGGGGGGCTTCAGTGGATTTCGCAATTTTTTCGTTACACCTGGCGGGGGTGTCTTCAATTTTGGGTGCTGTAAACTTCATTACAACTGTAATTAATATGCGTAGCCCAGGAATGACGTTGGATCGCATACCGCTATTCGTTTGATCTGTTGTAATCACTGCGGTTTTGTTATTGCTTTCTCTCCCGGTTCTCGCGGGGGCGATTACTATGCTGCTAACTGACCGTAATTTAAATACTTCATTTTTTGACCCAGCGGGGGGAGGAGACCCAATTTTATACCA----------
#> 9 -----------------------------------TGGTGGGGACATCGCTCAGTCTTTTAATTCGAGCCGAGCTTGGTAACCCTGGGTCCTTAATTGGTGACGACCAGATTTACAACGTTATTGTTACTGCTCACGCGTTTATTATAATTTTTTTTATAGTGATACCAATTATAATCGGGGGGTTTGGTAATTGGCTTGTTCCACTTATATTAGGGGCCCCCGACATAGCATTCCCCCGCATAAATAATATAAGTTTCTGGTTGTTACCTCCTTCCCTTACACTACTTGTGTCGAGCAGTGTGGTTGATGTAGGGGCGGGCACTGGGTGGACCGTGTATCCCCCCTTGGCTGCCAACATCGCACATGGGGGGGCTTCAGTGGATTTCGCAATTTTTTCGTTACACCTGGCGGGGGTGTCTTCAATTTTGGGTGCTGTAAACTTCATTACAACTGTAATTAATATGCGTAGCCCAGGAATGACGTTGGATCGCATACCGCTATTCGTTTGGTCTGTTGTAATCACTGCGGTTTTGTTATTGCTTTCTCTCCCGGTTCTCGCGGGGGCGATTACCATGCTGCTAACTGACCGTAATTTAAATACTTCATTTTTTGACCCAGCGGGGGGAGGAGACCCAATTTTATACCA----------
#> 10 -----------------------------------TGGTGGGGACATCGCTCAGTCTTTTAATTCGAGCCGAGCTTGGTAACCCTGGGTCCTTAATTGGTGACGACCAGATTTACAACGTTATTGTTACTGCTCACGCGTTTATTATAATTTTTTTTATAGTGATACCAATTATAATCGGGGGGTTTGGTAATTGGCTTGTTCCACTTATATTAGGGGCCCCCGACATAGCATTTCCCCGCATAAATAATATAAGTTTCTGGTTGTTACCTCCTTCCCTTACACTACTTGTGTCGAGCAGTGTGGTTGATGTAGGGGCGGGCACTGGGTGGACCGTGTATCCCCCCTTGGCTGCCAACATCGCACATGGGGGGGCTTCAGTGGATTTCGCAATTTTTTCGTTACACCTGGCGGGGGTGTCTTCAATTTTGGGTGCTGTAAACTTCATTACAACTGTAATTAATATGCGTAGCCCAGGAATGACGTTGGATCGCATACCGCTATTCGTTTGGTCTGTTGTAATCACTGCGGTTTTGTTATTGCTTTCTCTCCCGGTTCTCGCGGGGGCGATTACCATGCTGCTAACTGACCGTAATTTAAATACTTCATTTTTTGACCCAGCGGGGGGAGGAGACCCAATTTTATACCA----------
#> country province_state lat lon
#> 1 Switzerland <NA> NA NA
#> 2 Switzerland <NA> NA NA
#> 3 Switzerland <NA> NA NA
#> 4 Switzerland <NA> NA NA
#> 5 Switzerland <NA> NA NA
#> 6 Switzerland <NA> NA NA
#> 7 Switzerland <NA> NA NA
#> 8 Switzerland <NA> NA NA
#> 9 Switzerland <NA> NA NA
#> 10 Switzerland <NA> NA NA