This function assesses the genetic similarity among sequences within each taxa. It takes user defined thresholds (one threshold per taxonomic level) to warn about sequences which are singularly different (based on median distance) from the others. Sequences in the reference database must be aligned.

refdb_check_seq_homogeneity(x, levels, min_n_seq = 3)

Arguments

x

a reference database (sequences must be aligned).

levels

a named vector of genetic similarity thresholds. Names must correspond to taxonomic levels (taxonomic fields) and values must be included in the interval [0, 1]. For example to assess homogeneity at 5 percents (within species) and 10 percents (within genus): levels = c(species = 0.05, genus = 0.1)

min_n_seq

the minimum number of sequences for a taxon to be tested.

Value

A dataframe reporting suspicious sequences whose median distance to other sequences of the same taxon is greater than the specified threshold. The first column "level_threshold_homogeneity" indicates the lowest taxonomic level for which the threshold has been exceeded and the second column "value_threshold_homogeneity" gives the computed median distance.

Details

For every tested taxonomic levels, the algorithm checks all sequences in every taxa (for which the total number of sequence is > min_n_seq) In each taxon, the pairwise distance matrix among all the sequences belonging to this taxon is computed. A sequence is tagged as suspicious and returned by the function if its median genetic distance from the other sequences is higher than the threshold set by the user (levels argument).

Examples

lib <- read.csv(system.file("extdata", "homogeneity.csv", package = "refdb"))
lib <- refdb_set_fields_BOLD(lib)
refdb_check_seq_homogeneity(lib, levels = c(species = 0.05, genus = 0.1))
#>    level_threshold_homogeneity value_threshold_homogeneity source sequenceID
#> 1                      species                   0.1961806   BOLD   11680869
#> 2                      species                   0.1909722   BOLD   11680870
#> 3                      species                   0.1927083   BOLD   11680871
#> 4                      species                   0.1718750   BOLD    9698884
#> 5                      species                   0.1666667   BOLD    9698872
#> 6                      species                   0.1666667   BOLD    9698875
#> 7                      species                   0.1684028   BOLD    9698883
#> 8                        genus                   0.1631944   BOLD    9698885
#> 9                        genus                   0.1666667   BOLD    9698887
#> 10                       genus                   0.1649306   BOLD    9698886
#>    markercode phylum_name class_name    order_name family_name subfamily_name
#> 1      COI-5P  Arthropoda    Insecta Ephemeroptera    Baetidae       Baetinae
#> 2      COI-5P  Arthropoda    Insecta Ephemeroptera    Baetidae       Baetinae
#> 3      COI-5P  Arthropoda    Insecta Ephemeroptera    Baetidae       Baetinae
#> 4      COI-5P  Arthropoda    Insecta Ephemeroptera    Baetidae       Baetinae
#> 5      COI-5P  Arthropoda    Insecta Ephemeroptera    Baetidae       Baetinae
#> 6      COI-5P  Arthropoda    Insecta Ephemeroptera    Baetidae       Baetinae
#> 7      COI-5P  Arthropoda    Insecta Ephemeroptera    Baetidae       Baetinae
#> 8      COI-5P  Arthropoda    Insecta Ephemeroptera    Baetidae       Baetinae
#> 9      COI-5P  Arthropoda    Insecta Ephemeroptera    Baetidae       Baetinae
#> 10     COI-5P  Arthropoda    Insecta Ephemeroptera    Baetidae       Baetinae
#>    genus_name     species_name subspecies_name
#> 1      Baetis   Baetis alpinus              NA
#> 2      Baetis   Baetis alpinus              NA
#> 3      Baetis   Baetis alpinus              NA
#> 4      Baetis   Baetis alpinus              NA
#> 5      Baetis   Baetis alpinus              NA
#> 6      Baetis   Baetis alpinus              NA
#> 7      Baetis   Baetis alpinus              NA
#> 8      Baetis Baetis melanonyx              NA
#> 9      Baetis Baetis melanonyx              NA
#> 10     Baetis Baetis melanonyx              NA
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           nucleotides
#> 1  TACCCTTTATTTTATTTTTGGTGCGTGGTCGGGTATGGTGGGGACTTCACTGAGCCTGCTAATTCGAGCCGAACTTGGAAACCCGGGTTCTTTAATTGGGGATGACCAGATTTACAACGTGATTGTTACTGCCCATGCGTTTATTATAATTTTTTTTATAGTGATACCAATTATAATTGGTGGGTTTGGCAATTGGTTAGTGCCTTTAATACTAGGGGCCCCAGACATGGCCTTTCCTCGTATAAACAATATAAGTTTTTGGTTGTTGCCGCCTTCGTTAACCTTGCTTGTGTCAAGTAGAATTGTAGATGTTGGTGCGGGAACGGGATGGACGGTGTACCCACCGCTTTCAGCCAATATCGCTCATGGTGGGGCATCTGTCGATTTTGCTATTTTTTCTCTACATTTGGCAGGGGTGTCTTCTATCTTAGGGGCTGTAAATTTTATTACAACTGTTGTTAACATGCGTAGCCCTGGAATGACTCTGGACCGAATACCCCTGTTTGTATGGTCTGTAGTAATTACCGCTGTTCTCTTACTCCTGTCGCTGCCGGTGCTGGCAGGAGCTATTACTATGTTACTAACGGATCGTAATTTGAATACCTCGTTTTTTGACCCTGCAGGTGGGGGTGATCCTATTTTGTACCAACATTTATTT
#> 2  TACTTTATACTTCATTTTTGGTGCATGAGCTGGAATAGTGGGCACTTCTTTGAGTTTATTAATTCGTGCAGAGCTAGGTAATCCTGGTTCTTTAATTGGTGATGACCAGATTTATAATGTTATTGTTACTGCCCATGCTTTTATTATAATTTTCTTTATAGTTATACCAATTATGATTGGTGGTTTTGGTAATTGATTAGTTCCTCTTATATTAGGAGCTCCTGATATAGCTTTTCCTCGTATAAATAATATAAGTTTTTGATTATTGCCTCCTTCATTGACTTTATTAGTATCAAGTAGTTTAGTAGATATAGGAGCTGGTACGGGTTGAACTGTATATCCTCCTTTAGCGGCTAATATTGCTCATGGAGGGTCATCAGTTGATTACGCTATTTTTTCTTTACATTTAGCTGGGGTATCTTCTATTTTAGGTGCTGTAAATTTTATTACAACAGTAATTAACATGCGTAGCCCTGGTATAACTTTAGATCGTATTCCTTTATTTGTATGATCTGTTGTAATTACTGCTGTGTTATTGCTTTTATCGCTACCTGTATTAGCCGGTGCTATTACTATACTTTTAACCGATCGTAATCTTAATACTTCTTTTT-----------------------------------------------
#> 3  TACTTTATACTTCATTTTTGGTGCATGAGCTGGTATAGTGGGCACTTCATTGAGTTTATTAATTCGTGCGGAACTAGGGAATCCTGGATCTTTAATTGGTGATGATCAAATTTATAATGTTATTGTTACTGCTCACGCTTTTATTATAATTTTCTTTATGGTTATACCAATCATGATTGGAGGTTTTGGTAATTGATTAGTACCTCTTATATTAGGAGCACCTGATATGGCCTTCCCTCGTATAAATAATATAAGTTTTTGATTATTACCTCCTTCATTAACTTTATTAGTATCAAGTAGTTTAGTTGATATGGGAGCCGGTACAGGCTGAACAGTTTATCCTCCTTTAGCTGCAAATATTGCTCATGGAGGGTCATCCGTTGATTATGCTATTTTTTCTTTACATTTAGCTGGGGTATCTTCTATTTTAGGTGCTGTAAATTTTATTACAACAGTAATTAATATGCGTAGTCCGGGTATAACTTTAGATCGAATTCCTTTATTTGTATGATCTGTTGTAATTACTGCTGTTTTATTATTATTATCACTACCTGTGTTAGCTGGTGCTATTACTATACTTTTAACTGATCGTAATCTTAATACTTCTTTTTTTGATCCTGCTGGAGGGGGTGATCCTATTTTATACCAACATTTA---
#> 4  -----------------------------------TGGTGGGGACCTCCCTCAGTCTTTTAATTCGAGCCGAGCTTGGCAATCCTGGGTCTCTAATTGGTGACGACCAAATTTACAACGTTATTGTCACTGCTCATGCGTTTATCATAATTTTTTTTATAGTGATACCAATTATAATCGGCGGGTTTGGTAATTGGCTTGTTCCGCTCATGTTAGGAGCCCCAGATATGGCATTCCCCCGCATAAATAATATAAGTTTCTGGTTGTTACCCCCTTCGCTCACACTGCTTGTATCAAGCAGTGTTGTTGATGTAGGGGCGGGCACTGGGTGGACCGTATATCCCCCCTTGGCTGCCAACATCGCACATGGGGGTGCTTCTGTGGATTTCGCAATTTTTTCGCTACACCTGGCGGGGGTGTCTTCAATTTTGGGCGCTGTAAACTTCATTACAACTGTAATTAATATGCGCAGCCCAGGAATAACGTTGGATCGCATACCGCTATTCGTTTGGTCTGTTGTAATTACTGCGGTTTTATTACTACTTTCTCTACCGGTTCTTGCCGGGGCAATTACTATACTGCTAACTGACCGTAATTTAAATACTTCATTTTTTGATCCGGCGGGGGGAGGAGACCCAATTTTATACCA----------
#> 5  -----------------------------------TGGTGGGGACCTCCCTCAGTCTTTTAATTCGAGCCGAGCTTGGCAATCCTGGGTCTCTAATTGGTGACGACCAAATTTACAACGTTATTGTCACTGCTCATGCGTTTATTATAATTTTTTTTATAGTGATACCAATTATAATCGGCGGGTTTGGTAATTGGCTTGTTCCGCTCATGTTAGGAGCCCCAGATATGGCATTTCCCCGCATAAATAATATAAGTTTCTGGTTGTTACCCCCTTCGCTCACACTGCTTGTATCAAGCAGTGTTGTTGATGTAGGGGCGGGCACTGGGTGGACCGTGTATCCCCCCTTGGCTGCCAACATCGCACATGGGGGTGCTTCTGTGGATTTCGCAATTTTTTCGCTACACCTGGCGGGGGTGTCTTCAATTTTGGGCGCTGTAAACTTCATTACAACTGTAATTAATATGCGCAGCCCAGGAATAACGTTGGATCGCATACCGCTATTCGTTTGGTCTGTTGTAATTACTGCGGTTTTATTACTACTTTCTCTACCGGTTCTTGCCGGGGCAATTACTATACTGCTAACTGACCGTAATTTAAATACTTCATTTTTTGATCCGGCGGGGGGAGGAGACCCAATTTTATACCA----------
#> 6  -----------------------------------TGGTGGGGACCTCCCTCAGTCTTTTAATTCGAGCCGAGCTTGGCAATCCTGGGTCTCTAATTGGTGACGACCAAATTTACAACGTTATTGTCACTGCTCATGCGTTTATTATAATTTTTTTTATAGTGATACCAATTATAATCGGCGGGTTTGGTAATTGGCTTGTTCCGCTCATGTTAGGAGCCCCAGATATGGCATTCCCCCGCATAAATAATATAAGTTTCTGGTTGTTACCCCCTTCGCTCACACTGCTTGTATCAAGCAGTGTTGTTGATGTAGGGGCGGGCACTGGGTGGACCGTGTATCCCCCCTTGGCTGCCAACATCGCACATGGGGGTGCTTCTGTGGATTTCGCAATTTTTTCACTACACCTGGCGGGGGTGTCTTCAATTTTGGGCGCTGTAAACTTCATTACAACTGTAATTAATATGCGCAGCCCAGGAATAACGTTGGATCGCATACCGCTATTCGTTTGGTCTGTTGTAATTACTGCGGTTTTATTACTACTTTCTCTACCGGTTCTTGCCGGGGCAATTACTATACTGCTAACTGACCGTAATTTAAATACTTCATTTTTTGATCCGGCGGGGGGAGGAGACCCAATTTTATACCA----------
#> 7  -----------------------------------TGGTGGGGACCTCCCTCAGTCTTTTAATTCGAGCCGAGCTTGGCAATCCTGGGTCTCTAATTGGTGACGACCAAATTTACAACGTTATTGTCACTGCTCATGCGTTTATTATAATTTTTTTTATAGTGATACCAATTATAATCGGCGGGTTTGGTAATTGGCTTGTTCCGCTCATGTTAGGAGCCCCAGATATGGCATTCCCCCGCATAAATAATATAAGTTTCTGGTTGTTACCCCCTTCGCTCACACTGCTTGTATCAAGCAGTGTTGTTGATGTAGGGGCGGGCACTGGGTGGACCGTGTATCCCCCCTTGGCTGCCAACATCGCACATGGGGGTGCTTCTGTGGATTTCGCAATTTTTTCGCTACACCTGGCGGGGGTGTCTTCAATTTTGGGCGCTGTAAACTTCATTACAACTGTAATTAATATGCGCAGCCCAGGAATAACGTTGGATCGCATACCGCTATTCGTTTGGTCTGTTGTAATTACTGCGGTTTTATTACTACTTTCTCTACCGGTTCTTGCCGGGGCAATTACTATACTGCTAACTGACCGTAATTTAAATACTTCATTTTTTGATCCGGCGGGGGGAGGAGACCCAATTTTATACCA----------
#> 8  -----------------------------------TGGTGGGGACATCGCTCAGTCTTTTAATTCGAGCCGAGCTTGGTAACCCTGGGTCCTTAATTGGTGACGACCAGATTTACAACGTTATTGTTACTGCTCACGCGTTTATTATAATTTTTTTTATAGTGATACCAATTATAATCGGGGGGTTTGGTAATTGGCTTGTTCCCCTTATATTAGGGGCCCCCGACATAGCATTTCCCCGCATAAATAATATAAGTTTCTGGTTGTTACCTCCTTCCCTTACACTACTTGTGTCGAGCAGTGTGGTTGATGTAGGGGCGGGCACTGGGTGGACCGTGTATCCCCCCTTGGCTGCCAACATCGCGCATGGGGGGGCTTCAGTGGATTTCGCAATTTTTTCGTTACACCTGGCGGGGGTGTCTTCAATTTTGGGTGCTGTAAACTTCATTACAACTGTAATTAATATGCGTAGCCCAGGAATGACGTTGGATCGCATACCGCTATTCGTTTGATCTGTTGTAATCACTGCGGTTTTGTTATTGCTTTCTCTCCCGGTTCTCGCGGGGGCGATTACTATGCTGCTAACTGACCGTAATTTAAATACTTCATTTTTTGACCCAGCGGGGGGAGGAGACCCAATTTTATACCA----------
#> 9  -----------------------------------TGGTGGGGACATCGCTCAGTCTTTTAATTCGAGCCGAGCTTGGTAACCCTGGGTCCTTAATTGGTGACGACCAGATTTACAACGTTATTGTTACTGCTCACGCGTTTATTATAATTTTTTTTATAGTGATACCAATTATAATCGGGGGGTTTGGTAATTGGCTTGTTCCACTTATATTAGGGGCCCCCGACATAGCATTCCCCCGCATAAATAATATAAGTTTCTGGTTGTTACCTCCTTCCCTTACACTACTTGTGTCGAGCAGTGTGGTTGATGTAGGGGCGGGCACTGGGTGGACCGTGTATCCCCCCTTGGCTGCCAACATCGCACATGGGGGGGCTTCAGTGGATTTCGCAATTTTTTCGTTACACCTGGCGGGGGTGTCTTCAATTTTGGGTGCTGTAAACTTCATTACAACTGTAATTAATATGCGTAGCCCAGGAATGACGTTGGATCGCATACCGCTATTCGTTTGGTCTGTTGTAATCACTGCGGTTTTGTTATTGCTTTCTCTCCCGGTTCTCGCGGGGGCGATTACCATGCTGCTAACTGACCGTAATTTAAATACTTCATTTTTTGACCCAGCGGGGGGAGGAGACCCAATTTTATACCA----------
#> 10 -----------------------------------TGGTGGGGACATCGCTCAGTCTTTTAATTCGAGCCGAGCTTGGTAACCCTGGGTCCTTAATTGGTGACGACCAGATTTACAACGTTATTGTTACTGCTCACGCGTTTATTATAATTTTTTTTATAGTGATACCAATTATAATCGGGGGGTTTGGTAATTGGCTTGTTCCACTTATATTAGGGGCCCCCGACATAGCATTTCCCCGCATAAATAATATAAGTTTCTGGTTGTTACCTCCTTCCCTTACACTACTTGTGTCGAGCAGTGTGGTTGATGTAGGGGCGGGCACTGGGTGGACCGTGTATCCCCCCTTGGCTGCCAACATCGCACATGGGGGGGCTTCAGTGGATTTCGCAATTTTTTCGTTACACCTGGCGGGGGTGTCTTCAATTTTGGGTGCTGTAAACTTCATTACAACTGTAATTAATATGCGTAGCCCAGGAATGACGTTGGATCGCATACCGCTATTCGTTTGGTCTGTTGTAATCACTGCGGTTTTGTTATTGCTTTCTCTCCCGGTTCTCGCGGGGGCGATTACCATGCTGCTAACTGACCGTAATTTAAATACTTCATTTTTTGACCCAGCGGGGGGAGGAGACCCAATTTTATACCA----------
#>        country province_state lat lon
#> 1  Switzerland           <NA>  NA  NA
#> 2  Switzerland           <NA>  NA  NA
#> 3  Switzerland           <NA>  NA  NA
#> 4  Switzerland           <NA>  NA  NA
#> 5  Switzerland           <NA>  NA  NA
#> 6  Switzerland           <NA>  NA  NA
#> 7  Switzerland           <NA>  NA  NA
#> 8  Switzerland           <NA>  NA  NA
#> 9  Switzerland           <NA>  NA  NA
#> 10 Switzerland           <NA>  NA  NA