vignettes/ncbi_bold.Rmd
ncbi_bold.Rmd
This short vignette will show you how you can use the functions of the refdb package to download sequence data from two popular repositories: NCBI Genbank and BOLD. We will also cover how these data can be combined into a single reference database. For an introduction to the package refdb see the vignette Introduction to the refdb package.
We start by downloading sequence data from NCBI. The refdb packages uses the rentrez package (Winter 2017) to interface with NCBI servers.
silo_ncbi <- refdb_import_NCBI("Silo COI")
#> Downloading 72 sequences from NCBI...
#>
|
| | 0%
|
|=================================================================================================================================| 100%
Data already have a set of fields defined.
refdb_get_fields(silo_ncbi)
#> source: source
#> id: id
#> taxonomy:
#> superkingdom: superkingdom
#> kingdom: kingdom
#> phylum: phylum
#> subphylum: subphylum
#> class: class
#> subclass: subclass
#> infraclass: infraclass
#> order: order
#> suborder: suborder
#> infraorder: infraorder
#> superfamily: superfamily
#> family: family
#> genus: genus
#> species: species
#> sequence: sequence
#> marker: gene
#> latitude: latitude
#> longitude: longitude
Now let’s download data for another taxon from the BOLD database.
goera_bold <- refdb_import_BOLD(taxon = "Goera pilosa", ncbi_taxo = FALSE)
#> Downloading 30 sequences from BOLD...
You may have noticed that the search interface is a bit different. This is because here we rely on another package (bold, Chamberlain 2020) to download the data. You can check the manual of refdb_import_BOLD
to see the different arguments available. For the purpose of this vignette we also have turned off the automatic conversion to the NCBI taxonomy.
Similarly to the NCBI data, data downloaded from BOLD have a set of fields defined automatically.
refdb_get_fields(goera_bold)
#> source: source
#> id: sequenceID
#> taxonomy:
#> phylum: phylum_name
#> class: class_name
#> order: order_name
#> family: family_name
#> subfamily: subfamily_name
#> genus: genus_name
#> species: species_name
#> sequence: nucleotides
#> marker: markercode
#> latitude: lat
#> longitude: lon
We can now use refdb_merge
to merge the two databases. To make things clearer we will keep only the first three sequences of goera_bold
.
# Extract the first three records of goera_bold
goera_bold <- goera_bold[1:3, ]
# Merging goera_bold and silo_ncbi into one database
bold_ncbi <- refdb_merge(goera_bold, silo_ncbi)
Note that you can merge more than two database with refdb_merge
.
Now, let’s take a look at the columns genus_name
, species_name
and nucleotides
of the merged database.
bold_ncbi[, c("source", "genus_name", "species_name", "nucleotides")]
#> # A tibble: 75 x 4
#> source genus_name species_name nucleotides
#> <chr> <chr> <chr> <DNA>
#> 1 BOLD Goera Goera pilosa AACAATTTATTTTATTTTTGGTATTTGATCAGGAATAGTCGGAACGTCCCTAAGTATAATTATTCGAATTGAATTAGGAACAGCTAATTCTTTAATTAAAAAT…
#> 2 BOLD Goera Goera pilosa AACAATTTATTTTATTTTTGGTATTTGATCAGGAATAGTCGGAACATCCCTAAGTATAATTATTCGAATTGAATTAGGAACAGCTAATTCTTTAATTAAAAAT…
#> 3 BOLD Goera Goera pilosa AACAATTTATTTTATTTTTGGTATTTGATCAGGAATAGTCGGAACGTCCCTAAGTATAATTATTCGAATTGAATTAGGAACAGCTAATTCTTTAATTAAAAAT…
#> 4 NCBI Silo Silo pallipes AACTATTTATTTTATTTTTGGAATTTGATCTGGAATAGTAGGAACTTCTCTAAGAATAATTATTCGAACTGAGCTAGGGACTGCTGAATCTTTAATTAAAAAT…
#> 5 NCBI Silo Silo pallipes AACTATTTATTTTATTTTTGGAATTTGATCTGGAATAGTAGGAACTTCTCTAAGAATAATTATTCGAACTGAGCTAGGGACTGCTGAATCTTTAATTAAAAAT…
#> 6 NCBI Silo Silo pallipes AACTATTTATTTTATTTTTGGAATTTGATCTGGAATAGTAGGAACTTCTCTAAGAATAATTATTCGAACTGAGCTAGGGACTGCTGAATCTTTAATTAAAAAT…
#> 7 NCBI Silo Silo pallipes AACTATTTATTTTATTTTTGGAATTTGATCTGGAATAGTAGGAACTTCTCTAAGAATAATTATTCGAACTGAGCTAGGGACTGCTGAATCTTTAATTAAAAAT…
#> 8 NCBI Silo Silo pallipes AACTATTTATTTTATTTTTGGAATTTGATCTGGAATAGTAGGAACTTCTCTAAGAATAATTATTCGAACTGAGCTAGGGACTGCTGAATCTTTAATTAAAAAT…
#> 9 NCBI Silo Silo pallipes AACTATTTATTTTATTTTTGGAATTTGATCTGGAATAGTAGGAACTTCTCTAAGAATAATTATTCGAACTGAGCTAGGGACTGCTGAATCTTTAATTAAAAAT…
#> 10 NCBI Silo Silo pallipes AACTATTTATTTTATTTTTGGAATTTGATCTGGAATAGTAGGAACTTCTCTAAGAATAATTATTCGAACTGAGCTAGGGACTGCTGAATCTTTAATTAAAAAT…
#> # … with 65 more rows
Despite those columns having different names in the two databases, they were merged successfully because they were associated to the same field. The names used in the merged database are inherited from the database used as first argument in the merge functions (here goera_bold
).
By default only the columns that are associated to a field in one of the two databases are kept. Of course, if there is no equivalent in the other database, NAs are produced. See the columns subfamily_name
(subfamily is a taxonomic rank which exist only BOLD):
bold_ncbi[, c("source", "subfamily_name")]
#> # A tibble: 75 x 2
#> source subfamily_name
#> <chr> <chr>
#> 1 BOLD Goerinae
#> 2 BOLD Goerinae
#> 3 BOLD Goerinae
#> 4 NCBI <NA>
#> 5 NCBI <NA>
#> 6 NCBI <NA>
#> 7 NCBI <NA>
#> 8 NCBI <NA>
#> 9 NCBI <NA>
#> 10 NCBI <NA>
#> # … with 65 more rows
Alternatively, we can request the function refdb_merge
to return only the fields shared by all the reference databases.
bold_ncbi <- refdb_merge(goera_bold, silo_ncbi, keep = "fields_shared")
colnames(bold_ncbi)
#> [1] "source" "sequenceID" "phylum_name" "class_name" "order_name" "family_name" "genus_name" "species_name" "nucleotides"
#> [10] "markercode" "lat" "lon"
NCBI and BOLD share several taxonomic ranks (phylum, class, order, family, genus, and species) that are a good basis for sequence taxonomic classification. But we may also use the default behavior of refdb_import_BOLD
to force BOLD data to conform to the NCBI taxonomic system. So let’s download these data again.
goera_bold <- refdb_import_BOLD(taxon = "Goera pilosa", ncbi_taxo = TRUE)
#> Downloading 30 sequences from BOLD...
#>
Processing: Goera
goera_bold <- goera_bold[1:3, ]
refdb_get_fields(silo_ncbi)
#> source: source
#> id: id
#> taxonomy:
#> superkingdom: superkingdom
#> kingdom: kingdom
#> phylum: phylum
#> subphylum: subphylum
#> class: class
#> subclass: subclass
#> infraclass: infraclass
#> order: order
#> suborder: suborder
#> infraorder: infraorder
#> superfamily: superfamily
#> family: family
#> genus: genus
#> species: species
#> sequence: sequence
#> marker: gene
#> latitude: latitude
#> longitude: longitude
refdb_get_fields(goera_bold)
#> source: source
#> id: sequenceID
#> taxonomy:
#> superkingdom: superkingdom
#> kingdom: kingdom
#> phylum: phylum
#> subphylum: subphylum
#> class: class
#> subclass: subclass
#> infraclass: infraclass
#> order: order
#> suborder: suborder
#> infraorder: infraorder
#> superfamily: superfamily
#> family: family
#> genus: genus
#> species: species_name
#> sequence: nucleotides
#> marker: markercode
#> latitude: lat
#> longitude: lon
Now we can observe many more matching taxonomic fields between the two database, which makes the merge operation much more straightforward. Note that you can operate NCBI taxonomic conversion with data from other sources using the function refdb_set_ncbitax
.
Winter, D. J. (2017) rentrez: an R package for the NCBI eUtils API The R Journal 9(2):520-526
Chamberlain, S. (2020). bold: Interface to Bold Systems API. R package version 1.1.0. https://CRAN.R-project.org/package=bold