Title: | Harmonise and Integrate Heterogeneous Areal Data |
---|---|
Description: | Many relevant applications in the environmental and socioeconomic sciences use areal data, such as biodiversity checklists, agricultural statistics, or socioeconomic surveys. For applications that surpass the spatial, temporal or thematic scope of any single data source, data must be integrated from several heterogeneous sources. Inconsistent concepts, definitions, or messy data tables make this a tedious and error-prone process. 'arealDB' tackles those problems and helps the user to integrate a harmonised databases of areal data. Read the paper at Ehrmann, Seppelt & Meyer (2020) <doi:10.1016/j.envsoft.2020.104799>. |
Authors: | Steffen Ehrmann [aut, cre] |
Maintainer: | Steffen Ehrmann <[email protected]> |
License: | GPL-3 |
Version: | 0.9.5 |
Built: | 2025-02-11 22:20:57 UTC |
Source: | https://github.com/luckinet/arealdb |
Allows the user to match concepts with an already existing ontology, without actually writing into the ontology, but instead storing the resulting matching table as csv.
.editMatches( new, topLevel, source = NULL, ontology = NULL, matchDir = NULL, stringdist = TRUE, parentClasses = FALSE, beep = NULL, verbose = TRUE )
.editMatches( new, topLevel, source = NULL, ontology = NULL, matchDir = NULL, stringdist = TRUE, parentClasses = FALSE, beep = NULL, verbose = TRUE )
new |
|
topLevel |
|
source |
|
ontology |
|
matchDir |
|
stringdist |
|
parentClasses |
|
beep |
|
verbose |
|
In order to match new concepts into an already existing ontology, it
may become necessary to carry out manual matches of the new concepts with
already harmonised concepts, for example, when the new concepts are
described with terms that are not yet in the ontology. This function puts
together a table, in which the user would edit matches by hand. Whith the
argument verbose = TRUE
, detailed information about the edit process
are shown to the user. After defining matches, and even if not all
necessary matches are finished, the function stores a specific "matching
table" with the name match_SOURCE.csv in the respective directory
(matchDir
), from where work can be picked up and continued at
another time.
Fuzzy matching is carried out and matches with 0, 1 or 2 differing charcters are presented in a respective column.
A table that contains all new matches, or if none of the new concepts weren't already in the ontology, a table of the already sucessful matches.
(internal function not for user interaction)
.getColTypes(input = NULL)
.getColTypes(input = NULL)
input |
data.frame |
This function takes a table to replace the values of various columns with harmonised values listed in the project specific gazetteer.
.matchOntology( table = NULL, columns = NULL, dataseries = NULL, ontology = NULL, colsAsClass = TRUE, groupMatches = FALSE, stringdist = TRUE, strictMatch = FALSE, parentClasses = FALSE, beep = NULL, verbose = FALSE )
.matchOntology( table = NULL, columns = NULL, dataseries = NULL, ontology = NULL, colsAsClass = TRUE, groupMatches = FALSE, stringdist = TRUE, strictMatch = FALSE, parentClasses = FALSE, beep = NULL, verbose = FALSE )
table |
|
columns |
|
dataseries |
|
ontology |
|
colsAsClass |
|
groupMatches |
|
stringdist |
|
strictMatch |
|
parentClasses |
|
beep |
|
verbose |
|
Returns a table that resembles the input table where the target columns were translated according to the provided ontology.
This function takes a table (spatial) and updates all territorial concepts in the provided gazetteer.
.updateOntology( table = NULL, threshold = NULL, dataseries = NULL, ontology = NULL )
.updateOntology( table = NULL, threshold = NULL, dataseries = NULL, ontology = NULL )
table |
|
threshold |
|
dataseries |
|
ontology |
onto |
called for its side-effect of updating a gazetteer
Archive the data from an areal database
adb_archive(pattern = NULL, variables = NULL, compress = FALSE, outPath = NULL)
adb_archive(pattern = NULL, variables = NULL, compress = FALSE, outPath = NULL)
pattern |
|
variables |
|
compress |
|
outPath |
|
This function prepares and packages the data into an archiveable form. This contains geopacakge files for geometries and csv files for all tables, such as inventory, matching and thematic data tables.
no return value, called for the side-effect of creating a database archive.
Backup the current state of an areal database
adb_backup()
adb_backup()
This function creates a tag that is composed of the version and the date, appends it to all stage3 files (tables and geometries), the inventory and the ontology/gazetteer files and stores them in the backup folder of the current areal database.
No return value, called for the side effect of saving the inventory, the stage3 files and modified ontology/gazetteer into the backup directory.
work in progress, not yet useable
adb_diagnose( territory = NULL, concept = NULL, variable = NULL, level = NULL, year = NULL )
adb_diagnose( territory = NULL, concept = NULL, variable = NULL, level = NULL, year = NULL )
territory |
description |
concept |
description |
variable |
description |
level |
description |
year |
description |
This function helps setting up an example database up until a certain step.
adb_example(path = NULL, until = NULL, verbose = FALSE)
adb_example(path = NULL, until = NULL, verbose = FALSE)
path |
|
until |
|
verbose |
|
Setting up a database with an R-based tool can appear to be cumbersome and too complex and thus intimidating. By creating an example database, this functions allows interested users to learn step by step how to build a database of areal data. Moreover, all functions in this package contain verbose information and ask for information that would be missing or lead to an inconsistent database, before a failure renders hours of work useless.
No return value, called for the side effect of creating an example
database at the specified path
.
if(dev.interactive()){ # to build the full example database adb_example(path = paste0(tempdir(), "/newDB")) # to make the example database until a certain step adb_example(path = paste0(tempdir(), "/newDB"), until = "regDataseries") }
if(dev.interactive()){ # to build the full example database adb_example(path = paste0(tempdir(), "/newDB")) # to make the example database until a certain step adb_example(path = paste0(tempdir(), "/newDB"), until = "regDataseries") }
Initiate a geospatial database or register a database that exists at the root path.
adb_init( root, version, author, licence, ontology, gazetteer = NULL, top = NULL, staged = TRUE )
adb_init( root, version, author, licence, ontology, gazetteer = NULL, top = NULL, staged = TRUE )
root |
|
version |
|
author |
|
licence |
|
ontology |
|
gazetteer |
|
top |
|
staged |
|
This is the first function that is run in a project, as it initiates the areal database by creating the default sub-directories and initial inventory tables. When a database has already been set up, this function is used to register that path in the options of the current R session.
No return value, called for the side effect of creating the directory structure of the new areal database and tables that contain the database metadata.
adb_init(root = paste0(tempdir(), "/newDB"), version = "1.0.0", licence = "CC-BY-0.4", author = list(cre = "Gordon Freeman", aut = "Alyx Vance", ctb = "The G-Man"), gazetteer = paste0(tempdir(), "/newDB/territories.rds"), top = "al1", ontology = list(var = paste0(tempdir(), "/newDB/ontology.rds"))) getOption("adb_path"); getOption("gazetteer_path")
adb_init(root = paste0(tempdir(), "/newDB"), version = "1.0.0", licence = "CC-BY-0.4", author = list(cre = "Gordon Freeman", aut = "Alyx Vance", ctb = "The G-Man"), gazetteer = paste0(tempdir(), "/newDB/territories.rds"), top = "al1", ontology = list(var = paste0(tempdir(), "/newDB/ontology.rds"))) getOption("adb_path"); getOption("gazetteer_path")
Load the inventory of the currently active areal database
adb_inventory(type = NULL)
adb_inventory(type = NULL)
type |
|
returns the table selected in type
Load the metadata from an areal database
adb_metadata()
adb_metadata()
Load the currently active ontology
adb_ontology(..., type = "ontology")
adb_ontology(..., type = "ontology")
... |
combination of column name in the ontology and value to filter
that column by to build a tree of the concepts nested into it; see
|
type |
|
returns a tidy table of an ontology or gazetteer that is used in an areal database.
Extract database contents
adb_querry( territory = NULL, concept = NULL, variable = NULL, level = NULL, year = NULL )
adb_querry( territory = NULL, concept = NULL, variable = NULL, level = NULL, year = NULL )
territory |
'character(.) |
concept |
description |
variable |
description |
level |
description |
year |
description |
returns ...
if(dev.interactive()){ adb_example(path = paste0(tempdir(), "/newDB")) adb_querry(territory = list(al1 = "a_nation"), concept = list(commodity = "barley"), variable = "harvested") }
if(dev.interactive()){ adb_example(path = paste0(tempdir(), "/newDB")) adb_querry(territory = list(al1 = "a_nation"), concept = list(commodity = "barley"), variable = "harvested") }
Reset an areal database to its unfilled state
adb_reset(what = "all")
adb_reset(what = "all")
what |
|
no return value, called for its side effect of reorganising an areal database into a state where no reg* or norm* functions have been run
Restore the database from a backup
adb_restore(version = NULL, date = NULL)
adb_restore(version = NULL, date = NULL)
version |
'character(1) |
date |
|
This function searches for files that have the version and date tag,
as it was defined in a previous run of adb_backup
, to restore
them to their original folders. This function overwrites by default, so use
with care.
No return value, called for the side effect of restoring files that were previously stored in a backup.
Load the schemas of the currently active areal database
adb_schemas(pattern = NULL)
adb_schemas(pattern = NULL)
pattern |
|
returns a list of schema descriptions
Load the translation tables of the currently active areal database
adb_translations(type = NULL, dataseries = NULL)
adb_translations(type = NULL, dataseries = NULL)
type |
|
dataseries |
|
returns the selected translation table
Harmonise and integrate geometries into a standardised format
normGeometry( input = NULL, pattern = NULL, query = NULL, thresh = 10, beep = NULL, simplify = FALSE, stringdist = TRUE, strictMatch = FALSE, verbose = FALSE )
normGeometry( input = NULL, pattern = NULL, query = NULL, thresh = 10, beep = NULL, simplify = FALSE, stringdist = TRUE, strictMatch = FALSE, verbose = FALSE )
input |
|
pattern |
|
query |
|
thresh |
|
beep |
|
simplify |
|
stringdist |
|
strictMatch |
|
verbose |
|
To normalise geometries, this function proceeds as follows:
Read in input
and extract initial metadata from
the file name.
In case filters are set, the new geometry is filtered by those.
The territorial names are matched with the gazetteer to harmonise new territorial names (at this step, the function might ask the user to edit the file 'matching.csv' to align new names with already harmonised names).
Loop through every nation potentially included in the file that shall be processed and carry out the following steps:
In case the geometries are provided as a list of simple feature POLYGONS, they are dissolved into a single MULTIPOLYGON per main polygon.
In case the nation to which a geometry belongs has not yet been created at stage three, the following steps are carried out:
Store the current geometry as basis of the respective level (the user needs to make sure that all following levels of the same dataseries are perfectly nested into those parent territories, for example by using the GADM dataset)
In case the nation to which the geometry belongs has already been created, the following steps are carried out:
Check whether the new geometries have the same coordinate reference system as the already existing database and re-project the new geometries if this is not the case.
Check whether all new geometries are already exactly matched spatially and stop if that is the case.
Check whether the new geometries are all within the already defined parents, and save those that are not as a new geometry.
Calculate
spatial overlap and distinguish the geometries into those that overlap with
more and those with less than thresh
.
For all units that dName match, copy gazID from the geometries they overlap.
For all units that dName not match, rebuild metadata and a new gazID.
store the processed geometry at stage three.
Move the geometry to the folder '/processed', if it is fully processed.
This function harmonises and integrates so far unprocessed geometries at stage two into stage three of the geospatial database. It produces for each main polygon (e.g. nation) in the registered geometries a spatial file of the specified file-type.
Other normalise functions:
normTable()
if(dev.interactive()){ library(sf) # build the example database adb_example(until = "regGeometry", path = tempdir()) # normalise all geometries ... normGeometry(pattern = "estonia") # ... and check the result st_layers(paste0(tempdir(), "/geometries/stage3/Estonia.gpkg")) output <- st_read(paste0(tempdir(), "/geometries/stage3/Estonia.gpkg")) }
if(dev.interactive()){ library(sf) # build the example database adb_example(until = "regGeometry", path = tempdir()) # normalise all geometries ... normGeometry(pattern = "estonia") # ... and check the result st_layers(paste0(tempdir(), "/geometries/stage3/Estonia.gpkg")) output <- st_read(paste0(tempdir(), "/geometries/stage3/Estonia.gpkg")) }
Harmonise and integrate data tables into standardised format
normTable( input = NULL, pattern = NULL, query = NULL, ontoMatch = NULL, beep = NULL, verbose = FALSE )
normTable( input = NULL, pattern = NULL, query = NULL, ontoMatch = NULL, beep = NULL, verbose = FALSE )
input |
|
pattern |
|
query |
|
ontoMatch |
|
beep |
|
verbose |
|
To normalise data tables, this function proceeds as follows:
Read in input
and extract initial metadata from
the file name.
Employ the function
tabshiftr::reorganise()
to reshape input
according to
the respective schema description.
The territorial names are matched with the gazetteer to harmonise new territorial names (at this step, the function might ask the user to edit the file 'matching.csv' to align new names with already harmonised names).
Harmonise territorial unit names.
store the processed data table at stage three.
This function harmonises and integrates so far unprocessed data tables at stage two into stage three of the areal database. It produces for each main polygon (e.g. nation) in the registered data tables a file that includes all thematic areal data.
Other normalise functions:
normGeometry()
if(dev.interactive()){ # build the example database adb_example(until = "normGeometry", path = tempdir()) # normalise all available data tables ... normTable() # ... and check the result output <- readRDS(paste0(tempdir(), "/tables/stage3/Estonia.rds")) }
if(dev.interactive()){ # build the example database adb_example(until = "normGeometry", path = tempdir()) # normalise all available data tables ... normTable() # ... and check the result output <- readRDS(paste0(tempdir(), "/tables/stage3/Estonia.rds")) }
This function registers a new dataseries of both, geometries or areal data into the geospatial database. This contains the name and relevant meta-data of a dataseries to enable provenance tracking and reproducability.
regDataseries( name = NULL, description = NULL, homepage = NULL, version = NULL, licence_link = NULL, reference = NULL, notes = NULL, overwrite = FALSE )
regDataseries( name = NULL, description = NULL, homepage = NULL, version = NULL, licence_link = NULL, reference = NULL, notes = NULL, overwrite = FALSE )
name |
|
description |
|
homepage |
|
version |
|
licence_link |
|
reference |
|
notes |
|
overwrite |
|
Returns a tibble of the new entry that is appended to 'inv_dataseries.csv'.
Other register functions:
regGeometry()
,
regTable()
if(dev.interactive()){ # start the example database adb_exampleDB(until = "match_gazetteer", path = tempdir()) regDataseries(name = "gadm", description = "Database of Global Administrative Areas", version = "3.6", homepage = "https://gadm.org/index.html", licence_link = "https://gadm.org/license.html") }
if(dev.interactive()){ # start the example database adb_exampleDB(until = "match_gazetteer", path = tempdir()) regDataseries(name = "gadm", description = "Database of Global Administrative Areas", version = "3.6", homepage = "https://gadm.org/index.html", licence_link = "https://gadm.org/license.html") }
This function registers a new geometry of territorial units into the geospatial database.
regGeometry( ..., subset = NULL, gSeries = NULL, label = NULL, ancillary = NULL, layer = NULL, archive = NULL, archiveLink = NULL, downloadDate = NULL, updateFrequency = NULL, notes = NULL, overwrite = FALSE )
regGeometry( ..., subset = NULL, gSeries = NULL, label = NULL, ancillary = NULL, layer = NULL, archive = NULL, archiveLink = NULL, downloadDate = NULL, updateFrequency = NULL, notes = NULL, overwrite = FALSE )
... |
|
subset |
|
gSeries |
|
label |
|
ancillary |
|
layer |
|
archive |
|
archiveLink |
|
downloadDate |
|
updateFrequency |
|
notes |
|
overwrite |
|
When processing geometries to which areal data shall be linked, carry out the following steps:
Determine the main
territory (such as a nation, or any other polygon), a subset
(if
applicable), the dataseries of the geometry and the ontology label
,
and provide them as arguments to this function.
Run the function.
Export the shapefile with the following properties:
Format: GeoPackage
File name: What is provided as message by this function
CRS: EPSG:4326 - WGS 84
make sure that 'all fields are exported'
Confirm that you have saved the file.
Returns a tibble of the entry that is appended to 'inv_geometries.csv'.
Other register functions:
regDataseries()
,
regTable()
if(dev.interactive()){ # build the example database adb_exampleDB(until = "regDataseries", path = tempdir()) # The GADM dataset comes as *.7z archive regGeometry(gSeries = "gadm", label = list(al1 = "NAME_0"), layer = "example_geom1", archive = "example_geom.7z|example_geom1.gpkg", archiveLink = "https://gadm.org/", nextUpdate = "2019-10-01", updateFrequency = "quarterly") # The second administrative level in GADM contains names in the columns # NAME_0 and NAME_1 regGeometry(gSeries = "gadm", label = list(al1 = "NAME_0", al2 = "NAME_1"), ancillary = list(name_lcl = "VARNAME_1", code = "GID_1", type = "TYPE_1"), layer = "example_geom2", archive = "example_geom.7z|example_geom2.gpkg", archiveLink = "https://gadm.org/", nextUpdate = "2019-10-01", updateFrequency = "quarterly") }
if(dev.interactive()){ # build the example database adb_exampleDB(until = "regDataseries", path = tempdir()) # The GADM dataset comes as *.7z archive regGeometry(gSeries = "gadm", label = list(al1 = "NAME_0"), layer = "example_geom1", archive = "example_geom.7z|example_geom1.gpkg", archiveLink = "https://gadm.org/", nextUpdate = "2019-10-01", updateFrequency = "quarterly") # The second administrative level in GADM contains names in the columns # NAME_0 and NAME_1 regGeometry(gSeries = "gadm", label = list(al1 = "NAME_0", al2 = "NAME_1"), ancillary = list(name_lcl = "VARNAME_1", code = "GID_1", type = "TYPE_1"), layer = "example_geom2", archive = "example_geom.7z|example_geom2.gpkg", archiveLink = "https://gadm.org/", nextUpdate = "2019-10-01", updateFrequency = "quarterly") }
This function registers a new areal data table into the geospatial database.
regTable( ..., subset = NULL, dSeries = NULL, gSeries = NULL, label = NULL, begin = NULL, end = NULL, schema = NULL, archive = NULL, archiveLink = NULL, downloadDate = NULL, updateFrequency = NULL, metadataLink = NULL, metadataPath = NULL, notes = NULL, diagnose = FALSE, overwrite = FALSE )
regTable( ..., subset = NULL, dSeries = NULL, gSeries = NULL, label = NULL, begin = NULL, end = NULL, schema = NULL, archive = NULL, archiveLink = NULL, downloadDate = NULL, updateFrequency = NULL, metadataLink = NULL, metadataPath = NULL, notes = NULL, diagnose = FALSE, overwrite = FALSE )
... |
|
subset |
|
dSeries |
|
gSeries |
|
label |
|
begin |
|
end |
|
schema |
|
archive |
|
archiveLink |
|
downloadDate |
|
updateFrequency |
|
metadataLink |
|
metadataPath |
|
notes |
|
diagnose |
|
overwrite |
|
When processing areal data tables, carry out the following steps:
Determine the main territory (such as a nation, or any
other polygon), a subset
(if applicable), the ontology
label
and the dataseries of the areal data and of the geometry, and
provide them as arguments to this function.
Provide a begin
and end
date for the areal data.
Run the function.
(Re)Save the table with the following properties:
Format: csv
Encoding: UTF-8
File name: What is provided as message by this function
make sure that the file is not modified or reshaped. This will happen during data normalisation via the schema description, which expects the original table.
Confirm that you have saved the file.
Every areal data dataseries (dSeries
) may come as a slight
permutation of a particular table arrangement. The function
normTable
expects internally a schema description (a list
that describes the position of the data components) for each data table,
which is saved as paste0("meta_", dSeries, TAB_NUMBER)
. See package
tabshiftr
.
Returns a tibble of the entry that is appended to 'inv_tables.csv' in
case update = TRUE
.
Other register functions:
regDataseries()
,
regGeometry()
if(dev.interactive()){ # build the example database adb_exampleDB(until = "regGeometry", path = tempdir()) # the schema description for this table library(tabshiftr) schema_madeUp <- setIDVar(name = "al1", columns = 1) %>% setIDVar(name = "year", columns = 2) %>% setIDVar(name = "commodities", columns = 3) %>% setObsVar(name = "harvested", factor = 1, columns = 4) %>% setObsVar(name = "production", factor = 1, columns = 5) regTable(nation = "Estonia", subset = "barleyMaize", label = "al1", dSeries = "madeUp", gSeries = "gadm", begin = 1990, end = 2017, schema = schema_madeUp, archive = "example_table.7z|example_table1.csv", archiveLink = "...", nextUpdate = "2024-10-01", updateFrequency = "quarterly", metadataLink = "...", metadataPath = "my/local/path") }
if(dev.interactive()){ # build the example database adb_exampleDB(until = "regGeometry", path = tempdir()) # the schema description for this table library(tabshiftr) schema_madeUp <- setIDVar(name = "al1", columns = 1) %>% setIDVar(name = "year", columns = 2) %>% setIDVar(name = "commodities", columns = 3) %>% setObsVar(name = "harvested", factor = 1, columns = 4) %>% setObsVar(name = "production", factor = 1, columns = 5) regTable(nation = "Estonia", subset = "barleyMaize", label = "al1", dSeries = "madeUp", gSeries = "gadm", begin = 1990, end = 2017, schema = schema_madeUp, archive = "example_table.7z|example_table1.csv", archiveLink = "...", nextUpdate = "2024-10-01", updateFrequency = "quarterly", metadataLink = "...", metadataPath = "my/local/path") }
gazetteer
An ontology of territory names (gazetteer)
territories
territories
object of class onto
for the example territories used in
adb_example
.