Create an ontology

library(ontologics)
library(dplyr, warn.conflicts = FALSE)

Any work with an ontology would either start by reading it in from an already existing database, or by creating a new ontology from scratch.

Even though this package is still under development, we do already provide a function that can read in an ontology from an *.rds file (one that is optimized for the usage within R), and can write to any format that is useful for triplestores or the semantic web. This vignette focuses on the basic building blocks for creating a new ontology and you can find more on how to map new concepts from external ontologies, and how to export an ontology so that it’s interoperable with the semantic web.

An existing ontology

# read in example ontology
crops <- load_ontology(path = system.file("extdata", "crops.rds", package = "ontologics"))

crops   # ... has a pretty show-method
#>   sources : 1
#>     -> 'harmonised' (73)
#> 
#>   classes : 3 
#>    ∟ group    20   Groups of crop or livestock commoditi...
#>     ∟ class   53   Classes of crop or livestock commodi...
#>      ∟ crop    0   Crop or livestock commodities
#> 
#>   top concepts: 73 
#>     -> group: 'CEREALS' (10), 'FRUIT' (8), 'VEGETABLES' (6), 'UNGULATES' (5), 'BIOENERGY CROPS' (4), ...
#>     -> class: 'Bioenergy herbaceous' (20), 'Barley' (20), 'Fibre crops' (20), 'Flower herbs' (20), 'Grass crops' (20), ...
#>     -> crop:

The onto class is an S3 class with the 3 slots @sources, @classes and @concepts, each of which are reflected by an entry in the show-method. Often the classes in an ontology have a hierarchical order, but this is not obligatory. In any case, the first three levels of the hierarchical structure together with the number of concepts of each level and the description is shown here. Moreover, the five most frequent concepts are shown together with a visual representation of the frequency distribution of all concepts at the first three levels.

The three main slots are represented by a function that allows to add new items to this slot (new_source, new_class and new_concept) and an additional function allows to create mappings between your focal ontology and any external ontology (new_mappings). There is more detailed information about the architecture of the onto-class in the vignette Ontology database description.

New ontology

A new ontology is built by calling the function start_ontology(). This requires a bunch of meta-data that will be stored in the ontology and which serve the purpose of properly linking also this ontology to other linked open data.

lulc <- start_ontology(name = "land_surface_properties",
                       version = "0.0.1",
                       path = tempdir(), 
                       code = ".xx", 
                       description = "showcase of the ontologics R-package", 
                       homepage = "https://github.com/luckinet/ontologics", 
                       license = "CC-BY-4.0")

lulc # nothing included so far
#>   sources : 1
#>     -> 'harmonised' (0)
#> 
#>   classes : 0 
#> 
#>   top concepts: 0

These information are stored in the @sources slot, just like any other external data source. It is recommended to always set the code for building IDs with a leading symbol that can’t be transformed into a numeric/integer, to avoid problems in case the ontology is opened in a spreadsheet program that may automatically do this transformation without asking or informing the author.

kable(lulc@sources)
id label version date description homepage license notes
1 harmonised 0.0.1 2024-10-24 showcase of the ontologics R-package https://github.com/luckinet/ontologics CC-BY-4.0

Next, classes and their hierarchy need to be defined. Each concept is always a combination of a code, a label and a class. The code must be unique for each unique concept, but the label or the class can have the same value for two concepts. For instance, the concept football can have the class game or the class object and then mean two different things, despite having the same label.

# currently it is only possible to set one class at a time
lulc <- new_class(
  new = "landcover", 
  target = NA, 
  description = "A good definition of landcover",
  ontology = lulc)

lulc <- new_class(
    new = "land use", 
    target = "landcover", 
    description = "A good definition of land use",
    ontology = lulc)

# the class IDs are derived from the code that was previously specified 
kable(lulc@classes$harmonised[, 1:6])
id label description has_broader has_close_match has_narrower_match
.xx landcover A good definition of landcover NA NA NA
.xx.xx land use A good definition of land use landcover NA NA

Then, new concepts that have these classes can be defined. In case classes are chosen that are not yet defined, you’ll get a warning.

lc <- c(
  "Urban fabric", "Industrial, commercial and transport units",
  "Mine, dump and construction sites", "Artificial, non-agricultural vegetated areas",
  "Temporary cropland", "Permanent cropland", "Heterogeneous agricultural areas",
  "Forests", "Other Wooded Areas", "Shrubland", "Herbaceous associations",
  "Heterogeneous semi-natural areas", "Open spaces with little or no vegetation",
  "Inland wetlands", "Marine wetlands", "Inland waters", "Marine waters"
)

lulc <- new_concept(
  new = lc,
  class = "landcover",
  ontology = lulc
)

kable(lulc@concepts$harmonised[, 1:5])
id label class description has_broader
.01 Urban fabric landcover NA NA
.02 Industrial, commercial and transport units landcover NA NA
.03 Mine, dump and construction sites landcover NA NA
.04 Artificial, non-agricultural vegetated areas landcover NA NA
.05 Temporary cropland landcover NA NA
.06 Permanent cropland landcover NA NA
.07 Heterogeneous agricultural areas landcover NA NA
.08 Forests landcover NA NA
.09 Other Wooded Areas landcover NA NA
.10 Shrubland landcover NA NA
.11 Herbaceous associations landcover NA NA
.12 Heterogeneous semi-natural areas landcover NA NA
.13 Open spaces with little or no vegetation landcover NA NA
.14 Inland wetlands landcover NA NA
.15 Marine wetlands landcover NA NA
.16 Inland waters landcover NA NA
.17 Marine waters landcover NA NA

An ontology is different from a vocabulary in that concepts that are contained in an ontology are related semantically to one another. For example, concepts can be nested into other concepts. Hence, let’s create also a second level of concepts that depend on the first level.

lu <- tibble(
  concept = c(
    "Fallow", "Herbaceous crops", "Temporary grazing",
    "Permanent grazing", "Shrub orchards", "Palm plantations",
    "Tree orchards", "Woody plantation", "Protective cover",
    "Agroforestry", "Mosaic of agricultural-uses",
    "Mosaic of agriculture and natural vegetation",
    "Undisturbed Forest", "Naturally Regenerating Forest",
    "Planted Forest", "Temporally Unstocked Forest"
  ),
  broader = c(
    rep(lc[5], 3), rep(lc[6], 6),
    rep(lc[7], 3), rep(lc[8], 4)
  )
)



lulc <- get_concept(label = lu$broader, ontology = lulc) %>% 
  left_join(lu %>% select(label = broader), .) %>% 
  new_concept(
    new = lu$concept,
    broader = .,
    class = "land use",
    ontology = lulc
  )
#> Joining with `by = join_by(label)`

kable(lulc@concepts$harmonised[, 1:5])
id label class description has_broader
.01 Urban fabric landcover NA NA
.02 Industrial, commercial and transport units landcover NA NA
.03 Mine, dump and construction sites landcover NA NA
.04 Artificial, non-agricultural vegetated areas landcover NA NA
.05 Temporary cropland landcover NA NA
.05.01 Fallow land use NA .05
.05.02 Herbaceous crops land use NA .05
.05.03 Temporary grazing land use NA .05
.06 Permanent cropland landcover NA NA
.06.01 Permanent grazing land use NA .06
.06.02 Shrub orchards land use NA .06
.06.03 Palm plantations land use NA .06
.06.04 Tree orchards land use NA .06
.06.05 Woody plantation land use NA .06
.06.06 Protective cover land use NA .06
.07 Heterogeneous agricultural areas landcover NA NA
.07.01 Agroforestry land use NA .07
.07.02 Mosaic of agricultural-uses land use NA .07
.07.03 Mosaic of agriculture and natural vegetation land use NA .07
.08 Forests landcover NA NA
.08.01 Undisturbed Forest land use NA .08
.08.02 Naturally Regenerating Forest land use NA .08
.08.03 Planted Forest land use NA .08
.08.04 Temporally Unstocked Forest land use NA .08
.09 Other Wooded Areas landcover NA NA
.10 Shrubland landcover NA NA
.11 Herbaceous associations landcover NA NA
.12 Heterogeneous semi-natural areas landcover NA NA
.13 Open spaces with little or no vegetation landcover NA NA
.14 Inland wetlands landcover NA NA
.15 Marine wetlands landcover NA NA
.16 Inland waters landcover NA NA
.17 Marine waters landcover NA NA

Here we see that get_concept() was used to extract those broader concepts, into which the new level is nested. This is to ensure that a valid concept is provided, i.e., one that has already been included into the ontology.