UMBEL - Annex K 20160510

From UMBEL Wiki
Jump to: navigation, search
UMBEL Annex K: UMBEL Generator

UMBEL Annex Document - 20 April 2015

Latest version
http://techwiki.umbel.org/index.php/UMBEL_-_Annex_K
UMBEL Logo
Last update
$Date: 2015/4/20 14:28:36 $
Version
Version No.: 1.20
Volume
TR 12-5-21-K
Authors
Michael Bergman - Structured Dynamics
Frédérick Giasson - Structured Dynamics

Structured Dynamics Logo

UMBEL: Upper Mapping and Binding Exchange Layer by Structured Dynamics LLC is provided under the
Creative Commons Attribution 3.0 license. See the attribution section for how to cite the effort.

Creative Commons License

Copyright © 2009-2015 by Structured Dynamics LLC.

Beginning with UMBEL version 1.20, statistics regarding numbers of reference concepts (RCs) in the ontology and splits between SuperTypes (STs) and modules have been moved to the statistics Annex Z document. As a result, earlier statistics in this and other annexes are no longer being updated, which means any statistics cited below may be out of date. Please consult Annex Z for the current UMBEL statistics.

This annex describes the scripts used to generate new versions of UMBEL. This method was introduced for UMBEL v.1.10. Note the Clojure generator build code is NOT available for public distribution.

Installation

Run the umbel-generator in a Clojure REPL

Usage

Create a new version of UMBEL using this command:

(use 'umbel-generator.core)
(generate-umbel-core-reference-concepts "1.10")

Then once you are done generating the new UMBEL structures, you have to check the /logs/ folder and check for any possible issues.

Finally, once all the issues from the logs are fixed, you have to make sure the reference structures are coherent and satisfiable. You can perform the following two steps using the Pellet commandline tool on the Core and all the modules UMBEL reference concepts:

; Check inconsistencies
pellet explain --inconsistent "file:/c:/Users/Proprietaire/Documents/Clojure/umbel-generator/umbel/1.10/target/umbel_reference_concepts.n3"
; Check for unsatisfiability
pellet explain --all-unsat "file:/c:/Users/Proprietaire/Documents/Clojure/umbel-generator/umbel/1.10/target/umbel_reference_concepts.n3"

Once everything is sounds and without issues, the new UMBEL version is ready to get released.

Structure of an UMBEL Version Folder

The creation of a new version of UMBEL is based on a series of files present within a version folder. In this section we will see how that folder is structured to create a new version of UMBEL.

  • /umbel
    • /1.10
      • /fixes
        • reference-concepts-fixes.csv
      • /indexes
        • opencyc_umbel.csv
        • super_types.csv
      • /logs
        • missing-definitions.csv
        • missing-new-concept-parents.csv
        • missing-opencyc-ids.csv
        • missing-preferred-labels.csv
        • new-preferred-labels.csv
        • non-distinct-preferred-labels.csv
        • missing-opencyc-parent-concepts-assign-supertype.csv
        • missing-opencyc-concepts-paths.csv
        • proposed-opencyc-concepts-additions-based-on-missing-paths-analysis.csv
      • /mappings
        • dbpedia.csv
        • geonames.csv
        • schema.org.csv
      • /modules
        • geo.csv
      • /new-concepts
        • dbpedia.csv
        • geonames.csv
        • schema_org.csv
      • /opencyc
        • owl-export-unversioned.owl
      • /owl
        • umbel_reference_concepts.n3
        • umbel_geo.n3
      • /target
        • dbpediaOntology.n3
        • geonames.n3
        • schema.org.n3
        • umbel_geo.n3
        • umbel_reference_concepts.n3

All the files of the /logs/ and the /target/ folders are generated by the application. First let's describe the content and the purpose of each of these files. Then we will see how they should be used to create or update a version of UMBEL.

/indexes

File name Description
opencyc_umbel.csv This is the core UMBEL index file. It has two columns. The first one is the UMBEL-ID used to create the UMBEL URI in the reference structure. The second column is the OPENCYC-IDwhich is used to reference the proper entity in the OpenCyc ontology. All the concepts that are added in the UMBEL Core Reference Structure or any of the UMBEL modules or any of the new concepts need to be present in that file.
super_types.csv This is the Super Types index. It has two columns. The first column is the UMBEL-ID of the UMBEL reference concept for which we want to assign a Super Type to it. The second column is the Super-Type ID of the Super Type to assign to the reference concept. ThisSuper-Type ID is the URI ending of one of the super type.

/logs

File name Description
missing-definitions.csv This log file lists all the UMBEL-ID for which we have a missing definition. For each of the ID in that file, it means that we have no definition for that reference concept in the OpenCyc ontology.
missing-new-concept-parents.csv This log file lists all the new concepts which have a missing parent assignation. This means that the UMBEL-ID of the parent assignation has a typo error in it, or simply because it is not (yet) existing in UMBEL.
missing-opencyc-ids.csv This file lists all the OpenCyc ID that have been specified in the opencyc_umbel.csv index but that are not existing in the OpenCyc ontology. This may happen that the OpenCyc ID changed between different versions of OpenCyc and if this happens, then concepts will be written to that file.
missing-preferred-labels.csv This log file lists all the UMBEL-ID for which we are missing a preferred label. If a preferred label is missing, it means that a label is missing in the OpenCyc ontology, or in one of the new UMBEL concept.
non-distinct-preferred-labels.csv This log file lists all the UMBEL-ID that share the same preferred label. It is the case that not all preferred label are distinct in OpenCyc. The first column of that file is the label that is shared between multiple UMBEL-IDs. Then the subsequent columns (one per UMBEL-ID) list all theUMBEL-ID that share that same preferred label.
new-preferred-labels.csv This file is a derivate file of non-distinct-preferred-labels.csv. Its purpose is to format the same information in a different way. This file need to be manipulated to fix the non-distinct preferred label in UMBEL. This file is composed of three columns: UMBEL-ID, pref-label andalt-label. The structure of the file is such that the pref-label and the alt-label columns do initially have the shared label in them. The purpose is to not touch the alt-label column and to leave it as-is. Then the pref-label column should be modified such that we have a distinct preferred label for each of the concept. If for one of the concept, you want to keep the pref-label as-is, then leave the pref-label column untouched, and remove the label in the alt-labelcolumn (leave it blank). Once finalized, this file will appended to the /fixes/pref-labels.csv file. Then when UMBEL will be re-generated, the pref-labels and alt-labels changes will be taken into account.
missing-opencyc-parent-concepts-assign-supertype.csv This is a list of UMBEL-ID that is generated when creating the new concept. Each of the reference concept that appear in this list is missing a SuperType assignation. All the concepts in that list are concepts that appears in the list of parent concepts of the new concepts that the generator is creating.
missing-opencyc-concepts-paths.csv Each line of this file shows a UMBEL-ID for which we have no sub-class-of relationships for them. The concepts that appears in that list normally appears because the structure of OpenCyc changed since the last version. The paths that are shown are the paths that exists between these "orphan" concepts and another UMBEL reference concepts. The paths is described in terms of OpenCyc classes IDs. This fine should be used to find OpenCyc classes to add to UMBEL. However, this file is generated for information purposes only. The file proposed-opencyc-concepts-additions-based-on-missing-paths-analysis.csv should be used to add the new UMBEL reference concepts.
proposed-opencyc-concepts-additions-based-on-missing-paths-analysis.csv This file is generated from the file missing-opencyc-concepts-paths.csv. These are proposed new concepts to add to UMBEL core. They are coming from OpenCyc. They will help tiying untied concepts that currently exists in UMBEL.

/mappings

File name Description
dbpedia.csv This file list all the DBpedia classes that should be mapped to UMBEL reference concept. This mapping is used to generate the /target/dbpediaOntology.n3 file which list all therdfs:subClassOf assignations between the UMBEL reference concept classes and the DBpedia Ontology classes.
geonames.csv This file list all the Geonames classes that should be mapped to UMBEL reference concept. This mapping is used to generate the /target/geonames.n3 file which list all the umbel:correspondsToassignations between the UMBEL reference concept classes and the Geonames Ontology classes.
schema.org.csv This file list all the Schema.org classes that should be mapped to UMBEL reference concept. This mapping is used to generate the /target/schema.org.n3 file which list all therdfs:subClassOf assignations between the UMBEL reference concept classes and the Schema.org Ontology classes.

/modules

File name Description
geo.csv This file list all the UMBEL reference concepts that belongs to the UMBEL Geo module. This file is used to generate the UMBEL Geo Ontology module file: /target/umbel_geo.n3

/new-concepts

The format of the CSV files of any of the files in the new-concepts folder is the same. These files are used to create new concepts in UMBEL which are not existing in OpenCyc. The CSV files have 5 columns: id, pref-label, alt-labels,parents and definition. The id and the parents columns are expecting UMBEL-ID references. The columns alt-labels and parents can use the double pipe: |&#124 control characters to specify more than one value for that field.

File name Description
dbpedia.csv It lists all the new concepts required to create the DBpedia linkage
geonames.csv It lists all the new concepts required to create the Geonames linkage
schema_org.csv It lists all the new concepts required to create the Schema.org linkage

/opencyc

File name Description
owl-export-unversioned.owl This is the version of the OpenCyc Ontology file to use for generating a given version of UMBEL.

/owl

File name Description
umbel_reference_concepts.n3 This is the base ontology file to use where all the Core reference concepts will be added.
umbel_geo.n3 This is the base ontology file to use where all the Geo module reference concepts will be added.

/target

File name Description
dbpediaOntology.n3 This is the generated linkage file between the DBpedia Ontology and UMBEL. It is generated using the /mappings/dbpedia.csv file.
geonames.n3 This is the generated linkage file between the Geonames Ontology and UMBEL. It is generated using the /mappings/geonames.csv file.
schema.org.n3 This is the generated linkage file between the Schema.org Ontology and UMBEL. It is generated using the /mappings/schema.org.csv file.
umbel_geo.n3 This is the generated Geo Module ontology
umbel_reference_concepts.n3 This is the generated UMBEL Core Reference Concepts ontology

/fixes

File name Description
reference-concepts-fixes.csv This file is used to fix preferred labels, alternative labels and definitions when generating UMBEL reference concepts from OpenCyc. It has three columns: umbel-id, pref-label, alt-label anddefinition. For each umbel-id specified in that file, the preferred label that will be used when creating the file is the one in the pref-label column (It will replaces it). Then, every alternative labels of the alt-label column will be added as alternative labels in the concept's description. Finally, what is in the definition colomn will replace the (possibly) existing definition of a reference concept. Note that you can seperate multiple alternative labels using the double pipe control character: ||

Creating a New Version of UMBEL

To create a new version of UMBEL you first have to create a new folder with the version number under the /umbel/folder such as:

/umbel/1.10/

You should normally create that new folder by copy/pasting the folder of the previous version and to rename it with the new version's number. This will create the foundation of the new version.

Adding/Removing UMBEL reference concept (in OpenCyc)

To add a new UMBEL reference concept that exists in OpenCyc, you have to:

  1. Find the OpenCyc class that you want to add
  2. Take its ID in note
  3. Take its cycAnnot:label
  4. Open the /indexes/umbel_opencyc.csv index file
  5. Use the cycAnnot:label value in the first column. This will become the UMBEL reference concept's URI ending
  6. Use the ID in the second column, this will make the link to the OpenCyc class
  7. Save the file

To remove a UMBEL reference concept that exists in OpenCyc, you have to:

  1. Open the /indexes/umbel_opencyc.csv index file
  2. Find the UMBEL-ID you want to remove
  3. Remove the line
  4. Save the file

Adding/Modifying UMBEL reference concepts (not in OpenCyc)

To add a new UMBEL reference concept that does not exist in OpenCyc, you have to:

  1. Open one of the existing file in the /new-concepts/ folder. You can create a new one, but it will need to be taken into account in the umbel-generator code.
  2. Create an ID for that new reference concept (make sure it doesn't already exists), then specify a pref-label, one or multiple alt-labels, a definition and one or multiple parents IDs.
  3. Save the file

Fixing the preferred label, alternative labels and definition of a UMBEL reference concept (that exists in OpenCyc)

To fix a preferred label of an UMBEL reference concept that exists in OpenCyc you have to:

  1. Open the file /fixes/reference-concepts-fixes.csv
  2. Add the UMBEL-ID of the UMBEL reference concept to fix in the first column
  3. Add the preferred label that should be used by that reference concept in the second column
  4. Save the file

To fix a definition of an UMBEL reference concept that exists in OpenCyc you have to:

  1. Open the file /fixes/reference-concepts-fixes.csv
  2. Add the UMBEL-ID of the UMBEL reference concept to fix in the first column
  3. Add the definition that should be used by that reference concept in the last column
  4. Save the file

To add one or multiple alternative labels to an UMBEL reference concept that exists in OpenCyc, then do:

  1. Open the file /fixes/reference-concepts-fixes.csv
  2. Add the UMBEL-ID of the UMBEL reference concept to fix in the first column
  3. Add one or multiple alternative labels in the second column. These alternative labels will be added to the other alternative labels that exists in OpenCyc. Also note that you can add more than one alternative labels by using the double pipe: |&#124 control character between each alternative label
  4. Save the file

Assigning a super-type relationship for a UMBEL reference concept

To assign a new super-type relationship between a UMBEL super type and an UMBEL reference concept, you have to:

  1. Open the /indexes/super_types.csv file
  2. Put the UMBEL-ID of the UMBEL reference concept for which you want to assign to a UMBEL Super Type in the first column
  3. Put the SUPER-TYPE-ID of the UMBEL Super Type in the second column
  4. Save the file

Note that this file is used to assign supertypes to the Core reference concepts and any of the UMBEL modules concepts.

Updating an existing mapping

To update the mapping between UMBEL reference concepts and external ontologies you have to:

  1. Open one of the mapping files in the /mappings/ folder
  2. Add/modify/remove one of the linkage that is there
  3. You have to put the URI ending of the external ontology to map in the first column
  4. You have to put the UMBEL-ID of the reference concept to link to in the second column
  5. Save the file

Creating a new mapping

To create a new mapping to another external ontology, you have to:

  1. Create a new mapping with the same structure and rules as we currently have with DBpedia, Schema.org and Geonames
  2. Save the file
  3. Send the file to Fred such that he handle that new mapping into the umbel-generator code

Updating existing module

To update an existing module, you have to:

  1. Open the modules' file located in the /modules/ folder
  2. Add/Remove/Modify the UMBEL-ID from the file
  3. Save the file

Creating new module

To create a new UMBEL module, we have to select a list of UMBEL reference concepts that exists in UMBEL core that we want to move into an ontology module.

All the concepts that are part of the upper structure of UMBEL (the ones that keep it fully connected) should remain in the UMBEL Core reference concepts structure.

All the UMBEL reference concepts that are listed in one of the module won't appear in the Core structure.

It may be the case that this breaks the integraty of the Core structure. In which case, that we will have to remove them from the module such that they get added back into the Core structure by the umbel-generator.

Steps performed by the umbel-generator

Here are the steps performed by the umbel-generator to create the Core reference structure, all the modules and linkages:

  1. Create UMBEL Core Reference Concepts
    1. Create core concepts from /indexes/opencyc_umbel.csv
      1. Before processing, remove the ones that belong to the Modules
      2. Make sure the specified OpenCyc IDs in the index are existing in OpenCyc
        1. Missing OpenCyc IDs are written to: /logs/missing-opencyc-ids.csv
      3. Create the class, RefConcept and assign the super types from /indexes/super-types.csv
    2. Add new concepts from /new-concepts/*
      1. For each new concepts, make sure the parents/super-classes exists in opencyc_umbel.csv
        1. Missing parent IDs are written to: /logs/missing-new-concept-parents.csv
  2. Create UMBEL Geo module
    1. Create geo concepts from /modules/geo.csv and by getting the OpenCyc ID from /indexes/opencyc_umbel.csv
    2. Add the Super Types assignations from /indexes/super-types.csv
    3. Save Geo Module
  3. Create Geonames.org linkage
    1. Add concepts required for the linkage
  4. Create DBPedia linkage
    1. Add concepts required for the linkage
  5. Create Schema.org linkage
    1. Add concepts required for the linkage
  6. Save Ontology
  7. Perform some quality checks:
    1. Check for missing pref-labels: written to /logs/missing-preferred-labels.csv
    2. Check for distinct pref-labels: written to /logs/non-distinct-preferred-labels.csv and /logs/new-preferred-labels.csv
    3. Check for missing definitions: written to /logs/missing-definitions.csv


Copyright © 2009-2015 by Structured Dynamics LLC.