UMBEL - Annex H

From UMBEL Wiki
Jump to: navigation, search
UMBEL Annex H: Version 1.00 Changes

UMBEL Annex Document - 10 May 2016

Latest version
http://techwiki.umbel.org/index.php/UMBEL_-_Annex_H
UMBEL Logo
Last update
$Date: 2016/5/10 9:22:47 $
Version
Version No.: 1.50
Volume
TR 16-5-10-H
Authors
Michael Bergman - Structured Dynamics
Frédérick Giasson - Structured Dynamics

Structured Dynamics Logo

UMBEL: Upper Mapping and Binding Exchange Layer by Structured Dynamics LLC is provided under the
Creative Commons Attribution 3.0 license. See the attribution section for how to cite the effort.

Creative Commons License

Copyright © 2009-2016 by Structured Dynamics LLC.

Beginning with UMBEL version 1.20, statistics regarding numbers of reference concepts (RCs) in the ontology and splits between SuperTypes (STs) and modules have been moved to the statistics Annex Z document. As a result, earlier statistics in this and other annexes are no longer being updated, which means any statistics cited below may be out of date. Please consult Annex Z for the current UMBEL statistics.

INTRODUCTION

This document describes: 1) the process for adding new reference concepts to UMBEL to create a more long-lasting core ontology; and 2) how the concepts are mapped to important external ontologies. These two steps filled the remaining gaps to increment UMBEL to version 1.00.

Please note that many links to supporting, detailed documentation are included in this internal version of the process. These are shown under the See Also headers below. These links to the detailed explanations have been removed in the public version of this Annex.

Key statistics from these mappings are also included in this document.

Organization of this Annex

This Annex first presents the objectives and summary benefits and statistics from this new version of UMBEL. Then, the document describes how the mappings (and resulting UMBEL concept expansions) occurred. Mappings were based on both class and instance relationships between UMBEL and its guiding Wikipedia complement throughout this process.

The effort of mapping UMBEL to Wikipedia (and other sources, described in the body) resulted in some practical lessons regarding UMBEL's vocabulary. This learning has resulted in some updates and changes to the UMBEL predicates and definitions.

All of the efforts were also tested for coherence[1] (consistency[2] and satisfiability[3]). Directions for future updates and options are also described.

OVERVIEW AND PURPOSE

The guiding purpose -- not yet completely fulfilled -- of this version of UMBEL was to find the "best" Wikipedia category and page matches to each UMBEL concept. This purpose was chosen to provide a comprehensive and proven benchmark for testing, refining and enhancing the UMBEL structure. One outcome, for example, was to grow the UMBEL graph with Wikipedia categories or content that was inadequately represented in earlier versions of UMBEL.

The intent of this effort has been to obtain quality, accurate mappings. A variety of techniques are used for these actual mappings. However, even when aided by some automatic techniques, the final assignments have been made manually. In this manner, and with future and near-term iterations on this process, the intent is for the combination of UMBEL and Wikipedia to become the "gold standard" of knowledge bases, useful for many reasoning, natural language processing, metadata tagging and semantic and ontology tasks.

The net result of this effort is that category structures and mappings have been inspected multiple times and from different viewpoints. In various steps and in various phases, the inspection of Wikipedia, its categories, and its match with UMBEL has perhaps incurred more than 5,000 hours (or nearly three person-year equivalents) of expert domain and semantic technology review. Both advanced analytic techniques and algorithms, plus painstaking manual review and inspection of hundreds of thousands of items, have been combined in this effort.

The results are promising and very much useful today, but by no means complete. Future versions will extend the current mappings and continue to refine its accuracy and completeness.[4] What we can say, however, is that a coherent organization and conceptual schema -- namely, UMBEL -- overlaid on the richness of the instance data and content of Wikipedia, can produce immediate and useful benefits. These benefits apply to semantic search, semantic annotation and tagging, reasoning, discovery, inferencing, organization and comparisons.

In this process we have expanded UMBEL about 33% while remaining consistent with its origins as a faithful subset of the venerable Cyc knowledge structure[5]. We have also unlocked other hidden structure within Wikipedia's contents not heretofore discovered or mined. Later publications will describe these discoveries.

Summary Statistics

Based on these efforts, the 1.00 version of UMBEL has[6]:

  • A core structure of 27,917 Reference Concepts (RCs) and 33 SuperTypes (STs)
  • Direct RC mappings to 444 PROTON[7] classes
  • Direct RC mapping to 257 DB Ontology[8] classes
  • Direct mapping of 16,884 RCs to Wikipedia (DBpedia) (categories and pages)
  • Linking of 2,130,021 unique DBpedia pages via 3,935,148 predicate relations; all are characterized by one or more STs
  • 876,125 of these linkages are assigned a specific rdf:type, and
  • Vocabulary changes related to these updates, including some new and some dropped predicates.

MAPPINGS TO EXTERNAL ONTOLOGIES

UMBEL's purpose is to provide a coherent reference structure by which other external structures and schema (including and up to full knowledge bases) can relate. Then, through these common interrelationships, the constituent sources can then interoperate.

The best way to hone this intended purpose is to actually use UMBEL in these manners. So, for nearly three years now, and especially after the clean up of UMBEL to version 0.80,[9], a concerted effort was undertaken to map UMBEL to a variety of external ontologies.

These mappings have been made at the class level (most powerful and useful) and instance level, with the primary mapping target being Wikipedia.

Initial Preparation

The basic approach is to use the DBpedia representation of Wikipedia, since its extractors have already done a great job in preparing structured data.

Note: as of DBpedia version 3.5, the entire extraction framework has been rewritten in Scala, plus many other changes. Our approach still relies on the earlier PHP extraction framework. In addition, with the release of DBpedia version 3.6, all of the scripts and frameworks need to be re-run for this update.

Much of the mapping takes place via manual assignments, assisted by various NLP systems and scripts. In order for this process to be manageable, many lists are created and categories need to be winnowed down to manageable numbers.

The first step is to remove extraneous internal Wikipedia categories, what we term the "admin list". There are specific patterns and other formats that cause the Wikipedia category listings to be reduced by about 20% (from a starting base of about 500 K categories). These admin categories are for (mostly) internal purposes, and have no bearing on the actual structure of the knowledge base.

In a second step we extract and set aside what we term functional categories:[10] This process reduces the overall category count by a further 60% or so. The resulting 80 K or so categories represent the true knowledge structure of Wikipedia.[11] This much reduced category listing is what we term the "clean" categories in various places within the mapping methodologies.

From this reduced baseline, we then find "patterns" of category names (via deconstructing them,[12]) which we then assemble (for the portion applicable) into lists aggregated by the various UMBEL SuperTypes. These deconstructed stems may be used or not in further steps in the Wikipedia processing.

See Also

Class-level Mappings

Class level mappings (namely, between categories or set classes), if done correctly, provide the most power and leverage when relating two knowledge structures. They are powerful because the instances or members of the mappings naturally accompany the association. They have leverage because one may also infer or extend relationships via parents, children or other relationships in the structure of the sources.

For this UMBEL version 1.00, concerted class mapping attention was given to these knowledge structures and knowledge bases:

  • UMBEL <-> PROTON
  • UMBEL <-> DBpedia
  • UMBEL <-> GeoNames[13]

Each of these is explained in turn.

All class mappings were done and verified by hand. Verification was done via testing with reasoners (Pellet, Fact++ or other) for consistency[2] and satisfiability[3] against the imposed SuperTypes restrictions. These mappings were geared to allow any of the constituent ontologies or their predicates to be used in conjunction with the other ontologies.

The results of these class mappings manifested in a variety of ways:

  1. Direct mappings between UMBEL and Wikipedia
  2. Extend sibling instances in UMBEL (breadth)
  3. Extend linking concepts in UMBEL (gaps)
  4. Extend UMBEL downward (depth).

The resulting class-level mappings were also used to inherit each source's instance mappings (see next major section). The class mappings informed the use of assigning the rdf:type predicate.

PROTON Mapping

The mapping between the PROTON ontology and the UMBEL reference concept structure was done by hand. For each PROTON class, an equivalent or parent concept was identified in the UMBEL reference concept structure. Matches were expressed as a rdf:subClassof relationship between the PROTON class and the UMBEL reference concept. Each linkage was tested for consistency and satisfiability using Pellet. At the end of the mapping, all UMBEL reference concepts were made rdfs:subClassOf of one and only one PROTON class (according to FactForge's world view[14]).

See Also

DBpedia Mapping

The mapping between the DBpedia ontology and the UMBEL reference concept followed the same procedure. However, since there was already a mapping between PROTON and the DBpedia ontology, only the unmapped DBpedia ontology classes needed to be added.

See Also

GeoNames Mapping

The mapping between the 671 GeoNames feature codes, treated as classes for these mapping purposes, followed the same procedure. The GeoNames mappings have not yet been tested for consistency and satisfiability. They are thus not included in the version 1.0 formal results, though the class mappings file between GeoNames and UMBEL reference concepts is included.[13]

See Also

Instance-level Mappings

These class mappings and other approaches were then the basis for manual or semi-automatic mappings to Wikipedia instances (pages) using either the:

  1. DBpedia Ontology
  2. Semantic Vectors correspondences, or
  3. Analysis of the DBpedia category structure, or
  4. Existing OpenCyc-DBpedia mappings.

All instance mappings were also related to one of 33 UMBEL SuperTypes (SuperClasses of reference concepts). Three methods were employed to link Wikipedia pages (instances) via the DBpedia v. 3.51 extraction to the UMBEL reference concept structure.

Method # 1: DBpedia Ontology

In method one, the instances associated with the DBpedia ontology were inherited directly based on their class mappings to the UMBEL reference concepts. These mappings also received the rdf:type predicate. Some 659,527 unique pages were linked in this matter, resulting in a total of 876,125 rdf:type assignments.

The existing DBpedia ontology[8] already has 272 classes with about 1.5 million articles manually mapped to them. About 190 of these were already mapped due to the PROTON Mapping to UMBEL. The remaining ones were also mapped manually.

See Also

Method # 2: Semantic Vectors

In method two, using Semantic Vectors[15] applied to "clean" Wikipedia categories, an association file to candidate UMBEL reference concepts with SV scores was created for every clean Wikipedia category. These candidates were then inspected by hand with an assignment made manually. Wikipedia instances associated with these categories were then mapped to the UMBEL structure and given a umbel:relatesToXXX predicate (see teh 'UMBEL CHANGES' section below) for the reference concept's associated, single ST (SuperType). Because multiple Wikipedia instances could be related to different reference concepts, then individual Wikipedia pages may have been assigned multiple umbel:relatesToXXX predicates. (If the Wikipedia page already had a rdf:type assignment, this would supercede the umbel:relatesToXXX predicates).

UMBEL reference concepts and the "cleaned" Wikipedia categories resulting from the previous steps (admin and functional categories[16]) were processed and then matched with the semantic vectors software.[17]

The content basis for the SV runs was based on indexing these sources:

  • For the UMBEL concepts, the content includes the labels (pref and alt) and description of the concept
  • For the Wikipedia categories, the content includes the labels and abstract.

After indexing, the standard settings are applied for the semantic vectors package. Two parallel runs are created: one uses the Wikipedia categories as the corpus, matching the UMBEL concepts against them; the other uses the reverse.

Using positioning v unpositioned indexing was tested, and multiple tests were run until the final script and methodology was determined. The unpositioned indexing was eventually used, since it was faster with no loss in precision.

Runs take multiple hours with a standard server workstation with 2GB RAM. Possible matches are scored and ranked on a scale of 0 to 1 for each possible match. An example for Wikipedia categories mapped to the UMBEL corpus is:

category ID label concept weight
Category:AAA AAA ClassAAABaseball 0.6685217
PublicationStyleSpecification 0.42948616
DermatoImmunologist 0.2956852
HandballBall 0.29052848
Category:AIDS AIDS AIDS 0.70008224
AIDSClinic 0.6572799
AIDSSpecialist 0.6272948
MobilityAid 0.5272866
Category:Abbeys Abbeys Abbey 0.9134619
TiziOuzou_ProvinceAlgeria 0.31687668
Howitzer_SelfPropelled 0.30506617
DeadAnimal 0.2689236
Abbey Abbey 0.9134619
TiziOuzou_ProvinceAlgeria 0.31687668
Howitzer_SelfPropelled 0.30506617
DeadAnimal 0.2689236
Category:Abbots Abbots Archbishop 0.904934
Duodenum 0.29744086
Guppy 0.29535353
Campfire 0.29021814

An example for UMBEL concepts mapped to the Wikipedia corpus, duplicated for altLabels as well, is:

concept label category ID weight
umbel/rc/AllergyAndImmunologySpecialist allergy and immunology specialist Allergology 0.547624
Immunologists 0.458715
Rheumatologists 0.402268
Rhinology 0.396518
allergy and immunologist Allergology 0.845904
Rhinology 0.520327
Gastrodia 0.454643
Antihistamines 0.305657
allergy and immunology specialists Allergology 0.5644
Immunologists 0.425192
Gastrodia 0.357338
Rheumatologists 0.328607
allergy and immunologists Immunologists 0.621635
Allergology 0.611449
Gastrodia 0.407595
Rhinology 0.351132
umbel/rc/Immunologist immunologist - -
immunologists Immunologists 0.849982
Quebec_Aces 0.331707
Gemstone_Publishing 0.311423
Biopsychology 0.3101
serologist - -

A variety of screening and matching rules were applied based on these scores, but ultimately all choices were inspected manually and confirmed before committing to the system.

With this method two, 2,484 unique reference concepts participated in the linkage to 102,956 unique Wikipedia pages. A total of 111,470 umbel:relatesToXXX predicates (which vary by the underlying SuperType) were created based on this method.

See Also

Method # 3: Deconstructed Wikipedia Categories

In method three, the Wikipedia categories were deconstructed to discern their structural compositions, largely based on suffix extensions. A script was used to relate these Wikipedia categories by list to candidate UMBEL reference concepts. These lists were then presented via script for assigning by hand to the associated UMBEL reference concept. Instances related to the assigned Wikipedia category were then given the same umbel:relatesToXXX predicate that was associated with the related UMBEL reference concept.

Lists of some 5000 or more patterns were extracted and mapped to the various UMBEL SuperTypes. A variety of scripts were developed to perform the candidate matches. If a direct candidate match was not available, the candidate assignment defaulted to a generic match for each SuperType.

Candidate matches were then written to a CSV file and inspected to chose the eventual match. An example of such a listing is:

SuperType list concept category match
Plants
_algae http://umbel.org/umbel/rc/Algae X
http://umbel.org/umbel/rc/Algae_LikeProtist
http://umbel.org/umbel/rc/AlgaeSubkingdom
http://umbel.org/umbel/rc/BlueGreenAlgae
http://dbpedia.org/resource/Category:Green_algae
http://dbpedia.org/resource/Category:Red_algae
http://dbpedia.org/resource/Category:Brown_algae
http://dbpedia.org/resource/Category:Edible_algae
http://dbpedia.org/resource/Category:WikiProject_Algae
_broadleaf http://umbel.org/umbel/rc/BroadleafForest ADD
_crop http://umbel.org/umbel/rc/Crop X
http://umbel.org/umbel/rc/CropCircle
http://umbel.org/umbel/rc/CropCoveredRegion
http://umbel.org/umbel/rc/CropFarm
http://umbel.org/umbel/rc/CropPlant
http://umbel.org/umbel/rc/Poppy_crop
http://umbel.org/umbel/rc/Potato_crop
http://dbpedia.org/resource/Category:Energy_crops
http://dbpedia.org/resource/Category:Underutilized_crops
http://dbpedia.org/resource/Category:Ethanol_fuel_crops
http://dbpedia.org/resource/Category:Non-food_crops
_cultivars http://umbel.org/umbel/rc/Cultivar X

You will note the entry of 'ADD' for the _broadleaf category. This, along with other notes, flags that some structure attention needs to be given to UMBEL to deal with broadleaf plants, separate from broadleaf forest regions. Other gaps were discovered and flagged as the inspections proceeded.

With method three, eventually 1,668 unique reference concepts participated in the linkage to 1,808,782 unique Wikipedia pages. A total of 2,947,553 umbel:relatesToXXX predicates were created based on this method.

See Also

Method # 4: Existing OpenCyc-DBpedia Mappings

Lastly, a fourth source, which was not really a method, added 16,031 Wikipedia instances by virtue of hand-inspected OpenCyc to DBpedia page mappings within the current OpenCyc knowledge base.

These mappings began as semi-automated matches conducted by Legg and Medelyan at the University of Waikato.[18][19] They made and then inspected about 42 K of these matches by hand.[20]

Not all, but most of these, were included as stated mappings in the latest version of OpenCyc.[21] In keeping with the relation of UMBEL to Cyc, we only kept the 41 K or so presently in OpenCyc as our starting basis.

An initial screening showed nearly half of the entries to be mappings to individuals, not suitable for UMBEL's reference concept purpose. An additional grouping of about 3500 entries were also removed for later use in a geographic module to UMBEL (see last section). The result of these exclusions was a candidate pool of 18,863 mappings between UMBEL reference concepts and Wikipedia pages.

All of these candidates were evaluated manually for suitability and accuracy. From this, some 15% were dropped as having incorrect mappings (a frequent mismatch, for example, was between items such as Anthropology and Anthropologist; but other patterned errors also exist). Further, another nearly 7% (6.8%) of mappings were deprecated as useful but only partial. For a system that had already undergone semi-automatic screening with spot checks for common sense, these error rates appear quite high.

Nonetheless, about 16 K actual mappings were deemed appropriate and accurate for UMBEL.

See Also

Overall Statistics

The result of these mappings and changes produces these overall statistics:

  • In its role as a central mapping vocabulary, the number of UMBEL reference concepts was expanded from 20,512 to 27,917. These are all fully integrated into the UMBEL ontology with one of 33 SuperTypes (ST) assigned
  • 444 PROTON classes are directly mapped to corresponding UMBEL reference concepts
  • 257 DBpedia ontology classes are directly mapped to corresponding UMBEL reference concepts
  • Across all mappings, 3,527 UMBEL reference concepts are linked directly to Wikipedia (DBpedia). The result is that 2,130,021 unique Wikipedia pages are in total linked to this structure via nearly 4 million predicate relations (3,935,148). All of these pages are also characterized by one or more STs
  • Of these 2 million pages, 876,125 are assigned a specific SuperType via rdf:type; the remaining have a less certain relationship (umbel:relatesToXXX predicate)
  • 16,884 specific RC to Wikipedia page links are made using the umbel:correspondsTo predicate. Of these, 16,031 result from the OpenCyc mapping basis, 2,983 from the semantic vectors basis (2130 of these overall were duplicate assignments from both methods).

See Also

UMBEL CHANGES

Three major changes to the UMBEL vocabulary and reference concept structure (ontology) were made as the result of this version 1.00 effort.

New Reference Concepts

The first major change was to add 7,405 reference concepts to the core UMBEL structure. These additions came about as a way to complete the coverage of the general UMBEL structure in order to provide appropriate linkage points into the ontology. The analysis leading to these additions came about from analyzing existing OpenCyc to DBpedia linkages and missing linking concepts due to the DBpedia and GeoNames class mapping activities. This larger "core" UMBEL structure is now felt to be closer to adequate for ongoing reference mappings to other external ontologies into the future.

umbel:correspondsTo Predicate

For some time the semantic Web community has grappled with the issue of the sameAs predicate, often misusing it in application.

Among other options along a spectrum of relatedness is the desire to assign a predicate that is meant to represent the same kind of thing, yet without knowing if the relationship is an equivalence (identity, or sameAs), a subset, or merely just a member of relationship. (Contrast this sense to the umbel:relatesToXXX predicate; see next subsection.)

Thus, with respect to existing and commonly used predicates, we want an umbrella property that is generally equivalent or sameAs in nature, but perhaps if known precisely might actually encompass a degree of approximation for any of these relations:

rdfs:subClassOf
owl:equivalentClass
owl:sameAs
superClassOf
rdf:type

Approximate relationships[22] and the x:coref predicate from the UMBC Ebiquity group[23] try to capture these relationships as well. For example, in the words of Tim Finin of the Ebiquity group:[24]

The solution we are currently exploring is to define a new property to assert that two RDF instances are co-referential when they are believed to describe the same object in the world. The two RDF descriptions might be incompatible because they are true at different times, or the sources disagree about some of the facts, or any number of reasons, so merging them with owl:sameAs may lead to contradictions. However, virtually merging the descriptions in a co-reference engine is fine -- both provide information that is useful in disambiguating future references as well as for many other purposes. Our property (:coref) is a transitive, symmetric property that is a super-property of owl:sameAs and is paired with another, :notCoref that is symmetric and generalizes owl:differentFrom.

When we look at the analog properties noted above, we see that the property objects tend to share reflexivity, symmetry and transitivity.

The umbel:correspondsTo predicate is patterned in a similar way to describe these close, nearly equivalent, but uncertain degree of relationships.

Thus, the formal description of this property is:

Property name umbel:correspondsTo
Description The property umbel:correspondsTo is used to assert a close correspondence between an external class, named entity, individual or instance with a Reference Concept class. umbel:correspondsTo relates the external class, named entity, individual or instance to the class through the basis of both its subject matter and intended scope. This predicate should be used where the correspondence between the two entities is felt to be nearly equivalent to a sameAs assertion, and is reflexive, but without the full entailments of intensional class memberships. In these cases, both entities are understood to have the same type and intended scope, but without asserting a full class-level or sameAs individual relationship.

This predicate is designed for the circumstance of aligning two different ontologies or knowledge bases based on node-level correspondences, but without entailing the actual ontological relationships and structure of the object source. For example, the umbel:correspondsTo predicate is used to assert close correspondence between UMBEL Reference Concepts and Wikipedia categories or pages, yet without entailing the actual Wikipedia category structure.

This property asserts a different and stronger relationship than umbel:isAbout. One practical use is to guide specific instance member determinations when, say, the native structure of the external ontology or knowledge base is to be analyzed and replaced with an UMBEL-based structure.

This property is therefore used to create a nearly equivalent assertion (however, with the degree of that equivalence being unknown or unknowable) between an external instance or class and a Reference Concept class.

Domain owl:Thing
Range umbel:RefConcept
Reflexive True
Status Experimental - Unstable

With this predicate, a concept about, say, Anthropology in UMBEL can be related to a corresponding "node" in Wikipedia that best represents that concept. That corresponding node may not be an EXACT represention of what is in UMBEL, but it is the closest we can find and is definitely NOT of some unlike type (such as an anthropologist).

umbel:relatesToXXX Predicates

At a different point along this relatedness spectrum we have unlike things that we would like to relate to one another. It might be an attribute, a characteristic or a functional property about something that we care to describe. Further, by nature of the thing we are relating, we may also be able to describe the kind of thing we are relating. The UMBEL SuperTypes (among many other options) gives us one such means to characterize the thing being related.

Version 1.00 thus adds 31 new predicates to the UMBEL vocabulary to represent a linkage relationship to a SuperType.

These predicates, listed below, all have the form relatesToXXX. The predicate indicates that the object instance has a relation to the SuperType, perhaps as a true class member or perhaps only as an attribute, but that the degree of this relationship can not be resolved at this time. (More information or inspection, for example, might cause an rdf:type predicate to be more precisely assigned.)

The formal definition of the relatesToXXX predicate is:

Property name umbel:relatesToXXX
Description The property umbel:relatesToXXX is used to assert a relationship between an external instance (object) and a particular (XXX) umbel:SuperType. There may be as many umbel:relatesToXXX properties as there are numbers of SuperTypes (see next table for the listing of all specific umbel:relatesToXXX predicates).

The assertion of this property does not entail class membership with the asserted SuperType. Rather, the assertion may be based on particular attributes or characteristics of the object at hand. For example, a British person might have an umbel:relatesToXXX asserted relation to the SuperType of the geopolitical entity of Britain, though the actual thing at hand (person) is a member of the Person class SuperType.

This predicate is used for filtering or clustering, often within user interfaces. Multiple umbel:relatesToXXX assertions may be made for the same instance.

Domain owl:Thing
Range umbel:SuperType
Status Experimental - Unstable

Each of the 32 UMBEL SuperTypes has a matching predicate for external topic assignments (relatesToOtherOrganism shares two SuperTypes, leading to 31 different predicates):

SuperType Mapping Predicate Comments
NaturalPhenomena relatesToPhenomenon This predicate relates an external entity to the SuperType (ST) shown. It indicates there is a relationship to the ST of a verifiable nature, but which is undetermined as to strength or a full rdf:type relationship
NaturalSubstances relatesToSubstance same as above
Earthscape relatesToEarth same as above
Extraterrestrial relatesToHeavens same as above
Prokaryotes relatesToOtherOrganism same as above
ProtistsFungus
Plants relatesToPlant same as above
Animals relatesToAnimal same as above
Diseases relatesToDisease same as above
PersonTypes relatesToPersonType same as above
Organizations relatesToOrganizationType same as above
FinanceEconomy relatesToFinanceEconomy same as above
Society relatesToSociety same as above
Activities relatesToActivity same as above
Events relatesToEvent same as above
Time relatesToTime same as above
Products relatesToProductType same as above
FoodorDrink relatesToFoodDrink same as above
Drugs relatesToDrug same as above
Facilities relatesToFacility same as above
Geopolitical relatesToGeoEntity same as above
Chemistry relatesToChemistry same as above
AudioInfo relatesToAudioMusic same as above
VisualInfo relatesToVisualInfo same as above
WrittenInfo relatesToWrittenInfo same as above
StructuredInfo relatesToStructuredInfo same as above
NotationsReferences relatesToNotation same as above
Numbers relatesToNumbers same as above
Attributes relatesToAttribute same as above
Abstract relatesToAbstraction same as above
TopicsCategories relatesToTopic same as above
MarketsIndustries relatesToMarketIndustry same as above

These predicates and their association with SuperTypes has the following counts in UMBEL version 1.00:

 relatesToAbstraction         5042
 relatesToActivity           51651
 relatesToAnimal             30949
 relatesToAttribute            966
 relatesToAudioInfo         149233
 relatesToChemistry           2162
 relatesToDisease             7060
 relatesToDrug                6061
 relatesToEvent             762966
 relatesToFacility           69528
 relatesToFoodDrink          16440
 relatesToGeoEntity         174223
 relatesToHeavens            19258
 relatesToMarketIndustry      1184
 relatesToNotation           37974
 relatesToNumbers             3023
 relatesToOrganizationType  187775
 relatesToOtherOrganism       3743
 relatesToPersonType        949914
 relatesToPhenomenon          3356
 relatesToPlant              20552
 relatesToProductType       121896
 relatesToSociety            16228
 relatesToStructuredInfo     17024
 relatesToSubstance           9168
 relatesToTime                2932
 relatesToTopic             174963
 relatesToVisualInfo         74564
 relatesToWorkplace           1064
 relatesToWrittenInfo       134130

Qualifications to the umbel:hasMapping Predicate

Another major change was to apply the UMBEL hasMapping predicate to all of the possible assignments, using a controlled vocabulary for characterizing the mapping assignment. This vocabulary was designed to capture the diversity of sources and methods used for the current mapping sets in UMBEL v. 1.00.

The Qualifier class is a set of descriptions that indicate the method used in order to establish an isAbout or similar relationship between an UMBEL reference concept (RC) and an external entity. This description should be complete enough to aid understanding of the nature and reliability of the "aboutness" assertion and to be usable for filtering or user interface information. The descriptions may be literal strings or may refer to literal numeric values resulting from an automated alignment technique.

Here is the current listing:

Qualifier Description
Manual - Nearly Equivalent The two mapped concepts are deemed to be nearly an equivalentClass or sameAs relationship, but not 100% so
Manual - Similar Sense The two mapped concepts share much overlap, but are not the exact same sense, such as an action as related to the thing it acts upon
Heuristic - ListOf Basis Type assignment based on Wikipedia ListOf category; not currently used
Heuristic - Not Specified Heuristic mapping method applied; script or technique not otherwise specified
External - OpenCyc Mapping Mapping based on existing OpenCyc assertion
External - DBOntology Mapping Mapping based on existing DBOntology assertion
External - GeoNames Mapping Mapping based on existing GeoNames assertion
Automatic - Inspected SV Mapping based on automatic scoring of concepts using Semantic Vectors, with specific alignment choice based on hand selection
Automatic - Inspected S-Match Mapping based on automatic scoring of concepts using S-Match, with specific alignment choice based on hand selection; not currently used
Automatic - Not Specified Mapping based on automatic scoring of concepts using a script or technique not otherwise specified; not currently used

Some Other Vocabulary Changes

Lastly, as a result of all of the testing and mapping associated with creating UMBEL v. 1.00, some other minor vocabulary changes were necessary. These are described in the main UMBEL Specification.

ONTOLOGY TESTING

All mapping assignments have been tested for consistency and satisfiability.

All new reference concepts added to UMBEL have been tested for proper assignments and disjointness based on its claimed SuperType. This test, if errors were discovered, led to close inspection of the Reference Concept, and most often a change in the SuperType to which it was assigned. In a few cases, however, oversights in the disjointedness assertions between SuperTypes were revised or corrected.

See Also

NEXT STEPS

With version 1.00, most of the methods and techniques have now been tested and refined for mapping UMBEL to external vocabularies and knowledge bases. The overall UMBEL vocabulary has also been refined, particularly for the purposes of acting as a gold standard for mappings and reference.

With regard to Wikipedia, still some 40% (~ 11 K) of UMBEL RefConcepts lack a direct correspondence. Determining the types of those entities also needs expansion. These efforts will likely result in changes to the size of the overall UMBEL Reference Concept Structure as well.

GeoNames mappings have been done at the level of the 670 or so feature codes, but testing for consistency or satisfiability has not yet been completed. Expect to see a much improved geographical component to UMBEL in next releases.

Re-establishing earlier UMBEL links to other external vocabularies is also important. Updating the UMBEL Web services and means to better edit and manage the vocabulary are also in the works.

See Also

ENDNOTES

  1. Strictly speaking in knowledge bases, a coherent knowledge base is one that is both consistent and satisfiable (see next). However, as used in UMBEL, we also add to that the test that the knowledge base makes sense; that is, has logical order and understandable semantic relationships. See further http://en.wikipedia.org/wiki/Coherence_%28linguistics%29.
  2. 2.0 2.1 In logic, a consistent theory is one that does not contain a contradiction; see http://en.wikipedia.org/wiki/Consistency_%28knowledge_bases%29.
  3. 3.0 3.1 In mathematical logic, satisfiability and validity are elementary concepts concerning interpretation. A formula is satisfiable with respect to a class of interpretations if it is possible to find an interpretation that makes the formula true see http://en.wikipedia.org/wiki/Satisfiability.
  4. Fortunately, returns on the time investment will accelerate since basic lessons and techniques have now been learned.
  5. See the 'Use of OpenCyc' section in the UMBEL Specifications.
  6. Note: Since new releases of UMBEL have occurred since this version, the current actual statistics differ. See the main Specification for the current figures.
  7. For a somewhat dated description of PROTON, see http://proton.semanticweb.org/.
  8. 8.0 8.1 The DBpedia Ontology, found at http://wiki.dbpedia.org/Ontology.
  9. See http://www.mkbergman.com/930/announcing-a-major-new-umbel-release/
  10. Functional categories combine two or more facets in order to split or provide more structured characterization of a category. For example, Category:English cricketers of 1890 to 1918, has as its core concept the idea of a cricketer, a sports person. But, this is also further characterized by nationality and time period. Functional categories tend to have a A x B x C construct, with prepositions denoting the facets. From a proper characterization standpoint, the items in this category should be classified as a Person --> Sports Person --> Cricketer, with additional facets (metadata) of being English and having the period 1890 to 1981 assigned.
  11. The inclusion of functional categories by most prior semantic and analytical approaches to Wikipedia (see, for example, the Sweetpedia listing of Wikipedia analysis papers) has been a major source of unnecessary low precision and recall findings.
  12. See, for example, Vivi Nastase and Michael Strube, 2008. Decoding Wikipedia Categories for Knowledge Acquisition, in Proceedings of the AAAI08 Conference, Chicago, US, , pp.1219-1224. See http://www.eml-research.de/english/homes/nastase/Publications/nastase08b.pdf.
  13. 13.0 13.1 For more information on GeoNames, see http://www.geonames.org/. The complete mapping to GeoNames has not been completed for UMBEL version 1.00. Look for this mapping to be published in subsequent releases.
  14. FactForge presents what its developer, Ontotext, calls a reason-able view to the web of data. FactForge aims to allow users to find resources and facts based on the semantics of the data, like web search engines index WWW pages and facilitate their usage. It provides efficient mechanisms to query data from multiple datasets and sources, taking into account the semantics of the data. FactForge includes the datasets of DBPedia, Freebase, Geonames, UMBEL, Wordnet, CIA World Factbook, Lingvoj, MusicBrainz (RDF from Zitgist). The system also uses the schema or ontologies of Dublin Core, SKOS, RSS and FOAF. FactForge has a size of about 1.2B explicit statements, plus 0.8B inferred statements and 10B different retrievable statements.
  15. See http://code.google.com/p/semanticvectors/
  16. Functional categories combine two or more facets in order to split or provide more structured characterization of a category. For example, Category:English cricketers of 1890 to 1918, has as its core concept the idea of a cricketer, a sports person. But, this is also further characterized by nationality and time period. Functional categories tend to have a A x B x C construct, with prepositions denoting the facets. From a proper characterization standpoint, the items in this category should be classified as a Person --> Sports Person --> Cricketer, with additional facets (metadata) of being English and having the period 1890 to 1981 assigned.
  17. Semantic vector indexes are created by applying a Random Projection algorithm to term-document matrices created using Apache Lucene. The package creates a WordSpace model, of the kind developed by Stanford University's Infomap Project and other researchers during the 1990s and early 2000s. Such models are designed to represent words and documents in terms of underlying concepts, and as such can be used for many semantic (concept-aware) matching tasks such as automatic thesaurus generation, knowledge representation, and concept matching. The Semantic Vectors package uses a Random Projection algorithm, a form of automatic semantic analysis. Other methods supported by the package include Latent Semantic Analysis (LSA) and Reflective Random Indexing.
  18. Olena Medelyan and Cathy Legg, 2008. Integrating Cyc and Wikipedia: Folksonomy Meets Rigorously Defined Common-Sense, in Proceedings of the WIKI-AI: Wikipedia and AI Workshop at the AAAI08 Conference, Chicago, US. See http://www.cs.waikato.ac.nz/~olena/publications/Medelyan_Legg_Wikiai08.pdf.
  19. A more recent update with claimed mapping improvements is Samuel Sarjant, Catherine Legg, Michael Robinson and Olena Medelyan, 2009. “All You Can Eat” Ontology-Building: Feeding Wikipedia to Cyc, in 2009 IEEE/WIC/ACM International Conference on Web Intelligence (WI-09), 15 – 18 September 2009 Università degli Studi di Milano Bicocca, Milano, Italy. See http://www.cs.waikato.ac.nz/%7Eolena/publications/feedingWikipedia2Cyc.pdf. Note, however, that these mappings are not available at present. For example, the source http://wdm.cs.waikato.ac.nz/cyc/portal/ is not currently active.
  20. See further, http://www.cs.waikato.ac.nz/~olena/cyc.html. Note that the candidate starting list was the 42,279 exact mappings shown on this page.
  21. OpenCyc v 3.0 is used for the current UMBEL; see the OpenCyc KB v3.0 release notes for additional information.
  22. M.K. Bergman, 2010. "The Nature of Connectedness on the Web," AI3:::Adaptive Information blog, November 22, 2010; see http://www.mkbergman.com/935/the-nature-of-connectedness-on-the-web/.
  23. Jennifer Sleeman and Tim Finin, 2010. "Learning Co-reference Relations for FOAF Instances," Proceedings of the Poster and Demonstration Session at the 9th International Semantic Web Conference, November 2010; see http://ebiquity.umbc.edu/_file_directory_/papers/522.pdf.
  24. See quote on http://www.semanticoverflow.com/questions/1095/alternatives-to-owlsameas-for-linked-data
Copyright © 2009-2016 by Structured Dynamics LLC.