UMBEL - Annex G 20101115
UMBEL Annex Document - 15 November 2010
- Latest version
- Last update
- $Date: 2010/11/15 12:32:43 $
- Version No.: 0.80
- TR 10-11-15-G
- Michael Bergman - Structured Dynamics
- Frédérick Giasson - Structured Dynamics
|UMBEL: Upper Mapping and Binding Exchange Layer by Structured Dynamics LLC and Ontotext AD is provided under the Creative Commons Attribution 3.0 license. See the attribution section for how to cite the effort.|
- 1 INTRODUCTION
- 2 SUPPORTING FILES
- 3 BASIS AND RATIONALE FOR THE SUPERTYPE CLASS
- 4 DESCRIPTION OF THE SUPERTYPES
- 5 ANALYSIS OF THE SUPERTYPES
- 6 POTENTIAL PROBLEM AREAS
This report describes the rationale for the class of
SuperTypes within UMBEL and how all 20,000+ reference concepts (RCs) are assigned to one of 33 current
SuperTypes are split into two main groupings. The four
SuperTypes of Attributes, Abstract-level, Topics/Categories and Markets & Industries are designed to be fully non-disjoint, and do not participate in any disjoint assertions. About 10% of all RCs fall into this grouping.
The remaining 29
SuperTypes are designed to be as disjoint as possible. How disjoint and other commentary is provided in the main body of this report. These 29
SuperTypes designed as mostly disjoint are:
Protists or Fungus
Finance & Economy
Society (culture, issues, beliefs)
Food or Drink
Notations & References
In addition, all of these
SuperTypes are clustered into 9 "dimensions", which are useful for aggregation and organizational purposes, but which have no direct bearing on logic assertions or disjoint testing.
Four files accompany this report and provide the actual assignments and details. They are, with brief explanations as to content and interpretation:
This is a two-column listing of all 20,512 reference concepts organized by the 33
SuperType is listed in Col 1 followed by its assigned reference concepts in Col 2, in alphabetical order.
This file only lists the reference concepts that are non-disjoint (they share a parental reference concept). The file is organized into four columns:
SuperType-1 SuperType-2 Reference Concept super-Reference Concept
The first two columns list the shared
SuperTypes by the given reference concept (Col 3). The last column shows the direct parent concept of the given reference concept.
This file is perhaps best viewed and manipulated by removing all of the inherently non-disjoint
SuperTypes (Attributes, Abstract-level, Topics/Categories and Markets & Industries) that occur in Cols 1 and 2.
This file presents the statistics for the
SuperTypes and their intersection as measured by reference concepts that are non-disjoint (they share a parental reference concept). The file is organized into three columns:
SuperType-1 SuperType-2 Number of Reference Concepts
The number shown is irrespective of whether a given reference concept is assigned to either one or the shared
SuperType. For the actual assignments by reference concept, see the next file.
This is the largest and master listing file. To get material out, you will likely need to save individual tabs as CSVs. If you do not have a newer version of Excel, read in OpenOffice. As for the spreadsheet itself:
- Read all notes under the latter column in the Overview tab
- The Matrix tabs show where some areas are not disjoint with other areas. The defined non-disjoint categories -- Attributes, Abstract-level, Topics/Categories and Markets & Industries -- have much interaction with the other categories and no statistics for these are shown. In other categories the interactions are sometimes minimal and sometimes not. These overlaps are generally explained in the
SuperTypeIntersection and Potential Overlaps columns on the Overview tab
- The Overview tab presents a description of each
SuperType, clusters them into 9 "dimensions", shows overlaps, and provides other commentary and notes
- After the intro tabs -- Matrix, Matrix-Weighted, Overview, Stats, and NOTES -- there next follows 33 tabs, one for each
SuperType, in alphabetical order
- Four of the 33
SuperTypesare by definition non-disjoint. These are Abstract_level, Attributes, MarketsIndustries and TopicsCategories
- The remaining 29
SuperTypesare mostly disjoint. Where there are overlaps, these are presented by RefConcept (RC) and
SuperType(ST) in columns C to J with counts on each tab
- There still are likely problems with Chemicals and Natural Substances; additional analysis is needed
- Events and Activities are quite close; should be "named" to be an event. Most Activities are human-related.
BASIS AND RATIONALE FOR THE SUPERTYPE CLASS
The assignment of UMBEL reference concepts to
SuperTypes was an outgrowth of the observation that many of the concepts within UMBEL may be clustered into disjoint groupings. Most things and concepts about them are based on real, observable, physical things in the real world. Because most of these things can not occupy both the same moment in time and the same location in physical space, a useful criterion for looking at these things and concepts is disjointedness.
In a broad sense, then, we can split our concepts of the world between those ideas that are disjoint because they pertain to separable objects or ideas and those that are cross-cutting or organizational or classificatory. Attributes, such as color (pink, for example), are often cross-cutting in that they can be used to describe quite disparate things. Inherent classification schemes such as academic fields of study or library catalog systems — while useful ways to organize the world — are not themselves in-and-of the world or discrete from other ideas. Thus, classificatory or organizational concepts are inherently not disjoint.
The potential advantage of clustering into logical, disjoint groups can include:
- A better basis for organizing a large concept space
- Possible amenability to the use of templates for displaying similar attributes and information for similar concepts
- Possible computational efficiency due to being able to segregate concepts into logically coherent groupings
- Improved disambiguation by assessing concept matches in addition to entity matches via triangulation between the two assessments
- Structure and integrity testing.
Any classificatory scheme has a degree of arbitrariness. To be useful, it must be perceived as logical and coherent and it should achieve most if not all of the potential advantages above.
Both "bottom up" (coherent clustering of related concepts) and "top down" (selecting top-level concepts and evaluating and clustering all child concepts using union, intersection or complement operators) were used to create the assignments herein. Each approach was iterated multiple times, with logic and coherence testing after every run. For example, analysis of shared parent concepts in the lineage and other structure-wide tests were employed.
Classification schemes always are subject to the tension between "lumping" and "splitting": are three groupings too few, 100 too many? This tension is also compounded by the possible sense of arbitrary boundaries, such as why "Drugs" and not "Toys"?
Classical taxonomists and other classifiers have always strove for "natural" classification systems. Based on the best information available, is the assignment of one item to Group A more defensible than it is to Group B? New knowledge or perceptions, such as the immense impact of genetics on classical systematics, can thoroughly change perceptions of what is logical and natural.
In the case of these UMBEL reference concepts, the tests employed were to find the highest degree of disjointedness while also maintaining a sense of logical coherence with the observable world. And, where non-disjointness was found, could that degree of overlap be seen as both natural and limited? For example, the
SuperType of PersonTypes is non-disjoint with Animals because persons are humans; otherwise the groups are disjoint. Similarly, PersonTypes are non-disjoint with Organizations because some types of agents, such as MusicPerformingAgent, may be either an individual or a group.
These overlaps can be understood and can also be sought to be as minimal as possible.
DESCRIPTION OF THE SUPERTYPES
Here is a description of the
SuperTypes, their clustering into "dimensions" and the intersections with other
SuperTypes. Note that
SuperType intersections with strong overlap (more than 10 assigned reference concepts involved) are noted in Bold, with very strong overlap (more than 100 assigned reference concepts involved) noted in Bold Underline:
|Dimension||SuperType (label)||Description/Sub-types||SuperType Intersections|
|Natural World||Natural Phenomena
|This SuperType includes natural phenomena and natural processes such as weather, weathering, erosion, fires, lightning, earthquakes, tectonics, etc. Clouds and weather processes are specifically included. Also includes climate cycles, general natural events (such as hurricanes) that are not specifically named, and biochemical processes and pathways.||Activities, Events|
|Notable inclusions are minerals, compounds, chemicals, or physical objects that are not the outcome of purposeful human effort, but are found naturally occurring. Other natural objects (such as rock, fossil, etc.) are also found under this SuperType. Natural Substances include subatomic particles. The contrast is with Earthscape, which covers natural "features" or living substances, which are covered under the appropriate SuperTypes. Chemicals can be Natural Substances, but only if they are naturally occurring, such as limestone or salt||Animals, Chemistry, Drugs, FoodDrinks, Products|
|The Natural Feature SuperType is the collection of cartographic features that occur on the surface of the Earth. Positive examples include Mountain, Ocean, and Mesa. Artificial features such as canals are excluded. Most instances of these features have a fixed location in space.
Underground and underwater are also explicitly contained.
This SuperType is explicitly disjoint with Extraterrestrial (see below).
|Geopolitical, NaturalSubstances, Organizations|
|This SuperType includes all natural things not specifically terrestrial, including celestial bodies (planets, asteroids, stars, galaxies, etc., that can be located within a sky map)||Events, NaturalPhenomena, NaturalSubstances, VisualInfo|
|The Prokaryotes include all prokaryotic organisms, including the Monera, Archaebacteria, Bacteria, and Blue-green algas. Also included in this SuperType are viruses and prions.|
|Protists & Fungus
|This is the remaining cluster of eukaryotic organisms, specifically including the fungus and the protista (protozoans and slime molds).||FoodDrinks, Prokaryotes|
|This SuperType includes all plant types and flora, including flowering plants, algae, non-flowering plants, gymnosperms, cycads, and plant parts and body types. Note that all Plant Parts are also included.||Drugs, FoodDrinks, Products|
|This large SuperType includes all animal types, including specific animal types and vertebrates, invertebrates, insects, crustaceans, fish, reptiles, amphibia, birds, mammals, and animal body parts. Animal parts are specifically included. Also, groupings of such animals are included. Humans, as an animal, are included (versus as an individual Person). Diseases are specifically excluded. Animals have many of the similar overlaps to Plants. However, in addition, there are more terms for animal groups, animal parts, animal secretions, etc. Also Animals can include some human traits (posture, dead animal, etc)||Chemistry, FoodDrinks, NaturalSubstances, PersonTypes, Products, Society|
|Diseases are atypical or unusual or unhealthy conditions for (mostly human) living things, generally known as conditions, disorders, infections, diseases or syndromes. Diseases only affect living things and sometimes are caused by living things. This SuperType also includes impairments, disease vectors, wounds and injuries, and poisoning||Animals, Events, NaturalPhenomena|
|The appropriate SuperType for all named, individual human beings. This SuperType also includes the assignment of formal, honorific or cultural titles given to specific human individuals. It further includes names given to humans who conduct specific jobs or activities (the latter case is known as an avocation). Examples include steelworker, waitress, lawyer, plumber, artisan. Ethnic groups are specifically included.||Animals, Society, Organizations|
|Organization is a broad SuperType and includes formal collections of humans, sometimes by legal means, charter, agreement or some mode of formal understanding. Examples include geopolitical entities such as nations, municipalities or countries; or companies, institutes, governments, universities, militaries, political parties, game groups, international organizations, trade associations, etc. All institutions, for example, are organizations.
Also included are informal collections of humans. Informal or less defined groupings of humans may result from ethnicity or tribes or nationality or from shared interests (such as social networks or mailing lists) or expertise ("communities of practice"). This dimension also includes the notion of identifiable human groups with set members at any given point in time. Examples include music groups, cast members of a play, directors on a corporate Board, TV show members, gangs, mobs, juries, generations, minorities, etc.
Finally, Organizations contain the concepts of Industries and Programs and Communities.
|Finance & Economy
|This SuperType pertains to all things financial and with respect to the economy, including chartable company performance, stock index entities, money, local currencies, taxes, incomes, accounts and accounting, mortgages and property.||Activities, Earthscape, Events, Facilities, NaturalSubstances, Products, StructuredInfo, WrittenInfo|
|This category includes concepts related to political systems, laws, rules or cultural mores governing societal or community behavior, or doctrinal, faith or religious bases or entities (such as gods, angels, totems) governing spiritual human matters. Culture, Issues, beliefs and various activisms (most -isms) are included||PersonTypes, WrittenInfo|
|These are ongoing activities that result (mostly) from human effort, often conducted by organizations to assist other organizations or individuals (in which case they are known as services, such as medicine, law, printing, consulting or teaching) or individual or group efforts for leisure, fun, sports, games or personal interests (activities)||Events, FinanceEconomy, NaturalPhenomena, Products, StructuredInfo|
|These are nameable occasions, games, sports events, conferences, natural phenomena, natural disasters, wars, incidents, anniversaries, holidays, or notable moments or periods in time||Activities, Chemistry, FinanceEconomy, NaturalPhenomena|
|This SuperType is for specific time or date or period (such as eras, or days, weeks, months type intervals) references in various formats|
|This is the largest SuperType and includes any instance offered for sale or performed as a commercial service. Often these are physical objects made by humans that are not a conceptual work or a facility, such as vehicles, cars, trains, aircraft, spaceships, ships, foods, beverages, clothes, drugs, weapons. Products also include the concept of 'state' (e.g., on/off)||Activities, Animals, AudioInfo, Chemistry, Drugs, Facilities, FinanceEconomy, FoodDrinks, NaturalSubstances, Notations, Plants, StructuredInfo, VisualInfo, WrittenInfo|
|Food or Drink
|This SuperType is any edible substance grown, made or harvested by humans. The category also specifically includes the concept of cuisines||Activities, Animals, Chemistry, Drugs, Events, NaturalSubstances, Plants, Products, ProtistsFungus|
|This SuperType is a drug, medication or addictive substance||Chemistry, FoodDrinks, NaturalSubstances, Products|
|Facilities are physical places or buildings constructed by humans, such as schools, public institutions, markets, museums, amusement parks, worship places, stations, airports, ports, carstops, lines, railroads, roads, waterways, tunnels, bridges, parks, sport facilities, monuments. All can be geospatially located.
Facilities also include animal pens and enclosures and general human "activity" areas (golf course, archeology sites, etc.). Importantly Facilities include infrastructure systems such as roadways and physical networks.
Facilities also include the component parts that go into making them (such as foundations, doors, windows, roofs, etc.)
|Earthscape, FinanceEconomy, Products, Workplaces|
|Named places that have some informal or formal political (authorized) component. Important subcollections include Country, IndependentCountry, State_Geopolitical, City, and Province.||FinanceEconomy, Organizations|
|These are various workplaces and areas of human activities, ranging from single person workstations to large aggregations of people (but which are not formal political entities)||Earthscape, Facilities, FinanceEconomy|
|This SuperType is a residual category (n.o.c., not otherwise categorized) for chemical bonds, chemical composition groupings, and the like. It is formed by what is not a natural substance or living thing (organic) substance.||Drugs, Events, FoodDrinks, NaturalSubstances, Products|
|This SuperType is for any audio-only human work. Examples include live music performances, record albums, or radion shows or individual radio broadcasts||Events, Notations, Products|
|any still image or picture or streaming video human work, with or without audio. Examples include graphics, pictures, movies, TV shows, individual shows from a TV show, etc.||AudioInfo, Events, Facilities, NaturalPhenomena, Notations, Products, StructuredInfo, WrittenInfo|
|This SuperType includes any general material written by humans including books, blogs, articles, manuscripts, but any written information conveyed via text.||FinanceEconomy, Notations, Products, StructuredInfo, VisualInfo|
|This information SuperType is for all kinds of structured information and datasets, including computer programs, databases, files, Web pages and structured data that can be presented in tabular form||Activities, Events, NaturalPhenomena, Notations, Products, VisualInfo, WrittenInfo|
|Notations & References
|Akin to conceptual works, these are codified means of human expression. Examples range from human languages themselves, to more domain-specific cases such as chemical symbols, genetic code (A-G-C-T), protocols, and computer languages, mathematical and set notations, etc.
Identifiers (numeric or alphanumeric identifiers for objects, often in a highly patterned way, such as phone numbers, URLs, zip and postal codes, SKUs, product codes, etc.), Units (any of the various ways in which measurement, space, volume, weight, speed, intensity, temperature, calories, siesmic intensity or other quantitative descriptions of phenomena can be made) and key reference types are also included in this SuperType
|AudioInfo, Numbers, StructuredInfo, VisualInfo, WrittenInfo|
|This unique SuperType is for any abstract representation of numbers and numerics||Notations |
|This general SuperType category is for descriptive attributes of all kinds. Think of the specific attributes in Wikipedia "infoboxes" to understand the purpose and coverage of this SuperType. It includes colors, shapes, sizes, emotions, states or other descriptive characteristics about an object, particularly those than can be listed or enumerated by attribute type|
|This general SuperType category is largely composed of former AbstractConcepts, and represent some of the more abstract upper-level nodes for connecting the UMBEL structure together. This SuperType also includes theories or processes or methods for humans to do stuff or any human technology|
|This largely subject-oriented SuperType is a means for using controlled vocabularies and classification schemes for characterizing what content "is about". The key constituents of this category are Types, Classifications, Concepts, CCC, and controlled vocabularies|
|Markets & Industries
|This SuperType is a specialized classificatory system for markets and industries. It could be combined with the SuperType above, but is kept separate in order to provide a separate, economy-oriented system.|
These actual assignment to these categories are shown in the SuperTypes_20100316.xlsx file.
ANALYSIS OF THE SUPERTYPES
This section provides an analysis of the reference concept assignments and their possible disjointedness or overlap with other
Distribution of SuperTypes
The following diagram shows the distribution of these 20,000 UMBEL concepts across
SuperType. By far the largest
SuperType is Products (itself split into two columns to keep the other items in scale), even with further splits into Food & Drinks and Drugs (pharmaceuticals). The next largest categories are Person and Places and Events
SuperTypes, with Organizations and Animals not far behind:
Even in its generic state, UMBEL provides a very rich vocabulary for describing things or for tying in more detailed external ontologies. There are nearly 5,000 concepts across products of all types, for example. Note this figure with actual values is in the SuperTypes_20100316.xlsx file as well.
Possible Overlaps (non-disjoint) between SuperTypes
Twenty-nine of the
SuperTypes are “mostly disjoint.” This is because there are some concepts — say, MusicPerformingAgent — that can apply to either a person or a group (band or orchestra, for example). Thus, for this concept alone, we have a bit of overlap between the normally disjoint Person and Organization
The following shows the resulting interaction matrix where there may be some overlap between
SuperTypes, including the RC count of the overlap and shading from light red to dark red showing an increasing degree of overlap:
This kind of interaction diagram is also useful for further analyzing the concept graph structure. Note this figure is in the SuperTypes_20100316.xlsx file as well.
Disjoint and Non-disjoint Analysis
First, the 29
SuperTypes in our mostly disjoint categories contain 90% of the UMBEL reference concepts. The remaining 10%, by definition classificatory or non-disjoint (overlapping), occurs in the other four
SuperTypes. Here are the summary percentages of these high-level splits:
|Disjoint Concepts (29 SuperTypes)||90%|
|Attributes (1 SuperType)||1%|
|Classifications (3 SuperTypes)||9%|
The actual statistics by
SuperType are shown in the table below.
Within the 90% of reference concepts that are putatively disjoint, nearly 80% are fully disjoint. That means, of the 20,000+ concepts within UMBEL, about 14,400 (or about 70% overall) are disjoint from all other
SuperType categories. This means more than two-thirds of concepts when identified can be immediately segregated and dealt with on a computationally separate basis. This also means that many of the other advantages of a
SuperType system design noted earlier can also be invoked.
But, as this table indicates, there is a wide diversity of overlap or not between
|# RCs||# RCs||% RCs|
|Markets & Industries||142||0.7%|
|Topics & Types||1292||6.3%|
|Protists & Fungus||37||0.2%||3||8.1%|
|Finance and Economy||476||2.3%||291||61.1%|
|Food or Drink||664||3.2%||127||19.1%|
|Notations & References||408||2.0%||37||9.1%|
Also telling is that more than half of the overlaps (53%) between
SuperTypes occur in only three areas: PersonTypes v. Animals (for humans), PersonTypes v. Organizations (for certain agents) and Events v. Activities (for genuine ambiguities between the categories). The remaining 75 interactions account for less than half (47%) of the observed overlaps.
Ontology Treatment of Non-Disjointness
Of the 18,000 RCs in the 29 mostly disjoint
SuperTypes, about 80% are evaluated as disjoint. For the remaining 20%, where there is any non-disjointness between two
SuperTypes (e.g., PersonType and Organization), then a synthetic
SuperType is created with the concatenated label of OrganizationPersonType. The concatenation order between the two labels is alphabetical for these "synthetic"
ALL RCs are assigned to its appropriate SuperType; only non-disjoint RCs are assigned to the synthetic
Here is the current thinking about assertions:
- None of the non-disjoint
SuperTypesget a disjointedness assertion OR participate in any other disjointedness assertion.
- All RCs with a non-synthetic
SuperTypeare disjoint with all other non-meta
- All RCs with a synthetic
SuperTypeare disjoint with all other non-meta
SuperTypesEXCEPT where they share a synthetic
Even Where Overlaps Occur, They are Minor
Of the 29 mostly disjoint
SuperTypes, only a relatively few show potential interactions, and then mostly in minor ways (excepting the three interactions noted earlier). We can illustrate this (drawn to scale) for the interaction between the Product, Food & Drink and Drug (Pharmaceuticals)
SuperTypes, with the fully disjoint Organization
SuperType thrown in for comparison:
Across all 20,000 concepts, then, 70% are disjoint from one another. These reference concepts can gain the advantages noted at the beginning of this report.
POTENTIAL PROBLEM AREAS
There are a number of potential problems and other areas deserving work:
- It would be useful to test these assignments against the full Cyc knowledge base
SuperTypes, such as Attributes or Chemicals v Natural Substances, deserve more analysis
- Better descriptive labels may be warranted for many of the
- Perhaps some different boundary conditions might lead to even better natural groupings for some
SuperTypesor a reduction in number of reference concept overlaps
- Some thought should be given to how a minor concept overlap (such as human) could be better treated to reduce the ensuing overlap implications between two
SuperTypes. Or, stated another way, is there a better way to designate non-overlapping child concepts even where a parent may overlap with another SuperType?