To this point, the Wikipedia category structure has been much cleaned up, with administrative categories and functional categories[1] removed or segregated. Depending on the specifics of the various prior steps, this can result in a decrease in candidate categories of around 75%. We are now within the working pool of categories worth mapping to third-party vocabularies and ontologies.

The next set of steps is to find patterns within this Wikipedia category structure for cleaner (an unambiguous) assignments to these external systems.

Create SuperType Lists

Throughout Wikipedia, there are various lists and patterned category descriptions that correspond to specific things, such as various job occupations, product types, organizations or music albums.

A database of these patterns has been created that now exceeds 5000 items. Many of these items come from the various lists; others have been manually identified by extracting out the various categories with suffixes. (Based on the clean categories produced to this point, there are nearly 8000 of these suffixed categories, about 5000 of which have been assigned.) For example, here is a sampling of some of the patterns within the "animal" SuperType"

We have further aggregated these patterned categories and assigned them to their corresponding SuperType. The resulting master of lists is named Summary_Lists_Assignments_YYYYMMDD.xls and kept as a spreadsheet, which can then have individual lists extracted for various scripts and other purposes.

As of December 13, 2010, here is the count of patterned items within these SuperType lists:

activities 318
animals 232
chemistry 60
diseases 58
drugs 29
earthscape 62
events 148
extraterrestrial 24
facilities 170
financial 59
food 76
info-audio 40
info-structured 101
info-visual 96
info-written 160
nationalities 236
nature_phenomena 36
nature_substance 26
notations 60
numerics 9
organizations 169
people 913
geopolitical 66
plants 25
products 354
prokaryotes 6
protists_fungus 4
society 173
time 22
abstract 0
attributes 20
markets 8
topics_categories 1,390
Subtotal 5,150
unallocated 2,868

Test List Assignments

By assigning an individual in each of these list categories, it is then possible to test whether it is properly assigned by tracing its parents through the UMBEL structure. If an assignment results in an improperly assigned category to a SuperType via disjointedness testing, that category is removed.

The testing of these assignments has been automated via scripts.

In addition, the scripts allow a separate file of assigned categories to be written out that can be inspected and tested for other possible QA/QC or error checks.


  1. Functional categories combine two or more facets in order to split or provide more structured characterization of a category. For example, Category:English cricketers of 1890 to 1918, has as its core concept the idea of a cricketer, a sports person. But, this is also further characterized by nationality and time period. Functional categories tend to have a A x B x C construct, with prepositions denoting the facets. From a proper characterization standpoint, the items in this category should be classified as a Person --> Sports Person --> Cricketer, with additional facets (metadata) of being English and having the period 1890 to 1981 assigned.