Wikipedia Mapping - SuperType Lists
To this point, the Wikipedia category structure has been much cleaned up, with administrative categories and functional categories removed or segregated. Depending on the specifics of the various prior steps, this can result in a decrease in candidate categories of around 75%. We are now within the working pool of categories worth mapping to third-party vocabularies and ontologies.
The next set of steps is to find patterns within this Wikipedia category structure for cleaner (an unambiguous) assignments to these external systems.
Create SuperType Lists
Throughout Wikipedia, there are various lists and patterned category descriptions that correspond to specific things, such as various job occupations, product types, organizations or music albums.
A database of these patterns has been created that now exceeds 5000 items. Many of these items come from the various lists; others have been manually identified by extracting out the various categories with suffixes. (Based on the clean categories produced to this point, there are nearly 8000 of these suffixed categories, about 5000 of which have been assigned.) For example, here is a sampling of some of the patterns within the "animal" SuperType"
. . . _animal _ankylosaurs _antelopes _ants _apes _arachnids . . .
We have further aggregated these patterned categories and assigned them to their corresponding SuperType. The resulting master of lists is named Summary_Lists_Assignments_YYYYMMDD.xls and kept as a spreadsheet, which can then have individual lists extracted for various scripts and other purposes.
As of December 13, 2010, here is the count of patterned items within these SuperType lists:
Test List Assignments
By assigning an individual in each of these list categories, it is then possible to test whether it is properly assigned by tracing its parents through the UMBEL structure. If an assignment results in an improperly assigned category to a SuperType via disjointedness testing, that category is removed.
The testing of these assignments has been automated via scripts.
In addition, the scripts allow a separate file of assigned categories to be written out that can be inspected and tested for other possible QA/QC or error checks.
- Functional categories combine two or more facets in order to split or provide more structured characterization of a category. For example, Category:English cricketers of 1890 to 1918, has as its core concept the idea of a cricketer, a sports person. But, this is also further characterized by nationality and time period. Functional categories tend to have a A x B x C construct, with prepositions denoting the facets. From a proper characterization standpoint, the items in this category should be classified as a Person --> Sports Person --> Cricketer, with additional facets (metadata) of being English and having the period 1890 to 1981 assigned.