Wikipedia Mapping - Category Cleanup

From UMBEL Wiki
Jump to: navigation, search

Initial Wikipedia category cleanup occurs over two major steps.

Admin Category Removal

Extraneous categories are removed from Wikipedia that will never participate in UMBEL mapping. These are almost entirely internal, Wikipedia administrative-related categories. For example:

Wikimedia_projects
Wikimedia_text_licensing_permissions
Wikimood_templates
Wikinews_administrators

At present, there are 238 items on the Wikipedia admin.txt removal list.

However, because some of these specifications are a bit blunt-edged, a further 'white list' of specifically desirable Wikipedia categories are added back in. So, the category removal process is:

  1. Start by extracting all unique Wikipedia categories (should be about 473 K; name unique_categories.txt)
  2. Remove all category matches based on admin.csv
  3. Note, that "Wikipedia" is a removal match, but there is a white list of desirable Wikipedia categories to be added back in (wikipedia_white.txt)
  4. Create a separate file of all removals (cat_admin.csv) to be used for counts and QA/QC.

Segregation of Functional Categories

A number of categorical removals are made, each of which is set to the side for possible later analysis and metadata characterization.

Note as with the prior step, that the scripts used for these purposes can also produce direct listings of the removals, which is helpful for QA/QC and review.

Articles Segregation

Most "article" categories are simply topical or subject-oriented listings, and are not themselves a core part of the underlying Wikipedia structure. These are removed and set aside by:

  1. Using _articles_ as a category removal match, and then create a file with (cat_articles.csv) to be used for counts and QA/QC.
  2. The resulting file is the new working baseline (baseline_categories.csv). This is the file to be used for all efforts under the various steps below.

Date-related Segregation

Most date-tagged functional categories are then removed and set aside by:

  1. Remove the date-prefaced categories as shown in the year_prefaced.csv. This file may eventually need a white list
  2. Also remove all categories with trailing labels for years and decades (such as _1970 or _1920-1929 or _1980s), using the following regex:
 ^.*_[0-9]{4}$
 ^.*_[0-9]{4}-[0-9]{4}$
 ^.*_[0-9]{4}s$
  1. Also remove the categories captured by the date_issues.txt list
  2. Based on all of these removals, create a separate file of all date removals (baseline_dates.csv).

Preposition (Functional) Segregation

  1. Based on the prepositions.txt file, remove all preposition-related categories one-by-one, each working from the baseline.csv file
  2. For each preposition, write a file after each extraction containing the categories containing that preposition; name with its preposition label, such as baseline_about.csv. Note: these preposition extractions are not cumulative with regard to prior extractions; each list extraction should work off of the full baseline.csv.
  3. Repeat for all prepositions in the prepositions.txt file.

There are presently about 20 items in the preposition list.

Restriction to Substantive Categories

For other mapping steps, we also want to work with only substantive, parent categories in Wikipedia. For this purpose, we first identify all categories that also have a matching page with description (or a redirect to one), plus those which have child nodes below them (since those are putatively more important for organizational purposes within the Wikipedia category graph).

For one set of analyses, we get this kind of breakdown:

"Clean" categories without a matching page 34,490
"Clean" categories with a matching page 47,987
"Clean" categories with a matching page, is leaf 12,715
"Clean" categories with a matching page, is not leaf 35,271

Alternative/Supplementary Cleaning Options

  • get to some common encoding
  • split on case
  • remove most punctuation ( ( , . _ - ) ; : < > @ & ! % ^ + = { } [ ] \ / ` ~ | ' " ) --> replace with whitespace
  • remove all line/carriage breaks --> replace with whitespace
  • lowercase all tokens
  • remove stoplist words (perhaps use expanded?)