Downloading Plain-Text Wikipedia

Mon, Dec 12, 2016 2-minute read

If you’ve ever been interested in having all of Wikipedia in a plain-text format, you might have been disappointed to learn that Wikipedia doesn’t actually make this format available. Fortunately, they do offer an XML version of the entire database, so I’ve written a PowerShell script to convert that XML dump into individual plain-text articles. The script tries to remove as much of Wikipedia’s additional markup as possible, and skips inconsequential articles. This script demonstrates a unique way of processing XML in PowerShell that you rarely see - because it is rarely needed. In XML form, the Wikipedia database is nearly 60GB. This is FAR too large for PowerShell’s [xml] cast, due to the memory overhead required for the XmlDocument format on which the [xml] cast is built. It’s also far too large for most systems to even hold in memory at once. Instead, this script takes a streaming approach built on System.Xml.XmlReader. The XmlReader class lets you handle tags and elements as the reader sees them, rather than forcing you to wait for that final ill-fated closing tag while everything buffers in memory.

  1. Install the ‘Split-Wikipedia’ helper script from a PowerShell prompt:
    1. Install-Script Split-Wikipedia -Scope CurrentUser
    2. The Install-Script command requires Windows 10 or install PowerShell 5.0.
    3. If this is the first time you’ve used Install-Script, exit PowerShell and launch it again.
  2. Use PowerShell to navigate to a directory that you want to contain your Wikipedia articles
    1. mkdir ~/Documents/Wikipedia
    2. Set-Location ~/Documents/Wikipedia
  3. Download the latest English Wikipedia database (~ 13GB)
    1. Invoke-WebRequest https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 -Outfile enwiki-latest-pages-articles.xml.bz2
  4. Decompress the XML, using bzip2 (or another tool like 7zip if you wish):
    1. bzip2 -d enwiki-latest-pages-articles.xml.bz2
  5. Process the XML (~ 58GB). This will take about 7 hours:
    1. Split-Wikipedia -Path ./enwiki-latest-pages-articles.xml
  6. (Optional) Delete the source XML
    1. Remove-Item ./enwiki-latest-pages-articles.xml

All 4 million articles are now in your ‘Wikipedia\Articles’ directory. Within this directory, they are again split into subdirectories of 5,000 articles each - as most software (i.e.: File | Open dialogs, browsing in Explorer) doesn’t handle single directories with 4 million items very well. Enjoy!