Writing a Mediawiki Interwiki Bot


Have you ever wondered how to write a Mediawiki bot that adds interwiki links to pages?

As owner of InterwikiBot on the International Scratch Wikis, this was my bot's primary task, and it was accomplished with the Mediawiki backed Pywikibot. However, Pywikibot quickly became unusable. It paused many times for manual confirmation, meaning the process was not automatic, but a few hour slog. It was simply too much of a pain to run.

So, I've written new code for this task based off of JavaMediawikiBot. Because why not? It's more automated, faster, and the requested edits are peer-reviewable before being pushed.

In this blog post I'd like to cover the design of the bot, seeing as writing this bot is actually quite easy! Given the right data structures that is.

Overview

For those of you new to Mediawiki, interwiki is a way to mark a page as having translations available. Sort of like "This page is offered in Deutsch, Bahasa Indonesia, and Magyar".

The goal of our bot will be to find pages to put interwiki links on.

Let's make the assumption that existing interwiki links are correct.

Here is the algorithm pre-sketch. For a given page, the bot needs to download all interwiki pages, checking each page for new interwikis, until all have been downloaded.

Next, the goal is to analyze the pages downloaded. If page EN has an interwiki to page JP, but not visa versa, a new interwiki needs to be added.

The only caveat is if somehow among the pages downloaded, two are from the same wiki. Why? It is impossible to have multiple interwiki to the same language. Example: We've downloaded pages EN1, EN2, and JP, which do not already connect to each other. Hence "JP" should not be touched. This is an interwiki conflict.

Network Considerations

Mediawiki allows bots to download one page at a time, or a batch of pages. Normally, the bottleneck with bots is the network, seeing as requests normally take 1 second. Hence, it is best to be able to download pages in batches. However, this means juggling many pages and lots of data!

Data Type: BufferPool

Thinking to our network constraints, the job can be simplified by creating a new data type. Ideally, in it we can collect pages by language, and once a certain language has enough pages, batch download them.

Let's nickname this data type "BufferPool".

More formally, here is how I define the data type. A BufferPool stores groupings, called buffers, of Objects. Each buffer has a unique key via which it is identified. Items are added to a buffer with a given key. The BufferPool can be queried to identify the size of the largest buffer. Additionally, the BufferPool can be queried for the key of the largest buffer. Buffers can be emptied and returned for processing.

For convenience only, the BufferPool may be queried to check if it is empty, and to see if it contains an Object corresponding with the given key.

This data type solves all of the requirements of our network challenge. Huzzah! But we can improve further.

Data Type: ConnectionGraph

Another sub-challenge of our interwiki goal is storing pages and the directed connections they make to each other. Ideally, Java would have a graph class, but this is not the case. There is the library JGraphT, although I learned about it too late.

Let's nickname this data type "ConnectionGraph".

More formally, here is how I define this data type. A ConnectionGraph stores Object nodes and and the Objects it links to. Links are different from nodes in that links do not need to be existing nodes in the ConnectionGraph.

A node and its links may be added. If the node already exists, its links are updated. A node's links may be queried. Finally, a ConnectionGraph may be queried for its nodes.

For convenience, a ConnectionGraph may be queried to get the links that are not existing nodes. Also, a ConnectionGraph may be checked if it is complete. Aka, all links are existing nodes.

In review, this data type is very useful in that we can handle page data so beautifully. For example, it's easy to see which interwiki pages haven't been downloaded yet. Assuming downloaded pages are nodes, then non-downloaded pages are the links that are not contained nodes.

One last note. To detect interwiki conflicts, it will be helpful to define a new method outside of ConnectionGraph. It should check if a ConnectionGraph has multiple pages from the same language wiki.

Bringing it Together

Now that we've considered our goals, and defined some data types, let's flesh out our algorithm! To rehash, we will use a BufferPool to store what pages need to be downloaded, and ConnectionGraph's to store a group of pages and their existing links.

Start

  1. Get a batch of unvisited page locations. If there are no pages left, go to step 7.
  2. Repeat for each page location:
    1. Add the page to BufferPool for later download.
  3. Check BufferPool. Repeat while any language has a large amount of buffered pages:
    1. Run "Process Largest Language Buffer".
  4. Go back to step 1.
  5. Time to wrap up. Repeat while BufferPool is not empty:
    1. Run "Process Largest Language Buffer".
  6. Celebrate!

Process Largest Language Buffer

  1. Check BufferPool for the language with the most buffered pages. Remove them from BufferPool.
  2. Batch download the pages.
  3. Repeat for each page:
    1. Run "Process Page".

Process Page

  1. Get the page's ConnectionGraph. If it doesn't have one, create one.
  2. If the page is a redirect, follow the redirect. Inside the page's ConnectionGraph, replace links to the redirect with the new location.
  3. Parse out the page's interwikis. Add the page and its interwikis to the ConnectionGraph.
  4. Query the ConnectionGraph, and see which pages have not been downloaded yet. These will be the links that aren't existing nodes.
  5. For each such page:
    1. If not already contained, add the page to BufferPool for later download. Use the page's language as the key.
  6. If the ConnectionGraph is complete, then it needs to be processed. Otherwise, finish "Process Page".
  7. If the ConnectionGraph contains an interwiki conflict, log it, for intervention purposes. Finish "Process Page". (Another option is to remove the conflicting pages.)
  8. For each page node in the ConnectionGraph:
    1. If the page doesn't have interwikis for every other page node, append the missing interwikis.

That's all! For the best results, run the bot on all the wikis it has access to.