June 30, 2013 at 03:15 PM | categories: software | View Comments

One year ago I went to a Python meetup where William Bert made a presentation on gensim, a "topic modelling" library for Python. One of the most practical uses of topic modelling is finding similar documents from a large corpus. For example, where Google search takes a query as an argument and returns documents based on that query, Google News tries to organize stories from multiple sources into clusters based on the same topic or event.

In memory of Google Reader shutting down tomorrow, I present rsscluster, a small script which demonstrates the usage of the gensim library. One of the problems I ran into with wanting to play with gensim was finding a large corpus of documents with which to populate the database. When Google Reader announced it was closing and I looked for a new home for my feeds, I found a great corpus to play with.

My typical set of feeds contains about 3000 stories, which isn't a huge corpus, but enough to play with. Crucially, there are a number of feeds I subscribe to which are likely to produce "similar" documents. For example, Ars Technica and The Verge will probably both have stories about the latest Apple keynote, while The Washington Post and NPR will both cover recent Supreme Court decisions.

You can see how well rsscluster did in finding stories similar to those in my feeds published today (June 30th) here. For as little effort I put in and considering how small the corpus is, it did a pretty good job of detecting clusters around recent events (Edward Snowden and the NSA, a presidential tour of Africa, John Kerry speaking on the Middle East). It did a slightly worse job categorizing Supreme Court stories, as they've been pretty active lately and ruled on a lot of disparate issues. Gensim decided to cluster those up more than it should have. Finally, some of my feeds are just not very semantically rich, and it decided to cluster a bunch of Steam sales together. One feed is particularly degenerate in that it only contains the word "NO" in each entry. Gensim dutifully clustered all of those documents together, but those documents probably don't belong in the corpus at all.

Again, this is a really naive use of gensim (I spend more time parsing command line arguments than actually exercising the library), but it let me get gensim testing out of my system and it demonstrates how easy it is to set up a pretty powerful document similarity search engine. Hopefully someone else can also be inspired to play with it, and will find a better scenario with which to exercise it.

See it on github.

blog comments powered by Disqus