Tag Clouds, Election-Style
October 9th, 2008, 11 Comments »
I’m a big fan of Wordle. Everybody likes pretty tag clouds, but until recently, I’ve had no practical use for the tool.
What with the forthcoming election and all, and being in marketing, I thought it might be interesting to use Wordle to distill each of the four national parties’ websites into a tag cloud. The cloud would reflect the terms that the party uses most frequently on their English-language websites. With an assist from Ask Metafilter, I got them done. I’ll explain a little more about how after the clouds.
As usual, click for larger versions:
What Conclusions Can We Draw?
That’s more a question for you than me, as I haven’t spent much time trying to grok what these clouds tell us (yes, I used ‘grok’). What jumps out at you?
How Did We Make Them?
First, I grabbed a complete copy of each party’s website. I just stuck with HTML files, so if a party hosts a lot of PDFs with unique content, then that’s not reflected. The sites, of course, ended up being different sizes, and I’m relying on my site-copying software, so I can’t be certain I got all the pages.
Then we concatenated each set of HTML files into one gigantic file. Using some scripty-magic, we generated the top 100 or 250 words, each appearing as many times as they appear in the original site.
I went through each of these to clean out most or all of the leftover HTML code, navigational terms like ‘email’ or ‘newsletter’ and French words. The French is why we used 250 words in some cases. For some sites, I downloaded both the French and English version of the site, so I needed to remove the French. By working with a 250 word file, I was able to clean out the French and still have a sizable database of words.
In short, it’s somewhat unscientific, but I’m optimistic that the clouds represent a reasonably fair reflection of each site’s top content. If anyone wants to work with the content I copied, I’m happy to share it. I’m not going to publish the complete sites here, though, as I expect that would constitute a copyright violation.



