Last week in collaboration with SEOmoz we launched a brand new version of Open Site Explorer, an SEO search engine for link data (soon to be the only one?). This latest version of the site is built on Rails 3 and was in development for close to four months.
We’re proud of our work on the site, which extended far beyond the new design and consisted of new features as well. With that in mind, I’d like to take a moment to talk about two of those new features: advanced reports and social data.
Big reports, made fast
Advanced reports are CSV exports delivered by e-mail, containing up to 100,000 rows. Users can create a custom report from a huge number of options—type of link, link origin, link destination, link properties, anchor text terms and phrases, Page Authority and Domain Authority, TLDs, and more.
There was just one problem: responses from the LinkScape API, SEOmoz’s API that powers OSE, are limited to 1,000 results, and in most cases only one filter per request. Well, that’s actually two problems, but the fact remained that we would have to write a layer between LinkScape and Open Site Explorer. Thankfully, LinkScape provides you with an offset and a limit, allowing you to page through a result set.
LinkScape is an impressive feat of technology—trillions of data points built in a very scalable way. Hitting it with a lot of little requests is how it performs best, so we took that approach and distributed the work amongst a cluster of reporting servers. We use these servers solely for processing LinkScape data (I’ll get into that in a minute), and when they’re ready they queue up another job to append to the CSV itself that will eventually make its way to the user’s inbox. That’s handled by another server that specializes in CSV writing.
Then, once we determine we’re ready to close out the report, we kill off its remaining querying jobs (we generate a bunch ahead of time), add as much social data as we can get away with, and send it on its way!
The beauty of this system is that any machine can act as any other machine just by changing its role. And because these guys all communicate through Resque (a queue built on Redis, a key-value store), scaling it horizontally is just a matter of adding another server and deploying to it.
Getting back to the actual processing, the challenge of filtering on 20+ data points when you can only filter on one per request is that you have to marry up the data after the fact. In order to do this, we implemented every filter on our end in code so we can check each row against the report’s filter set to determine if it passes our tests. We also need to check if we’ve seen each result before, so we store unique identifiers in a Redis set. We store other stuff in Redis, too, mostly around pagination. As with any large data set, it becomes infeasible to identify the exact size of the data in the system, so we keep track of final pages and things.
Incidentally, you might ask why we chose to go with Redis instead of other options like SimpleDB. We actually are using SimpleDB, but for non-critical tasks. In benchmarks we found SimpleDB to be too slow for most of our needs. Plus having a Redis-based queue means that we could avoid introducing yet another dependency like RabbitMQ.
Likes, tweets, and plus-ones
Social data was our next challenge. We needed to have a way to collect data from across multiple networks and aggregate it into a common response format, so we created a gem built on the lovely Wrest.
Two things of interest here:
- Facebook pushes developers to use their Graph API, but that returns URL share counts as the total number of likes, shares, and comments. We needed them to be distinct, so we went with the older (but more granular) FQL instead.
- Google +1’s API (if you can call it that) is undocumented, but a nicely-timed blog article by Tom Anthony solved that problem.
Generating social data across multiple networks quickly can be a challenge, so we multi-thread those requests.
Not done yet
There were other challenges, of course, and plenty in the pipeline. In particular, users have clamored for direct report downloads, and we expect to launch that feature very shortly.
We’re really excited about Open Site Explorer, and looking forward to whatever the smart people at SEOmoz can cook up next!