Amazon Cloud Search
Saturday, June 9, 2012
Facepunch forums are pretty huge. There’s over 20 million posts in the database. This is cool, but the database regularly locks up when searching. There’s a number of ways to try to get around this, better database server, faster hard drives, pruning posts, slave databases. But whatever you do sooner or later as your post count increases you’re going to have to do it again.
So last week I looked into Amazon Cloudsearch. Put simply you throw all your posts at this service and it provides an API for you to search them. This means that search isn’t using your databases anymore, so your database doesn’t need to be half as powerful, and doesn’t keep locking up.
First of all let me start by saying that you can put anything in here. It doesn’t have to be plain text - you can throw pretty much any document at it and it’ll make it searchable.
The way I do it on our forum is simple. I added a ‘indexed’ field to the posts table. When a post is created or edited it sets the ‘indexed’ field to 0.
Then I set up a cron that scan the posts table every minute, then any posts that are indexed=0 it uploads them to CloudSearch. You upload to CloudSearch by sending AJAX queries - it couldn’t be easier. Here’s how I build mine
So obviously I’ve selected the appropriate posts from the database, then I loop through them all adding them to an array. I don’t add them if they are lower than 10 letters, and I strip UTF from the posts to save space (you don’t have to do this, I just decided it’d be a waste of time preserving these characters since we’re an english forum).
Then it’s just a case of converting the array to JSON
And sending that baby to Amazon
And if it succeeds, mark them as indexed..
You might have noticed that I send date, forumid, threadid and userid with the posts. This allows us to also search via those fields, and filter by those fields. So if you only want to find threads containing the word ‘Butt’ in General Discussion, posted last week, by me - you can do that easily.
But more than that. When you search it will also categorize your results. It’ll show you the top x forumid’s with that word in, and the top userid’s, and the top threadids.
This allows you to show the results in a way that lets people drill down to find what they want. For example, this search:
You can see that it’s most mentioned in the Max Payne thread. And most mentioned by “A Big Fat Ass”. But it’s also mentioned in the Max Payne 3 thread a lot, and maybe that’s what we’re looking for. So clicking on the thread restricts the results to that thread.
Not a super example of why this is cool, but compare that process to VBulletin’s search:
Searching shouldn’t be THIS much work.
Searching is as easy as opening a URL. Because that’s all it is. When you create your search domain you’ll be given a unique URL to query. You can do this from inside your site (I wrap it in an API) - or you could query it directly via an ajax request. Their dashboard lets you run test queries, so you can make sure it’s working.
This is where the usefulness will probably drop out for most people. The price starts at around $80 a month. So if you’re providing search for one small thing - it probably wouldn’t be worth it.
It scales with usage. So if you have a hundred million entries, and you’re querying 1000 times a second - it’s gonna cost a lot more $80. But on the upside performance will remain the same.
Right now we have nearly 3 million posts indexed and we’re still on a single small search type (it scales automatically). I am expecting this to change to the large type soon (which is around $350 a month).
For us - even at $350 a month - I think it’s worth it, for these three reasons.
- Search Results are Instant
- Search Results are Better
- Takes pressure off the Database