|Geek Out with our CTO, Isaac Mosquera|
From time to time, we’re able to pull Isaac away from his job architecting Socialize to write a post. This is one such post. It’s more technical than most of our content, but if you want to geek out with Isaac, you’ll probably enjoy it. Want to continue the conversation? Leave a comment below or hit him up on Twitter.
At Socialize we have a lot of user interactions that we store to provide our customers with intelligent reports and useful information on their dashboards about their mobile applications. Many times a customer will want to know “how many users were added in the last 3 months for my app?” or “how many notifications where sent in the last hour and for which entity?” This is a common task not only for Socialize but for many data-driven companies. In order to answer these types of questions we typically perform these 4 steps with the data:
The traditional solution is to use aggregate functions in the RDBMS such as
count() to get the aggregate results but this presents a few problems at a large scale:
- Aggregating rows in a database creates unneeded load on the server
- Data could be stored in multiple sharded databases and the aggregated results would be inaccurate.
- Data could be stored in other datastore like a NoSQL datastore or even flat log files.
- Data is stored in an uncommon format across many sources.
There is actually a lot of work associated with all those steps. Doing it repeatedly whenever there was a new request from customers didn’t make sense. To solve this problem we created the EventTracker. The purpose of the EventTracker system is to persist and analyze event-based data that requires scalability, reliability and performance. An event, in this context, is an action with a timestamp.
To collect the data we’ve setup a REST API that accepts data in a standard format but allows for meta-data to be attached to the event.
- bucket name – equivelant to a database table
- label name – a name to give this specific event
- metadata – a JSON formatted string of any additional properties you want to persist.
Thats it. Simple and Lightweight. Typical request take anywhere from 30-70ms although typically events should be sent in a non-blocking manner causing no delay to the client. We’ve also created a event-tracker client in python which is easy to pull into any of our project and get started immediatley. Here is a sample of how tracking an event looks:
When an event reaches the EventTracker server we do some validation, attach a timestamp and forward the event to Splunk. Splunk is enterprise software that indexes, analyzes and stores machine data. Splunk is designed to work with petabytes of information and scales horizontally by adding more machines and RAID configured disks for durability and speed. After Splunk indexes the event, it’s archived in a gzip format. This reduces the total event log size by about 70-80% because it’s mostly the same repetitive text. Then we do incremental backups to S3.
The most powerful aspect of Splunk is the ability to create queries that can analyze billions of events at a time using a MapReduce algorithm. A typical Splunk query looks like this:
This query returns all new users where app_id=2 and aggregates it by day. This kicks off a Splunk MapReduce job that involves all the indexers and search heads available to splunk in our infrastructure. This is equivelant to use Apache Pig or Google’s Sawzall for their large MapReduce jobs. The EventTracker schedules these queries to run every 1 hour to run reports on customer’s data. Once the report is completed results are stored back in a RDBMS database for super-fast key based retrieval.
More on how Splunk uses MapReduce:
After the results are persisted in a RDBMS there needs to be a way to pull the data out so we can display the information on the website, mobile SDK or a daily executive summary. The Events Analytics API connects directly to the RDBMS to pull out the aggregated results and returns them in a JSON format so it can be consumed easily by the client. A sample query to find out “how many users were added in the last 3 months for my app?” would look like:
In the graphic below you can see how we pull all this together into one system that alleviates the developer from having to think about the entire process.
I’d love to know what you think. Please leave comments below or ask me on twitter: @imosquera
UPDATE 3/16/12: Great follow-up post by Alex Popescu, diagraming out the architecture diagram from Isaac’s post above. Here is Alex’s diagram. You can find his full post here.