Geek Out with our CTO, Isaac Mosquera

From time to time, we’re able to pull Isaac away from his job architecting Socialize to write a post. This is one such post. It’s more technical than most of our content, but if you want to geek out with Isaac, you’ll probably enjoy it. Want to continue the conversation? Leave a comment below or hit him up on Twitter.

At Socialize we have a lot of user interactions that we store to provide our customers with intelligent reports and useful information on their dashboards about their mobile applications.  Many times a customer will want to know “how many users were added in the last 3 months for my app?” or “how many notifications where sent in the last hour and for which entity?”  This is a common task not only for Socialize but for many data-driven companies.   In order to answer these types of questions we typically perform these 4 steps with the data:

  1. Collection
  2. Storage
  3. Analyzation/Aggregation
  4. Retrieval

The traditional solution is to use aggregate functions in the RDBMS such as count() to get the aggregate results but this presents a few problems at a large scale:

  1. Aggregating rows in a database creates unneeded load on the server
  2. Data could be stored in multiple sharded databases and the aggregated results would be inaccurate.
  3. Data could be stored in other datastore like a NoSQL datastore or even flat log files.
  4. Data is stored in an uncommon format across many sources.

There is actually a lot of work associated with all those steps.  Doing it repeatedly whenever there was a new request from customers didn’t make sense.  To solve this problem we created the EventTracker. The purpose of the EventTracker system is to persist and analyze event-based data that requires scalability, reliability and performance.  An event, in this context, is an action with a timestamp.

Collection

To collect the data we’ve setup a REST API that accepts data in a standard format but allows for meta-data to be attached to the event.

  1. bucket name – equivelant to a database table
  2. label name – a name to give this specific event
  3. metadata – a JSON formatted string of any additional properties you want to persist.

Thats it.  Simple and Lightweight.  Typical request take anywhere from 30-70ms although typically events should be sent in a non-blocking manner causing no delay to the client.  We’ve also created a event-tracker client in python which is easy to pull into any of our project and get started immediatley.  Here is a sample of how tracking an event looks:

When an event reaches the EventTracker server we do some validation, attach a timestamp and forward the event to Splunk. Splunk is enterprise software that indexes, analyzes and stores machine data. Splunk is designed to work with petabytes of information and scales horizontally by adding more machines and RAID configured disks for durability and speed.  After Splunk indexes the event, it’s archived in a gzip format.  This reduces the total event log size by about 70-80% because it’s mostly the same repetitive text.  Then we do incremental backups to S3.

Analyzation/Aggregation

The most powerful aspect of Splunk is the ability to create queries that can analyze billions of events at a time using a MapReduce algorithm. A typical Splunk query looks like this:

This query returns all new users where app_id=2 and aggregates it by day. This kicks off a Splunk MapReduce job that involves all the indexers and search heads available to splunk in our infrastructure. This is equivelant to use Apache Pig or Google’s Sawzall for their large MapReduce jobs. The EventTracker schedules these queries to run every 1 hour to run reports on customer’s data. Once the report is completed results are stored back in a RDBMS database for super-fast key based retrieval.

More on how Splunk uses MapReduce:
http://www.splunk.com/web_assets/pdfs/secure/Splunk_and_MapReduce.pdf

Retrieval

After the results are persisted in a RDBMS there needs to be a way to pull the data out so we can display the information on the website, mobile SDK or a daily executive summary.  The Events Analytics API connects directly to the RDBMS to pull out the aggregated results and returns them in a JSON format so it can be consumed easily by the client. A sample query to find out “how many users were added in the last 3 months for my app?” would look like:

Architecture

In the graphic below you can see how we pull all this together into one system that alleviates the developer from having to think about the entire process.

I’d love to know what you think. Please leave comments below or ask me on twitter: @imosquera

UPDATE 3/16/12: Great follow-up post by Alex Popescu, diagraming out the architecture diagram from Isaac’s post above. Here is Alex’s diagram. You can find his full post here.

Socialize Architecture diagram by Alex Popescu

Socialize Architecture diagram by Alex Popescu

8 Responses to “Leverage Splunk for MapReduce & Big Data Analysis”

  1. Alex Popescu says:

    Great post! I had to link to it from myNoSQL blog. And because the final architecture diagram was not displayed I took a stab at recreating it: http://nosql.mypopescu.com/post/19393198800/polyglot-persistence-architecture-at-socialize-splunk

  2. Nice post! Is EventTracker something like esper (http://esper.codehaus.org/), a CEP engine ?

  3. imosquera says:

    Alex Popescu asked me: Even if Splunk feeds work in real-time aren't you concerned that a bug/failure of the pipes would make you lose data? Or is the EventTracker writting data on disk or a queue from where Splunk consumes it?

    • imosquera says:

      Currently it's being forwarded to Splunk using syslog-ng but moving forward we plan on swapping out syslog-ng with Splunk Forwarders that do queue'ing on the local machine before sending out the packets in case the indexers are down.

  4. Chris says:

    Why did you use Splunk? I mean, did you consider Mongodb instead ?

  5. imosquera says:

    I think Mongodb solves a different problem than Splunk. For starters Mongodb is a document storage, whereas splunk indexes machine data.

    Although Mongodb provides a MapReduce framework, just like hadoop is a MapReduce is a framework, it's more difficult to write the MapReduce functions than to write queries in the Splunk's query language.

    lastly, Splunk comes with the ability to schedule, save and show reports with charting options which allows us to quickly see the information in real-time. This is not only great for developers but also great to provide the executive team with insights as to what is happening in the system.

    That said, I'm a huge fan of MongoDB and we will probably be moving some of our datasets to mongo in the future.

Leave a Reply