If you haven’t heard, Socialize was recently acquired by ShareThis.  Part of joining a larger organization means leveraging resources that  were previously inaccessible.  One of those resources is ShareThis’ world class technical operations team.  Specifically, Senior Engineer Vigith Maurice is responsible for integrating Socialize’s systems into the existing ShareThis stack.  He naturally had a few questions about the Splunk infrastructure and instead of keeping the information locked up we decided it would be good to share this information.    His questions where as follow:

  • An overall overview on Splunk
  • How do you use it?
  • How the deployment was done?
  • What are the components
  • If we create a new instance of splunk using our bootstrap technologies, how do we move the existing data?
This post hopes to answer those questions for him but as well as anybody else trying to accomplish the same thing or just generally know more about how Socialize uses Splunk.

Overview of Splunk At Socialize

Splunk is used for all things logging based at Socialize.  Splunk provides the ability for us to ask questions about our data and quickly get answers without need to write complex code or MapReduce jobs.  More importantly once we have answers about our data we can share our findings with others through easily digestible visually based dashboards and not vague spreadsheets.

Our Use Cases

  • Monitoring
    Splunk monitors our all of our servers as well as our applications.  It ingests and indexes everything from nginx logs and application logs to custom events (JSON formatted) directly from a port.  We can see how all our systems are responding all in one place.
  • Alerting
    We currently use it to alert through e-mails.  We’re able to set thresholds for the results of queries.  For instance “if nginx 500 response events  is greater than 100” or “if user registration events are less than 350 for the last 10 minutes”.  If any of those thresholds are met then we can shoot off an e-mail.  We could even execute a script that self-remedies the situation or escalates the problem to the right parties through an API.
  • Log-based Analytics
    We log 3 types of events to Splunk:  service logs, application status and custom semi-structured events.

    Service Logs
    - These are logs from nginx, apache or gevent services.  Anything that is running on the system level that we want to monitor.Application Status -  These events are all about what’s happening internally with the application.  For instance it could print out the queue size on a regular interval and print it to the application log so Splunk can interrogate the queue size and calculate analytics like average queue size, max queue size and alert the team if there is ever a problem through email or kick-off a custom scriptCustom Semi-Structured Events – These are events that are JSON structured which we can later use for analysis.  For instance, we log an event when a user logs in or authenticates with Facebook/Twitter. Here is an example of a semi-structured JSON event:  
  • Ad-hoc Analytics
    Splunk allows anybody to ask any questions of the data at anytime.  For instance, last week one of our customers was experiencing what he described as “slow response times”.  Using splunk and ad-hoc querying I was able to get the response times just for his app.  I then broke down the query by endpoints and got their average response times.  I was able to quickly identify that the /v1/entity endpoint was the source of the issue.  Seeing that his request/response sizes were fairly large for a few events, I was then able to get an average packet size and notice they were sending extremely large packets.  All because I was able to slice and dice the data in different ways.  I was even able to merge it with different datasets quickly to get an application name and e-mail so I can notify that user with the results.
  • Dashboards
    It’s easy to create and display dashboards for upper management to easily digest numbers, charts and analytics.  We provide logins to the system to a variety of users so they can get to these dashboards which show data in near real-time.

 

Architecture

High-Level Architecture


Indexer Architecture

 

 

Components

  • Splunk Forwarders
    From Splunk: Forwarders are lightweight Splunk instances, whose main purpose is to consume data and forward it on to Splunk indexers for further processing. They require minimal resources and have little impact on performance, so they can usually reside on the machines where the data originates.

    For example, say you have a number of Apache servers generating data that you want to search centrally. You can install a Splunk indexer on its own Linux machine and then set up forwarders on the Apache machines. The forwarders can take the Apache data and send it on to the Splunk indexer, which then consolidates and indexes it and makes it available for searching. Because of their light footprint, the forwarders won’t affect the performance of the Apache servers.

  • Splunk Indexers
    Indexers are the core of the Splunk ecosystem.  They’re responsible for ingesting log data, storing it safely and indexing the data so it be found later.  The more of these type of boxes we have the more the system can index and scale out.  The Splunk instance that indexes data, transforming raw data into events and placing the results into an index. It also searches the indexed data in response to search requests.

  • Search Head
    The search head is the point of entry for end-users trying to integrate the data.  It connects to all the indexers in the cluster to query all the data that is available.  All queries are managed and setup through this UI.  Currently this search head isn’t pooled/distributed but it can and should be setup to do so.

Migration

  Splunk is a distributed system in which a component can fail and the system will continue to work because the forwards will redirect traffic to the available indexers.   This makes the migration easy.  The steps below are how the migration can happen without a need for an outage.

  1. rsync data directories from indexer 1 while the processes are still running.
  2. shut down the indexing process of machine 1.  All traffic directed at machine 1 will now be directed to other machines in the cluster.
  3. rsync machine 1  to the new machine again to  make sure the datastore is in a consistent state.
  4. Redirect traffic from indexer 1 to the new machine.
  5. Repeat with the other machines in the cluster

Leave a Reply