Distributed Logging – Managing 7 Production Apps across 200+ Servers with Gearman

History

 

Be DLA - Log View

Behance Distributed Logging Application - Log View

What do you do when there is an error across 7 different web applications running in production and load balanced across roughly 200+ servers? Good luck logging into each and grepping some logs. No, crying won’t help; we tried that. Instead, why not build a tool to log all of those errors into a centralized location? After evaluating the many (fantastic) pre-built options like: Loggly, Splunk, even Syslog… we found none of them provided all the capabilities we wanted. The job to build our own solution was then tasked to Matt LeBrun and me, and here is our awesome journey.

First off, a little history. What are our 7 applications?

  1. Behance Network – Our online social networking platform to showcase and discover creative work.
  2. ProSite – Premium service of the network allowing users to transform their portfolio into a completely custom website.
  3. Billing service – Handles all of the monetary transactions that occur on any one of our applications.
  4. Image service – Handles all image manipulations across all of our sites including cropping, resizing, thumbnail generation, HTML to PDF conversion, etc.
  5. Creative’s Outfitter – Our online store full of products and services that organize and empower creative professionals.
  6. Action Method Online – A task management system for creative thinkers. Maximizes organization and accountability for individuals and teams.
  7. Linkedin Application – Our Behance Network module on LinkedIn which allows our users to showcase their work directly on their LinkedIn profile pages.

There were a couple of questions we needed to figure out in order to get started:

  1. How are messages going to be sent to this centralized system?
  2. How can this be done asynchronously so that the reporting app doesn’t stall when something goes wrong?
  3. How will we display, search and sort the information once we get it?

We went through several different iterations:

Attempt #1 – SNMP

 

The very first method we tried to use was SNMP — more specifically a SNMPv2-Trap notification. A trap is an asynchronous notification that is sent from an agent ( a node of some sort ) to a manager. Though this required additional setup on each node, this seemed like a straightforward way to immediately integrate with standard monitoring tools, such as Nagios, or Zenoss. While testing this approach, we were able to successfully send off a couple small test notification messages from one server to our temporary centralized system ( we set up Zenoss on an extra Virtual Machine ). Things looked fine until we tried to send a real notification full of the information we wanted to log, including a stack trace, $_POST/$_GET variables, $_SERVER variable, $_COOKIE variable etc. It wasn’t until then did we realized SNMP had a very basic limitation:

"SMIv2 specifies a limit of 65535 octets. A size may be specified which can be fixed, varying, or multiple ranges.  Examples: OCTET STRING (SIZE(0..255))"

Based on the type of trap we were using, it was impossible to send a trap message bigger than 255 characters. Some of our stack traces alone tend to be around 900+ characters. It was clear at this point that SNMP was not the protocol we wanted to use for our system.

Attempt #2 – cURL

 

On our second attempt, we tried to simulate an asynchronous notification by using curl – which required no additional server changes or configuration. We did this by using two cURL options, CURLOPT_CONNECTTIMEOUT_MS and CURLOPT_TIMEOUT_MS. By setting both of these options to 1 MS, we were able to simulate a “close-enough” fire & forget notification. This seemed to be a much better alternative to SNMP until we ran into two complications.

First, we realized that the version of cURL that was installed by default on our pre-production / production cloudservers was version  7.15.1 (released in 2005!). The two PHP curl_setopt values we needed were not added till version 7.16.2, and we were definitely far behind. Our CTO Chris Henry got around this issue by compiling a cURL RPM from source and installing it where needed, but then we ran into our second issue. On single notifications, there were no problems….. Notifications were handled correctly on both the agent and manager side of things. However, in order to test the performance of this system in production, we used ab ( Apache Bench, an Apache HTTP server benchmarking tool ) against a test script. We ran ab with 10,000 requests for the benchmarking session and a concurrency of about 100. The results were less than stellar. Mysql had a REALLY hard time dealing with all of the requests coming in, and the requests per second that apache was serving was embarrassing. Once again, we were forced to go back to the drawing board.

Attempt #3 [ 3rd time's the charm!! ] – Gearman

 

Finally, our Chief Software Architect, Bryan Latten, recommended using Gearman to handle the notifications, “workerizing” the whole process. Here at Behance we are big fans of Gearman and we’ve been using it to handle many other tasks across all of our applications. We already have the PHP gearman module installed on all servers by default, which makes for a happy infrastructure team! For those of you not familiar with it, Gearman is designed to distribute appropriate tasks (or jobs) to workers so large tasks can be done more quickly. All jobs are queued up nicely and performed when a worker is “free”. To increase parallel processing performance, simply increase the number of active workers in the pool. We implemented this change in our code first thing at work one morning and ran Apache Bench on it … #jackpot. Performance across the board was outstanding.

We finally found the correct mechanism to transfer our log messages across all of our applications. Background jobs sent to Gearman are nearly fire & forget, so our applications will not choke when centralized logging is not available.

Putting it together:

We have our transport, now time to add the appropriate application hooks. What did we want to log, anyway?

The Bad Stuff:

1. Uncaught Exceptions: These errors are picked up by our top-level exception handler which traditionally e-mails out logs/stack traces to a few developers on the team. Sometimes, these emails are archived automatically due to the sheer number of them we receive in our development environment. By using this method alone it’s possible for some bugs to be overlooked. Now that we have the Be DLA, by dropping a single line wrapper before the email – anyone with access to it can now have enough information to debug and fix the problem. Easy!

2. Fatal Errors: little known PHP fact – you can create a “close enough” fatal error handler with the following code:

    register_shutdown_function( function() {

      $last_error = error_get_last();

      if ( !empty( $last_error['message'] ) ) {
        // Create Gearman background job!
      }

    });

 

This won’t catch everything, but there is no other known way to trap a fatal otherwise – E_FATAL does NOT go through the default php error_handler :(

The Good Stuff:

  1. Signups
  2. Payments
  3. Upgrades
  4. Sampled performance data

So now that we’re able to send messages across all of our applications to our “manager” server, how do we view everything we’re throwing at it? Let’s explore how we put everything together.

Logs Page

( to debug errors across all of our servers )

This page contains all of the errors that our applications throw on a regular basis. The majority of these errors are expected ( Eg. When someone tries to upload a CMYK image, which we do not allow, an exception is thrown and it is logged in the Behance Distributed Logging Application) .

Performance Page


( Benchmark database queries / Memcache queries / Page render times  )

Be DLA - Performance View

The Performance tab gives us a high level snapshot of some of our worst performing pages. You can sort by application and a period of time to figure out where some of our bottlenecks are. Some of our categories for our Worst Offenders list is Slowest Render Time, Most Slave Reads, Slowest Slave Read Query, Slowest Slave Total Read, Most Master Reads, Slowest Master Read, Most Master Writes … the list goes on. Below the Worst Offenders List we have a Summary Average List which shows more detail on which unique URLs have the the slowest render times. When you expand the views on these guys, It displays a number of key averages including:

 

  • DB Master Read/Write Avg
  • DB Slave Read Avg
  • Memcache Read/Write Avg

 

Using this page we were further able to discover outliers within our applications that were candidates for optimization.

Realtime Page

( Keep track of all of our users logging in across the world! )

Be DLA - Now View

Finally we arrive at our Realtime page. This page was way too much fun to create. As a way of keeping track of signups/payments/subscription changes across the board AND inspiring our team members to continue working hard to create those awesome features you love, we put together a “Now” page for things that are currently happening. It currently features a global map with markers on it to display where the most recent actions have [generally] taken place, as well as lists for each app with the most recent actions that have taken place.

This is great because we get insights into things such as what services/features are being taken advantage of the most, geographic trends, along with typically being the first place for catching spammers. Mainly, however, we just like seeing the markers popup on the map! :) Stay tuned for a more detailed post on this page because it will detail some of the awesome, fun technologies used for this page alone (ProcessingJS, NodeJS, and RabbitMQ to wet your whistle!).

The Be DLA gives us insight into issues that we might have missed during our QA process because we were unable to reproduce/predict how some of our customers would use some of our features. This web application makes it easier for our development team to quickly identify outliers, reproduce the problems, and HOTFIX changes into production. We also have DLA for all 3 of our environments ( Development, Pre Production, Production ). Over the past year, DLA has enabled us to create better features and applications for all of our users, and we’re only getting started!

Thanks a lot for reading about our journey. Feel free to leave any questions / comments below!

– Malcolm

Find me on Twitter: @bossjones