As engineers we have many open source tools at our disposal.  Making sense of it all is a daunting task.  Luckily at ShareThis, we have 2 brilliant engineers Senthil Rangaswamy and Juan Valencia to work some of this stuff out for us.  This past week we discussed Oozie, a workflow scheduler for Hadoop.  It’s not an easy task to manage multiple jobs running across the cluster.  Issues like timing, job priority, exception handling, and dependencies can get very complex over time.  Here is an overview of pros/cons of Oozie:

Pro:

  • It is built from the ground-up to manage, coordinate and run Hadoop jobs.
  • It is battle-tested in Yahoo’s mega hadoop infrastructure.
  • Setup on your own infrastructure, not dependent on any cloud-service
  • Industry standard
  • UI for job states
  • Lots of instrumentation built-in
  • Can suspend and resume jobs
  • DAG support

Cons:

  • XML configuration makes debugging more difficult and it makes your eyes bleed =)
  • Might not satisfy all use-cases if your jobs extend out of hadoop.
  • Maybe convoluted to port more complex hadoop jobs and pipelines.

In the end, you should make sure that Oozie is right for your use-case.  You can check out Oozie at http://oozie.apache.org/

Leave a Reply