Tuesday, June 10, 2014

Running Cascading Hadoop jobs via CLI, Oozie, or an IDE (part 1)

If you stumbled upon this post, I guess you already know what Cascading is - another library out there that offers higher level language to compose Hadoop jobs. If you're just starting with it, you may very well be searching for best way to setup your project so these jobs can be run in various environments - locally within IDE of your choice, or to try them out on real Hadoop cluster by using CLI commands for job submission. Also, in real case scenarios, Hadoop workflows easily become complex enough to warrant the usage of additional engine for definition and coordination of all these jobs, and Oozie is currently the most popular choice for that in Hadoop ecosystem.

Ok, enough talkin', let's get to business...

To define our project setup and job packaging tasks, we'll use Gradle buid tool which is more modern substitution for Maven. Version of Cascading that we use here is 2.5.

But first, let's examine our 3 target environments....

Environments

IDE

Within our IDE, we want to run Cascading jobs in Hadoop local mode. This requires that all dependency libs (Cascading, Hadoop, and possibly some other ones) are included in classpath when executing Cascading jobs there. So, our build script should take care to generate IDE-specific project files that include all mentioned dependencies for job runs.

Command Line Interface (CLI)

Most of examples found out there show how to trigger hadoop mapreduce jobs via command line interface - via "hadoop jar myjobapp.jar..." command (or "yarn jar myjobapp.jar ..." in newer versions of Hadoop).

Since we're using Cascading here, we need to package somehow Cascading jars together with our custom code for these commands to work. There are couple of ways to do this, but it seems that bundling them all into one jar ("fat jar") is the most popular one.

To construct the "fat jar" we just create regular jar from all our custom classes, and also add 3rd party libs to it. There are 2 way how we can add these libs:
  • by placing them under internal "lib" folder of the fat jar
  • by extracting all classes from them, and add them in unpackaged form to same fat jar
The later approach is a bit more complicated, it destroys the structure of dependency libraries, but as we will see right away, it offers one advantage.

One more thing - we don't need to package Hadoop libraries since they are already present when jobs are run on Hadoop.

Oozie

Unlike hadoop/yarn CLI commands, Oozie doesn't recognize fat jars containing internal lib folder with dependency libs, so our Cascading app's jar packaged in that way would not be able to run when triggered by Oozie.

On the other hand, the second approach of building fat jars is good to go, and fortunately, there is Gradle plugin to help us with that - Gradle Shadow.

Same as CLI approach, we don't want to package Hadoop libraries for already mentioned reason.

Build script

As said, we'll use Gradle as our build tool.

First, let's define our project dependencies in build script (build.gradle).

As already mentioned, we need to somehow mark separately Hadoop libraries because we have to exclude them when packaging our job applications for CLI/Oozie enviroments. We'll do that by defining custom Gradle configuration using:
 configurations {  
   hadoopProvided  
 }  

And finally define dependencies for that configuration (we'll add slf4j/logback dependencies also for proper logging when running within an IDE):
 hadoopProvided(  
       "org.apache.hadoop:hadoop-client:${hadoopVersion}",  
       "commons-httpclient:commons-httpclient:3.1",  
       "org.slf4j:slf4j-api:${slf4jVersion}",  
       "org.slf4j:jcl-over-slf4j:${slf4jVersion}",  
       "org.slf4j:log4j-over-slf4j:${slf4jVersion}",  
       "ch.qos.logback:logback-classic:1.0.+"  
   )  

For everything to work correctly, we have to add that configuration to main sourceSet, and also register that configuration to be included in generated IDEA/Eclipse projects. Take a look at GitHub project to see how it is done inside the build script.

Cascading libraries are added to standard "compile" configuration:
 compile (  
     "cascading:cascading-core:${cascadingVersion}",  
     "cascading:cascading-local:${cascadingVersion}",  
     "cascading:cascading-hadoop2-mr1:${cascadingVersion}"  
   )  

Using special Gradle plugins for Intellij IDEA and Eclipse support, we can generate IDE-specific project files. Project files generation tasks are called by:
 gradle idea  
or
 gradle eclipse  

To package our job application as "fat jar" with all dependency classes extracted from their original jars, we have to include Gradle Shadow plugin (version 0.9.0-M1 currently) into project via:
 buildscript {  
   repositories {  
     jcenter()  
   }  
   dependencies {  
     classpath 'com.github.jengelman.gradle.plugins:shadow:0.9.0-M1'  
   }  
 }  
 apply plugin: 'shadow'  

"Fat jar" is constructed by calling:
 gradle shadowJar  
and end result can be seen at path:
<project-dir>/build/libs/<appname>-<appversion>-all.jar

Complete build script is available here, as part of my cascading-wordcount GitHub project.

In the next part of this post, we'll look at how to execute simple Cascading job within all 3 environments.

23 comments:

  1. This information you provided in the blog that was really unique I love it!!, Thanks for sharing such a great blog..Keep posting..

    Hadoop Training Institutes in Chennai

    ReplyDelete
  2. nice post and it may be more useful to the hadoop learners keep blogging...
    Hadoop training in Velachery

    ReplyDelete
  3. Thanks for sharing this informative blog. If anyone wants to get Big Data Training Chennai visit fita academy located at Chennai, which offers best Hadoop Training Chennai with years of experienced professionals.

    ReplyDelete
  4. Day by day I am getting new things and learn new concept through your blogs, I am feeling so confidants, and thanks for your informative blog keep your post as updated one...
    Informatica courses in Chennai

    ReplyDelete
  5. Thanks for splitting your comprehension with us. It’s really useful to me & I hope it helps the people who in need of this vital information.
    If anyone focuses the Best ccna training in Chennai get here .

    ReplyDelete
  6. Cloud Computing Training

    I have read your blog and i got a very useful and knowledgeable information from your blog.its really a very nice article.You have done a great job . If anyone want to get real time Cloud Computing Course in Chennai, Please visit FITA academy located at Chennai Velachery which offer best Cloud Computing Training in Chennai.

    ReplyDelete
  7. This Information very helpful for the beginners.In this each step have a wonderful explaination.I would study and known about the application.thanks for giving wonderful information.
    AWS course chennai | AWS Certification in chennai | AWS Certification chennai

    ReplyDelete
  8. I learn a worthful information by this training.This makes very helpful for future reference.All the doubts are very clearly explained in this article.Thank you very much.
    VMWare course chennai | VMWare certification in chennai | VMWare certification chennai

    ReplyDelete
  9. Thanks for posting the useful information to my vision. Java is a programming language, now a day lots of websites & application created using java, because it’s more secure than others and reliable too. The popular JAVA Training institute has located in Chennai helps you to get your bright career.
    JAVA Training in Chennai | JAVA Training Institutes in Chennai

    ReplyDelete
  10. I am following your blog from the beginning, it was so distinct & I had a chance to collect conglomeration of information that helps me a lot to improvise myself.
    Web design training in Chennai

    ReplyDelete
  11. Thanks Admin for sharing such a useful post, I hope it’s useful to many individuals for developing their skill to get strong career soon.
    Regards,
    ccna training institute in Chennai|ccna institutes in Chennai|ccna courses in Chennai

    ReplyDelete
  12. All are saying the same thing repeatedly, but in your blog I had a chance to get some useful and unique information, I love your writing style very much, I would like to suggest your blog in my dude circle, so keep on updates…
    Regards
    Angularjs course in chennai|Angularjs training center in Chennai|Angularjs training in chennai

    ReplyDelete
  13. Really Good blog post.provided a helpful information.I hope that you will post more updates like this
    Big data hadoop online Course

    ReplyDelete
  14. I feel really happy to have seen your webpage and look forward to so many more entertaining times reading here. Thanks once more for all the details. Best Selenium Training in Bangalore

    ReplyDelete
  15. Your blog is interesting for readers.you have developed your blog information's with such a wonderful ideas and which is very much useful for the readers.i enjoyed your post and i need some more articles also please update soon.
    angularjs classes in bangalore
    angularjs tutorial in bangalore
    AngularJS Training in Nolambur
    AngularJS Training in Saidapet

    ReplyDelete
  16. i'm honest natured you take conveyance of to selfishness in your message. It makes you stand dependancy out from numerous helper essayists that can't uphold over the top climate content remembering you. Download Fifa 19 Torrent

    ReplyDelete
  17. Software Development Life Cycle, is a set of steps used to create software applications.Software Development Life Cycle

    ReplyDelete
  18. Looking to attract your dream life? These best law of attraction books show you how to use the law of attraction to manifest money, love, .Law-Of-Attraction-Books-Manifest-Your-Desires

    ReplyDelete
  19. This comment has been removed by the author.

    ReplyDelete
  20. This comment has been removed by the author.

    ReplyDelete