This post is the second in a hopefully substantive and informative series of posts about Apache Crunch, a framework for enabling Java developers to write Map-Reduce programs more easily for Hadoop.
In my previous and first tutorial on Apache Crunch, I talked about the benefits of Crunch, and some basic driver code to help you understand what Crunch can do at an entry level. In today’s entry, I’d like to walk you through getting Crunch installed on your local machine so you can start playing with it yourself. If you’ve done this already, you’ll love the next tutorial on Java objects and materialization (coming soon).
Let’s talk about what you need:
- Eclipse or IntelliJ (both are free, the latter is strongly recommended)
- Maven
- An Apache Crunch project archetype (the wordcount example)
- Terminal (or equivalent command line application)
Setup
Install Eclipse or IntelliJ (but seriously though, if you’re just getting set up, get IntelliJ, it’s amazing). You’ve got this step down, I’m sure.
Install Maven. You can do this in less than 5 minutes on any home operating system (Windows, Mac, Linux), by following these steps on the Maven site. If you’re on Mac or Linux, this is an even simpler process:
Open Terminal (or equivalent command line). Enter these commands exactly (one at a time):
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" brew install maven
The first command will ask for you to confirm by pressing Enter, do so. It’ll also ask for a password to confirm, enter it. You should see an ‘Installation Successful’ statement upon completion. The second command will install Maven using what you installed in the first command, Brew. It’s about 10mb in size, and should say as much when it completes.
Create a Crunch Project (comprehensive guide here). You can do this through a few short command line commands. Note: for any command in the code below that is bolded, it means you can customize it. For example, you don’t have to call your package com.hadoopsters.bigdata, you can call it mycompany.banana.suitcase, but it’s best to follow Java package naming conventions. The same applies for crunchdemo, you can call it MyCrunchDemoSupreme, it’s up to you.
- Open Terminal (or equivalent command line). Navigate to your development work area, such as an Eclipse Workspace or code project folder on your Mac.
- Enter these command exactly (one at a time):
mvn archetype:generate -Dfilter=org.apache.crunch:crunch-archetype 1 23 com.hadoopsters.bigdata crunchdemo
- Prompt will say 1.0-SNAPSHOT, but just hit ENTER.
- Prompt will say com.bigdata.crunch, but just hit ENTER.
- Prompt will say “Y:”, but just hit ENTER.
- Your Crunch project should be installed in the current folder in a directory called crunchdemo (or whatever you named it).
Expected output:
[INFO] Generating project in Interactive mode [INFO] No archetype defined. Using maven-archetype-quickstart (org.apache.maven.archetypes:maven-archetype-quickstart:1.0) Choose archetype: 1: remote -> org.apache.crunch:crunch-archetype (Create a basic, self-contained job for Apache Crunch.) Choose a number or apply filter (format: [groupId:]artifactId, case sensitive contains): : 1 Choose org.apache.crunch:crunch-archetype version: 1: 0.4.0-incubating 2: 0.5.0-incubating 3: 0.6.0 4: 0.7.0 5: 0.7.0-hadoop2 6: 0.8.0 7: 0.8.0-hadoop2 8: 0.8.1 9: 0.8.1-hadoop2 10: 0.8.2 11: 0.8.2-hadoop2 12: 0.8.3 13: 0.8.3-hadoop2 14: 0.8.4 15: 0.8.4-hadoop2 16: 0.9.0 17: 0.9.0-hadoop2 18: 0.10.0 19: 0.10.0-hadoop2 20: 0.11.0 21: 0.11.0-hadoop2 22: 0.12.0 23: 0.12.0-hadoop2 24: 0.13.0 Choose a number: 24: Downloading: https://repo.maven.apache.org/maven2/org/apache/crunch/crunch-archetype/0.13.0/crunch-archetype-0.13.0.jar Downloaded: https://repo.maven.apache.org/maven2/org/apache/crunch/crunch-archetype/0.13.0/crunch-archetype-0.13.0.jar (15 KB at 19.1 KB/sec) Downloading: https://repo.maven.apache.org/maven2/org/apache/crunch/crunch-archetype/0.13.0/crunch-archetype-0.13.0.pom Downloaded: https://repo.maven.apache.org/maven2/org/apache/crunch/crunch-archetype/0.13.0/crunch-archetype-0.13.0.pom (4 KB at 13.3 KB/sec) Define value for property 'groupId': : com.hadoopsters.bigdata Define value for property 'artifactId': : crunchdemo Define value for property 'version': 1.0-SNAPSHOT: : Define value for property 'package': com.hadoopsters.bigdata: : Confirm properties configuration: groupId: com.hadoopsters.bigdata artifactId: crunchdemo version: 1.0-SNAPSHOT package: com.hadoopsters.bigdata Y: : [INFO] ---------------------------------------------------------------------------- [INFO] Using following parameters for creating project from Archetype: crunch-archetype:0.13.0 [INFO] ---------------------------------------------------------------------------- [INFO] Parameter: groupId, Value: com.hadoopsters.bigdata [INFO] Parameter: artifactId, Value: crunchdemo [INFO] Parameter: version, Value: 1.0-SNAPSHOT [INFO] Parameter: package, Value: com.hadoopsters.bigdata [INFO] Parameter: packageInPathFormat, Value: com/hadoopsters/bigdata [INFO] Parameter: package, Value: com.hadoopsters.bigdata [INFO] Parameter: version, Value: 1.0-SNAPSHOT [INFO] Parameter: groupId, Value: com.hadoopsters.bigdata [INFO] Parameter: artifactId, Value: crunchdemo [INFO] project created from Archetype in dir: /Users/landon/Desktop/DevWorkspace/crunchdemo [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 04:35 min [INFO] Finished at: 2015-10-01T22:42:09-04:00 [INFO] Final Memory: 13M/120M [INFO] ------------------------------------------------------------------------
If everything went well, you should have a Crunch project ready to go! Let’s see what’s in it by importing the project to IntelliJ.
Import into IntelliJ (as Maven project)
Now you have a Crunch project, and can start playing with things in the MemPipeline on your local machine (or Map/Reduce and Spark if you’re so bold, though I’d recommend getting familiar with Crunch in local form first). Definitely walk through the Wordcount Example on the Apache Crunch website, and see how it works!
Next time, we’ll write our first Crunch program in a MemPipeline, and explore more advanced topics like Java objects and materialization.
2 thoughts