This post is the eighth in a hopefully substantive and informative series of posts about Apache Crunch, a framework for enabling Java developers to write Map-Reduce programs more easily for Hadoop.
In my previous and seventh tutorial on Apache Crunch, we covered leveraging scaleFactor in Crunch, and how it can help you with data that grows or shrinks in size in processing. Let’s talk today about writing to the Apache Orc file format on Hadoop in Crunch, using built-in functionality.
Crunch can write to a few different Hadoop common formats: text, Avro, and Orc are the 3 big ones, and it can write those formats right to HDFS pretty easily. In my experience, text is sufficient for most folks when it comes to writing out their data. However, some would like to leverage a deeper, columnar-based format like Orc, which has several bonuses like compression and read-speed (while taking minor hits on write speed).
Today, I’ll show you how I’m able to convert text data to Orc format and write it to HDFS with Crunch — no plugins needed.
Orc, Explained
You can learn more about Orc (and my opinions on it) in this earlier post on Hadoopsters. That page will also tell you how to leverage it in Hive (which is a useful aside to this post). Go check that out for more info — for now, we’ll operate on the assumption that you want to go ahead and learn how to write Orc using Crunch.
Let’s do it.
Writing Orc in Crunch
Let’s import some stuff:
import org.apache.crunch.io.orc.OrcFileTarget; import org.apache.crunch.types.orc.Orcs;
Let’s create a pipeline and an OrcFileTarget. Crunch uses a concept of “targets” to write files of a certain type to designated path.
//Create a Map-Reduce pipeline Pipeline pipeline = new MRPipeline(MyClass.class, "Crunch Pipeline", getConf()); //Create a target linked to an HDFS path 'outputPath' OrcFileTarget target = new OrcFileTarget(new Path(outputPath));
Great! Now you have a target that can write out an Orc file. But how do we form some Orc data to write out? The easiest way is to write a DoFn that will convert your data into Orc records, known as Tuples.
See below an example of a dataset composed of 9 columns (strings and ints) that’s in a PCollection called my_data, which we’ll convert to an equal PCollection called my_data_as_orc:
PCollection<TupleN> my_data_as_orc = my_data.parallelDo(DoFn_ProduceOrcRecords(), Orcs.tuples( Writables.ints(), //column1 Writables.ints(), //column2 Writables.strings(), //column3 Writables.strings(), //column4 Writables.strings(), //column5 Writables.ints(), //column6 Writables.ints(), //column7 Writables.ints(), //column8 Writables.ints(), //column9 ));
Since this calls a custom DoFn (called DoFn_ProduceOrcRecords, we need to go ahead and create that DoFn right? Here it is:
//define a dofn that takes in a string, and returns a TupleN object static DoFn<String, TupleN> DoFn_ProduceOrcRecords(){ return new DoFn<String, TupleN>() { @Override public void process(String input, Emitter<TupleN> emitter) { String[] inputParts = input.split(","); String col_1 = inputParts[0]; String col_2 = inputParts[1]; String col_3 = inputParts[2]; String col_4 = inputParts[3]; String col_5 = inputParts[4]; String col_6 = inputParts[5]; String col_7 = inputParts[6]; String col_8 = inputParts[7]; String col_9 = inputParts[8]; emitter.emit(new TupleN( Integer.parseInt(col_1), Integer.parseInt(col_2), col_3, col_4, col_5, Integer.parseInt(col_6), Integer.parseInt(col_7), Integer.parseInt(col_8), Integer.parseInt(col_9))); } }; }
Now we have a way to convert our data into a PCollection of Tuples (which Orc is formed from), now let’s write it to a path! And we’ll go ahead and tell it to overwrite what’s there, if anything (there are other options for writing, like append):
pipeline.write(finalDataOrc, target, Target.WriteMode.OVERWRITE);
So what does that look like in a file viewer? Well, not much of course…
But in Hive… we have data! Check out the tutorial on Orc in Hive here on Hadoopsters to learn about why/how this works (hint: using an orc-based table).
3 102 SM0044 F 0 30 0 0 0 2014-08-01 3 102 SM0044 F 0 45 0 0 0 2014-08-01 3 102 SM0044 F 1 0 0 0 0 2014-08-01 3 102 SM0044 F 1 15 0 0 0 2014-08-01 3 102 SM0044 F 2 30 0 0 0 2014-08-01 3 102 SM0044 F 2 45 0 0 0 2014-08-01 3 102 SM0044 F 3 0 0 0 0 2014-08-01 3 102 SM0044 F 3 15 0 0 0 2014-08-01 3 102 SM0044 F 4 30 0 0 0 2014-08-01 3 102 SM0044 F 4 45 0 0 0 2014-08-01
Breakdown
Alright, so we did a lot there. Let’s break it down.
- we imported the necessary Orc packages to support OrcFileTarget and the Orc type
- we setup an OrcFileTarget mapped to an HDFS path (where we’ll write the file)
- we setup a DoFn that converts our text records to Orc (which uses TupleN) and called it
- we wrote the newly Orc PCollection to the target we setup earlier
And that’s how you write to Orc in Crunch! You can follow similar methods for writing Avro files too. Next time, we’ll cover… something else! Comments, feedback, and suggestions welcome in the comments. Thanks for reading!
<< Previous Tutorial | Next Tutorial >>
I’ve just released 1.0 of a java object to Orc writer library. It is annotation driven and uses runtime code generation to hide all the boilerplate code. Have a look at it – github.com/eclecticlogic/eclectic-orc
LikeLike