sporc

ORC, or Optimized Row Columnar, is a popular big data file storage format. Its rise in popularity is due to it being highly performant, very compressible, and progressively more supported by top-level Apache products, like Hive, Crunch, Cascading, Spark, and more.

I recently wanted/needed to write ORC files from my Spark pipelines, and found specific documentation lacking. So, here’s a way to do it.

First, a few assumptions:

  • You have a working Spark application
  • You know what RDDs and DataFrames are (and the difference)
  • You have a Dstream, RDD or DataFrame with data in it

If you have a DataFrame, writing ORC to HDFS could not be simpler:

Writing ORC from a DataFrame

This will simply write some good old .orc files in an HDFS directory. You can put a Hive table on top of it.

val mydataframe = ... //put some data in your dataframe, friend
mydataframe
.write
.option("orc.compress", "snappy")
.mode(SaveMode.Append)
.orc("/this/is/an/hdfs/directory/")

Writing ORC (Partitioned) from a DataFrame

Write some good old .orc files in the HDFS directory specified, but inside of partitions/folders just like a Hive table would be laid out. You can put a partitioned Hive table on top of it.

val mydataframe = ... //put some data in your dataframe, friend
mydataframe
.write
.partitionBy("year", "month", "day", "hour")
.option("orc.compress", "snappy")
.mode(SaveMode.Append)
.orc("/this/is/another/hdfs/directory")

If you have an RDD, it’s really not much more complex.

Writing ORC from an RDD

Write some good old .orc files in an HDFS directory. Same drill as a DataFrame, really. 

// import this guy
import org.apache.spark.sql.hive.HiveContext
// this should look familiar
val conf = new SparkConf()
val sc = new SparkContext(conf)
// setup this fella
val hiveContext = new HiveContext(sc)
val rdd = ... // this is where the data goessssss
// boom shocka-locka, turn that RDD into a DF
hiveContext.createDataFrame(rdd)
.option("orc.compress", "snappy")
.mode(SaveMode.Append)
.orc("/this/is/yet/another/hdfs/directory/")
view raw rdd_to_orc.scala hosted with ❤ by GitHub

Writing ORC from a DStream

Working on a Spark Streaming app? Want to write ORC files out and avoid that awful, nasty “.saveAsTextFiles()” option? Do this, friend.

val mydstream = ... // these usually come from Spark Streaming apps
// they basically contain a chain of RDDs that you can convert to DFs
mydstream.foreachRDD(rdd => {
hiveContext.createDataFrame(rdd)
.write
.option("orc.compress", "snappy")
.mode(SaveMode.Append)
.orc("/this/is/an/hdfs/directory/too")
})
view raw dstream_to_orc.scala hosted with ❤ by GitHub

And that’s really it! It’s pretty easy to benefit from the ORC file format without needing to do much in the way of altering your current pipeline.

And it beats raw text. If you have any questions or issues, put ’em below!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.