How to Override a Spark Dependency in Client or Cluster Mode

Jars!

In this post, we’ll cover a simple way to override a jar, library, or dependency in your Spark application that may already exist in the Spark classpath, which would cause you runtime issues.

Recently, I needed to use a specific library as a dependency: Google’s GSON.

The version of GSON I needed had to be a certain version or newer, otherwise there would be a runtime conflict. Why? There was a method in GSON (that I was using) that was private in earlier versions of GSON, but public in later versions.

This doesn’t draw the line exactly where the method changed from private to public, but generally speaking:

  • gson-2.2.4.jar: the method is private, and therefore too old for use here
  • gson-2.6.1: the method is public, and works fine.
  • Somewhere between the two, the method’s status changed.

So, because I had some functionality that required the method be public and accessible, it was important I specify the right version in my dependency manager (SBT). “That’s easy,” I thought. “No problem.”

So I added the GSON version I needed, and made sure to exclude GSON from any other dependencies that might be peskily including their own GSON.

"com.google.code.gson" % "gson" % "2.6.1" % "compile"

But there was a snag.

Everything worked fine during compilation and execution on my local machine, but my app would fail at runtime in a production environment, specifically using a spark-submit command on a YARN/Hadoop cluster. Specifically, I ran into this dreaded error:

java.lang.IllegalAccessError: tried to access method com.google.gson.Gson.newJsonWriter(Ljava/io/Writer;)Lcom/google/gson/stream/JsonWriter; from class retrofit2.converter.gson.GsonRequestBodyConverter

I did what everyone does, and I started Googling.

I didn’t know it at the time, but this Illegal Access exception makes sense: somewhere in my build (either my jar or my classpath) was a version of GSON that was older than the one I was expecting/including (in my Uber/Fat jar no less!), and causing me to hit this error.

But where was it coming from?

I checked the dependency tree of my Scala project using the wonderful SBT Dependency Graph by Jrudolph, but to my disappointment GSON was only coming up in the class I expected it to — and only that one place! I thought for sure I’d find another library that was including GSON that I missed.

But then I though to check the Spark classpath on my cluster, which for me lived at:

../spark-2.3.2-bin-hadoop2.6/jars/

A quick ls and I found my jackpot:

[lrobinson@myhadoopcluster]$ ls spark-2.3.2-bin-hadoop2.6/jars/*gson*
spark-2.3.2-bin-hadoop2.6/jars/gson-2.2.4.jar

So, what are we looking at? Apache Spark includes GSON for its own use and purpose in its jar library (which is included in your classpath by default). So Spark was behind it the whole time!

I now knew the problem: Spark’s baked in GSON jar was causing dependency collision with my application at runtime, which explains why I did not see the problem when executing locally on my laptop.

So what were my options? I again turned to Google.

Option 1: spark.driver.userClassPathFirst

This is an experimental setting that did not solve my issue and I would not recommend to you either. It will attempt to use the user’s classpath ahead of Spark’s. Unfortunately, it caused a myriad of other confusing errors that I did not even know where to begin debugging, so I gave up on it. If it works for you, please give me a holler.

Option 2: spark.driver.extraClassPath & spark.executor.extraClassPath

This is exactly what I wanted, and exactly what worked. According to the docs, this would allow me to prepend dependencies/JARS to the classpath. Keyword being prepend, as in, put in front of Spark’s built-in classpath and libs.

Below is how we leverage this feature for our apps. We use the packages CLI setting to pull the GSON jar from Maven Central, the Jars setting to point to where those jars are saved via packages, and the two conf settings to prepend the jar to the driver and executor classpaths.

--packages com.google.code.gson:gson:2.6.1
--jars /home/lrobinson/.ivy2/jars/com.google.code.gson_gson-2.6.1.jar
--conf spark.driver.extraClassPath=com.google.code.gson_gson-2.6.1.jar 
--conf spark.executor.extraClassPath=com.google.code.gson_gson-2.6.1.jar

And like that, voila, the application ran and the IllegalAccessError was gone.

There are more scenarios beyond this one that you can use this solution for, so let us know how it helps you! Cheers!

2 thoughts

Leave a Reply to shanjames Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s