For a while now, it’s been possible to give custom names to RDDS in Spark. One result of this is a convenient name in the Storage tab of the Spark Web UI. It looks like this:

val my_rdd = sc.parallelize(List(1,2,3))
my_rdd.setName("Some Numbers")
my_rdd.cache()

// running an action like .count() will fully materialize the rdd
my_rdd.count() 

Note: You could use an action like take or show, instead of count. But be careful.

.take() with cached RDDs (and .show() with DFs), will mean only the “shown” part of the RDD will be cached (remember, spark is a lazy evaluator, and won’t do work until it has to).

The implication being that you might think your entire set is cached when doing one of those actions, but unless your data will be 100% shown in said action, it will not be 100% cached.

But DataFrames have not been given the same, clear route to convenient renaming of cached data. It has, however, been attempted and requested by the community:

However, with the below approach, you can start naming your DataFrames all you want. It’s very handy.

How to Nickname a DataFrame and Cache It

This method requires a few steps:

  • Create a DataFrame
  • If it needs to be repartitioned (due to skew), do that immediately
  • Create a SQL View
  • Cache using SQL Context (not precisely the same as df.cache() or df.persist(), as we’ll see)

PySpark

from pyspark.sql import Row

# Some arguments
tbl_name = "Some Numbers (in a DataFrame)"
do_cache = True
num_partitions = 50

# Create an RDD
rdd = sc.parallelize(
    [Row(1),
     Row(2),
     Row(3)])

# Create a DataFrame
df = spark.createDataFrame(rdd, ['number'])

# Optional: Repartition as desired for better performance
if(num_partitions):
    df = df.repartition(num_partitions)
    
# Create a SQL View from that DataFrame
df.createOrReplaceTempView(tbl_name)

# Cache the table using sqlContext
if(do_cache):
    sqlContext.cacheTable(tbl_name)

# Run a .count() to materialize the df in cache
df.count()

Now, your DataFrame has been named as specified in the Storage tab of the Spark Web UI. It’s worth noting that the allowances for a DataFrame nickname are more restricted (namely, no spaces are allowed) compared to RDDs. This is for good reason: it follows the SQL convention for table names – and those can’t have spaces, can they?

Plus, cached DF tables will be prefixed with an “In-memory table” designation.

So there you have it: a way to name your cached DataFrames as table views in Spark SQL. This has the additional benefit of making your Spark SQL queries easier to read and comprehend.

Always a good thing.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.