For a while now, it’s been possible to give custom names to RDDS in Spark. One result of this is a convenient name in the Storage tab of the Spark Web UI. It looks like this:
val my_rdd = sc.parallelize(List(1,2,3)) my_rdd.setName("Some Numbers") my_rdd.cache() // running an action like .count() will fully materialize the rdd my_rdd.count()
Note: You could use an action like take or show, instead of count. But be careful.
.take() with cached RDDs (and .show() with DFs), will mean only the “shown” part of the RDD will be cached (remember, spark is a lazy evaluator, and won’t do work until it has to).
The implication being that you might think your entire set is cached when doing one of those actions, but unless your data will be 100% shown in said action, it will not be 100% cached.
But DataFrames have not been given the same, clear route to convenient renaming of cached data. It has, however, been attempted and requested by the community:
However, with the below approach, you can start naming your DataFrames all you want. It’s very handy.
How to Nickname a DataFrame and Cache It
This method requires a few steps:
- Create a DataFrame
- If it needs to be repartitioned (due to skew), do that immediately
- Create a SQL View
- Cache using SQL Context (not precisely the same as df.cache() or df.persist(), as we’ll see)
from pyspark.sql import Row # Some arguments tbl_name = "Some Numbers (in a DataFrame)" do_cache = True num_partitions = 50 # Create an RDD rdd = sc.parallelize( [Row(1), Row(2), Row(3)]) # Create a DataFrame df = spark.createDataFrame(rdd, ['number']) # Optional: Repartition as desired for better performance if(num_partitions): df = df.repartition(num_partitions) # Create a SQL View from that DataFrame df.createOrReplaceTempView(tbl_name) # Cache the table using sqlContext if(do_cache): sqlContext.cacheTable(tbl_name) # Run a .count() to materialize the df in cache df.count()
Now, your DataFrame has been named as specified in the Storage tab of the Spark Web UI. It’s worth noting that the allowances for a DataFrame nickname are more restricted (namely, no spaces are allowed) compared to RDDs. This is for good reason: it follows the SQL convention for table names – and those can’t have spaces, can they?
Plus, cached DF tables will be prefixed with an “In-memory table” designation.
So there you have it: a way to name your cached DataFrames as table views in Spark SQL. This has the additional benefit of making your Spark SQL queries easier to read and comprehend.
Always a good thing.