
If you’ve been running Spark applications for a few months, you might start to notice some odd behavior with the history server (default port 18080). Specifically, it’ll take forever to load the page, show links to applications that don’t exist or even crash. Three parameters take care of this once and for all.
This is because the default configuration for Spark regarding its history logs is to keep them all indefinitely. This can be a good thing and a bad thing obviously:
Good:
- You have a record of all your spark apps from day one
- You can track your progress as a super DUPER awesome Spark Developer
Bad:
- Depending on the verbosity level, these logs can get VERY large. I mean VERY large:
[hdfs@cluster1 ~]$ hadoop fs -du -h / 6.8 G /app-logs 1.3 G /mr-history 1.5 T /spark-history 25.0 M /tmp 20.2 G /user
- Remember with HDFS default replication, this is an actual size of 4.5 TERABYTES of just INFO messages telling you various executors are starting or stopping. Not very useful.
Set these parameters in your “Custom spark-defaults” config setting in Ambari (or your spark-env.sh file without Ambari) to take care of these massive logs:
1. spark.history.fs.cleaner.enabled=true 2. spark.history.fs.cleaner.interval=1d 3. spark.history.fs.cleaner.maxAge=7d
#1 enables the history cleaner, #2 sets the check-interval (every day, in this case), and #3 sets the maximum age of any log (7 days, in this case). Anything older than 7 days will be automatically deleted.
See the documentation here for more details:
https://spark.apache.org/docs/1.6.2/monitoring.html#viewing-after-the-fact
In my experience, this can take anywhere from a few hours to a whole day to actually take effect, but it does work! 45.4 Gb is much better than 1.5 Tb.
[hdfs@cluster1 ~]$ hadoop fs -du -h / 6.8 G /app-logs 1.3 G /mr-history 45.4 G /spark-history 25.0 M /tmp 20.2 G /user
After some Googling, according to this site, it takes ~675,000 pieces of paper to print 1 Gb of text, so that means 1,012,500,000 pieces of paper would be required to print 1.5 Tb of logs. That’s around 31,000 metric tons of paper, or about half the weight of the Titanic. Let’s stick with hard drives.