I wonder how much paper you’d need to print 1.5 Tb of logs…

If you’ve been running Spark applications for a few months, you might start to notice some odd behavior with the history server (default port 18080). Specifically, it’ll take forever to load the page, show links to applications that don’t exist or even crash. Three parameters take care of this once and for all.

This is because the default configuration for Spark regarding its history logs is to keep them all indefinitely. This can be a good thing and a bad thing obviously:


  • You have a record of all your spark apps from day one
  • You can track your progress as a super DUPER awesome Spark Developer


  • Depending on the verbosity level, these logs can get VERY large. I mean VERY large:
    [hdfs@cluster1 ~]$ hadoop fs -du -h /
    6.8 G /app-logs
    1.3 G /mr-history
    1.5 T /spark-history
    25.0 M /tmp
    20.2 G /user
  • Remember with HDFS default replication, this is an actual size of 4.5 TERABYTES of just INFO messages telling you various executors are starting or stopping. Not very useful.

Set these parameters in your “Custom spark-defaults” config setting in Ambari (or your file without Ambari) to take care of these massive logs:

1. spark.history.fs.cleaner.enabled=true

2. spark.history.fs.cleaner.interval=1d

3. spark.history.fs.cleaner.maxAge=7d

#1 enables the history cleaner, #2 sets the check-interval (every day, in this case), and #3 sets the maximum age of any log (7 days, in this case). Anything older than 7 days will be automatically deleted.

See the documentation here for more details:

In my experience, this can take anywhere from a few hours to a whole day to actually take effect, but it does work! 45.4 Gb is much better than 1.5 Tb.

[hdfs@cluster1 ~]$ hadoop fs -du -h /

6.8 G /app-logs

1.3 G /mr-history

45.4 G /spark-history

25.0 M /tmp

20.2 G /user

After some Googling, according to this site, it takes ~675,000 pieces of paper to print 1 Gb of text, so that means 1,012,500,000 pieces of paper would be required to  print 1.5 Tb of logs. That’s around 31,000 metric tons of paper, or about half the weight of the Titanic. Let’s stick with hard drives.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.