Apache Spark is a feature-rich, rapidly-growing analytic engine for big data processing. It’s used by numerous companies and universities, big and small, across many industries, to ingest, process and analyze data.

And while there are infinite resources available to you to learn Spark these days: books, Stack Overflow threads, Apache Mailing List discussions, Spark Summit talks, blogs and more – we think we have crafted the ideal resource for the Spark newcomer.

We at Hadoopsters are launching the Apache Spark Starter Guide – to teach you Apache Spark using an interactive, exercise-driven approach.

Exercise-Driven Learning

While there are many disparate blogs and forums you could use to collectively learn to code Spark applications – our approach is a unified, comprehensive collection of exercises designed to teach Spark step-by-step. Exercises are written in both Python and Scala, to be as accommodating as possible.

Learn at Your Own Pace

With exercises driving the majority of the text in the guide, the format caters to those with a busy schedule. This makes it easier to seamlessly pick up where you left off, and drive your development at your pace.

Less Theory, More Coding

Understanding the theory behind distributed computing engines like Spark is critical to becoming an expert. But unlike most Spark books, this guide is hyper-focused on helping you start coding in Spark right now, learning practical, reusable, and scale-able applications of Spark.

Questions and Answers

Who wrote this guide?

All content in this guide was written by Landon Robinson and Craig Covey.

Craig Covey is a Senior Software Engineer at Sam’s Club, focusing on Data Science platforms using Azure DataBricks and Google Cloud Platform. He is responsible for the following content in The Spark Starter Guide:

  • Chapter 1: DataFrames
  • Chapter 2: Advanced DataFrames
  • Chapter 3: Spark SQL

Landon Robinson, a co-founder of Hadoopsters, is currently a Principal Data Scientist at Walt Disney Television Research, using platforms like Snowflake and Qubole (Spark). He is responsible for the following content in The Spark Starter Guide:

  • Chapter 4: Data Pipelines
  • Chapter 5: Spark Streaming and Structured Streaming
  • Chapter 6: Monitoring and Spark Listeners
  • Chapter 7: Addressing Skew and Cardinality

How will guide content be released?

Our goal is to release content periodically over the next several months, leading into 2021. You can expect at least 1-2 new sections in the guide released every week or two, roughly. Because each author wrote their own content semi-individually for this collaboration, they will release their content sequentially, as it relates to their chapters.

How much content is there in the guide?

This guide is actually the scattered remnants of a book deal that had recently fallen apart. It has been refactored and repurposed to better fit the blog format. So believe us when we say it is a few hundred pages of content (again, almost entirely exercises explained step-by-step).

What programming language are the exercises in?

Python and Scala. Pretty handy right? There are exceptions where not every exercise is supported in both languages, but by and large both are supported.

Why Free? Can I buy the guide from you?

We ultimately decided that the free release of all our hard work was the best way to benefit the most people looking to learn and hone their craft at Apache Spark.

We welcome tips in the tip jar if our work has proven beneficial to you, or made your work easier.

100% of donations are used to pay for the Hadoopsters website costs. That includes hosting The Spark Starter Guide, and keeping it and the entire website ad-free.

Permissions for Code Usage

This site is here to help you get your job done. In general, you may use the code on this site and in The Spark Starter Guide in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code that you wish to publish.

For example:

  • Writing a program that uses several lines of code from this site does not require permission.
  • Selling or distributing examples from this site does require permission.
  • Answering a question by citing this site and quoting example code does not require permission.
  • Incorporating a significant amount of example code from this site into your product’s documentation does require permission.

We appreciate, but do not require, formal attribution. An attribution usually includes the title, author, publisher, and website URL. For example:

“The Spark Starter Guide: The interactive and exercise-driven approach to learning Apache Spark (hadoopsters.com/spark).”

If you feel your use of code examples falls outside fair use or the permission given here, feel free to contact us at hadoopsters@gmail.com. We’re pretty relaxed about the whole thing. Be cool to others, be cool to yourself.

Outline

If a section below is linked, it is available for access. If it is unlinked, it has not yet been posted. We’ll be releasing new content sequentially in a periodic fashion over the coming weeks. There are a total of 4 chapters in The Spark Guide.