Editor’s Note: The previous headline of this article, Spark Starter Guide 1.1, was incorrect. It has been changed to 4.1 to accurately reflect the outline of The Spark Starter Guide.

Overview

Welcome to the opening section of the Spark Starter Guide!

This section introduces the concept of data pipelines – how data is processed from one form into another. It’s also the generic term used to describe how data moves from one location or form, and is consumed, altered, transformed, and delivered to another location or form.

You’ll be introduced to Spark functions like join, filter, and aggregate to process data in a variety of forms. You’ll learn it all through interactive Spark exercises in Scala and Python.

It’s also worth noting that this section is purely introductory information – meaning no code in this post. If you’d like to skip the introductions, proceed to the next post.

Introduction

In this chapter, you’ll become familiar with terms like “data pipeline” and “data engineering”, and learn what they may mean to you and your organization. As part of that, you’ll learn the pieces that make up an effective and performant data pipeline in Spark and showcase how Spark is a powerful tool for data engineers.

The challenges of data engineering, which is the ingestion, processing, transforming, and storing of data, are multi-faceted. There is a significant learning curve to the discipline, especially if you don’t come from a background in distributed computing — but anyone can learn it with time and effort. There are technical challenges in this space that are less common in other gigs—and often very satisfying to overcome.

Spark has gained a lot of traction in the data engineering world due to its impressive performance with large datasets, its abstraction of complex distributed computing concepts into simple and powerful APIs, and continued support from the open source community. In effect, Spark is fast, performant and is packed full of features.

Whether you’re just setting foot in this industry or have a decade of experience already, this chapter provides the skills you need to start building data pipelines in Apache Spark.

Introduction to Data Pipelines

Photo by tian kuan on Unsplash

Much like a physical pipeline you might encounter in real life, a data pipeline carries data from one place to another. Along the way, your data might travel great distances, be filtered, fork into two or more routes, or join into other pipelines. It can be a very simple and straightforward path, or it can be a very complicated one.

To put it more technically, a data pipeline is a set of one or more actions to be taken on data, where data is the input. Often, the output of one action is the input into another, with the final output being an analysis of the data or even an enhanced, cleaned or modified version of it.

Data pipelines take many forms. For example, working as a data engineer at a bank, you might be tasked with building a pipeline that pulls customer transaction records from a database, sums up the transactions by date and customer, and delivers a spending report to your customers showing their spending by day in the last week.

Photo by Campaign Creators on Unsplash

Alternatively, you might work at a tech startup with a team of data scientists who need to build machine learning models on your company’s e-commerce web traffic data. You might be tasked with building a pipeline that builds and stores a random sampling of user behavior logs with which those data scientists can train their models to predict scenarios like cart abandonment (when you add an item to your cart on an e-commerce website, and then leave the website without purchasing).

In any case, data pipelines are the method by which engineers consume, process, analyze, transform or deliver data to build a great product. Data pipelines allow us to answer tough questions with a valuable resource: facts and information. After all, that’s all data really is — since the Oxford English Dictionary defines data as: “Facts and statistics collected together for reference or analysis.”

So, data by its very nature is information that is collected for a purpose. It’s our gateway to a successful understanding of our customers, our applications, and our world. Without it, our decisions would simply be best guesses based on hunches and opinions. Thankfully, performant and resilient pipelines are a great starting point for unlocking the hidden power of your data.

Who Needs a Data Pipeline?

Photo by Kaleidico on Unsplash

Now that you have a solid idea of what Data Pipelines are at a fundamental level, it’s important to discern who among us has a need for them. To put it simply, a data pipeline could prove valuable if:

  • You generate, consume or store large quantities of data
  • You have use cases for data, such as analytics, reporting or research

Thankfully, getting off the ground building data pipelines has never been a more popular pursuit than it is at this moment. Companies the world over—big and small, startups and empires—are utilizing thorough, performant pipelines to process their data.

And you can too, with the help of this guide, and a little open source project called Apache Spark.

ETL

There are several different ways in which data is processed in a pipeline.

The most common data pipeline is called Extract Transform and Load (ETL), though oftentimes, data pipeline and ETL are terms used interchangeably.

ETL is a data pipeline that extracts data from one system, transforms it in some way, and loads it to another system.

A concrete example of this would be a scheduled process that reads a compressed file from a Secure File Transfer Protocol (SFTP) server every morning (extract), decompresses it on a local server, deduplicates it (transform), and loads it into a table in a relational database (a traditional solution for storing and querying data).

ETL is often used to describe, in a general sense, how a company reads data from one system and loads it into another. Companies may have one or more ETL pipelines that are responsible for filling distributed storage with a wide variety and volume of datasets. It’s also used as a general-purpose term for how data flows through a system or is processed initially.

ETL, of course, is just the beginning. There are many ways in which you can interact with data as it passes from system to system. You’ll learn all about it in the coming sections.

Proceed to the next section to begin your first exercise in Apache Spark!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.