Spark Starter Guide 4.4: How to Filter Data

Previous post: Spark Starter Guide 4.3: How to Deduplicate Data

Photo by Tyler Nix on Unsplash

Mmm…. coffee!

Introduction

Filtering data is important when you need to remove entries of information that are irrelevant or troublesome in your pipeline. It can also be used apart from data cleaning as a way to fork different chunks of data onto different routes based on their contents.

Spark, unsurprisingly, has a clean and simple way to filter data: the appropriately and aptly named, .filter(). Alternatively, you can use .where(), which is an alias function for .filter(). Like the deduplication method, filtering is commonly applied in the Transform stage of ETL.

In the following exercise, we will learn how to filter data in a Spark DataFrame step by step.


Follow these steps to complete the exercise in SCALA:

Import additional relevant Spark libraries using the following code:

import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}

Create a Sequence of Rows where the content is a tuple containing an animal and its category using the following code:

val categorized_animals = Seq(Row("dog", "pet"),
                              Row("cat", "pet"),
                              Row("bear", "wild"))

Create a schema that corresponds to the data using the following code:

val schema = List(
  StructField("name", StringType, nullable = true),
  StructField("category", StringType, nullable = true)
)

Use the parallelize() function of Spark to turn that Sequence into an RDD as shown in the following code:

val animalDataRDD = spark.sparkContext.parallelize(categorized_animals)

Create a DataFrame from the RDD and schema created using the following code:

val animalData = spark.createDataFrame(animalDataRDD, StructType(schema))

Filter out any animal names that start with the letter c and print results to the console using the following code:

val nonCats = animalData.filter("name not like 'c%'")
nonCats.show()

The following is the output of the preceding code:

+----+--------+
|name|category|
+----+--------+
| dog|     pet|
|bear|    wild|
+----+--------+

In the above output, we can see that the data is filtered such that only animal names that do not start with c, namely dog and bear but not cat, remain displayed.

Apply another filter, this time removing any animals that are not pets, and print the results to the console using the following code:

val nonPets = animalData.filter("category != 'pet'")
nonPets.show()

The following is the output of the preceding code:

+----+--------+
|name|category|
+----+--------+
|bear|    wild|
+----+--------+

The above output shows a table where only wild animals (i.e. non-pets) remain in our data set.


Follow these steps to complete the exercise in PYTHON:

Create a List of tuples containing an animal and its category using the following code:

categorized_animals = [("dog", "pet"), ("cat", "pet"), ("bear", "wild")]

Use the parallelize() function of Spark to turn that Sequence into an RDD using the following code:

animalDataRDD = sc.parallelize(categorized_animals)

Create a DataFrame from the RDD using the following code:

animalsDF = spark.createDataFrame(animalDataRDD, ['name', 'category'])

Filter out any animal names that start with the letter c, and print results to the console using the following code:

nonCats = animalsDF.filter("name not like 'c%'")
nonCats.show()

The following is the output of the preceding code:

+----+--------+
|name|category|
+----+--------+
| dog|     pet|
|bear|    wild|
+----+--------+

Above, we can see that only dog and bear remain after the filtering step.

Apply another filter, this time removing any animals that are not pets, using the following code:

nonPets = animalsDF.filter("category != 'pet'")
nonPets.show()

The following is the output of the preceding code:

+----+--------+
|name|category|
+----+--------+
|bear|    wild|
+----+--------+

In this exercise, we learned how to filter a given set of data based on conditions using the .filter () method.

In the next section, we will learn the concept of Joins — and the different approaches you have to use them in Spark.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.