Getting ready

There are several ways to create a dataframe in Spark. One common way is by importing a .txt, .csv, or .json file. Another method is to manually enter fields and rows of data into the PySpark dataframe, and while the process can be a bit tedious, it is helpful, especially when dealing with a small dataset. To predict gender based on height and weight, this chapter will build a dataframe manually in PySpark. The dataset used is as follows:

While the dataset will be manually added to PySpark in this chapter, the dataset can also be viewed and downloaded from the following link:

https://github.com/asherif844/ApacheSparkDeepLearningCookbook/blob/master/CH02/data/HeightAndWeight.txt

Finally, we will begin this chapter and future chapters by starting up a Spark environment configured with a Jupyter notebook that was created in chapter 1, Setting up your Spark Environment for Deep Learning, using the following terminal command:

sparknotebook