Set up Apache Spark on Ubuntu 22+

An easy guide to installing Apache Spark 3.5.1 on Ubuntu 22+

big datalinuxbash

Apache Spark 3.5.1, a powerful open-source distributed computing system, has become a cornerstone in the world of big data processing. If you're running Ubuntu and eager to harness the capabilities of Apache Spark 3.5.1, you're in the right place. Here is a step-by-step process of installing and Apache Spark 3.5.1 on Ubuntu, ensuring you can leverage its robust features for data processing, machine learning, and more.

Introduction

Apache Spark provides lightning-fast data processing through in-memory computing and offers a unified platform for various data processing tasks. It supports diverse workloads, including batch processing, streaming analytics, machine learning, and graph processing. Its user-friendly APIs make it accessible for developers in different domains.

Let's set up Apache Spark 3.5.1 on your Ubuntu machine. Follow along, and soon you'll be harnessing the power of distributed computing for your data-intensive tasks.

1. Update the System

Before installing any software on Ubuntu, update the system.

sudo apt-get update

2. Install Java

To run Apache Spark on Ubuntu, we need to install Java by running the following command:

sudo apt install openjdk-8-jdk -y

3. Download Apache Spark

Download the latest version of Apache Spark from the official website. You can use the wget command to download the file.

wget https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz

4. Extract the Downloaded File

Once downloaded, extract the Apache Spark file by utilizing this command:

tar -xvzf spark-3.5.1-bin-hadoop3.tgz

5. Move folder to /opt

sudo mv spark-3.5.1-bin-hadoop3 /opt/spark

Configuring Apache Spark

Once you are done with downloading and extracting the Apache Spark file, now will discuss the procedure for its configuration.

1. Set Environment Variables

To set the environment variables, open the .bashrc file using the following command:

sudo nano ~/.bashrc

Copy these 3 lines and paste them at the end after export PATH=$PATH:$JAVA_HOME

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_PYTHON=/usr/bin/python3

Save and close the file.

2. Source the .bashrc File

To apply the changes, run the following command:

source ~/.bashrc

3. Verify Installation

Make use of the command below to verify Apache Spark installation on your system:

To verify the installation, run the following command:

spark-shell --version