Up Next – Installing Hadoop on your local machine

Ready to Get Hands-On with Hadoop?

May 28, 2025

Hadoop is a robust and scalable big data framework that powers some of the world’s largest data-driven applications. While it can seem overwhelming for beginners—especially given its many components—this guide walks you through the first steps to get started with Hadoop in simple, easy-to-follow stages.

📦 Installing Hadoop

Before you can start working with Hadoop, you’ll need to install it on your system. The process can vary depending on whether you’re setting it up locally or in the cloud. For learning purposes, we’ll focus on installing it locally.
Hadoop works best on Linux or macOS systems; Windows requires additional setup.
You’ll also need:

Java (version 8 or 11 installed)
SSH enabled for internal communication

🛠️ Local Installation: Step by Step

To try out Hadoop locally, we’ll set up a single-node cluster, where all components run on one machine.
First, check the Hadoop Releases page for the latest version—this guide uses Hadoop 3.4.1. Adjust the version numbers in the code below if you prefer a different version.

Open your terminal and run:

wget https://downloads.apache.org/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz
tar -xvzf hadoop-3.4.1.tar.gz
cd hadoop-3.4.1

If the link fails, it’s likely that the version is outdated—simply find the latest version and update the command.

The installation directory will be about 1 GB in size.

🔧 Configure Your Environment

Next, set up your environment variables. These let your system know where Hadoop is installed and make its commands available globally:

export HADOOP_HOME=~/hadoop-3.4.1 
export PATH=$PATH:$HADOOP_HOME/bin

📝 Edit Hadoop Configuration Files

Before running Hadoop, you’ll want to tweak its configuration:

core-site.xml: Core Hadoop settings (e.g., cluster info)
hdfs-site.xml: HDFS-specific settings (e.g., data directories, replication factor)
yarn-site.xml: YARN resource manager settings

For local testing, set the replication factor to 1—since we’re working on a single machine, we don’t need multiple replicas. Open the HDFS configuration file:

nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add:

<property> 
    <name>dfs.replication</name> 
    <value>1</value> 
</property>

For more configuration options, check the Hadoop Docs.

🚀 Start Hadoop

With the configuration done, it’s time to fire up Hadoop!
First, format the NameNode:

hdfs namenode -format

Then start the services:

start-dfs.sh
start-yarn.sh

If errors occur, ensure Java is installed. On macOS, you may need to start processes explicitly:

~/hadoop-3.4.1/bin/hdfs --daemon start namenode
~/hadoop-3.4.1/bin/hdfs --daemon start datanode
~/hadoop-3.4.1/bin/yarn --daemon start resourcemanager
~/hadoop-3.4.1/bin/yarn --daemon start nodemanager

Verify that everything is running:

jps

You should see processes like NameNode, DataNode, ResourceManager, and NodeManager.

✅ That’s it! You now have Hadoop installed and ready to go.

With Hadoop now installed and configured, you’ve laid the foundation for working with one of the most powerful big data frameworks available. Whether you’re exploring data analysis, machine learning, or large-scale processing, this setup gives you the flexibility and scalability to dive deep into the world of distributed computing. Take your time experimenting with the environment, and enjoy the journey into big data!

Data Basecamp’s Substack

Discussion about this post

Ready for more?