Hadoop

rltyty included in IT Bigdata

2025-07-31 536 words

Contents

Hadoop

Big Data computation & storag, distributed data processing (MapReduce, YARN, HDFS, etc.)

Hadoop Eco System

Hadoop Cluster Components

Master Node

higher quality hardware
NameNode, Secondary NameNode, JobTracker w/ each running on a separate machine
responsible for storing data in HDFS and overseeing key operations, such as parallel computations on the data using MapReduce

Worker Node

virtual machines running on commodity hardware
DataNode and TaskTracker running on each worker node
responsible for the actual work of storing and processing the jobs as directed by the master nodes

Edge Node (a.k.a. Gateway Node or Client Node)

responsible for loading the data into the cluster, submitting MapReduce jobs (how data needs to be processed) and then fetch the results.

Summary

Master Nodes: NameNode, ResourceManager (control services).
Worker Nodes: DataNode, NodeManager (storage & compute).
Edge Nodes: Client tools, job submission, data ingestion.

Hadoop Cluster Architecture

YARN (ResourceManager and NodeManager)

The ResourceManager is the “brain” of YARN, ensuring efficient and fair resource sharing across the Hadoop cluster. It runs on the Master Node and works with NodeManagers (on Workers) and ApplicationMasters (per-job managers) to execute distributed applications.

Analogy:

RM = Air Traffic Control Manages all flights (jobs) in the cluster (airspace).
NM = Airport Ground Control Handles takeoffs/landings (containers) on a single node (airport).
Containers = Runways Allocated to flights (tasks) for a limited time.
Global Resource Allocation
- Manages and allocates cluster resources (CPU, memory) to applications (e.g., MapReduce, Spark, Hive) running on Worker Nodes.
- Ensures fair sharing of resources among different users or queues (using schedulers like Fair Scheduler or Capacity Scheduler).
Application Life-cycle Management
- Accept job submissions from clients (via the Edge Node)
- Start Application Master (AM) for each application (e.g., a MapReduce or Spark job)
- Monitor the AM and restarts it if it fails
Node Management
- Communicates with Node Managers (running on Worker Nodes) to track resource availability.
- Handle heartbeats from NMs to detect node failures.
Security & Scheduling
- Enforces access controls (e.g., via Kerberos or ACLs).
- Uses a plug-in scheduler (Fair/Capacity/DRF) to assign resources based on policies. (policy provider)

Compared with Apache Mesos

Mesos: General-purpose resource allocation across many applications. Mesos

HDFS (NameNode and DataNodes)

An analogy is to consider NameNode as inode (Unix/Linux File System), stores and manage file metadata (file name, permissions, block locations) vs inode (permissions, size, pointers to data blocks); DataNodes as data blocks that stores actual file data and tracked by NameNode (inode); Data replication as RAID.

Edge Nodes

Security & Access Control

As controlled entry point to enforce authentication (Kerberos, LDAP) and authorization policies.
Channeling all user access through the entry points, limiting access to master / worker nodes, enforce strict security policies.

Job Submission

submit Hadoop jobs (MapReduce, Spark, Hive, etc.)
tools: hadoop, yarn, spark-submit

Data Ingestion & Export

import/export data into HDFS or other storage systems (sqoop, flume, kafka)
connected to external data sources (databases, APIs, etc.)

Development & Administration

Ambari, Cloudera Manager agents
Hosts BI tools (like Tableau, Power BI) that query Hadoop data.

Hadoop Cluster Size

A set of metrics that defines storage and compute capabilities to run Hadoop workloads:

Number of nodes: Master nodes, Worker nodes, Edge nodes (Client / Gateway)
Configuration of each type node: number of cores per node, RAM and Disk Volume