Data Science

As a reminder, HDFS is a file system
HBase is a database
Specifically, it is a distributed NoSQL database
It is built on top of (but separate from) HDFS
HBase is used for providing real-time, read/write access to HDFS
Roughly, we can think of HBase as the database form of the unstructured file system that is HDFS
A server on which a NameNode lives is called a master server
A server on which a DataNode lives is called a region server

hbaseregions

Zookeeper nodes are responsible for:
- Coordinating between the HMaster nodes and region servers
- Coordinate data retrieval from region servers
- Monitoring any session timeouts
- Monitoring the statuses of nodes in the cluster by checking for heartbeats

Hadoop is basically $3$ things
- A file system (i.e. HDFS)
- A computation framework (i.e. MapReduce)
- A management bridge (i.e. YARN)
HDFS is used for:
- Storing huge amounts of data
- Ensuring the data is distributed
- Ensuring the data is redundant
HDFS is good for sequential data access (reads/writes)
However, it is not good for random data access (reads/writes)
- This is because it is only a file system
HBase is good for real-time, random data access

HBase stores both structued and unstructured data
HDFS also stores structured and unstructured data
Both provide multiple mechanisms to access data:
- Shell
- APIs
HBase stores data as key/value pairs in a columnar fashion
HDFS stores data as flat files
HDFS supports the following:
- Optimized for streaming access of large files
- Follows write-once, read-many ideology
- Doesn't support random read/write
HBase supports the following:
- Stores key/value pairs in columnar fashion
- Provides a flexible data model
- Supports random read/write
- Provides low latency access to small amounts of data from large datasets
To summarize:
- HDFS is used for offline batch-processing
- HBase is used for real-time reads and writes

HDFS Architecture

MapReduce