Data Science

Huge files are split into small chunks known as data blocks
Files larger than $128$ MB are separated into $128$ MB blocks
Those blocks are stored across DataNodes
A NameNode stores metadata including:
- Which DataNodes contain which blocks
- Where those blocks are located
- etc.

A NameNode consists of:
- A namespace
- A block management service
A namespace consists of:
- A file-directory tree
- Metadata for all files and directories within the tree
- Mappings of blocks to files within directories
A block management service is used for:
- Monitoring DataNodes by sending out heartbeats
- Handling registration of DataNodes
- Maintaining location of blocks
- Processing block reports
- Managing replica replacement
- Performing block-related operations:
  - Create
  - Delete
  - Modify
  - Get block location

Every HDFS cluster has a single NameNode
This NameNode runs on an individual machine
A `NameNode is a master server
It achieves the following:
- Regulating any client-requested access to files
- Managing the namespace of the file system
- Storing metadata of data blocks within DataNodes across its cluster
- Keeping metadata in memory for fast retrieval
- Sending requested transformations to DataNodes to fulfill
- Executing operations performed on the namespace
Namespace operations include the following:
- Opening files
- Closing files
- Renaming files
- Renaming directories

Every HDFS cluster has at least one DataNode
A DataNode manages any file storage on its machine
Specifically, a file is split into one or more blocks
Then, these blocks are stored in DataNodes
A DataNode will perform any read/write instruction
These instructions are sent from the NameNode
Then, the DataNode will perform any necessary deletion or replication operation
Remember, DataNodes aren't capable of performing any transformations
Only something like MapReduce is capable of this

A client sends a request to a NameNode on a cluster
The NameNode sends that request to the appropriate DataNodes
- It does this by analyzing the filesystem tree
- And it refers to the metadata
The DataNodes fulfill the request
- It does this by performing the appropriate read and write instructions
Essentially, the NameNode manages the client's requests
Then, the DataNodes process those requests

Hadoop HDFS

HBase

HDFS Architecture