Data Science

A file system is a system of files used for storage
A distributed file system is any file system that:
- Provides access of files from multiple hosts
- Provides access of files via a computer network
A distributed file system is managed locally or remotely

A file system stores unstructured, unrelated data
Databases store structured, related data
Databases have more overhead compared to file systems
File systems tend to be more lightweight
This is because they aren't structured
Implying, they don't have as much overhead
On the other hand, databases have the following overhead:
- Schemas
- Built-in operations for indexing, searching, etc.
Data files in databases are formatted in its own way
This provides querying capabilities and other operations specific to some system
The data files in a file system are formatted in its original, raw format

Handles hardware failure
Provides streaming access to datasets
Handles very large datasets
Portable from one platform to another platform
Executes code on the machine on which the data is stored
- Meaning, we don't need to load data on the machine where the code lives
- Moving small code to the machine with large processing capacity makes sense

ETL and ELT

HDFS Architecture

Hadoop HDFS