Hadoop HDFS

Describing Distributed File Systems

  • A file system is a system of files used for storage
  • A distributed file system is any file system that:

    • Provides access of files from multiple hosts
    • Provides access of files via a computer network
  • A distributed file system is managed locally or remotely

Differentiating between File Systems and Databases

  • A file system stores unstructured, unrelated data
  • Databases store structured, related data
  • Databases have more overhead compared to file systems
  • File systems tend to be more lightweight
  • This is because they aren't structured
  • Implying, they don't have as much overhead
  • On the other hand, databases have the following overhead:

    • Schemas
    • Built-in operations for indexing, searching, etc.
  • Data files in databases are formatted in its own way
  • This provides querying capabilities and other operations specific to some system
  • The data files in a file system are formatted in its original, raw format

Examples of Locally Managed DFS

  • HDFS
  • Ceph
  • Lustre

Examples of Remotely Managed DFS

  • HDFS
  • AWS S3
  • GCS
  • Microsoft Azure

Defining HDFS

  • HDFS is a component of the Hadoop ecosystem
  • HDFS assumes that a file can't be changed once its written
  • HDFS can be used as a standalone DFS
  • Meaning, we can plug in different computing engines into HDFS
  • For example, we can use Spark or MapReduce as a computing engine
  • Then, we can use HDFS as our storage
  • HDFS has the following benefits:

    • Designed to run on commodity hardware
    • Is highly fault-tolerant
    • Designed to be deployed on cheaper hardware
    • Provides high throughput access to application data
    • Suitable for applications with large datasets

Defining the Goals of HDFS

  • Handles hardware failure
  • Provides streaming access to datasets
  • Handles very large datasets
  • Portable from one platform to another platform
  • Executes code on the machine on which the data is stored

    • Meaning, we don't need to load data on the machine where the code lives
    • Moving small code to the machine with large processing capacity makes sense

References

Previous
Next

ETL and ELT

HDFS Architecture