Hive

Describing Apache Hive

  • Hive is a software framework
  • It provides a SQL-like interface to various databases
  • It was built for HBase and HDFS, but can be used for others
  • Hive is sometimes referred to as Hive Hadoop
  • This is because it easily integrates with HBase and HDFS
  • Hive is used for:

    • Simplifying the need for writing complex MapReduce jobs
    • Tracks data that is critical
    • Indexing to provide fast queries

Motivating the Hive Metastore

  • In Hive, data is stored in HDFS
  • Hive creates definitions for the following:

    • Tables
    • Databases
    • Schemas
    • HQL operations
  • These definitions are stored in a metastore
  • The metastore is separate from the data
  • It could be any RDBMS database
  • The HQL operations are:

    • SQL-like operations
    • Translated to pre-implemented MapReduce jobs

Describing the Hive Metastore

  • The metastore consists of relational tables
  • These tables contain metadata for objects created in Hive
  • The metadata could store the following:

    • Column names
    • Data types
    • Indexes
    • Comments

Motivating the Comparison between Hive and Spark

  • Consider two types of tools used in Hadoop:

    • Execution engines
    • Query optimizers
  • These are both individual software frameworks
  • An execution engine is required in Hadoop
  • A query optimizer is optional
  • An execution engine processes jobs related to the data
  • A query optimizer optimizes queries before they are processed

    • Sometimes, they optimize queries during processing too

Comparing Apache Hive and Spark

  • Hive is mostly referred to as a query optimizer

    • This is because Hive is essentially a metastore
  • Spark is mostly referred to as an execution engine
  • Hive uses MapReduce as its execution engine by default
  • Spark uses its own execution engine

    • It is an alternative of MapReduce
  • Spark offers query optimizers as well

    • It uses catalyst optimizers
    • This uses rule-based and cost-based optimization
  • Spark and Hive can be used together
  • Specifically, we can include:

    • HDFS as our storage layer
    • Hive's metastore for query optimization
    • Spark's query optimization
    • Either Spark or MapReduce as an execution engine

Comparing Apache Hive and Pig

  • Pig is a procedural language
  • Hive is a SQL-like language
  • Pig operates on the client side of a cluster
  • Hive involves defining tables beforehand
  • Pig doesn't have a dedicated metastore
  • Hive has a dedicated metastore

References

Previous
Next

YARN

Sqoop