Describing Apache Hive
- Hive is a software framework
- It provides a SQL-like interface to various databases
- It was built for HBase and HDFS, but can be used for others
- Hive is sometimes referred to as Hive Hadoop
- This is because it easily integrates with HBase and HDFS
-
Hive is used for:
- Simplifying the need for writing complex MapReduce jobs
- Tracks data that is critical
- Indexing to provide fast queries
Motivating the Hive Metastore
- In Hive, data is stored in HDFS
-
Hive creates definitions for the following:
- Tables
- Databases
- Schemas
- HQL operations
- These definitions are stored in a metastore
- The metastore is separate from the data
- It could be any RDBMS database
-
The HQL operations are:
- SQL-like operations
- Translated to pre-implemented MapReduce jobs
Describing the Hive Metastore
- The metastore consists of relational tables
- These tables contain metadata for objects created in Hive
-
The metadata could store the following:
- Column names
- Data types
- Indexes
- Comments
Motivating the Comparison between Hive and Spark
-
Consider two types of tools used in Hadoop:
- Execution engines
- Query optimizers
- These are both individual software frameworks
- An execution engine is required in Hadoop
- A query optimizer is optional
- An execution engine processes jobs related to the data
-
A query optimizer optimizes queries before they are processed
- Sometimes, they optimize queries during processing too
Comparing Apache Hive and Spark
-
Hive is mostly referred to as a query optimizer
- This is because Hive is essentially a metastore
- Spark is mostly referred to as an execution engine
- Hive uses MapReduce as its execution engine by default
-
Spark uses its own execution engine
- It is an alternative of MapReduce
-
Spark offers query optimizers as well
- It uses catalyst optimizers
- This uses rule-based and cost-based optimization
- Spark and Hive can be used together
-
Specifically, we can include:
- HDFS as our storage layer
- Hive's metastore for query optimization
- Spark's query optimization
- Either Spark or MapReduce as an execution engine
Comparing Apache Hive and Pig
- Pig is a procedural language
- Hive is a SQL-like language
- Pig operates on the client side of a cluster
- Hive involves defining tables beforehand
- Pig doesn't have a dedicated metastore
- Hive has a dedicated metastore
References
Previous
Next