Skip to content
>GLB_
Go back

Understanding How Hive Converts SQL Queries into Hadoop Jobs

When you execute a SQL query in Apache Hive, the query is not directly run on a traditional database. Instead, Hive translates it into a Hadoop job, which is then executed across a distributed system. This blog post will break down the step-by-step process of how Hive converts a simple SQL query into a Hadoop MapReduce job.

1. Executing the Query in Hive

Consider the following HiveQL query:

SELECT * FROM productos;

At first glance, this looks like a standard SQL query, but under the hood, Hive transforms it into a Hadoop job.

2. Query Parsing and Logical Plan Generation

Hive first parses the query to verify its syntax and then generates a logical execution plan. This plan consists of a series of operations required to execute the query efficiently.

Running EXPLAIN SELECT * FROM productos; in Hive provides a detailed breakdown of how the query will be executed.

3. Translating into a MapReduce Job

If Hive is configured to use MapReduce (instead of Tez or Spark), it converts the logical plan into a Hadoop MapReduce job. The job consists of the following components:

Mapper Phase

Since SELECT * only reads data, Hive generates a Mapper-only job without a Reducer. The mapper reads the data from HDFS and passes it along without transformation. A simplified Java equivalent of the Mapper function could be:

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;

public class TableScanMapper extends Mapper<Object, Text, Text, Text> {
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        context.write(null, value);  // Passing data as-is
    }
}

Job Configuration

Hive then prepares a MapReduce job configuration that specifies the input and output locations:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class HiveTableScanJob {
    public static void main(String&#91;] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "Hive SELECT * FROM productos");
        
        job.setJarByClass(HiveTableScanJob.class);
        job.setMapperClass(TableScanMapper.class);
        job.setNumReduceTasks(0); // No Reducer needed

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        TextInputFormat.addInputPath(job, new Path("/user/hive/warehouse/productos"));
        TextOutputFormat.setOutputPath(job, new Path("/user/hive/output"));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

4. Executing the Hadoop Job

Once the job is submitted, Hadoop’s YARN ResourceManager schedules and executes the tasks across multiple nodes.

To monitor the job, you can use:

yarn application -list

For logs:

yarn logs -applicationId <APPLICATION_ID>

5. Returning Results to Hive

After the job finishes:

Conclusion

Understanding how Hive transforms SQL into Hadoop jobs helps in optimizing queries for better performance. Whether using MapReduce, Tez, or Spark, the fundamental process remains the same: Hive parses, optimizes, translates, and executes the query in a distributed manner.

For improved performance, consider switching from MapReduce to Tez or Spark, which provide better query optimization and execution speed.


Share this post:

Previous Post
Handling Schema Changes in a Data Warehouse
Next Post
Delta Lake vs. Traditional Data Lakes: Key Differences and Vendor Options