Partitioning and Bucketing in HIVE

Hello guys, I have came back with new topic in Big data environment. that is HIVE.
We all know HIVE is query engine tool to access the data on hdfs. 
There are two optimization concepts in HIVE queries Partitioning and Bucketing .

We are going to see both of them and analyse the difference between the HIVE optimizations concepts

Partitioning : 

 Partitioning in hive is often used for distributing load horizontally in hive environment, this has performance benefit, and make the data in simple logical fashion. Example like if we are dealing with large student table and often run queries with WHERE clauses that restrict the results to a particular class or section. For making query to give response faster, Hive table can be PARTITIONED BY (class STRING, Section STRING), Partitioning tables changes how Hive structures the data storage and Hive will now create subdirectories under the main directory of student data reflecting the partitioning structure like . .../students/class=FirstYear/Section=Mechanical. If query limits for student from class FirstYear than it will only scan the contents of subdirectory ‘FirstYear’ under student directory. This can dramatically improve query performance, but only if the partitioning scheme reflects common filtering. Partitioning feature is very useful in Hive; however, a design that creates too many partitions may optimize some queries, but be detrimental for other important queries. Other drawback is having too many partitions is the large number of Hadoop files and directories that are created unnecessarily and overhead to NameNode since it must keep all metadata for the file system in memory.

 Bucketing :


Bucketing is another technique for decomposing data sets into more manageable parts. For example, suppose a table using the marks as the top-level partition and the student_id as the second-level partition leads to too many small partitions. Instead, if we bucket the student table and use student_id as the bucketing column, the value of this column will be hashed by a user-defined number into buckets. Records with the same student_id will always be stored in the same bucket. Assuming the number of student_id is much greater than the number of buckets, each bucket will have many student_id. While creating table you can specify like CLUSTERED BY (student_id) INTO XY BUCKETS ; where XY is the number of buckets . Bucketing has several advantages. The number of buckets is fixed so it does not fluctuate with variety of data. If two tables have buckets on student_id, Hive can create a logically correct sampling. Bucketing also aids in doing efficient map-side joins etc.
Example:
1.     marks=91
·         00000_0
·         00001_0
·         00002_0
·         ........
·         00010_0
Here marks=91 is the partition and 000 files are the buckets in each partition. Buckets are calculated based on some hash functions, so rows with name=Sandy will always go in same bucket.

Comparison: 

Features
Partition
Buckets
Size
The number of buckets is not fixed so it does fluctuate with data
The number of buckets is fixed so it does not fluctuate with data
Efficiency
Unnecessary may increase the load by creating many directories.
Enables more efficient queries
Distribution of data
Distributed according to condition we describe while creating partition
Hash(column) MOD(number of buckets) –evenly distributed
Query Optimization technique
Yes
Yes
Keyword
PARTITION
CLUSTERED
Execution
Queries for single itineraries by ID would be very fast but any other query would require to parse a huge amount of directories and files incurring serious overheads
We can optimize joins by bucketing ‘similar’ IDs so Hive can minimise the processing steps, and reduce the data needed to parse and compare for join operations

I suppose you like the post and please comments if you have any queries related to post or if you have any good ideas to share with me.


Enter your email address:  

Delivered by FeedBurner

No comments: