Hello guys, I have came back with new topic in Big data environment. that is HIVE.
We all know HIVE is query engine tool to access the data on hdfs.
There are two optimization concepts in HIVE queries Partitioning and Bucketing .
We are going to see both of them and analyse the difference between the HIVE optimizations concepts
Partitioning :
Partitioning
in hive is often used for distributing load horizontally in hive environment,
this has performance benefit, and make the data in simple logical fashion.
Example like if we are dealing with large student table and often run queries with WHERE
clauses that restrict the results to a particular class or section. For making
query to give response faster, Hive table can be PARTITIONED BY (class STRING, Section
STRING), Partitioning tables changes how Hive structures the data storage and
Hive will now create subdirectories under the main directory of student data reflecting
the partitioning structure like . .../students/class=FirstYear/Section=Mechanical. If query limits
for student from class FirstYear
than it will only scan the contents of subdirectory ‘FirstYear’ under student
directory. This can dramatically improve query performance, but
only if the partitioning scheme reflects common filtering. Partitioning feature
is very useful in Hive; however, a design that creates too many partitions may
optimize some queries, but be detrimental for other important queries. Other
drawback is having too many partitions is the large number of Hadoop files and
directories that are created unnecessarily and overhead to NameNode since it
must keep all metadata for the file system in memory.
Bucketing :
Bucketing is another technique for decomposing data sets into
more manageable parts. For example, suppose a table using the marks as the
top-level partition and the student_id as the second-level partition leads to
too many small partitions. Instead, if we bucket the student table and use student_id
as the bucketing column, the value of this column will be hashed by a
user-defined number into buckets. Records with the same student_id will always
be stored in the same bucket. Assuming the number of student_id is much greater
than the number of buckets, each bucket will have many student_id. While
creating table you can specify like CLUSTERED BY (student_id) INTO XY BUCKETS ;
where XY is the number of buckets . Bucketing has several advantages. The number
of buckets is fixed so it does not fluctuate with variety of data. If two
tables have buckets on student_id, Hive can create a logically correct
sampling. Bucketing also aids in doing efficient map-side joins etc.
Example:
1. marks=91
·
00000_0
·
00001_0
·
00002_0
·
........
·
00010_0
Here marks=91 is the partition and 000 files are the buckets in each partition.
Buckets are calculated based on some hash functions, so rows with name=Sandy will always go in same bucket.
Comparison:
Features
|
Partition
|
Buckets
|
Size
|
The number of buckets is not fixed so it does fluctuate with data
|
The number of buckets is fixed so it does not fluctuate with data
|
Efficiency
|
Unnecessary may increase the load by creating many directories.
|
Enables more efficient queries
|
Distribution of data
|
Distributed according to condition we describe while creating
partition
|
Hash(column) MOD(number of buckets) –evenly distributed
|
Query Optimization technique
|
Yes
|
Yes
|
Keyword
|
PARTITION
|
CLUSTERED
|
Execution
|
Queries for single itineraries by ID would be very fast but any other
query would require to parse a huge amount of directories and files incurring
serious overheads
|
We can optimize joins by bucketing ‘similar’ IDs so Hive can minimise
the processing steps, and reduce the data needed to parse and compare for
join operations
|
I suppose you like the post and please comments if you have any queries related to post or if you have any good ideas to share with me.
Enter your email address:
Delivered by FeedBurner
No comments:
Post a Comment