Partitioning and Bucketing in HIVE

Hello guys, I have came back with new topic in Big data environment. that is HIVE.
We all know HIVE is query engine tool to access the data on hdfs. 
There are two optimization concepts in HIVE queries Partitioning and Bucketing .

We are going to see both of them and analyse the difference between the HIVE optimizations concepts

Partitioning : 

 Partitioning in hive is often used for distributing load horizontally in hive environment, this has performance benefit, and make the data in simple logical fashion. Example like if we are dealing with large student table and often run queries with WHERE clauses that restrict the results to a particular class or section. For making query to give response faster, Hive table can be PARTITIONED BY (class STRING, Section STRING), Partitioning tables changes how Hive structures the data storage and Hive will now create subdirectories under the main directory of student data reflecting the partitioning structure like . .../students/class=FirstYear/Section=Mechanical. If query limits for student from class FirstYear than it will only scan the contents of subdirectory ‘FirstYear’ under student directory. This can dramatically improve query performance, but only if the partitioning scheme reflects common filtering. Partitioning feature is very useful in Hive; however, a design that creates too many partitions may optimize some queries, but be detrimental for other important queries. Other drawback is having too many partitions is the large number of Hadoop files and directories that are created unnecessarily and overhead to NameNode since it must keep all metadata for the file system in memory.

 Bucketing :


Bucketing is another technique for decomposing data sets into more manageable parts. For example, suppose a table using the marks as the top-level partition and the student_id as the second-level partition leads to too many small partitions. Instead, if we bucket the student table and use student_id as the bucketing column, the value of this column will be hashed by a user-defined number into buckets. Records with the same student_id will always be stored in the same bucket. Assuming the number of student_id is much greater than the number of buckets, each bucket will have many student_id. While creating table you can specify like CLUSTERED BY (student_id) INTO XY BUCKETS ; where XY is the number of buckets . Bucketing has several advantages. The number of buckets is fixed so it does not fluctuate with variety of data. If two tables have buckets on student_id, Hive can create a logically correct sampling. Bucketing also aids in doing efficient map-side joins etc.
Example:
1.     marks=91
·         00000_0
·         00001_0
·         00002_0
·         ........
·         00010_0
Here marks=91 is the partition and 000 files are the buckets in each partition. Buckets are calculated based on some hash functions, so rows with name=Sandy will always go in same bucket.

Comparison: 

Features
Partition
Buckets
Size
The number of buckets is not fixed so it does fluctuate with data
The number of buckets is fixed so it does not fluctuate with data
Efficiency
Unnecessary may increase the load by creating many directories.
Enables more efficient queries
Distribution of data
Distributed according to condition we describe while creating partition
Hash(column) MOD(number of buckets) –evenly distributed
Query Optimization technique
Yes
Yes
Keyword
PARTITION
CLUSTERED
Execution
Queries for single itineraries by ID would be very fast but any other query would require to parse a huge amount of directories and files incurring serious overheads
We can optimize joins by bucketing ‘similar’ IDs so Hive can minimise the processing steps, and reduce the data needed to parse and compare for join operations

I suppose you like the post and please comments if you have any queries related to post or if you have any good ideas to share with me.


Enter your email address:  

Delivered by FeedBurner

Send Image as binary data and string data via Socket programming in Play framework

Hello friends today we are going to see demo a simple example for sending binary data and string data via socket in play framework.

Application.java 

It defines the Controller for the application. its Provide web socket and send simple binary data from socket.


 

package controllers;

import play.*;
import play.mvc.*;

import views.html.*;
import models.*;

public class Application extends Controller {
   
    // render index page
    public static Result index() {
        return ok(index.render());
    }
   
    // get the ws.js script
    public static Result wsJs() {
        return ok(views.js.ws.render());
    }
   
    // Websocket interface
    public static WebSocket wsInterface(){
        return new WebSocket(){
           
            // called when websocket handshake is done
            public void onReady(WebSocket.In in, WebSocket.Out out){
                SimpleChat.start(in, out);
            }
        };  
    }  
}



Simplechat.java

It defines the socket listener for sending messages and receiving messages.
For Sending string data you just have to replace byte[] to String in both the files

package models;
import play.mvc.*;
import play.libs.*;
import play.libs.F.*;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.*;
import java.awt.image.BufferedImage;
import java.io.BufferedOutputStream;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.nio.ByteBuffer;
import java.util.Collections;
import java.util.HashSet;
import java.util.Set;

import javax.imageio.ImageIO;


public class SimpleChat{

    // collect all websockets here
    private static List> connections = new ArrayList>();
   
    public static void start(WebSocket.In in, WebSocket.Out out){
       
       
        File file = new File("C:\\Users\\saganlalp\\Pictures\\e.jpg");
       connections.add(out);
        in.onMessage(new Callback(){
            public void invoke(byte[] event){
                SimpleChat.notifyAll(event);
            }
        });
       
        in.onClose(new Callback0(){
            public void invoke(){
                //SimpleChat.notifyAll("A connection closed");
            }
        });
        try {

 /*FileInputStream imageInFile = new FileInputStream(file);
                    byte imageData[] = new byte[(int) file.length()];*/
                   
                  // server.getBroadcastOperations().sendEvent("fileevent", imageData);
                 
                  BufferedImage image = ImageIO.read(file);
                ByteArrayOutputStream baos = new ByteArrayOutputStream();
                ImageIO.write(image, "jpg", baos);
                byte[] byteArray = baos.toByteArray();
                /*OutputStream out = new BufferedOutputStream(new FileOutputStream("D:\\images\\new.jpg"));
                out.write(byteArray);
                if(out!=null){
                    out.close();
                }*/
               
                ByteBuffer buf = ByteBuffer.wrap(byteArray);
                       out.write(byteArray);
                    }
                     catch (IOException e) {
                      
                        e.printStackTrace();
                 }
           }
   
    // Iterate connection list and write incoming message
    public static void notifyAll(byte[] message){
        for (WebSocket.Out out : connections) {
            out.write(message);
        }
    }
   
   
}



Index.scala.html 

Its the index file of the application
 

@main("Small things jump around") {
    <section>
        <h1>Simple chat</h1>
       
        <input type="text" id="socket-input" />
        <div id="socket-messages"></div>
        <script type="text/javascript" charset="utf-8" src="@routes.Application.wsJs()"></script>
    </section>
}




main.scala.html

It is final file where data loads 


@(title: String)(content: Html)

<!DOCTYPE html>

<html>
    <head>
        <title>@title</title>
        <link rel="stylesheet" media="screen" href="@routes.Assets.at("stylesheets/main.css")">
        <link rel="shortcut icon" type="image/png" href="@routes.Assets.at("images/favicon.png")">
        <script src="@routes.Assets.at("javascripts/jquery-1.9.0.min.js")" type="text/javascript"></script>
       
    </head>
    <body>
        @content
    </body>
</html>



ws.scala.js

Its js file that include Socket programming





$(function(){

    // get websocket class, firefox has a different way to get it
    var WS = window['MozWebSocket'] ? window['MozWebSocket'] : WebSocket;
   
    // open pewpew with websocket
    var socket = new WS('@routes.Application.wsInterface().webSocketURL(request)');
    socket.binaryType = "arraybuffer";
    var writeMessages = function(event){
        //$('#socket-messages').prepend(''+event.data+'
');
        //alert(event.data);
        if(event.data instanceof ArrayBuffer)
                        {
                        //alert(true);
                        showBinaryMessage(event);
                        }
        //$('#socket-messages').prepend('Red dot');
        //$('#socket-messages').prepend('
');
    }
    function showBinaryMessage(evt)
            {
                //alert("Hi this is my message"+evt);
               
                var binary = '';
                var bytes = new Uint8Array(evt.data);
                var i;
                for(i=0;i< bytes.byteLength; i++)
                    {
                    binary +=String.fromCharCode(bytes[i]);
                    }
                    //alert(i);
                    //alert(bytes);
                    //alert(binary);
                   
                $('#socket-messages').prepend('Red dot
');
            }
   
    socket.onmessage = writeMessages;
   
    $('#socket-input').keyup(function(event){
        var charCode = (event.which) ? event.which : event.keyCode ;
      
        // if enter (charcode 13) is pushed, send message, then clear input field
        if(charCode === 13){
            socket.send($(this).val());
            $(this).val('');   
        }
    });
});




Place e.jpg within the project directory or change the appropriate path of the image file

Just create new play project , copy this files in appropriate directory .
run the play project , your example for socket programming is ready





Enter your email address:
Delivered by FeedBurner