Installing Hadoop in Windows With Eclipse

Introduction

Hadoop is a powerful framework for automatic parallelization of computing tasks. Unfortunately programming for it poses certain challenges. It is really hard to understand and debug Hadoop programs. One way to make it a little easier is to have a simplified version of the Hadoop cluster that runs locally on the developer's machine. This tutorial describes how to set up such a cluster on a computer running Microsoft Windows. It also describes how to integrate this cluster with Eclipse, a prime Java development environment.

Prerequisites 

The required Software that needed to be install are 

Now this steps will use to create hadoop environment

Installing Cygwin

  1. Installation method. Select “Install from Internet”.
  2. Root Directory. The default is c:\cygwin. Accept this directory.
  3. Local Package Directory (the directory where install files will be downloaded). The default is c:\cygwin-packages. Accept this directory.
  4. Connection and download site.
  5. A list of available packages will be displayed. The following packages are missing, so make sure to include them:
    • openssh
    • openssl
    • tcp_wrappers
    • diffutils
    If several options are listed (eg: openssl) include them all. 
  6. Upon installation completion, it will create a Cygwin icon in the Desktop and/or Start menu. Click it to open a Cygwin window.

Installing Hadoop

This is a detailed step-by-step guide for installing Hadoop on Windows, Linux or MAC. It’s based in Hadoop 1.0.0, which is the current and first official stable version. It’s based in version 0.20.0 (note that there was a 0.21.0 version).
Installing Hadoop on Linux / MAC is pretty straight forward. However, having it run on Windows can be a bit tricky. You’d probably not run Hadoop on Windows on a productive environment, but it may result convenient as a development environment. If you are using Linux/MAC, just skip Windows information.

Windows installation

Hadoop can be installed on Windows using Cygwin (not inteded for production environments), but there are several Cygwin installation and configuration issues.

Windows: Download and install Cygwin

Cygwin is an implementation of a set of Linux commands and applications for Windows. Download the web installer from: http://cygwin.com/setup.exe and run it.
Installer will request some information before installing:
  1. Installation method. Select “Install from Internet”.
  2. Root Directory. The default is c:\cygwin. Accept this directory.
  3. Local Package Directory (the directory where install files will be downloaded). The default is c:\cygwin-packages. Accept this directory.
  4. Connection and download site.
  5. A list of available packages will be displayed. The following packages are missing, so make sure to include them:
    • openssh
    • openssl
    • tcp_wrappers
    • diffutils
    If several options are listed (eg: openssl) include them all. 
  6. Upon installation completion, it will create a Cygwin icon in the Desktop and/or Start menu. Click it to open a Cygwin window.

Now some configuration of cygwin to use with hadoop

Configuring SSH on Windows

Hadoop requires SSH (Secure SHell) to be running. To configure it, open a Cygwin window and type:
ssh-host-config
Use the following installation options:
  • Should privilege separation be used? (yes/no) no
  • Do you want to install sshd as a service? yes
  • Enter the value of CYGWIN for the daemon: [] ntsec
  • If requested for an account name, specify: cyg_server with a password you’ll remember.
Eg:
$ ssh-host-config

*** Info: Generating /etc/ssh_host_key
*** Info: Generating /etc/ssh_host_rsa_key
*** Info: Generating /etc/ssh_host_dsa_key
*** Info: Generating /etc/ssh_host_ecdsa_key
*** Info: Creating default /etc/ssh_config file
*** Info: Creating default /etc/sshd_config file
*** Info: Privilege separation is set to yes by default since OpenSSH 3.3.
*** Info: However, this requires a non-privileged account called 'sshd'.
*** Info: For more info on privilege separation read /usr/share/doc/openssh/README.privsep.
*** Query: Should privilege separation be used? (yes/no) no
*** Info: Updating /etc/sshd_config file

*** Query: Do you want to install sshd as a service?
*** Query: (Say "no" if it is already installed as a service) (yes/no) yes
*** Query: Enter the value of CYGWIN for the daemon: [] ntsec
*** Info: On Windows Server 2003, Windows Vista, and above, the
*** Info: SYSTEM account cannot setuid to other users -- a capability
*** Info: sshd requires.  You need to have or to create a privileged
*** Info: account.  This script will help you do so.

*** Info: You appear to be running Windows XP 64bit, Windows 2003 Server,
*** Info: or later.  On these systems, it's not possible to use the LocalSystem
*** Info: account for services that can change the user id without an
*** Info: explicit password (such as passwordless logins [e.g. public key
*** Info: authentication] via sshd).

*** Info: If you want to enable that functionality, it's required to create
*** Info: a new account with special privileges (unless a similar account
*** Info: already exists). This account is then used to run these special
*** Info: servers.

*** Info: Note that creating a new user requires that the current account
*** Info: have Administrator privileges itself.

*** Info: No privileged account could be found.

*** Info: This script plans to use 'cyg_server'.
*** Info: 'cyg_server' will only be used by registered services.
*** Query: Do you want to use a different name? (yes/no) no

*** Query: Create new privileged user account 'cyg_server'? (yes/no) yes
*** Info: Please enter a password for new user cyg_server.  Please be sure
*** Info: that this password matches the password rules given on your system.
*** Info: Entering no password will exit the configuration.
*** Query: Please enter the password:
*** Query: Reenter: Enter password

*** Info: User 'cyg_server' has been created with password '####'.
*** Info: If you change the password, please remember also to change the
*** Info: password for the installed services which use (or will soon use)
*** Info: the 'cyg_server' account.

*** Info: Also keep in mind that the user 'cyg_server' needs read permissions
*** Info: on all users' relevant files for the services running as 'cyg_server3'.
*** Info: In particular, for the sshd server all users' .ssh/authorized_keys
*** Info: files must have appropriate permissions to allow public key
*** Info: authentication. (Re-)running ssh-user-config for each user will set
*** Info: these permissions correctly. [Similar restrictions apply, for
*** Info: instance, for .rhosts files if the rshd server is running, etc].

*** Info: The sshd service has been installed under the 'cyg_server'
*** Info: account.  To start the service now, call `net start sshd' or
*** Info: `cygrunsrv -S sshd'.  Otherwise, it will start automatically
*** Info: after the next reboot.

*** Info: Host configuration finished. Have fun!
Installation script creates:
  • configuration files:
    • /etc/ssh_config
    • /etc/ssh_host_dsa_key
    • /etc/ssh_host_ecdsa_key
    • /etc/ssh_host_key
    • /etc/ssh_host_rsa_key
    • /etc/sshd_config
  • cyg_server privilleged account.
  • sshd Windows service, using the specified account and password, and listed under the name CYGWIN sshd.
IMPORTANT: Do not run ssh-host-config without removing existing files or account. The script changes access permissions on configuration files so that they can only be accessed by ssh services. If the sshd service, configuration files and account are not created together, the script fails to configure the file permissions and no error is reported.Cleaning up ssh
If you run into any issue, delete the above 6 files, remove the created service using:
cygrunsrv -R sshd
and start over.
You should be able to start sshd service and login using your password. However, in order to run Hadoop you need to create a server key, so that you can stablish a ssh session without specifying a password. To this type
ssh-keygen
and accept all default options (no passphrase).
$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/AccountName/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/AccountName/.ssh/id_rsa.
Your public key has been saved in /home/AccountName/.ssh/id_rsa.pub.
The key fingerprint is:
9b:51:11:ea:c4:a4:72:fe:70:e7:dd:f1:ea:34:ac:0f AccountName@ServerName
The key's randomart image is:
+--[ RSA 2048]----+
|        . o.     |
|       + . .     |
|    . o + .      |
|     + o .       |
|      o S .   .  |
|       + * . o o |
|        + . E = .|
|             + o |
|            .o+  |
+-----------------+
Copy the generated RSA public key into the authorized_keys file, to allow logging without password.
cd ~/.ssh
cat id_rsa.pub >> authorized_keys
Try connecting locally:
ssh localhost
You should be able to connect without specifying a password.

Install JAVA latest version and set JAVA_HOME path to system variable

Now Hadoop part : 

Download Hadoop

Current Hadoop version is 1.0.0. Hadoop is organized as 3 projects:
  • Common: Common functionality to all projects (logging, utilities, etc).
  • HDFS: Hadoop Distributed File System.
  • MapReduce: Map-Reduce implementation. It allows performing distributed queries on the distributed file system. Explained later.
They are downloaded together from http://hadoop.apache.org/ as a single .tar.gz / .rpm / .deb file.
Unpack hadoop to any directory. Recommended install directory is /usr/local/hadoop-1.0.0, but you could use other directories.

Configure Hadoop

There are 3 basic configuration options for Hadoop:
  • Local (Standalone) Mode: All services run in a single node, with no replication.
  • Pseudo-Distributed Mode: Services run in a single node, but as separate Java processes.
  • Fully-Distributed Mode: Real distributed environment.
Hadoop configuration is stored in xml files located in /conf. They all share the same key-value format, stored as a sequence of:
  
    Property name
    Property value
  
Pseudo-Distributed Mode is the ideal development mode. Minimum configuration files for pseudo-distributed mode are shown below:
  • conf/core-site.xml:
    
      
        fs.default.name
        hdfs://localhost:9000
      
    
  • conf/hdfs-site.xml:
    
      
        dfs.replication
        1
      
    
  • conf/mapred-site.xml:
    
      
        mapred.job.tracker
        localhost:9001
      
    
If no path are specified, Hadoop temporary and data files are located in system tmp directory. So any implementation should begin by defining tmp and hdfs directories, as shown below:
  • conf/core-site.xml:
    
      
        hadoop.tmp.dir
        /tmp/hadoop-${user.name}
      
    
      
        fs.default.name
        hdfs://localhost:9000
      
    
  • conf/hdfs-site.xml:
    
      
        dfs.replication
        1
      
    
      
        dfs.name.dir
        /home/${user.name}/hdfs/name
      
    
      
        dfs.data.dir
        /home/${user.name}/hdfs/data
      
    
  • conf/mapred-site.xml:
    
      
        mapred.job.tracker
        localhost:9001
      
    
Under Windows, specify paths using full format. Eg:
  
    dfs.name.dir
    file:///c:/hdfs/name
  

Start hadoop

Format NameNode

Before starting Hadoop, you have to format the Name node. This is the node containing file structure. To format the Name node run:
cd /usr/local/hadoop-1.0.0
./bin/hadoop namenode -format
Several files will be created under the directory defined for the configuration key dfs.name.dir.

Start HDFS

bin/start-dfs.sh
Check HDFS is running by browsing to: http://localhost:50070/.
A webpage should be displayed with DFS information, where you can view and browse the directory structure.
If you run into any issue, check log files under hadoop-1.0.0/logs/ for errors.
You can also browse the file system using bin/hadoop fs -ls. Type bin/hadoop fs for the complete set of commands.
Under MAC/OSX you might get an “Unable to load realm info from SCDynamicStore” error. If you run into this issue, add the following line:
export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk

Start MapReduce (JobTracker):

bin/start-mapred.sh
Check JobTracker has started by browsing to:: http://localhost:50030/.
A page with scheduled jobs should be displayed.
Check hadoop-1.0.0/logs/ for errors.
Check HDFS and JobTracker by openning:

Now steps to start all hadoop components:

Start the local hadoop cluster


  1. Start the namenode in the first window by executing
    cd hadoop-0.19.1
    bin/hadoop namenode
  2. Start the secondary namenode in the second window by executing
    cd hadoop-0.19.1
    bin/hadoop secondarynamenode
  3. Start the job tracker the third window by executing
    cd hadoop-0.19.1
    bin/haoop jobtracker
  4. Start the data node the fourth window by executing
    cd hadoop-0.19.1
    bin/haoop datanode
  5. Start the task tracker the fifth window by executing
    cd hadoop-0.19.1
    bin/haoop tasktracker
  6. Now you should have an operational hadoop cluster. If everthing went fine your screen should look like the image below:

 Now the important step that post belong

Setup Hadoop Location in Eclipse

  1. Launch the Eclipse environment.
  2. Open Map/Reduce perspective by clicking on the open perspective icon (), select "Other" from the menu, and then select "Map/Reduce" from the list of perspectives.
  3. After you switched to the Map/Reduce perspective. Select the Map/Reduce Locations tab located at the bottom portion of your eclipse environment. Then right click on the blank space in that tab and select "New Hadoop location...." from the context menu.
  4. Fill in the following items, as shown on the figure above.
    • Location Name -- localhost
    • Map/Reduce Master
      • Host -- localhost
      • Port -- 9101
    • DFS Master
      • Check "Use M/R Master Host"
      • Port -- 9100
    • User name -- User
    Then press the Finish button.
  5. After you closed the Hadoop location settings dialog you should see a new location appearing in the "Map/Reduce Locations" tab.
  6. In the Project Explorer tab on the lefthand side of the eclipse window, find the DFS Locations item. Open it up using the "+" icon on the left side of it, inside of it you should see the localhost location reference with the blue elephant icon. Keep opening up the items

Upload data to HDFS

  1. Open a new CYGWIN command window.
  2. Execute the following commands in the new CYGWIN window as shown on the image above.
    cd hadoop-0.19.1
    bin/hadoop fs -mkdir In
    bin/hadoop fs -put *.txt In
    When the last of the above commands will start execution you should see some activity happening in the rest of the hadoop windows as shown on the image below.

    Creating and configuring Hadoop eclipse project.

  3. Launch Eclipse
  4. Right click on the blank space in the Project Explorer window and select New -> Project.. to create a new project.
  5. Select Map/Reduce Project from the list of project types.
 For Reference You can follow the below links :
  1. Hadoop on Windows With Eclipse
  2. Hadoop on Windows With Eclipse
  3. Video for Hadoop on Windows
  4. Installation guide for Cygwin and java and SSH and Hadoop Configuration



No comments: