Introduction
Hadoop is a powerful framework for automatic parallelization of computing tasks. Unfortunately programming for it poses certain challenges. It is really hard to understand and debug Hadoop programs. One way to make it a little easier is to have a simplified version of the Hadoop cluster that runs locally on the developer's machine. This tutorial describes how to set up such a cluster on a computer running Microsoft Windows. It also describes how to integrate this cluster with Eclipse, a prime Java development environment.Prerequisites
The required Software that needed to be install are
Now this steps will use to create hadoop environment
Installing Cygwin
- Installation method. Select “Install from Internet”.
- Root Directory. The default is
c:\cygwin
. Accept this directory. - Local Package Directory (the directory where install files will be downloaded). The default is
c:\cygwin-packages
. Accept this directory. - Connection and download site.
- A list of available packages will be displayed. The following packages are missing, so make sure to include them:
- openssh
- openssl
- tcp_wrappers
- diffutils
- Upon installation completion, it will create a Cygwin icon in the Desktop and/or Start menu. Click it to open a Cygwin window.
Installing Hadoop
This is a detailed step-by-step guide for installing Hadoop on Windows, Linux or MAC. It’s based in Hadoop 1.0.0, which is the current and first official stable version. It’s based in version 0.20.0 (note that there was a 0.21.0 version).Installing Hadoop on Linux / MAC is pretty straight forward. However, having it run on Windows can be a bit tricky. You’d probably not run Hadoop on Windows on a productive environment, but it may result convenient as a development environment. If you are using Linux/MAC, just skip Windows information.
Windows installation
Hadoop can be installed on Windows using Cygwin (not inteded for production environments), but there are several Cygwin installation and configuration issues.Windows: Download and install Cygwin
Cygwin is an implementation of a set of Linux commands and applications for Windows. Download the web installer from: http://cygwin.com/setup.exe and run it.Installer will request some information before installing:
- Installation method. Select “Install from Internet”.
- Root Directory. The default is
c:\cygwin
. Accept this directory. - Local Package Directory (the directory where install files will be downloaded). The default is
c:\cygwin-packages
. Accept this directory. - Connection and download site.
- A list of available packages will be displayed. The following packages are missing, so make sure to include them:
- openssh
- openssl
- tcp_wrappers
- diffutils
- Upon installation completion, it will create a Cygwin icon in the Desktop and/or Start menu. Click it to open a Cygwin window.
Now some configuration of cygwin to use with hadoop
Configuring SSH on Windows
Hadoop requires SSH (Secure SHell) to be running. To configure it, open a Cygwin window and type:ssh-host-configUse the following installation options:
- Should privilege separation be used? (yes/no) no
- Do you want to install sshd as a service? yes
- Enter the value of CYGWIN for the daemon: [] ntsec
- If requested for an account name, specify: cyg_server with a password you’ll remember.
$ ssh-host-config *** Info: Generating /etc/ssh_host_key *** Info: Generating /etc/ssh_host_rsa_key *** Info: Generating /etc/ssh_host_dsa_key *** Info: Generating /etc/ssh_host_ecdsa_key *** Info: Creating default /etc/ssh_config file *** Info: Creating default /etc/sshd_config file *** Info: Privilege separation is set to yes by default since OpenSSH 3.3. *** Info: However, this requires a non-privileged account called 'sshd'. *** Info: For more info on privilege separation read /usr/share/doc/openssh/README.privsep. *** Query: Should privilege separation be used? (yes/no) no *** Info: Updating /etc/sshd_config file *** Query: Do you want to install sshd as a service? *** Query: (Say "no" if it is already installed as a service) (yes/no) yes *** Query: Enter the value of CYGWIN for the daemon: [] ntsec *** Info: On Windows Server 2003, Windows Vista, and above, the *** Info: SYSTEM account cannot setuid to other users -- a capability *** Info: sshd requires. You need to have or to create a privileged *** Info: account. This script will help you do so. *** Info: You appear to be running Windows XP 64bit, Windows 2003 Server, *** Info: or later. On these systems, it's not possible to use the LocalSystem *** Info: account for services that can change the user id without an *** Info: explicit password (such as passwordless logins [e.g. public key *** Info: authentication] via sshd). *** Info: If you want to enable that functionality, it's required to create *** Info: a new account with special privileges (unless a similar account *** Info: already exists). This account is then used to run these special *** Info: servers. *** Info: Note that creating a new user requires that the current account *** Info: have Administrator privileges itself. *** Info: No privileged account could be found. *** Info: This script plans to use 'cyg_server'. *** Info: 'cyg_server' will only be used by registered services. *** Query: Do you want to use a different name? (yes/no) no *** Query: Create new privileged user account 'cyg_server'? (yes/no) yes *** Info: Please enter a password for new user cyg_server. Please be sure *** Info: that this password matches the password rules given on your system. *** Info: Entering no password will exit the configuration. *** Query: Please enter the password: *** Query: Reenter: Enter password *** Info: User 'cyg_server' has been created with password '####'. *** Info: If you change the password, please remember also to change the *** Info: password for the installed services which use (or will soon use) *** Info: the 'cyg_server' account. *** Info: Also keep in mind that the user 'cyg_server' needs read permissions *** Info: on all users' relevant files for the services running as 'cyg_server3'. *** Info: In particular, for the sshd server all users' .ssh/authorized_keys *** Info: files must have appropriate permissions to allow public key *** Info: authentication. (Re-)running ssh-user-config for each user will set *** Info: these permissions correctly. [Similar restrictions apply, for *** Info: instance, for .rhosts files if the rshd server is running, etc]. *** Info: The sshd service has been installed under the 'cyg_server' *** Info: account. To start the service now, call `net start sshd' or *** Info: `cygrunsrv -S sshd'. Otherwise, it will start automatically *** Info: after the next reboot. *** Info: Host configuration finished. Have fun!Installation script creates:
- configuration files:
- /etc/ssh_config
- /etc/ssh_host_dsa_key
- /etc/ssh_host_ecdsa_key
- /etc/ssh_host_key
- /etc/ssh_host_rsa_key
- /etc/sshd_config
- a cyg_server privilleged account.
- a sshd Windows service, using the specified account and password, and listed under the name
CYGWIN sshd
.
IMPORTANT: Do not run
If you run into any issue, delete the above 6 files, remove the created service using:
You should be able to start sshd service and login using your
password. However, in order to run Hadoop you need to create a server
key, so that you can stablish a ssh session without specifying a
password. To this typessh-host-config
without
removing existing files or account. The script changes access
permissions on configuration files so that they can only be accessed by
ssh services. If the sshd service, configuration files and account are
not created together, the script fails to configure the file permissions
and no error is reported.Cleaning up sshIf you run into any issue, delete the above 6 files, remove the created service using:
cygrunsrv -R sshdand start over.
ssh-keygenand accept all default options (no passphrase).
$ ssh-keygen Generating public/private rsa key pair. Enter file in which to save the key (/home/AccountName/.ssh/id_rsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/AccountName/.ssh/id_rsa. Your public key has been saved in /home/AccountName/.ssh/id_rsa.pub. The key fingerprint is: 9b:51:11:ea:c4:a4:72:fe:70:e7:dd:f1:ea:34:ac:0f AccountName@ServerName The key's randomart image is: +--[ RSA 2048]----+ | . o. | | + . . | | . o + . | | + o . | | o S . . | | + * . o o | | + . E = .| | + o | | .o+ | +-----------------+Copy the generated RSA public key into the authorized_keys file, to allow logging without password.
cd ~/.ssh cat id_rsa.pub >> authorized_keysTry connecting locally:
ssh localhostYou should be able to connect without specifying a password.
Install JAVA latest version and set JAVA_HOME path to system variable
Now Hadoop part :
Download Hadoop
Current Hadoop version is 1.0.0. Hadoop is organized as 3 projects:- Common: Common functionality to all projects (logging, utilities, etc).
- HDFS: Hadoop Distributed File System.
- MapReduce: Map-Reduce implementation. It allows performing distributed queries on the distributed file system. Explained later.
.tar.gz
/ .rpm
/ .deb
file.Unpack hadoop to any directory. Recommended install directory is
/usr/local/hadoop-1.0.0
, but you could use other directories.Configure Hadoop
There are 3 basic configuration options for Hadoop:- Local (Standalone) Mode: All services run in a single node, with no replication.
- Pseudo-Distributed Mode: Services run in a single node, but as separate Java processes.
- Fully-Distributed Mode: Real distributed environment.
/conf
. They all share the same key-value format, stored as a sequence of:Pseudo-Distributed Mode is the ideal development mode. Minimum configuration files for pseudo-distributed mode are shown below:Property name Property value
conf/core-site.xml
:fs.default.name hdfs://localhost:9000 conf/hdfs-site.xml
:dfs.replication 1 conf/mapred-site.xml
:mapred.job.tracker localhost:9001
tmp
directory. So any implementation should begin by defining tmp
and hdfs
directories, as shown below:conf/core-site.xml
:hadoop.tmp.dir /tmp/hadoop-${user.name} fs.default.name hdfs://localhost:9000 conf/hdfs-site.xml
:dfs.replication 1 dfs.name.dir /home/${user.name}/hdfs/name dfs.data.dir /home/${user.name}/hdfs/data conf/mapred-site.xml
:mapred.job.tracker localhost:9001
Under Windows, specify paths using full format. Eg:
dfs.name.dir file:///c:/hdfs/name
Start hadoop
Format NameNode
Before starting Hadoop, you have to format the Name node. This is the node containing file structure. To format the Name node run:cd /usr/local/hadoop-1.0.0 ./bin/hadoop namenode -formatSeveral files will be created under the directory defined for the configuration key
dfs.name.dir
.Start HDFS
bin/start-dfs.shCheck HDFS is running by browsing to: http://localhost:50070/.
A webpage should be displayed with DFS information, where you can view and browse the directory structure.
If you run into any issue, check log files under
hadoop-1.0.0/logs/
for errors.You can also browse the file system using
bin/hadoop fs -ls
. Type bin/hadoop fs
for the complete set of commands.Under MAC/OSX you might get an “
Unable to load realm info from SCDynamicStore
” error. If you run into this issue, add the following line:export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk
Start MapReduce (JobTracker):
bin/start-mapred.shCheck JobTracker has started by browsing to:: http://localhost:50030/.
A page with scheduled jobs should be displayed.
Check
hadoop-1.0.0/logs/
for errors.Check HDFS and JobTracker by openning:
- NameNode: http://localhost:50070/
- JobTracker: http://localhost:50030/
Now steps to start all hadoop components:
Start the local hadoop cluster
- Start the namenode in the first window by executing
cd hadoop-0.19.1
bin/hadoop namenode - Start the secondary namenode in the second window by executing
cd hadoop-0.19.1
bin/hadoop secondarynamenode - Start the job tracker the third window by executing
cd hadoop-0.19.1
bin/haoop jobtracker - Start the data node the fourth window by executing
cd hadoop-0.19.1
bin/haoop datanode - Start the task tracker the fifth window by executing
cd hadoop-0.19.1
bin/haoop tasktracker - Now you should have an operational hadoop cluster. If everthing went fine your screen should look like the image below:
Now the important step that post belong
Setup Hadoop Location in Eclipse
- Launch the Eclipse environment.
- Open Map/Reduce perspective by clicking on the open perspective icon (), select "Other" from the menu, and then select "Map/Reduce" from the list of perspectives.
- After you switched to the Map/Reduce perspective. Select the Map/Reduce Locations tab located at the bottom portion of your eclipse environment. Then right click on the blank space in that tab and select "New Hadoop location...." from the context menu.
- Fill in the following items, as shown on the figure above.
- Location Name -- localhost
- Map/Reduce Master
- Host -- localhost
- Port -- 9101
- DFS Master
- Check "Use M/R Master Host"
- Port -- 9100
- User name -- User
-
After you closed the Hadoop location settings dialog you should see a new location appearing in the "Map/Reduce Locations" tab.
- In the Project Explorer tab on the lefthand side of the eclipse window, find the DFS Locations item. Open it up using the "+" icon on the left side of it, inside of it you should see the localhost location reference with the blue elephant icon. Keep opening up the items
Upload data to HDFS
- Open a new CYGWIN command window.
- Execute the following commands in the new CYGWIN window as shown on the image above.
cd hadoop-0.19.1When the last of the above commands will start execution you should see some activity happening in the rest of the hadoop windows as shown on the image below.
bin/hadoop fs -mkdir In
bin/hadoop fs -put *.txt In
Creating and configuring Hadoop eclipse project.
- Launch Eclipse
- Right click on the blank space in the Project Explorer window and select New -> Project.. to create a new project.
- Select Map/Reduce Project from the list of project types.
- Hadoop on Windows With Eclipse
- Hadoop on Windows With Eclipse
- Video for Hadoop on Windows
- Installation guide for Cygwin and java and SSH and Hadoop Configuration
No comments:
Post a Comment