Table Of Content

Integrating Apache Sqoop And Apache Pig With Apache Hadoop By Abdulbasit F Shaikh Integrating Apache Sqoop And Apache Pig With Apache Hadoop 1 Apache Sqoop Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance. Prerequisites The following prerequisite knowledge is required for this product: Basic computer technology and terminology Familiarity with command-line interfaces such as bash Relational database management systems Basic familiarity with the purpose and operation of Hadoop Integrating sqoop with Apache Hadoop P.S : You have installed hadoop and configured it properly before going to follow these steps.And make sure to start hadoop.Just check by typing the JPS command from hadoop bin directory(/../hadoop/bin) and it should show below output, 7346 Jps 6423 NameNode 6879 JobTracker 6598 DataNode 2 Integrating Apache Sqoop And Apache Pig With Apache Hadoop 6789 SecondaryNameNode 7057 TaskTracker If not then type the below command, ./start-all.sh and then type the JPS command and check the output. And for running mysql database, you need to first put the mysql connector jar file in the lib directory of sqoop(/../sqoop-1.4.4.bin__hadoop-1.0.0/lib). Importing Data from mysql and put it to HDFS 1. download sqoop from http://www.carfab.com/apachesoftware/sqoop/1.4.4/ P.S : I have tested sqoop with hadoop 1.1.2.So I have downloaded sqoop- 1.4.4.bin__hadoop-1.0.0.tar.gz 2. Extract it. 3. Go to bin directory(/../sqoop-1.4.4.bin__hadoop-1.0.0/bin) 4. Type below command, ./sqoop import --connect <Your JDBC URL>/<Your Database Name> --table <Your table name> --username <User Name> -P E.g for mysql : ./sqoop import --connect jdbc:mysql://localhost/test --table ttest --username root - P It will ask for the password,enter it. 5) If all succeed then you will get the below output, Integrating Apache Sqoop And Apache Pig With Apache Hadoop 3 13/08/12 14:51:24 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. ... ... ... 13/08/12 14:51:44 INFO mapreduce.ImportJobBase: Retrieved 8 records. 6) Now you can go to your hdfs directory and check wheather data is succesfully imported or not. 7) For checking this go to bin directory of hadoop and type below command, hadoop fs -ls <table name> 8) If all succeed, you will get below output, Found 6 items -rw-r--r-- 2 hduser supergroup 0 2013-08-12 14:51 /user/hduser/ttest/_SUCCESS ... -rw-r--r-- 2 hduser supergroup 283 2013-08-12 14:51 /user/hduser/ttest/part-m-00003 Exporting Data From HDFS and put it to Mysql 1. Go to bin directory of sqoop and type below command, sqoop export --connect <JDBC Url>/<database name> --table <table name> -- export-dir <hdfs directory which contains data> --username <username> -P E.g for mysql : 4 Integrating Apache Sqoop And Apache Pig With Apache Hadoop ./sqoop export --connect jdbc:mysql://localhost/test --table test2 --export-dir /user/hduser/ttest --username root -P –validate P.S : Hdfs directory will be directory on hadoop which stores data.You can find it by going to bin directory of hadoop and by typing below command, ./hadoop dfs -ls It will give you output something like this, Found 1 items drwxr-xr-x - hduser supergroup 0 2013-08-12 14:51 /user/hduser/ttest 2. It will ask for password, enter it.If all succeed then you will get below output, 13/08/12 16:06:27 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. ... ... ... 13/08/12 16:06:42 INFO mapreduce.ExportJobBase: Exported 8 records. 3. Now you can check your data in your database by going to that table which you specified in the query. Integrating Apache Sqoop And Apache Pig With Apache Hadoop 5 Apache Pig Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties: Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. Extensibility. Users can create their own functions to do special-purpose processing. Pig Setup Requirements Mandatory Unix and Windows users need the following: Hadoop 0.20.2, 020.203, 020.204, 0.20.205, 1.0.0, 1.0.1, or 0.23.0, 0.23.1 - http://hadoop.apache.org/common/releases.html (You can run Pig with different versions of Hadoop by setting HADOOP_HOME to point to the directory where you have installed Hadoop. If you do not set 6 Integrating Apache Sqoop And Apache Pig With Apache Hadoop HADOOP_HOME, by default Pig will run with the embedded version, currently Hadoop 1.0.0.) Java 1.6 - http://java.sun.com/javase/downloads/index.jsp (set JAVA_HOME to the root of your Java installation) Windows users also need to install Cygwin and the Perl package: http://www.cygwin.com/ Optional Python 2.5 - http://jython.org/downloads.html (when using Python UDFs or embedding Pig in Python) JavaScript 1.7 - https://developer.mozilla.org/en/Rhino_downloads_archive and http://mirrors.ibiblio.org/pub/mirrors/maven2/rhino/js/ (when using JavaScript UDFs or embedding Pig in JavaScript) JRuby 1.6.7 - http://www.jruby.org/download (when using JRuby UDFs) Ant 1.7 - http://ant.apache.org/ (for builds) JUnit 4.5 - http://junit.sourceforge.net/ (for unit tests) Download Pig To get a Pig distribution, do the following: 1. Download a recent stable release from http://psg.mtu.edu/pub/apache/pig/stable/ 2. Unpack the downloaded Pig distribution, and then note the following: The Pig script file, pig, is located in the bin directory (/pig- o n.n.n/bin/pig). The Pig environment variables are described in the Pig script file. The Pig properties file, pig.properties, is located in the conf directory o (/pig-n.n.n/conf/pig.properties). You can specify an alternate location using the PIG_CONF_DIR environment variable. Integrating Apache Sqoop And Apache Pig With Apache Hadoop 7 3. Add /pig-n.n.n/bin to your path. Use export (bash,sh,ksh) or setenv (tcsh,csh). For example: $ export PATH=/<my-path-to-pig>/pig-n.n.n/bin:$PATH 4. Test the Pig installation with this simple command: $ pig -help Build Pig To build pig, do the following: 1. Check out the Pig code from SVN: svn co http://svn.apache.org/repos/asf/pig/trunk 2. Build the code from the top directory: ant If the build is successful, you should see the pig.jar file created in that directory. 3. Validate the pig.jar by running a unit test: ant test Running Pig You can run Pig (execute Pig Latin statements and Pig commands) using various modes. Local Mapreduce Mode Mode Interactive yes yes Mode Batch Mode yes yes Execution Modes Pig has two execution modes or exectypes: 8 Integrating Apache Sqoop And Apache Pig With Apache Hadoop Local Mode - To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system. Specify local mode using the -x flag (pig -x local). Mapreduce Mode - To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode; you can, but don't need to, specify it using the -x flag (pig OR pig -x mapreduce). You can run Pig in either mode using the "pig" command (the bin/pig Perl script) or the "java" command (java -cp pig.jar ...). Examples This example shows how to run Pig in local and mapreduce mode using the pig command. /* local mode */ $ pig -x local ... /* mapreduce mode */ $ pig ... or $ pig -x mapreduce ... This example shows how to run Pig in local and mapreduce mode using the java command. /* local mode */ $ java -cp pig.jar org.apache.pig.Main -x local ... /* mapreduce mode */ $ java -cp pig.jar org.apache.pig.Main ... or $ java -cp pig.jar org.apache.pig.Main -x mapreduce ... Integrating Apache Sqoop And Apache Pig With Apache Hadoop 9 Interactive Mode You can run Pig in interactive mode using the Grunt shell. Invoke the Grunt shell using the "pig" command (as shown below) and then enter your Pig Latin statements and Pig commands interactively at the command line. Example These Pig Latin statements extract all user IDs from the /etc/passwd file. First, copy the /etc/passwd file to your local working directory. Next, invoke the Grunt shell by typing the "pig" command (in local or hadoop mode). Then, enter the Pig Latin statements interactively at the grunt prompt (be sure to include the semicolon after each statement). The DUMP operator will display the results to your terminal screen. grunt> A = load 'passwd' using PigStorage(':'); grunt> B = foreach A generate $0 as id; grunt> dump B; Local Mode $ pig -x local ... - Connecting to ... grunt> Mapreduce Mode $ pig -x mapreduce ... - Connecting to ... grunt> or $ pig ... - Connecting to ... grunt> 10 Integrating Apache Sqoop And Apache Pig With Apache Hadoop

Description:

Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data .. Copy the pigtutorial.tar.gz file from the Pig tutorial directory to your local.

Integrating Apache Sqoop And Apache Pig With Apache Hadoop PDF

24 Pages·2008·0.37 MB·English

Checking for file health...

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Integrating Apache Sqoop And Apache Pig With Apache Hadoop

Description:

Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data .. Copy the pigtutorial.tar.gz file from the Pig tutorial directory to your local.

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.