本章將介紹如何在系統(tǒng)中下載,安裝和設置 Apache Pig 。
在你運行Apache Pig之前,必須在系統(tǒng)上安裝好Hadoop和Java。因此,在安裝Apache Pig之前,請按照以下鏈接中提供的步驟安裝Hadoop和Java://m.o2fo.com/hadoop/hadoop_enviornment_setup.htm
首先,從以下網(wǎng)站下載最新版本的Apache Pig:https://pig.apache.org/
打開Apache Pig網(wǎng)站的主頁。在News部分下,點擊鏈接release page,如下面的快照所示。
點擊指定的鏈接后,你將被重定向到 Apache Pig Releases 頁面。在此頁面的Download部分下,單擊鏈接,然后你將被重定向到具有一組鏡像的頁面。
選擇并單擊這些鏡像中的任一個,如下所示。
這些鏡像將帶您進入 Pig Releases 頁面。 此頁面包含Apache Pig的各種版本。 單擊其中的最新版本。
在這些文件夾中,有發(fā)行版中的Apache Pig的源文件和二進制文件。下載Apache Pig 0.16, pig0.16.0-src.tar.gz 和 pig-0.16.0.tar.gz 的源和二進制文件的tar文件。
下載Apache Pig軟件后,按照以下步驟將其安裝在Linux環(huán)境中。
在安裝了 Hadoop,Java和其他軟件的安裝目錄的同一目錄中創(chuàng)建一個名為Pig的目錄。(在我們的教程中,我們在名為Hadoop的用戶中創(chuàng)建了Pig目錄)。
$ mkdir Pig
提取下載的tar文件,如下所示。
$ cd Downloads/ $ tar zxvf pig-0.15.0-src.tar.gz $ tar zxvf pig-0.15.0.tar.gz
將 pig-0.16.0-src.tar.gz 文件的內(nèi)容移動到之前創(chuàng)建的 Pig 目錄,如下所示。
$ mv pig-0.16.0-src.tar.gz/* /home/Hadoop/Pig/
安裝Apache Pig后,我們必須配置它。要配置,我們需要編輯兩個文件 - bashrc和pig.properties 。
在 .bashrc 文件中,設置以下變量
PIG_HOME 文件夾復制到Apache Pig的安裝文件夾
PATH 環(huán)境變量復制到bin文件夾
PIG_CLASSPATH 環(huán)境變量復制到安裝Hadoop的etc(配置)文件夾(包含core-site.xml,hdfs-site.xml和mapred-site.xml文件的目錄)。
export PIG_HOME = /home/Hadoop/Pig export PATH = PATH:/home/Hadoop/pig/bin export PIG_CLASSPATH = $HADOOP_HOME/conf
在Pig的 conf 文件夾中,我們有一個名為 pig.properties 的文件。在pig.properties文件中,可以設置如下所示的各種參數(shù)。
pig -h properties
支持以下屬性:
Logging: verbose = true|false; default is false. This property is the same as -v switch brief=true|false; default is false. This property is the same as -b switch debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO. This property is the same as -d switch aggregate.warning = true|false; default is true. If true, prints count of warnings of each type rather than logging each warning. Performance tuning: pig.cachedbag.memusage=<mem fraction>; default is 0.2 (20% of all memory). Note that this memory is shared across all large bags used by the application. pig.skewedjoin.reduce.memusagea=<mem fraction>; default is 0.3 (30% of all memory). Specifies the fraction of heap available for the reducer to perform the join. pig.exec.nocombiner = true|false; default is false. Only disable combiner as a temporary workaround for problems. opt.multiquery = true|false; multiquery is on by default. Only disable multiquery as a temporary workaround for problems. opt.fetch=true|false; fetch is on by default. Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR jobs. pig.tmpfilecompression = true|false; compression is off by default. Determines whether output of intermediate jobs is compressed. pig.tmpfilecompression.codec = lzo|gzip; default is gzip. Used in conjunction with pig.tmpfilecompression. Defines compression type. pig.noSplitCombination = true|false. Split combination is on by default. Determines if multiple small files are combined into a single map. pig.exec.mapPartAgg = true|false. Default is false. Determines if partial aggregation is done within map phase, before records are sent to combiner. pig.exec.mapPartAgg.minReduction=<min aggregation factor>. Default is 10. If the in-map partial aggregation does not reduce the output num records by this factor, it gets disabled. Miscellaneous: exectype = mapreduce|tez|local; default is mapreduce. This property is the same as -x switch pig.additional.jars.uris=<comma seperated list of jars>. Used in place of register command. udf.import.list=<comma seperated list of imports>. Used to avoid package names in UDF. stop.on.failure = true|false; default is false. Set to true to terminate on the first error. pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of the host. Determines the timezone used to handle datetime datatype and UDFs. Additionally, any Hadoop property can be specified.
通過鍵入version命令驗證Apache Pig的安裝。如果安裝成功,你將獲得Apache Pig的正式版本,如下所示。
$ pig –version Apache Pig version 0.16.0 (r1682971) compiled Jun 01 2015, 11:44:35
更多建議: