Apache Pig 安裝

2018-01-09 18:47 更新

本章將介紹如何在系統(tǒng)中下載,安裝和設置 Apache Pig

先決條件

在你運行Apache Pig之前,必須在系統(tǒng)上安裝好Hadoop和Java。因此,在安裝Apache Pig之前,請按照以下鏈接中提供的步驟安裝Hadoop和Java://m.o2fo.com/hadoop/hadoop_enviornment_setup.htm

下載Apache Pig

首先,從以下網(wǎng)站下載最新版本的Apache Pig:https://pig.apache.org/

步驟1

打開Apache Pig網(wǎng)站的主頁。News部分下,點擊鏈接release page,如下面的快照所示。

Home Page

步驟2

點擊指定的鏈接后,你將被重定向到 Apache Pig Releases 頁面。在此頁面的Download部分下,單擊鏈接,然后你將被重定向到具有一組鏡像的頁面。

Apache Pig Releases

步驟3

選擇并單擊這些鏡像中的任一個,如下所示。

click mirrors

步驟4

這些鏡像將帶您進入 Pig Releases 頁面。 此頁面包含Apache Pig的各種版本。 單擊其中的最新版本。

Pig Release

步驟5

在這些文件夾中,有發(fā)行版中的Apache Pig的源文件和二進制文件。下載Apache Pig 0.16, pig0.16.0-src.tar.gz pig-0.16.0.tar.gz 的源和二進制文件的tar文件。

Pig Index

安裝Apache Pig

下載Apache Pig軟件后,按照以下步驟將其安裝在Linux環(huán)境中。

步驟1

在安裝了 Hadoop,Java和其他軟件的安裝目錄的同一目錄中創(chuàng)建一個名為Pig的目錄。(在我們的教程中,我們在名為Hadoop的用戶中創(chuàng)建了Pig目錄)。

$ mkdir Pig

第2步

提取下載的tar文件,如下所示。

$ cd Downloads/ 
$ tar zxvf pig-0.15.0-src.tar.gz 
$ tar zxvf pig-0.15.0.tar.gz 

步驟3

pig-0.16.0-src.tar.gz 文件的內(nèi)容移動到之前創(chuàng)建的 Pig 目錄,如下所示。

$ mv pig-0.16.0-src.tar.gz/* /home/Hadoop/Pig/

配置Apache Pig

安裝Apache Pig后,我們必須配置它。要配置,我們需要編輯兩個文件 - bashrcpig.properties 。

.bashrc文件

.bashrc 文件中,設置以下變量

  • PIG_HOME 文件夾復制到Apache Pig的安裝文件夾

  • PATH 環(huán)境變量復制到bin文件夾

  • PIG_CLASSPATH 環(huán)境變量復制到安裝Hadoop的etc(配置)文件夾(包含core-site.xml,hdfs-site.xml和mapred-site.xml文件的目錄)。

export PIG_HOME = /home/Hadoop/Pig
export PATH  = PATH:/home/Hadoop/pig/bin
export PIG_CLASSPATH = $HADOOP_HOME/conf

pig.properties文件

在Pig的 conf 文件夾中,我們有一個名為 pig.properties 的文件。在pig.properties文件中,可以設置如下所示的各種參數(shù)。

pig -h properties 

支持以下屬性:

Logging: verbose = true|false; default is false. This property is the same as -v
       switch brief=true|false; default is false. This property is the same 
       as -b switch debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO.             
       This property is the same as -d switch aggregate.warning = true|false; default is true. 
       If true, prints count of warnings of each type rather than logging each warning.		 
		 
Performance tuning: pig.cachedbag.memusage=<mem fraction>; default is 0.2 (20% of all memory).
       Note that this memory is shared across all large bags used by the application.         
       pig.skewedjoin.reduce.memusagea=<mem fraction>; default is 0.3 (30% of all memory).
       Specifies the fraction of heap available for the reducer to perform the join.
       pig.exec.nocombiner = true|false; default is false.
           Only disable combiner as a temporary workaround for problems.         
       opt.multiquery = true|false; multiquery is on by default.
           Only disable multiquery as a temporary workaround for problems.
       opt.fetch=true|false; fetch is on by default.
           Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR jobs.         
       pig.tmpfilecompression = true|false; compression is off by default.             
           Determines whether output of intermediate jobs is compressed.         
       pig.tmpfilecompression.codec = lzo|gzip; default is gzip.
           Used in conjunction with pig.tmpfilecompression. Defines compression type.         
       pig.noSplitCombination = true|false. Split combination is on by default.
           Determines if multiple small files are combined into a single map.         
			  
       pig.exec.mapPartAgg = true|false. Default is false.             
           Determines if partial aggregation is done within map phase, before records are sent to combiner.         
       pig.exec.mapPartAgg.minReduction=<min aggregation factor>. Default is 10.             
           If the in-map partial aggregation does not reduce the output num records by this factor, it gets disabled.
			  
Miscellaneous: exectype = mapreduce|tez|local; default is mapreduce. This property is the same as -x switch
       pig.additional.jars.uris=<comma seperated list of jars>. Used in place of register command.
       udf.import.list=<comma seperated list of imports>. Used to avoid package names in UDF.
       stop.on.failure = true|false; default is false. Set to true to terminate on the first error.         
       pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of the host.
           Determines the timezone used to handle datetime datatype and UDFs.
Additionally, any Hadoop property can be specified.

驗證安裝

通過鍵入version命令驗證Apache Pig的安裝。如果安裝成功,你將獲得Apache Pig的正式版本,如下所示。

$ pig –version 
 
Apache Pig version 0.16.0 (r1682971)  
compiled Jun 01 2015, 11:44:35


以上內(nèi)容是否對您有幫助:
在線筆記
App下載
App下載

掃描二維碼

下載編程獅App

公眾號
微信公眾號

編程獅公眾號