Hadoop课程

Hadoop 流

当前位置：免费教程 » 大数据/云 » Hadoop

上一节:Hadoop IO

下一节:Hadoop 多节点集群

优化或报错有奖

Hadoop流是Hadoop发行版附带的一个实用程序。此实用程序允许您使用任何可执行文件或脚本作为映射程序和/或reducer创建和运行Map / Reduce作业。

原理

Hadoop Streaming是Hadoop提供的一个编程工具，它允许用户使用任何可执行文件或者脚本文件作为Mapper和Reducer，例如：采用shell脚本语言中的一些命令作为mapper和reducer（cat作为mapper，wc作为reducer）

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar             -input myInputDirs             -output myOutputDir             -mapper cat             -reducer wc

mapper和reducer会从标准输入中读取用户数据，一行一行处理后发送给标准输出。Streaming工具会创建MapReduce作业，发送给各个tasktracker，同时监控整个作业的执行过程。

如果一个文件（可执行或者脚本）作为mapper，则在mapper初始化时，每一个mapper任务会把该文件作为一个单独进程启动，mapper任务运行时，它把输入切分成行并把每一行提供给可执行文件进程的标准输入。同时，mapper收集可执行文件进程标准输出的内容，并把收到的每一行内容转化成key/value对，作为mapper的输出。默认情况下，一行中第一个tab之前的部分作为key，之后的（不包括tab）作为value。如果没有tab，整行作为key值，value值为null。不过，这可以定制，在下文中会介绍如何自定义key和value的切分方式。

对于reducer，类似。

以上是Map/Reduce框架和streaming mapper/reducer之间的基本通信协议。

语法

基本语法

Usage: $HADOOP_HOME/bin/hadoop jar                        $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar [options]

options：
（1）-input：输入文件路径
（2）-output：输出文件路径
（3）-mapper：用户自己写的mapper程序，可以是可执行文件或者脚本
（4）-reducer：用户自己写的reducer程序，可以是可执行文件或者脚本
（5）-file：打包文件到提交的作业中，可以是mapper或者reducer要用的输入文件，如配置文件，字典等。
（6）-partitioner：用户自定义的partitioner程序
（7）-combiner：用户自定义的combiner程序（必须用java实现）
（8）-D：作业的一些属性（以前用的是-jonconf），具体有：
1）mapred.map.tasks：map task数目
2）mapred.reduce.tasks：reduce task数目
3）stream.map.input.field.separator/stream.map.output.field.separator：map task输入/输出数据的分隔符，默认均为\t。
4）stream.num.map.output.key.fields：指定map task输出记录中key所占的域数目
5）stream.reduce.input.field.separator/stream.reduce.output.field.separator：reduce task输入/输出数据的分隔符，默认均为\t。
6）stream.num.reduce.output.key.fields：指定reduce task输出记录中key所占的域数目。

有时只需要map函数处理输入数据，这时只需把mapred.reduce.tasks设置为零，Map/Reduce框架就不会创建reducer任务，mapper任务的输出就是整个作业的最终输出。为了做到向下兼容，Hadoop Streaming也支持“-reduce None”选项，它与“-jobconf mapred.reduce.tasks=0”等价。

扩展语法

之前已经提到，当Map/Reduce框架从mapper的标准输入读取一行时，它把这一行切分为key/value对。在默认情况下，每行第一个tab符之前的部分作为key，之后的部分作为value（不包括tab符）。

但是，用户也可以自定义，可以指定分隔符是其它字符而不是默认的tab符，或者指定在第n（n>=1）个分割符处分割而不是默认的第一个。例如：

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar             -input myInputDirs             -output myOutputDir             -mapper org.apache.hadoop.mapred.lib.IdentityMapper             -reducer org.apache.hadoop.mapred.lib.IdentityReducer             -jobconf stream.map.output.field.separator=.             -jobconf stream.num.map.output.key.fields=4

在上面的例子中，“-jobconf stream.map.output.field.separator=.”指定“.”作为map输出内容的分隔符，并且从在第4个“.”之前的部分作为key，之后的部分作为value（不包括这第4个“.”）。如果一行中的“.”少于4个，则整行的内容作为key，value设为空的Text对象（就像这样创建了一个Text：new Text("")）。

同样地，用户也可以使用“-jobconf stream.reduce.output.field.separator=SEP”和“-jobconf stream.num.reduce.output.fields=NUM”来指定reduce输出的行中，第几个分隔符处分割key和value。

实例

为了说明各种语言编写Hadoop Streaming程序的方法，下面以WordCount为例，WordCount作业的主要功能是对用户输入的数据中所有字符串进行计数。

1、shell

#vi mapper.sh

        #! /bin/bash
        while read LINE; do
          for word in $LINE
          do
            echo "$word 1"
          done
        done

#vi reducer.sh

        #! /bin/bash
        count=0
        started=0
        word=""
        while read LINE;do
            newword=`echo $LINE | cut -d ' ' -f 1`
            if [ "$word" != "$newword" ];then
                [ $started -ne 0 ] && echo -e "$word\t$count"
                word=$newword
                count=1
                started=1
            else
                count=$(( $count + 1 ))
            fi
        done
        echo -e "$word\t$count"

本地测试：

cat input.txt | sh mapper.sh | sort | sh reducer.sh

集群测试：

                $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar                     -input myInputDirs                     -output myOutputDir                     -mapper mapper.sh                    -reducer reducer.sh

如果执行上面脚本提示：“Caused by: java.io.IOException: Cannot run program “/user/hadoop/Mapper”: error=2, No such file or directory”，则说明找不到可执行程序，可以在提交作业时，采用-file选项指定这些文件，比如上面例子中，可以使用“-file mapper.py -file reducer.py”，这样，Hadoop会将这两个文件自动分发到各个节点上，比如：

        $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar             -input myInputDirs             -output myOutputDir             -mapper mapper.sh            -reducer reducer.sh            -file mapper.sh             -file reducer.sh

2、python

#vi mapper.py

        #!/usr/bin/env python 
        import sys
        #maps words to their counts
        word2count = {}
        #input comes from STDIN (standard input)
        for line in sys.stdin:
            #remove leading and trailing whitespace
            line = line.strip()
            #split the line into words while removing any empty strings
            words = filter(lambda word: word, line.split())
            #increase counters
            for word in words:
                #write the results to STDOUT (standard output);
                #what we output here will be the input for the
                #Reduce step, i.e. the input for reducer.py
                #
                #tab-delimited; the trivial word count is 1
                print '%s\t%s' % (word, 1)

#vi reducer.py

        #!/usr/bin/env python
        from operator import itemgetter
        import sys
        #maps words to their counts
        word2count = {}
        #input comes from STDIN
        for line in sys.stdin:
            #remove leading and trailing whitespace
            line = line.strip()
            #parse the input we got from mapper.py
            word, count = line.split()
            #convert count (currently a string) to int
            try:
                count = int(count)
                word2count[word] = word2count.get(word, 0) + count
            except ValueError:
                #count was not a number, so silently
                #ignore/discard this line
                pass
        #sort the words lexigraphically;
        #
        #this step is NOT required, we just do it so that our
        #final output will look more like the official Hadoop
        #word count examples
        sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
        #write the results to STDOUT (standard output)
        for word, count in sorted_word2count:
            print '%s\t%s'% (word, count)

本地测试：

cat input.txt | python mapper.py | sort | python reducer.py

集群测试：

            $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar                     -input myInputDirs                     -output myOutputDir                     -mapper mapper.py                    -reducer reducer.py

使用Python的示例

对于Hadoop流，我们正在考虑字数问题。 Hadoop中的任何作业必须有两个阶段：mapper和reducer。我们已经为python脚本中的mapper和reducer编写了代码，以便在Hadoop下运行它。也可以在Perl和Ruby中写同样的内容。

映射器阶段代码

!/usr/bin/python
import sys
# Input takes from standard input for myline in sys.stdin: 
# Remove whitespace either side myline = myline.strip() 
# Break the line into words words = myline.split() 
# Iterate the words list for myword in words: 
# Write the results to standard output print '%s	%s' % (myword, 1)

确保此文件具有执行权限（chmod + x /home/expert /hadoop-1.2.1 / mapper.py）。

减速器阶段代码

#!/usr/bin/python
from operator import itemgetter 
import sys 
current_word = ""
current_count = 0 
word = "" 
# Input takes from standard input for myline in sys.stdin: 
# Remove whitespace either side myline = myline.strip() 
# Split the input we got from mapper.py word, count = myline.split('	', 1) 
# Convert count variable to integer 
   try: 
      count = int(count) 
except ValueError: 
   # Count was not a number, so silently ignore this line continue
if current_word == word: 
   current_count += count 
else: 
   if current_word: 
      # Write result to standard output print '%s	%s' % (current_word, current_count) 
   current_count = count
   current_word = word
# Do not forget to output the last word if needed! 
if current_word == word: 
   print '%s	%s' % (current_word, current_count)

将mapper和reducer代码保存在Hadoop主目录中的mapper.py和reducer.py中。确保这些文件具有执行权限（chmod + x mapper.py和chmod + x reducer.py）。因为python是缩进敏感所以相同的代码可以从下面的链接下载。

执行WordCount程序

$ $HADOOP_HOME/bin/hadoop jar contrib/streaming/hadoop-streaming-1.
2.1.jar 
   -input input_dirs  
   -output output_dir  
   -mapper <path/mapper.py  
   -reducer <path/reducer.py

其中“\”用于行连续以便清楚可读性。

例如：

./bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -input myinput -output myoutput -mapper /home/expert/hadoop-1.2.1/mapper.py -reducer /home/expert/hadoop-1.2.1/reducer.py

流如何工作

在上面的示例中，mapper和reducer都是从标准输入读取输入并将输出发送到标准输出的python脚本。该实用程序将创建一个Map / Reduce作业，将作业提交到适当的群集，并监视作业的进度，直到作业完成。

当为映射器指定脚本时，每个映射器任务将在映射器初始化时作为单独的进程启动脚本。当映射器任务运行时，它将其输入转换为行，并将这些行馈送到进程的标准输入（STDIN）。同时，映射器从进程的标准输出（STDOUT）收集面向行的输出，并将每行转换为键/值对，作为映射器的输出收集。默认情况下，直到第一个制表符字符的行的前缀是键，行的其余部分（不包括制表符字符）将是值。如果行中没有制表符，则整个行被视为键，值为null。但是，这可以根据一个需要定制。

当为reducer指定脚本时，每个reducer任务将作为单独的进程启动脚本，然后初始化reducer。当reducer任务运行时，它将其输入键/值对转换为行，并将行馈送到进程的标准输入（STDIN）。同时，reducer从进程的标准输出（STDOUT）收集面向行的输出，将每行转换为键/值对，将其作为reducer的输出进行收集。默认情况下，直到第一个制表符字符的行的前缀是键，行的其余部分（不包括制表符字符）是值。但是，这可以根据特定要求进行定制。

重要命令

参数	描述
-input directory/file-name	输入mapper的位置。（需要）
-output directory-name	减速器的输出位置。（需要）
-mapper executable or script or JavaClassName	Mapper可执行文件。（需要）
-reducer executable or script or JavaClassName	Reducer可执行文件。（需要）
-file file-name	使mapper，reducer或combiner可执行文件在计算节点本地可用。
-inputformat JavaClassName	你提供的类应该返回Text类的键/值对。如果未指定，则使用TextInputFormat作为默认值。
-outputformat JavaClassName	您提供的类应该采用Text类的键/值对。如果未指定，则使用TextOutputformat作为默认值。
-partitioner JavaClassName	确定将键发送到哪个reduce的类。
-combiner streamingCommand or JavaClassName	组合器可执行映射输出。
-cmdenv name=value	将环境变量传递到流式命令。
-inputreader	对于向后兼容性：指定记录读取器类（而不是输入格式类）。
-verbose	详细输出。
-lazyOutput	创建输出延迟。例如，如果输出格式基于FileOutputFormat，则输出文件仅在首次调用output.collect（或Context.write）时创建。
-numReduceTasks	指定Reducer的数量。
-mapdebug	映射任务失败时调用的脚本。
-reducedebug	当reduce任务失败时调用的脚本。

本节参考：http://bbs.csdn.net/topics/390909413

转载本站内容时，请务必注明来自W3xue，违者必究。

上一节:Hadoop IO

下一节:Hadoop 多节点集群

优化或报错有奖

友情链接：直通硅谷　点职佳　北美留学生论坛