金山云-文档中心-MapReduce简介

托管Hadoop(KMR)

查看更多结果

未找到含当前关键字的文档标题

页面目录

全部展开全部收起

未找到含该关键词的产品

文档中心

托管Hadoop(KMR)

开发指南

MapReduce

MapReduce简介

最近更新时间：2024-01-16 16:40:55

MapReduce简介

MapReduce是一种编程模型，用于大规模数据集（大于1TB）的并行运算。概念"Map（映射）“和"Reduce（归约）”，是它们的主要思想，都是从函数式编程语言里借来的，还有从矢量编程语言里借来的特性。它极大地方便了编程人员在不会分布式并行编程的情况下，将自己的程序运行在分布式系统上。当前的软件实现是指定一个Map（映射）函数，用来把一组键值对映射成一组新的键值对，指定并发的Reduce（归约）函数，用来保证所有映射的键值对中的每一个共享相同的键组。

使用Maven工程来管理MR作业

当您的工程规模越来越大时，会变得非常复杂，不易管理。我们采用类似Maven这样的软件项目管理工具来进行管理。其操作步骤如下。

1.确保本地安装好Maven。

2.在IDE打开,并编辑pom.xml文件，在dependencies内添加如下内容。


<dependency>
     <groupId>org.apache.hadoop</groupId>
     <artifactId>hadoop-mapreduce-client-common</artifactId>
     <version>2.6.0</version>
 </dependency>
 <dependency>
     <groupId>org.apache.hadoop</groupId>
     <artifactId>hadoop-common</artifactId>
     <version>2.6.0</version>
 </dependency>

3.Wrodcount实例代码(以下代码来源于Hadoop官网)。


import java.io.IOException;
import java.util.StringTokenizer;
 
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 
public class WordCount2 {
 
    public static class TokenizerMapper
            extends Mapper<Object, Text, Text, IntWritable>{
 
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
 
        public void map(Object key, Text value, Context context
        ) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }
 
    public static class IntSumReducer
            extends Reducer<Text,IntWritable,Text,IntWritable> {
        private IntWritable result = new IntWritable();
 
        public void reduce(Text key, Iterable<IntWritable> values,
                           Context context
        ) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }
 
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount2.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

4.编译并打包上传。
在工程目录下执行以下命令，您在工程目录的target目录下看到一个wordcountv2-1.0-SNAPSHOT.jar，将jar包上传到服务器上。

mvn clean package -DskipTests

5.作业输入输出。
hadoop作业的输入和输出文件，可以放在HDFS上，也可以选择放在KS3上。

（1）使用HDFS

将输入文件放到HDFS上，假设输入文件为TWILIGHT.txt。首先su hdfs用户下。

hadoop dfs -mkdir -p /user/hadoop/examples/input 
hadoop dfs -put TWILIGHT.txt /user/hadoop/examples/input

（2）使用KS3

hadoop dfs -mkdir -p ks3://kmrtest9/wordcount/lib 
hadoop dfs -mkdir -p ks3://kmrtest9/wordcount/input 
hadoop dfs -put wordcountv2-1.0-SNAPSHOT.jar ks3://kmrtest9/wordcount/lib 
hadoop dfs -put TWILIGHT.txt ks3://kmrtest9/wordcount/input

作业提交方法

（1）输入输出在HDFS上：

hadoop jar wordcountv2-1.0-SNAPSHOT.jar WordCount2 /user/hadoop/examples/input/ /user/hadoop/examples/output

（2）输入输出在KS3上：

hadoop jar wordcountv2-1.0-SNAPSHOT.jar WordCount2 ks3://kmrtest9/wordcount/input/ ks3://kmrtest9/wordcount/output

Hadoop Stream构建MR作业

1.Hadoop Streaming允许用户使用可执行的命令或者脚本作为mapper和reducer。以下用几个示例说明Hadoop Streaming如何使用。详细可参考Hadoop官网Hadoop Streaming。

2.使用python脚本作为mapper和reducer。

mapper.py

#!/usr/bin/env python
import sys
for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print "%s\t%s" % (word, 1)

reducer.py

#!/usr/bin/env python
from operator import itemgetter
import sys
 
current_word = None
current_count = 0
word = None
 
for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)
    try:
        count = int(count)
    except ValueError:
        continue
    if current_word == word:
        current_count += count
    else:
        if current_word:
            print "%s\t%s" % (current_word, current_count)
        current_count = count
        current_word = word
 
if word == current_word:
    print "%s\t%s" % (current_word, current_count)

3.命令行执行streaming作业。

chmod +x mapper.py 
chmod +x reducer.py 
hadoop jar /usr/hdp/2.6.1.0-129/hadoop-mapreduce/hadoop-streaming-2.7.3.2.6.1.0-129.jar -input /user/hadoop/examples/input/ -output /user/hadoop/examples/output2 -mapper mapper.py -reducer reducer.py -file mapper.py  -file reducer.py

文档导读

上一篇：MapReduce

下一篇：Presto

纯净模式常规模式

纯净模式

点击可全屏预览文档内容

活动推荐

热门产品

新品发布

计算

网络

云分发

中间件

容器

视频服务

存储

关系型数据库

NoSQL数据库

数据库工具

云安全

安全管理

数据安全

业务安全

安全服务

大数据存储与计算

大数据搜索与分析

私有化大数据

人工智能应用

人工智能平台

专有云

私有云

混合云存储

开发与运维

应用服务

行业引擎

医疗

政务

金融

媒体

教育

游戏

音视频

交通物流

企业应用

人工智能

关于我们

公司动态

服务保障

加入我们

联系我们