本文在mac osx 上安装配置hadoop环境,配置eclipse开发环境,并打包jar文件在集群上运行.

Mac安装hadoop环境

参考 在Mac OSX Yosemite上安装Hadoop

设置alias

在zsh的.zshrc中配置打开和关闭hadoop的alias

1
2
alias hstart="/usr/local/Cellar/hadoop/2.7.1/sbin/start-dfs.sh;/usr/local/Cellar/hadoop/2.7.1/sbin/start-yarn.sh"
alias hstop="/usr/local/Cellar/hadoop/2.7.1/sbin/stop-yarn.sh;/usr/local/Cellar/hadoop/2.7.1/sbin/stop-dfs.sh"

然后还有hbase的

1
2
alias hbstart="/usr/local/Cellar/hbase/1.0.1/libexec/bin/start-hbase.sh"
alias hbstop="/usr/local/Cellar/hbase/1.0.1/libexec/bin/stop-hbase.sh"

开启成功的标识

使用jps命令,出现以下几项:

1
2
3
4
5
6
2821 ResourceManager
2582 DataNode
2918 NodeManager
2968 Jps
2699 SecondaryNameNode
2493 NameNode

使用如下命令测试是否能够正确执行hadoop程序:

1
hadoop jar /usr/local/Cellar/hadoop/2.7.1/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 2 5

查看web任务的接口

我们可以通过以下web接口查看mapreduce任务

1
2
3
http://localhost:8088 Cluster Status 这个接口非常有用,2.2之后,之前用50030端口
http://localhost:50070/ HDFS status
http://localhost:50090 secondaryNamenode



安装配置eclipse插件

下载地址(2.7.0)

1
http://download.csdn.net/detail/yew1eb/8716117

插件设置

在偏好设置 -> Hadoop Map/Reduce中配置Hadoop installation directory为:

1
2
/usr/local/Cellar/hadoop/2.7.1/libexec
#因人而异,这里是用homebrew安装hadoop后的地址

打开mapreduce工作空间

Window -> Open Perspective -> Other -> Map/Reduce

添加hadoop地址

在下面状态栏的Map/Reduce Locations中右击,选择New Hadoop location
配置 :

  • Location name : 随意,如hadoop
  • Map/Reduce Master的端口参考

    1
    /usr/local/Cellar/hadoop/2.7.1/libexec/etc/hadoop/mapred-site.xml
  • DFS Master的端口参考

    1
    /usr/local/Cellar/hadoop/2.7.1/libexec/etc/hadoop/core-site.xml

配置完成标志

点击左侧的DFSLocations—>hadoop(上一步配置的location name),如能看到user,表示安装成功(保证已经启动了hadoop)



实战程序wordcount

新建WordCount项目

File—>Project,选择Map/Reduce Project,输入项目名称WordCount等。

在WordCount项目里新建class,名称为WordCount,代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();
  public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
    StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
  }
}
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
  private IntWritable result = new IntWritable();
  public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
    int sum = 0;
    for (IntWritable val : values) {
      sum += val.get();
    }
    result.set(sum);
    context.write(key, result);
  }
}
public static void main(String[] args) throws Exception {
  Configuration conf = new Configuration();
  String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
  if (otherArgs.length != 2) {
    System.err.println("Usage: wordcount <in> <out>");
    System.exit(2);
  }
  Job job = new Job(conf, "word count");
  job.setJarByClass(WordCount.class);
  job.setMapperClass(TokenizerMapper.class);
  job.setCombinerClass(IntSumReducer.class);
  job.setReducerClass(IntSumReducer.class);
  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(IntWritable.class);
  FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
  FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
  System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

创建输入input目录

1
2
hadoop fs -mkdir /user
hadoop fs -mkdir /user/input

拷贝本地README.txt到HDFS的input里

1
hadoop fs -copyFromLocal /opt/hadoop/README.txt /user/input

参数配置

点击WordCount.java,右键,点击Run As—>Run Configurations,配置运行参数Arguments,即输入和输出文件夹(事先不能存在)

1
  hdfs://localhost:9000/user/input hdfs://localhost:9000/user/output

然后点击运行

查看结果

hadoop fs -ls output,可以看到有两个输出结果_SUCCESSpart-r-00000
执行hadoop fs -cat output/*

或者也可以展开DFS Locations,双击打开part-r00000查看结果



打包eclipse程序在集群上运行

直接运行eclipse程序,访问localhost:8088窗口并不会显示正在运行的job,因为在eclipse上并没有配置集群环境,而是本地环境。

打包jar

右击工程 -> export… -> java -> jar file -> next -> 填写jar包位置 -> finish

运行

1
2
hadoop jar crawler.jar MR_Crawler hdfs://localhost:9000/user/input/urls hdfs://localhost:9000/user/output
# 注意用MR_Crawler是因为没有配置具体的package,如果有包要加上包名.


注意点

  • 运行后台程序前, 必须格式化新安装的HDFS, 并通过创建存储目录和初始化元数据创新空的文件系统, 执行下面命令hdfs namenode -format
  • 输出文件默认不能存在