基于腾讯云对象存储跑hadoop任务实战一

公司在腾讯云有一个大数据集群，用hdp的ambari部署管理的，hdp面有hadoop、hive、spark等常用的大数据组件，公司的报表都从这里生成。

随着业务发展，历史数据量急速增加，半年已经过百TB，还在以每天几TB的数量增加着，删除太可惜了，要找个地方备份。于是想到了对象存储，看了下官方文档 https://cloud.tencent.com/document/product/436/6884， hadoop任务竟然可以直接跑在腾讯云对象存储上，太给力了！直接动手实践一下。

部署过程

部署配置过程，文档里还是写的很详细的，这里简单列下步骤：

1、在腾讯云存储新建一个bucket，注意bucket建的园区需要和你大数据cvm的园区相同，这样上传、下载都会走内网，不收钱的。如果建成了不同园区，下载会收流量费用。

2、有了bucket后，去文档里给的github地址（https://github.com/tencentyun/hadoop-cos/tree/master/dep）下载编译好的最新版本的两个lib，我的hadoop版本是2.7.3，所以下载 cos_hadoop_api-5.2.6.jar，hadoop-cos-2.7.3.jar（如果没有对应版本也可以下载源码自己编译）。这两个lib需要复制到跑hadoop任务的所有机器上去，都需要放到同一个路径下，后面会用到。

3、按照官网文档修改你的hadoop集群配置。这一步，有可能很多朋友现网环境配置不能随意变更，也可以针对对象存储，单独生成一份配置，在跑hadoop任务的时候通过参数指定，例如：hadoop fs -conf ./core-site.xml -libjars ./cos_hadoop_api-5.2.6.jar,./hadoop-cos-2.7.3.jar -ls / 。

通过上面三步，大概不到半小时的时间，就部署好了。赶紧执行下 hadoop fs -ls / 试试效果，列出了我的bucket里的文件，so easy！

跑mr任务

不过，在执行hadoop的example里的MapReduce任务的时候 hadoop jar ./mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount cosn://example/mr/input cosn://example/mr/output，报错了 "Class org.apache.hadoop.fs.CosFileSystem not found"。

上网查了下，为了能够在云对象存储运行 MAPREDUCE 任务，还需更改hdp安装目录下/hdp/apps/2.6.xxxx/mapreduce/mapreduce.tar.gz包的内容，将 COSN 的两个lib放如上述压缩包，替换掉原来的文件，测试可用。不过这个方案有个缺点，就是hdp的ambari每次修改完配置重启的时候，都会把mapreduce.tar.gz

替换回系统自带的。该方案治标不治本。

不过上面的方案启发了我，应该是MapReduce任务的配置有点问题，导致找不到cos的lib，研究了下mapred-site.xml这个配置文件，在mapreduce.application.classpath里加上cosn的lib，就ok了。

测试ok，即使修改集群配置、重启集群，都还能正常执行MapReduce任务：

代码语言：javascript

复制

[hdfs@hadoop01 ~]$ hadoop jar /usr/hdp/2.6.xxx/hadoop-mapreduce/hadoop-mapreduce-examples.jar  wordcount cosn://testshhadoop-xxxxxxx/input/ cosn://testshhadoop-xxxxxxxx/example/output/
19/02/22 10:44:12 INFO client.RMProxy: Connecting to ResourceManager at hadoop03.cos.qcloud.com/xxxxxxx:8050
19/02/22 10:44:12 INFO client.AHSProxy: Connecting to Application History server at hadoop04.cos.qcloud.com/xxxxxxx:10200
19/02/22 10:44:13 INFO fs.CosNativeFileSystemStore: store file input stream md5 hash content length
19/02/22 10:44:13 INFO fs.CosFsDataOutputStream: OutputStream for key '/user/hdfs/.staging/job_1550379850904_0007/job.jar' upload complete
19/02/22 10:44:14 INFO input.FileInputFormat: Total input paths to process : 1
19/02/22 10:44:14 INFO fs.CosNativeFileSystemStore: store file input stream md5 hash content length
19/02/22 10:44:14 INFO fs.CosFsDataOutputStream: OutputStream for key '/user/hdfs/.staging/job_1550379850904_0007/job.split' upload complete
19/02/22 10:44:14 INFO fs.CosNativeFileSystemStore: store file input stream md5 hash content length
19/02/22 10:44:14 INFO fs.CosFsDataOutputStream: OutputStream for key '/user/hdfs/.staging/job_1550379850904_0007/job.splitmetainfo' upload complete
19/02/22 10:44:14 INFO mapreduce.JobSubmitter: number of splits:1
19/02/22 10:44:14 INFO fs.CosNativeFileSystemStore: store file input stream md5 hash content length
19/02/22 10:44:14 INFO fs.CosFsDataOutputStream: OutputStream for key '/user/hdfs/.staging/job_1550379850904_0007/job.xml' upload complete
19/02/22 10:44:14 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1550379850904_0007
19/02/22 10:44:14 INFO impl.YarnClientImpl: Submitted application application_1550379850904_0007
19/02/22 10:44:14 INFO mapreduce.Job: The url to track the job: http://hadoop03.cos.qcloud.com:8088/proxy/application_1550379850904_0007/
19/02/22 10:44:14 INFO mapreduce.Job: Running job: job_1550379850904_0007
19/02/22 10:44:20 INFO mapreduce.Job: Job job_1550379850904_0007 running in uber mode : false
19/02/22 10:44:20 INFO mapreduce.Job:  map 0% reduce 0%
19/02/22 10:44:26 INFO mapreduce.Job:  map 100% reduce 0%
19/02/22 10:44:31 INFO mapreduce.Job:  map 100% reduce 100%
19/02/22 10:44:34 INFO mapreduce.Job: Job job_1550379850904_0007 completed successfully
19/02/22 10:44:34 INFO mapreduce.Job: Counters: 49
        File System Counters
                COSN: Number of bytes read=14408
                COSN: Number of bytes written=55
                COSN: Number of read operations=0
                COSN: Number of large read operations=0
                COSN: Number of write operations=0
                FILE: Number of bytes read=89
                FILE: Number of bytes written=303554
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
        Job Counters 
                Launched map tasks=1
                Launched reduce tasks=1
                Rack-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=9808
                Total time spent by all reduces in occupied slots (ms)=5824
                Total time spent by all map tasks (ms)=2452
                Total time spent by all reduce tasks (ms)=2912
                Total vcore-milliseconds taken by all map tasks=2452
                Total vcore-milliseconds taken by all reduce tasks=2912
                Total megabyte-milliseconds taken by all map tasks=10043392
                Total megabyte-milliseconds taken by all reduce tasks=5963776
        Map-Reduce Framework
                Map input records=3
                Map output records=9
                Map output bytes=89
                Map output materialized bytes=89
                Input split bytes=117
                Combine input records=9
                Combine output records=7
                Reduce input groups=7
                Reduce shuffle bytes=89
                Reduce input records=7
                Reduce output records=7
                Spilled Records=14
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=74
                CPU time spent (ms)=1840
                Physical memory (bytes) snapshot=593371136
                Virtual memory (bytes) snapshot=9280081920
                Total committed heap usage (bytes)=462946304
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=4994
        File Output Format Counters 
                Bytes Written=55

至此，就能通过hadoop自带的hadoop distcp任务，愉快地把我们公司的历史数据批量备份到腾讯云对象存储上了，并且想要对这些数据做分析，也不用把他们拉回本地hdfs，可以直接分析云上数据。