简介:本文详细介绍了如何使用Docker在单机环境中部署Hadoop,包括环境准备、Docker镜像构建、容器配置及Hadoop服务启动的全流程,适合开发者和企业用户快速搭建测试环境。
Hadoop作为分布式计算的标杆框架,广泛应用于大数据处理场景。然而,传统物理机部署存在资源占用高、环境配置复杂等问题。Docker容器化技术通过轻量级虚拟化,为Hadoop单机测试提供了高效解决方案。本文将系统阐述如何使用Docker在单机环境中快速部署Hadoop集群,覆盖环境准备、镜像构建、容器编排及服务验证全流程。
sequenceiq/hadoop-docker(基于CentOS 6,已停止维护)bde2020/hadoop-base(Ubuntu 20.04基础镜像,持续更新)
# 示例:基于Ubuntu构建Hadoop镜像FROM ubuntu:20.04ENV HADOOP_VERSION=3.3.4ENV HADOOP_HOME=/usr/local/hadoopENV PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbinRUN apt-get update && \apt-get install -y openjdk-8-jdk wget ssh && \wget https://downloads.apache.org/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz && \tar -xzvf hadoop-$HADOOP_VERSION.tar.gz -C /usr/local/ && \ln -s /usr/local/hadoop-$HADOOP_VERSION $HADOOP_HOME && \rm hadoop-$HADOOP_VERSION.tar.gzCOPY core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml $HADOOP_HOME/etc/hadoop/
fs.defaultFS=hdfs://localhost:9000dfs.replication=1(单机模式)mapreduce.framework.name=yarnyarn.nodemanager.aux-services=mapreduce_shuffle.dockerignore文件排除无关文件docker build -t hadoop:3.3.4-ubuntu .docker scan检测漏洞
# docker-compose.yml示例version: '3.8'services:hadoop:image: hadoop:3.3.4-ubuntucontainer_name: hadoop-masterhostname: hadoop-masterports:- "50070:50070" # NameNode Web UI- "50075:50075" # DataNode Web UI- "8088:8088" # ResourceManager UI- "9000:9000" # IPC端口volumes:- ./hadoop_data:/tmp/hadoopenvironment:- HADOOP_HOME=/usr/local/hadoopcommand: bash -c "service ssh start && $$HADOOP_HOME/sbin/start-dfs.sh && $$HADOOP_HOME/sbin/start-yarn.sh && tail -f /dev/null"
初始化SSH免密登录:
# 在Dockerfile中添加RUN ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa && \cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys && \chmod 600 ~/.ssh/authorized_keys
格式化NameNode:
docker exec -it hadoop-master bash -c "$HADOOP_HOME/bin/hdfs namenode -format"
启动集群服务:
docker-compose up -d# 或分步启动docker exec hadoop-master $HADOOP_HOME/sbin/start-dfs.shdocker exec hadoop-master $HADOOP_HOME/sbin/start-yarn.sh
HDFS文件操作:
docker exec hadoop-master bash -c "echo 'Hello Hadoop' > test.txt && $HADOOP_HOME/bin/hdfs dfs -put test.txt /"
MapReduce示例运行:
docker exec hadoop-master bash -c "$HADOOP_HOME/bin/hdfs dfs -mkdir /input && \$HADOOP_HOME/bin/hdfs dfs -put $HADOOP_HOME/etc/hadoop/*.xml /input && \$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar grep /input /output 'dfs[a-z.]+'"
hadoop-env.sh设置HADOOP_HEAPSIZE=2048mapred-site.xml中设置mapreduce.job.maps=2dfs.datanode.data.dir指向高速存储netstat -tulnp检查占用,修改docker-compose.yml端口映射$HADOOP_HOME/bin/hdfs dfs -chmod 777 /)HADOOP_HEAPSIZE至4096(根据物理内存调整)通过修改docker-compose.yml添加worker节点:
services:hadoop-master:# 主节点配置...hadoop-worker1:image: hadoop:3.3.4-ubuntucommand: bash -c "service ssh start && tail -f /dev/null"depends_on:- hadoop-master
添加Spark服务:
RUN wget https://archive.apache.org/dist/spark/3.3.0/spark-3.3.0-bin-hadoop3.tgz && \tar -xzvf spark-3.3.0-bin-hadoop3.tgz -C /usr/local/ && \ln -s /usr/local/spark-3.3.0-bin-hadoop3 /usr/local/spark
配置spark-defaults.conf:
spark.master yarnspark.eventLog.enabled truespark.eventLog.dir hdfs://localhost:9000/spark-logs
volumes:- hadoop_namenode:/tmp/hadoop/dfs/name- hadoop_datanode:/tmp/hadoop/dfs/data
定期备份:
docker exec hadoop-master bash -c "$HADOOP_HOME/bin/hdfs dfsadmin -report" > hdfs_report.txt
日志管理:
volumes:- ./logs:/usr/local/hadoop/logs
镜像更新:
docker pull bde2020/hadoop-base:latestdocker tag bde2020/hadoop-base:latest hadoop:3.3.4-ubuntu
通过Docker容器化部署Hadoop单机环境,开发者可在10分钟内完成从环境搭建到功能验证的全流程。这种方案特别适用于:
实际生产环境中,建议采用Kubernetes编排多容器集群,但单机Docker方案仍是学习Hadoop原理和进行轻量级开发的理想选择。随着Hadoop 4.0的发布,未来可探索基于Docker的混合架构部署方案。