DataX 读写 BOS
更新时间:2024-08-15
DataX
DataX 是一个异构数据源离线同步工具,致力于实现包括关系型数据库(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各种异构数据源之间稳定高效的数据同步功能。
配置
- 下载并解压 DataX;
- 下载BOS-HDFS 解压缩后,将 jar 包拷贝到 DataX 解压路径plugin/reader/hdfsreader/libs/ 以及 plugin/writer/hdfswriter/libs/ 下;
- 打开 DataX 解压目录下的 bin/datax.py 脚本,修改脚本中的 CLASS_PATH 变量为如下:
CLASS_PATH = ("%s/lib/*:%s/plugin/reader/hdfsreader/libs/*:%s/plugin/writer/hdfswriter/libs/*:.") % (DATAX_HOME, DATAX_HOME, DATAX_HOME)
开始
示例
将 {your bucket} 下的 testfile 文件读出并写入到 {your other bucket} 存储桶。
testfile:
1 hello
2 bos
3 world
bos2bos.json:
{
"job": {
"setting": {
"speed": {
"channel": 1
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [{
"reader": {
"name": "hdfsreader",
"parameter": {
"path": "/testfile",
"defaultFS": "bos://{your bucket}/",
"column": [
{
"index": 0,
"type": "long"
},
{
"index": 1,
"type": "string"
}
],
"fileType": "text",
"encoding": "UTF-8",
"hadoopConfig": {
"fs.bos.endpoint": "bj.bcebos.com",
"fs.bos.impl": "org.apache.hadoop.fs.bos.BaiduBosFileSystem",
"fs.bos.access.key": "{your ak}",
"fs.bos.secret.access.key": "{your sk}"
},
"fieldDelimiter": " "
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"path": "/testtmp",
"fileName": "testfile.new",
"defaultFS": "bos://{your other bucket}/",
"column": [{
"name": "col1",
"type": "string"
},
{
"name": "col2",
"type": "string"
}
],
"fileType": "text",
"encoding": "UTF-8",
"hadoopConfig": {
"fs.defaultFS": ""
"fs.bos.endpoint": "bj.bcebos.com",
"fs.bos.impl": "org.apache.hadoop.fs.bos.BaiduBosFileSystem",
"fs.bos.access.key": "{your ak}",
"fs.bos.secret.access.key": "{your sk}"
},
"fieldDelimiter": " ",
"writeMode": "append"
}
}
}]
}
}
按需替换配置中的 {your bucket}、endpoint、{your sk} 等选项;
支持仅 reader 或 writer 配置为 BOS。
结果
python bin/datax.py bos2bos.json
执行成功后返回: