DataX读写BOS
更新时间:2024-05-23
DataX
DataX 是一个异构数据源离线同步工具,致力于实现包括关系型数据库(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各种异构数据源之间稳定高效的数据同步功能。
配置
- 下载并解压DataX;
- 下载BOS-HDFS 解压缩后,将jar包拷贝到Datax解压路径plugin/reader/hdfsreader/libs/以及plugin/writer/hdfswriter/libs/下;
- 打开DataX解压目录下的bin/datax.py脚本,修改脚本中的CLASS_PATH变量为如下:
CLASS_PATH = ("%s/lib/*:%s/plugin/reader/hdfsreader/libs/*:%s/plugin/writer/hdfswriter/libs/*:.") % (DATAX_HOME, DATAX_HOME, DATAX_HOME)
开始
示例
将{your bucket}下的testfile文件读出并写入到{your other bucket}存储桶。
testfile:
1 hello
2 bos
3 world
bos2bos.json:
{
"job": {
"setting": {
"speed": {
"channel": 1
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [{
"reader": {
"name": "hdfsreader",
"parameter": {
"path": "/testfile",
"defaultFS": "bos://{your bucket}/",
"column": [
{
"index": 0,
"type": "long"
},
{
"index": 1,
"type": "string"
}
],
"fileType": "text",
"encoding": "UTF-8",
"hadoopConfig": {
"fs.bos.endpoint": "bj.bcebos.com",
"fs.bos.impl": "org.apache.hadoop.fs.bos.BaiduBosFileSystem",
"fs.bos.access.key": "{your ak}",
"fs.bos.secret.access.key": "{your sk}"
},
"fieldDelimiter": " "
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"path": "/testtmp",
"fileName": "testfile.new",
"defaultFS": "bos://{your other bucket}/",
"column": [{
"name": "col1",
"type": "string"
},
{
"name": "col2",
"type": "string"
}
],
"fileType": "text",
"encoding": "UTF-8",
"hadoopConfig": {
"fs.defaultFS": ""
"fs.bos.endpoint": "bj.bcebos.com",
"fs.bos.impl": "org.apache.hadoop.fs.bos.BaiduBosFileSystem",
"fs.bos.access.key": "{your ak}",
"fs.bos.secret.access.key": "{your sk}"
},
"fieldDelimiter": " ",
"writeMode": "append"
}
}
}]
}
}
按需替换配置中的{your bucket}、endpoint、{your sk}等选项;
支持仅reader或writer配置为BOS。
结果
python bin/datax.py bos2bos.json
执行成功后返回: