文档抽取

更新时间：2026-06-17

接口描述

文档抽取支持自定义配置字段，无需训练即可抽取文档字段信息，精准定位字段值，适用于合同、票据、订单等各类文档场景。如希望快速可视化体验效果，可登录智能文档分析平台，一键上传文档，在线测试；在线工具和API服务的额度共享互通。

文档抽取API服务为异步接口，需要先调用提交请求接口获取taskId，然后调用获取结果接口进行结果轮询，建议提交请求后30秒开始轮询。提交请求接口QPS为2，获取结果接口QPS为10。

在线调试

您可以在示例代码中心中调试该接口，可进行签名验证、查看在线调用的请求内容和返回结果、示例代码的自动生成。

提交请求接口

请求说明

请求示例

HTTP 方法：POST

请求URL： https://aip.baidubce.com/rest/2.0/brain/online/v1/extract/task

URL参数：

参数	值
access_token	通过API Key和Secret Key获取的access_token,参考“Access Token获取”

Header如下：

参数	值
Content-Type	application/x-www-form-urlencoded

Body中放置请求参数，参数详情如下：

请求参数

参数	是否必选	类型	说明
file	和 fileURLs 二选一	string	文件的 base64编码，需去掉编码头，例如图片类型需去掉编码头（data:image/jpeg;base64,）支持的文件类型： -图片：jpg/jpeg/png/bmp/tif/tiff -文档：doc/docx/pdf/xlsx/xls/ofd 支持的文件大小：仅支持上传一篇文件，文件大小不超过50M。图像最短边至少15px，最长边最大4096px 优先级：file>fileURLs，当file字段存在时，fileURLs字段失效
fileName	file不为空时必传	string	文档名称，当传入file参数时，该字段必传，例如test.docx
fileURLs	和file二选一	string	文件数据URL，URL长度不超过1024字节，支持单个URL传入。其余文件准入标准与file一致。优先级：file>fileURLs，当file字段存在时，fileURLs字段失效
sceneVersionId	否	string	当前可选的抽取场景，默认使用长文档 - Extract_Scene_Long_Doc_LLM 长文档 - Extract_Scene_Long_Doc_VLM 通用文档
manifestVersionId	和 manifest、prompt 三选一	string	用户在智能文档分析平台配置的清单版本id 优先级：manifestVersionId>manifest，当manifestVersionId字段存在时，manifest字段失效
manifest	和 manifestVersionId、prompt 三选一	string	抽取字段配置，每个抽取字段包含key、parentKey、description三个参数： • key为抽取字段名称，是必传参数； • parentKey为抽取字段的主字段，是非必传参数，不存在时传root或为空； • description为抽取字段的补充说明，用于辅助大模型提升抽取效果，是非必传参数，不存在时为空。以上三个参数支持中英文、数字、下划线、中划线、斜杠、冒号和括号，其中中横线、下划线、斜杠和冒号不能作为开头和结尾。key和parentKey字符数不超过30，description字符数不超过100。 key的数量不能超过100。优先级：manifestVersionId>manifest，当manifestVersionId字段存在时，manifest字段失效
prompt	和 manifest、manifestVersionId三选⼀	string	prompt内容，使⽤该字段将按Prompt模式进⾏⽂档抽取优先级：manifestVersionId>manifest>prompt
useParseResult	否	string	是否参考文档解析结果。关闭后，将跳过解析环节，直接通过VLM抽取。（当前使用VLM抽取没有定位信息）
removeDuplicates	否	bool	是否开启字段值去重。开启后，对单个字段内重复抽取的相同值进行去重，并默认输出首个值的位置信息
pageRange	否	string	是否开启指定页抽取。开启后，可输入指定页码进行抽取，扣费额度也依据指定页码范围输入格式：页码从1开始，使用英文逗号分隔单个页码，用连字符表示页码范围。例如，输入1,5-10,15，表示抽取第1页、第5至10页和第15页，共计扣减8页的额度注意：鉴于流式文档的特性，dox/docx/wps/xls/xlsx文件的页码解析不固定，建议使用pdf/ofd和图片等版式文档以确保页码解析的准确性
extractSeal	否	bool	是否开启印章抽取。开启后，将抽取文档内的印章信息，但会相应增加任务耗时
eraseWatermark	否	bool	是否开启水印去除。开启后，将去除文档内的水印、底纹、印章等视觉干扰，优化模型抽取效果，但会相应增加任务耗时
docCorrect	否	bool	是否开启图像矫正。开启后，将矫正倾斜图片，但会相应增加任务耗时

manifest参数示例：

                JSON
                
            

                [{
	"parentKey": "",
    "key": "生效范围",
	"description": "指国家范围而非省份、城市或更细分的范围"
}, {
	"parentKey": "商品信息",
    "key": "商品编号",
	"description": ""
}, {
	"parentKey": "商品信息",
    "key": "商品价格",
	"description": ""
}]
            

请求代码示例

提示一：使用示例代码前，请记得替换其中的示例Token、文档地址和清单。

提示二：目前仅提供Python语言，如需其他语言示例可参考 示例代码中心 。

                Python
                
            

                import base64
import requests
import os
'''
文档抽取-提交请求
'''
file_path = '[本地文件]'
manifest_version_id = '[清单id]'
request_host = "https://aip.baidubce.com/rest/2.0/brain/online/v1/extract/task"
encoded_string = ''
with open(file_path, 'rb') as file:
    file_base64 = base64.b64encode(file.read()).decode('utf-8')
# 优先级：manifestVersionId>manifest，当manifestVersionId字段存在时，manifest字段失效
data = {
    'file': file_base64,
    'fileName': os.path.basename(file_path),
    # 'manifestVersionId': manifest_version_id,
    'manifest': """
                 [{"key":"单位", "description":"工作单位一般出现在正文内容里前面带着单位2字", "parentKey":"root"},
                 {"key":"工作票号", "description":"工作票号不是编号一般由纯数字组成", "parentKey":"root"}
                    ]
                """
}
access_token = '[调用鉴权接口获取的token]'
request_url = request_host + "?access_token=" + access_token
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
response = requests.post(request_url, headers=headers, data=data)
if response:
    print(response.json())
            

返回说明

返回参数

字段	类型	说明
log_id	string	唯一的 log_id，用于问题定位
error_code	int	错误码
error_msg	string	错误描述信息
result	dict	返回的结果列表
+ taskId	string	该请求生成的taskId，后续使用该taskId获取抽取结果

返回示例 成功返回示例：

                JSON
                
            

                {
    "error_code": 0,
    "error_msg": "",
    "log_id": "088d6639-bafd-4007-be27-dcddbb651322",
    "result": {
        "taskId": "task-6tb7mgduz9rqaxzi"
    }
}
            

失败返回示例（详细的错误码说明见API文档-错误码）：

                JSON
                
            

                {
    "error_code": 283016,
    "error_msg": "parameters value error:清单字段名称格式错误",
    "log_id": "debb76a0-8015-4647-be19-f76c5a0f2892",
    "result": None
}
            

获取结果接口

请求说明

请求示例

HTTP 方法：POST 请求URL：https://aip.baidubce.com/rest/2.0/brain/online/v1/extract/query_task

URL参数：

参数	值
access_token	通过API Key和Secret Key获取的access_token,参考“Access Token获取”

Header如下：

参数	值
Content-Type	application/x-www-form-urlencoded

Body中放置请求参数，参数详情如下：

请求参数

参数	是否必选	类型	说明
taskId	是	string	发送提交请求时返回的taskId

请求代码示例

提示一：使用示例代码前，请记得替换其中的示例Token、taskId。

提示二：目前仅提供Python语言，如需其他语言示例可参考 示例代码中心 。

                Python
                
            

                import requests
'''
文档抽取-获取结果
'''

request_host = "https://aip.baidubce.com/rest/2.0/brain/online/v1/extract/query_task"
data = {
    "taskId": "[调用提交请求接口获取的task_id]"
}
access_token = "[调用鉴权接口获取的token]"
request_url = request_host + "?access_token=" + access_token

headers = {'Content-Type': 'application/x-www-form-urlencoded'}
response = requests.post(request_url, data=data, headers=headers)
if response:
	print(response.json())
            

返回说明

返回参数

字段	类型	说明
log_id	uint64	唯一的log id，用于问题定位
error_code	int	错误码
error_msg	string	错误描述信息
result	dict	返回的结果列表
+ taskId	string	任务ID
+ status	string	任务状态，Pending：排队中；Running：运行中；Success：成功；Failed：失败
+ reason	string	任务失败描述信息
+ createdAt	string	任务创建时间
+ startedAt	string	任务开始时间
+ finishedAt	string	任务结束时间
+ duration	string	任务执行时长
+ extendInfos	dict	高级选项配置
++ removeDuplicates	bool	是否开启字段值去重
++ extractSeal	bool	是否开启印章抽取
++ eraseWatermark	bool	是否开启水印去处
++ docCorrect	bool	是否开启图像矫正
++ pageRange	bool	是否开启指定页抽取
+ extractLabelInfo	[]dict	清单字段配置信息，status为Success时返回
++ key	string	抽取字段名称
++ parentKey	string	抽取字段的主字段
++ description	string	抽取字段的补充说明
+ extractResult	[]dict	文档抽取结果列表
++ docId	string	文档ID
++ docName	string	文档名称
++ pdfUrl	string	源文件转换为PDF后的文档bos下载链接，status为Success时返回，有效期为30分钟
++ data	dict	抽取字段和结果信息
+++singleKey	dict	单个字段抽取结果列表
++++{字段名称}	[]dict	单个字段的字段名称
+++++word	string	单个字段的抽取结果
+++++valuePositions	[]dict	抽取内容位置信息
++++++box	[4][2]int	抽取结果的四角点坐标框，[[x1,y1],[x2,y2],[,x3,y3],[x4,y4]]，分别对应坐标框四个点的坐标
++++++cbox	[4]int	抽取结果的坐标框，「x, y, w, h」(x, y)为坐标点坐标，w为box宽度，h为box高度（以页面坐标为原点）
++++++pageNo	int	页号
+++comboKey	dict	组合字段抽取结果列表
++++{组合名称}	[]dict	组合字段的组合名称
+++++{字段名称}	dict	组合字段的字段名称
++++++valuePositions	[]dict	抽取内容位置信息
+++++++box	[4][2]int	四角点坐标框，[[x1,y1],[x2,y2],[,x3,y3],[x4,y4]]，分别对应坐标框四个点的坐标
+++++++cbox	[4]int	坐标框，「x, y, w, h」(x, y)为坐标点坐标，w为box宽度，h为box高度（以页面坐标为原点）
+++++++pageNo	int	页号
+++seal	[]dict	印章字段抽取结果
++++major	string	印章主字段识别内容
++++minor	string	印章子字段识别内容
++++valuePositions	[]dict	抽取内容位置信息
+++++box	[4][2]int	四角点坐标框，[[x1,y1],[x2,y2],[,x3,y3],[x4,y4]]，分别对应坐标框四个点的坐标
+++++cbox	[4]int	坐标框，「x, y, w, h」(x, y)为坐标点坐标，w为box宽度，h为box高度（以页面坐标为原点）
+++++pageNo	int	页号

返回示例

成功返回示例：

                JSON
                
            

                {
	"error_code": 0,
	"error_msg": "",
	"log_id": "8ca3c6cf-f9d3-485c-96ee-dedcd9fb9a54",
	"result": {
		"taskId": "task-vrs7269yugakeg3g",
		"status": "Success",
		"createdAt": "2024-11-13 02:49:58 +0000 UTC",
		"startedAt": "2024-11-13 02:50:01 +0000 UTC",
		"finishedAt": "2024-11-13 02:50:38 +0000 UTC",
		"duration": 37,
		"reason": "",
		"extractLabelInfo": [{
			"key": "案号",
			"parentKey": "root",
			"description": ""
		}, {
			"key": "保全金额",
			"parentKey": "root",
			"description": ""
		}, {
			"key": "申请人地址",
			"parentKey": "申请人",
			"description": ""
		}, {
			"key": "申请人联系方式",
			"parentKey": "申请人",
			"description": ""
		}, {
			"key": "被申请人",
			"parentKey": "root",
			"description": ""
		}, {
			"key": "被申请人联系方式",
			"parentKey": "被申请人",
			"description": ""
		}, {
			"key": "申请人",
			"parentKey": "root",
			"description": ""
		}],
		"extractResult": [{
			"docId": "doc-pzvd73huishsjp29",
			"docName": "跨页楼地址.pdf",
			"pdfUrl": "https://model-extract-dev-bj.bj.bcebos.com/paas_extract_doc/50042791/doc-pzvd73huishsjp29/doc-pzvd73huishsjp29.pdf?authorization=bce-auth-v1%2FALTAK7IDj758EUbA1igu04rHAh%2F2024-11-13T02%3A50%3A39Z%2F1800%2Fhost%2F3b892fe167f095a5cc63b82b3d69cc9c72d7ee0ae65fde511d1c2a719456e7d8",
			"data": {
				"singleKey": {
					"保全金额": [{
						"valuePositions": [{
							"box": [
								[128, 423],
								[186, 423],
								[186, 442],
								[128, 442]
							],
							"cbox": [128, 423, 58, 19],
							"pageNo": 4
						}, {
							"box": [
								[390, 238],
								[446, 238],
								[446, 256],
								[390, 256]
							],
							"cbox": [390, 238, 56, 18],
							"pageNo": 4
						}],
						"word": "45000元"
					}]
				},
				"comboKey": {
					"申请人": [{
						"申请人地址": {
							"valuePositions": [{
								"box": [
									[378, 107],
									[501, 107],
									[501, 126],
									[378, 126]
								],
								"cbox": [378, 107, 123, 19],
								"pageNo": 1
							}, {
								"box": [
									[73, 140],
									[189, 140],
									[189, 158],
									[73, 158]
								],
								"cbox": [73, 140, 116, 18],
								"pageNo": 1
							}],
							"word": "烟台市莱山区润华大厦1号楼14层"
						},
						"申请人联系方式": {
							"valuePositions": [{
								"box": [
									[149, 172],
									[236, 172],
									[236, 190],
									[149, 190]
								],
								"cbox": [149, 172, 87, 18],
								"pageNo": 1
							}],
							"word": "13589767069"
						}
					}],
					"被申请人": [{
						"被申请人联系方式": {
							"valuePositions": [{
								"box": [
									[335, 300],
									[421, 300],
									[421, 317],
									[335, 317]
								],
								"cbox": [335, 300, 86, 17],
								"pageNo": 1
							}, {
								"box": [
									[75, 526],
									[159, 526],
									[159, 542],
									[75, 542]
								],
								"cbox": [75, 526, 84, 16],
								"pageNo": 1
							}, {
								"box": [
									[335, 654],
									[424, 654],
									[424, 672],
									[335, 672]
								],
								"cbox": [335, 654, 89, 18],
								"pageNo": 1
							}, {
								"box": [
									[297, 80],
									[382, 80],
									[382, 96],
									[297, 96]
								],
								"cbox": [297, 80, 85, 16],
								"pageNo": 2
							}, {
								"box": [
									[65, 241],
									[153, 241],
									[153, 256],
									[65, 256]
								],
								"cbox": [65, 241, 88, 15],
								"pageNo": 2
							}, {
								"box": [
									[362, 368],
									[449, 368],
									[449, 386],
									[362, 386]
								],
								"cbox": [362, 368, 87, 18],
								"pageNo": 2
							}, {
								"box": [
									[301, 498],
									[388, 498],
									[388, 515],
									[301, 515]
								],
								"cbox": [301, 498, 87, 17],
								"pageNo": 2
							}, {
								"box": [
									[69, 664],
									[156, 664],
									[156, 680],
									[69, 680]
								],
								"cbox": [69, 664, 87, 16],
								"pageNo": 2
							}, {
								"box": [
									[78, 145],
									[161, 145],
									[161, 160],
									[78, 160]
								],
								"cbox": [78, 145, 83, 15],
								"pageNo": 3
							}, {
								"box": [
									[79, 299],
									[163, 299],
									[163, 314],
									[79, 314]
								],
								"cbox": [79, 299, 84, 15],
								"pageNo": 3
							}, {
								"box": [
									[362, 421],
									[448, 421],
									[448, 439],
									[362, 439]
								],
								"cbox": [362, 421, 86, 18],
								"pageNo": 3
							}, {
								"box": [
									[306, 547],
									[389, 547],
									[389, 564],
									[306, 564]
								],
								"cbox": [306, 547, 83, 17],
								"pageNo": 3
							}, {
								"box": [
									[306, 676],
									[392, 676],
									[392, 694],
									[306, 694]
								],
								"cbox": [306, 676, 86, 18],
								"pageNo": 3
							}, {
								"box": [
									[378, 114],
									[462, 114],
									[462, 132],
									[378, 132]
								],
								"cbox": [378, 114, 84, 18],
								"pageNo": 4
							}],
							"word": "18866655005"
						}
					}, {
						"被申请人联系方式": {
							"valuePositions": [{
								"box": [
									[304, 397],
									[390, 397],
									[390, 413],
									[304, 413]
								],
								"cbox": [304, 397, 86, 16],
								"pageNo": 1
							}],
							"word": "18615970607"
						}
					}]
				}
			}
		}],
		"quota": 1
	}
}
            

失败返回示例（详细的错误码说明见API文档-错误码）：

                JSON
                
            

                {
    "error_code": 0,
    "error_msg": "",
    "log_id": "7ea4d878-ed81-49de-88cc-5f910c6b8d1b",
    "result": {
        "taskId": "task-utbgdcp6hqx96d3b",
        "status": "Failed",
        "createdAt": "2024-07-29 07:09:41 +0000 UTC",
        "startedAt": "",
        "finishedAt": "",
        "duration": 0,
        "reason": "no valid doc",
        "extractResult": [
            {
                "docId": "doc-g34h4tsyv7zpsz05",
                "docName": "4b252fcf49f9a7a7042e63e6b6fefbcc1000007.png",
                "data": {}
            }
        ]
    }
}
            

前端SDK渲染

辅助用户在网页中便捷地调用文档比对服务，实现与当前智能文档分析平台-文档抽取在线工具一致的前端渲染和交互界面。

使用说明

示例URL： https://textmind-sdk.bce.baidu.com/textmind/sdk/textExtract/{taskId}?access_token={access_token}

URL参数：

参数	值
access_token	通过API Key和Secret Key获取的access_token,参考“Access Token获取”
taskId	发送提交请求时返回的taskId

iframe引入方式

提示一：使用示例代码前，请记得替换其中的示例Token、taskId。

Plain Text

1<iframe
2    src="https://textmind-sdk.bce.baidu.com/textmind/sdk/textExtract/{taskId}?access_token={access_token}"
3/>

评价此篇文章

有帮助没帮助

文档比对

文档解析

百度智能云

文字识别

文字识别

文档抽取

接口描述

在线调试

提交请求接口

请求说明

返回说明

获取结果接口

请求说明

返回说明

前端SDK渲染

使用说明