原生多模态
更新时间:2025-04-11
原生多模态基础大模型通过多个模态联合建模实现协同优化,多模态理解能力优秀;具备更精进的语言能力,理解、生成、逻辑、记忆能力全面提升,去幻觉、逻辑推理、代码能力显著提升。
支持模型列表
| 模型名称 | 模型版本 | model 参数值 | max_completion_tokens取值范围 |
|---|---|---|---|
| ERNIE 4.5 | ERNIE-4.5-8K-Preview | ernie-4.5-8k-preview | [2,8192] |
| Llama-4-Maverick | Llama-4-Maverick-17B-128E-Instruct | llama-4-maverick-17b-128e-instruct | [2,8192] |
| Llama-4-Scout | Llama-4-Scout-17B-16E-Instruct | llama-4-scout-17b-16e-instruct | [2,8192] |
使用方法
输入
支持的输入模态
支持以下输入组合:
输入多模态数据的方式
输入的图片、文件支持 Base64 编码或公网 URL。以下示例代码均以传入公网 URL 为例,如果需要传入 Base64 编码,请参见:图片Base 64 编码输入
输出
当前支持以流式、非流式形式调用多模态模型。
支持的输出模态
重要
当前仅支持文本输出;Llama 4 仅支持文本输出,ERNIE 4.5 后续推出音频。
快速开始
您可以通过两种方式将图像传入模型:图像 URL 和 Base64 编码。与文本信息相同,图像信息也需要使用用户角色进行输入,即"role": "user"。以下是使用视觉理解模型的简单调用示例。
前提条件
调用本文API,需使用API Key鉴权方式。使用API Key鉴权调用API流程,具体调用流程,请查看认证鉴权。
文本输入
模型支持接收纯文本作为输入。
Plain Text
1curl --location 'https://qianfan.bj.baidubce.com/v2/chat/completions' \
2--header 'Content-Type: application/json' \
3--header 'Authorization: Bearer your-api-key' \
4--data '{
5 "model": "ernie-4.5-8k-preview",
6 "messages": [
7 {
8 "role": "user",
9 "content": [
10 {
11 "type": "text",
12 "text": "介绍几个上海著名景点"
13 }
14 ]
15 }
16 ],
17 "stream":false
18}'
图片输入
模型支持接收纯图片作为输入。
模型支持传入多张图片。对输入图片的要求如下:
- 单个图片文件的大小不超过10 MB;
- 图片数量受模型图文总 Token 上限(即最大输入)的限制,所有图片的总 Token 数必须小于模型的最大输入;
Plain Text
1curl --location 'https://qianfan.bj.baidubce.com/v2/chat/completions' \
2--header 'Content-Type: application/json' \
3--header 'Authorization: Bearer your-api-key' \
4--data '{
5 "model": "ernie-4.5-8k-preview",
6 "messages": [
7 {
8 "role": "user",
9 "content": [
10 {
11 "type": "image_url",
12 "image_url": {
13 "url": "https://testimage.bj.bcebos.com/image1.jpg"
14 }
15 }
16 ]
17 }
18 ],
19 "stream": false
20}'
图片+文本输入
模型支持接收 图片+文本 作为输入。
模型支持传入多张图片。对输入图片的要求如下:
- 单个图片文件的大小不超过10 MB;
- 图片数量受模型图文总 Token 上限(即最大输入)的限制,所有图片的总 Token 数必须小于模型的最大输入;
Plain Text
1curl --location 'https://qianfan.bj.baidubce.com/v2/chat/completions' \
2--header 'Content-Type: application/json' \
3--header 'Authorization: Bearer your-api-key' \
4--data '{
5 "model": "ernie-4.5-8k-preview",
6 "messages": [
7 {
8 "role": "user",
9 "content": [
10 {
11 "type": "text",
12 "text": "What are in these images? Is there any difference between them?"
13 },
14 {
15 "type": "image_url",
16 "image_url": {
17 "url": "https://testimage.bj.bcebos.com/image1.jpg"
18 }
19 },
20 {
21 "type": "image_url",
22 "image_url": {
23 "url": "https://testimage.bj.bcebos.com/image2.png"
24 }
25 }
26 ]
27 }
28 ],
29 "stream": false
30}'
搜索增强
注意:只有ERNIE 4.5支持搜索增强。
Plain Text
1curl --location 'https://qianfan.bj.baidubce.com/v2/chat/completions' \
2--header 'Content-Type: application/json' \
3--header 'Authorization: Bearer your-api-key' \
4--data '{
5 "model": "ernie-4.5-8k-preview",
6 "messages": [
7 {
8 "role": "user",
9 "content": [
10 {
11 "type": "text",
12 "text": "2024年奥运会乒乓球男单冠军是谁"
13 }
14 ]
15 }
16 ],
17 "web_search": {
18 "enable": true,
19 "enable_trace": true
20 },
21 "stream": false,
22 "max_tokens": 512
23}'
多轮对话
第一轮
请求
Plain Text
1curl --location 'https://qianfan.bj.baidubce.com/v2/chat/completions' \
2--header 'Content-Type: application/json' \
3--header 'Authorization: Bearer your-api-key' \
4--data '{
5 "model": "ernie-4.5-8k-preview",
6 "messages": [
7 {
8 "role": "user",
9 "content": [
10 {
11 "type": "text",
12 "text": "What are in these images? Is there any difference between them?"
13 },
14 {
15 "type": "image_url",
16 "image_url": {
17 "url": "https://testimage.bj.bcebos.com/image1.jpg"
18 }
19 },
20 {
21 "type": "image_url",
22 "image_url": {
23 "url": "https://testimage.bj.bcebos.com/image2.png"
24 }
25 }
26 ]
27 }
28 ],
29 "stream": false
30}'
返回
Plain Text
1{
2 "id": "as-t4ntazzewk",
3 "object": "chat.completion",
4 "created": 1742198333,
5 "model": "ernie-4.5-8k-preview",
6 "choices": [
7 {
8 "index": 0,
9 "message": {
10 "role": "assistant",
11 "content": "The first image depicts a serene natural landscape with a wooden boardwalk leading through a lush green field under a blue sky with scattered clouds. The second image shows a close-up of a hand interacting with a small, intricately designed figurine that resembles a turtle with detailed patterns and textures, surrounded by green foliage. The primary difference between the two images is that the first is a wide-angle photograph of an outdoor scene, while the second is a close-up photograph of a hand and a decorative object."
12 },
13 "finish_reason": "stop",
14 "flag": 0
15 }
16 ],
17 "usage": {
18 "prompt_tokens": 3019,
19 "completion_tokens": 102,
20 "total_tokens": 3121
21 }
22}
第二轮
请求
Plain Text
1curl --location 'https://qianfan.bj.baidubce.com/v2/chat/completions' \
2--header 'Content-Type: application/json' \
3--header 'Authorization: Bearer your-api-key' \
4--data '{
5 "model": "ernie-4.5-8k-preview",
6 "messages": [
7 {
8 "role": "user",
9 "content": [
10 {
11 "type": "text",
12 "text": "What are in these images? Is there any difference between them?"
13 },
14 {
15 "type": "image_url",
16 "image_url": {
17 "url": "https://testimage.bj.bcebos.com/image1.jpg"
18 }
19 },
20 {
21 "type": "image_url",
22 "image_url": {
23 "url": "https://testimage.bj.bcebos.com/image2.png"
24 }
25 }
26 ]
27 },
28 {
29 "role": "assistant",
30 "content": "The first image depicts a serene natural landscape with a wooden boardwalk leading through a lush green field under a blue sky with scattered clouds. The second image shows a close-up of a hand interacting with a small, intricately designed figurine that resembles a turtle with detailed patterns and textures, surrounded by green foliage. The primary difference between the two images is that the first is a wide-angle photograph of an outdoor scene, while the second is a close-up photograph of a hand and a decorative object."
31 },
32 {
33 "role": "user",
34 "content": "不够详细,再补充解析细节一点哇"
35 }
36 ],
37 "stream": false
38}'
返回
Plain Text
1{
2 "id": "as-14hvmej1ts",
3 "object": "chat.completion",
4 "created": 1742198435,
5 "model": "ernie-4.5-8k-preview",
6 "choices": [
7 {
8 "index": 0,
9 "message": {
10 "role": "assistant",
11 "content": "当然,让我们更详细地解析这两张图片的差异:\n\n1. **主题和内容**:\n - **第一张图片**:这是一张风景照片,展示了一个自然环境。照片中有一条木制栈道穿过一片绿色的草地,背景是蓝天和一些树木。整体氛围宁静,色彩鲜艳,给人一种开阔和放松的感觉。\n - **第二张图片**:这是一张特写照片,展示了一只手正在与一个装饰性物品互动。这个物品看起来像是一只带有复杂图案和纹理的乌龟,周围还有一些绿色的植物装饰。整体氛围更加细腻和手工艺感,色彩丰富且细节精致。\n\n2. **视角和构图**:\n - **第一张图片**:采用了广角视角,捕捉了大面积的自然景观。栈道从前景延伸到背景,引导观众的视线深入画面。天空和地面的比例均衡,给人一种空间感和深度感。\n - **第二张图片**:采用了特写视角,聚焦于手和乌龟的细节。手指和乌龟的纹理清晰可见,背景模糊,突出主体。构图紧凑,细节丰富,吸引观众注意手工艺品的精细制作。\n\n3. **色彩和光线**:\n - **第一张图片**:色彩以绿色和蓝色为主,光线自然,可能是拍摄于白天。阳光照射在草地和栈道上,形成明暗对比,增强了画面的立体感。\n - **第二张图片**:色彩更加多样,乌龟的绿、蓝、黄等颜色鲜艳,光线柔和,可能是室内拍摄。光线集中在手和乌龟上,突出细节和质感。\n\n4. **情感和氛围**:\n - **第一张图片**:传达出一种宁静、自然和放松的情感,适合用于展示自然风光或户外活动的场景。\n - **第二张图片**:传达出一种细腻、手工艺和专注的情感,适合用于展示手工艺品或创意作品的场景。\n\n总的来说,这两张图片在主题、视角、色彩和情感上都有显著的不同,分别展示了自然景观和手工艺品的美感。"
12 },
13 "finish_reason": "stop",
14 "flag": 0
15 }
16 ],
17 "usage": {
18 "prompt_tokens": 3130,
19 "completion_tokens": 394,
20 "total_tokens": 3524
21 }
22}
图片 Base 64 编码输入
如果您需要上传本地图像,可以将图像转成 Base 64 编码后输入。以下是一个兼容 OpenAI 接口规范的示例。
Plain Text
1from openai import OpenAI
2import os
3import base64
4
5client = OpenAI(
6 api_key = os.getenv("OPENAI_API_KEY"),
7 base_url="https://qianfan.baidubce.com/v2",
8)
9
10# Function to encode the image
11def encode_image(image_path):
12 with open(image_path, "rb") as image_file:
13 return base64.b64encode(image_file.read()).decode("utf-8")
14
15
16# Path to your image
17image_path = "image1.jpg"
18
19# Getting the Base64 string
20base64_image = encode_image(image_path)
21
22response = client.chat.completions.create(
23 model="ernie-4.5-8k-preview",
24 messages=[
25 {
26 "role": "user",
27 "content": [
28 {
29 "type": "text",
30 "text": "What is in this image?",
31 },
32 {
33 "type": "image_url",
34 "image_url": {
35 "url": f"data:image/jpeg;base64,{base64_image}"
36 },
37 },
38 ],
39 }
40 ],
41 stream=True
42)
43
44for chunk in response:
45 if chunk.choices[0].delta.content is not None:
46 print(chunk.choices[0].delta.content, end="")
图像参数使用说明
- 大模型每一次调用都是无状态的,您需要自行管理传入给模型的信息。如果需要模型多次理解同一张图像,请在每次请求时都传入该图。
- 支持单图和多图,每一张图片大小不超过10MB,多张图片输入的总token不超过模型上下文长度。如ERNIE-4.5模型,不超过8K token的图片输入。
- 图片格式:
a. 图片base64:JPG、JPEG、PNG和BMP类型,传入的格式需为:
data:image/<图片格式>;base64,<Base64编码>b. 图片公网url:支持JPG、JPEG、PNG、BMP和WEBP类型 - 详细参数格式说明
| 参数名称 | 类型 | 是否必须 | 描述 | 备注 |
|---|---|---|---|---|
| type | string | 是 | 只有一个取值: image_url | |
| image_url | obj | 是 | 输入的图片信息 | 1、支持多图,图片数量不限制,超过8K token会报token超限 2、单图最大不超过10MB(url下载后图片大小,或base64保存图片后大小) |
| +url | string | 是 | 图片的公网url或者base64 | 1、支持格式: base64:JPG、JPEG、PNG和BMP等类型 url:支持JPG、JPEG、PNG、BMP和WEBP等类型 2、若为base64,传入的格式需为:data:image/<图片格式>;base64,<Base64编码> |
| +detail | string | 否 | 图像/分辨率质量 | low表示低分辨率,high表示高分辨率 |
错误码
如果请求错误,服务器返回的JSON文本包含以下参数。
| code | 错误码 |
| msg | 错误描述信息,帮助理解和解决发生的错误 |
| type | 错误类型 |
更多相关错误码,请查看推理服务V2版本错误码说明。
评价此篇文章
