原生多模态

更新时间：2025-04-11

原生多模态基础大模型通过多个模态联合建模实现协同优化，多模态理解能力优秀；具备更精进的语言能力，理解、生成、逻辑、记忆能力全面提升，去幻觉、逻辑推理、代码能力显著提升。

支持模型列表

模型名称	模型版本	model 参数值	max_completion_tokens取值范围
ERNIE 4.5	ERNIE-4.5-8K-Preview	ernie-4.5-8k-preview	[2，8192]
Llama-4-Maverick	Llama-4-Maverick-17B-128E-Instruct	llama-4-maverick-17b-128e-instruct	[2，8192]
Llama-4-Scout	Llama-4-Scout-17B-16E-Instruct	llama-4-scout-17b-16e-instruct	[2，8192]

使用方法

输入

支持的输入模态

支持以下输入组合：

文本输入
图片输入
图片+文本输入

输入多模态数据的方式

输入的图片、文件支持 Base64 编码或公网 URL。以下示例代码均以传入公网 URL 为例，如果需要传入 Base64 编码，请参见：图片Base 64 编码输入

输出

当前支持以流式、非流式形式调用多模态模型。

支持的输出模态

重要

当前仅支持文本输出；Llama 4 仅支持文本输出，ERNIE 4.5 后续推出音频。

快速开始

您可以通过两种方式将图像传入模型：图像 URL 和 Base64 编码。与文本信息相同，图像信息也需要使用用户角色进行输入，即"role": "user"。以下是使用视觉理解模型的简单调用示例。

前提条件

调用本文API，需使用API Key鉴权方式。使用API Key鉴权调用API流程，具体调用流程，请查看认证鉴权。

文本输入

模型支持接收纯文本作为输入。

Plain Text

1curl --location 'https://qianfan.bj.baidubce.com/v2/chat/completions' \
2--header 'Content-Type: application/json' \
3--header 'Authorization: Bearer your-api-key' \
4--data '{
5    "model": "ernie-4.5-8k-preview",
6    "messages": [
7        {
8            "role": "user",
9            "content": [
10                {
11                    "type": "text",
12                    "text": "介绍几个上海著名景点"
13                }
14            ]
15        }
16    ],
17    "stream":false
18}'

图片输入

模型支持接收纯图片作为输入。

模型支持传入多张图片。对输入图片的要求如下：

单个图片文件的大小不超过10 MB;
图片数量受模型图文总 Token 上限（即最大输入）的限制，所有图片的总 Token 数必须小于模型的最大输入;

Plain Text

1curl --location 'https://qianfan.bj.baidubce.com/v2/chat/completions' \
2--header 'Content-Type: application/json' \
3--header 'Authorization: Bearer your-api-key' \
4--data '{
5    "model": "ernie-4.5-8k-preview",
6    "messages": [
7        {
8            "role": "user",
9            "content": [
10                {
11                    "type": "image_url",
12                    "image_url": {
13                        "url": "https://testimage.bj.bcebos.com/image1.jpg"
14                    }
15                }
16            ]
17        }
18    ],
19    "stream": false
20}'

图片+文本输入

模型支持接收图片+文本作为输入。

模型支持传入多张图片。对输入图片的要求如下：

单个图片文件的大小不超过10 MB;
图片数量受模型图文总 Token 上限（即最大输入）的限制，所有图片的总 Token 数必须小于模型的最大输入;

Plain Text

1curl --location 'https://qianfan.bj.baidubce.com/v2/chat/completions' \
2--header 'Content-Type: application/json' \
3--header 'Authorization: Bearer your-api-key' \
4--data '{
5    "model": "ernie-4.5-8k-preview",
6    "messages": [
7        {
8            "role": "user",
9            "content": [
10                {
11                    "type": "text",
12                    "text": "What are in these images? Is there any difference between them?"
13                },
14                {
15                    "type": "image_url",
16                    "image_url": {
17                        "url": "https://testimage.bj.bcebos.com/image1.jpg"
18                    }
19                },
20                {
21                    "type": "image_url",
22                    "image_url": {
23                        "url": "https://testimage.bj.bcebos.com/image2.png"
24                    }
25                }
26            ]
27        }
28    ],
29    "stream": false
30}'

搜索增强

注意：只有ERNIE 4.5支持搜索增强。

Plain Text

1curl --location 'https://qianfan.bj.baidubce.com/v2/chat/completions' \
2--header 'Content-Type: application/json' \
3--header 'Authorization: Bearer your-api-key' \
4--data '{
5    "model": "ernie-4.5-8k-preview",
6    "messages": [
7        {
8            "role": "user",
9            "content": [
10                {
11                    "type": "text",
12                    "text": "2024年奥运会乒乓球男单冠军是谁"
13                }
14            ]
15        }
16    ],
17    "web_search": {
18        "enable": true,
19        "enable_trace": true
20    },
21    "stream": false,
22    "max_tokens": 512
23}'

多轮对话

第一轮

请求

Plain Text

1curl --location 'https://qianfan.bj.baidubce.com/v2/chat/completions' \
2--header 'Content-Type: application/json' \
3--header 'Authorization: Bearer your-api-key' \
4--data '{
5    "model": "ernie-4.5-8k-preview",
6    "messages": [
7        {
8            "role": "user",
9            "content": [
10                {
11                    "type": "text",
12                    "text": "What are in these images? Is there any difference between them?"
13                },
14                {
15                    "type": "image_url",
16                    "image_url": {
17                        "url": "https://testimage.bj.bcebos.com/image1.jpg"
18                    }
19                },
20                {
21                    "type": "image_url",
22                    "image_url": {
23                        "url": "https://testimage.bj.bcebos.com/image2.png"
24                    }
25                }
26            ]
27        }
28    ],
29    "stream": false
30}'

Plain Text

1{
2    "id": "as-t4ntazzewk",
3    "object": "chat.completion",
4    "created": 1742198333,
5    "model": "ernie-4.5-8k-preview",
6    "choices": [
7        {
8            "index": 0,
9            "message": {
10                "role": "assistant",
11                "content": "The first image depicts a serene natural landscape with a wooden boardwalk leading through a lush green field under a blue sky with scattered clouds. The second image shows a close-up of a hand interacting with a small, intricately designed figurine that resembles a turtle with detailed patterns and textures, surrounded by green foliage. The primary difference between the two images is that the first is a wide-angle photograph of an outdoor scene, while the second is a close-up photograph of a hand and a decorative object."
12            },
13            "finish_reason": "stop",
14            "flag": 0
15        }
16    ],
17    "usage": {
18        "prompt_tokens": 3019,
19        "completion_tokens": 102,
20        "total_tokens": 3121
21    }
22}

第二轮

请求

Plain Text

1curl --location 'https://qianfan.bj.baidubce.com/v2/chat/completions' \
2--header 'Content-Type: application/json' \
3--header 'Authorization: Bearer your-api-key' \
4--data '{
5    "model": "ernie-4.5-8k-preview",
6    "messages": [
7        {
8            "role": "user",
9            "content": [
10                {
11                    "type": "text",
12                    "text": "What are in these images? Is there any difference between them?"
13                },
14                {
15                    "type": "image_url",
16                    "image_url": {
17                        "url": "https://testimage.bj.bcebos.com/image1.jpg"
18                    }
19                },
20                {
21                    "type": "image_url",
22                    "image_url": {
23                        "url": "https://testimage.bj.bcebos.com/image2.png"
24                    }
25                }
26            ]
27        },
28        {
29                "role": "assistant",
30                "content": "The first image depicts a serene natural landscape with a wooden boardwalk leading through a lush green field under a blue sky with scattered clouds. The second image shows a close-up of a hand interacting with a small, intricately designed figurine that resembles a turtle with detailed patterns and textures, surrounded by green foliage. The primary difference between the two images is that the first is a wide-angle photograph of an outdoor scene, while the second is a close-up photograph of a hand and a decorative object."
31        },
32        {
33                "role": "user",
34                "content": "不够详细，再补充解析细节一点哇"
35        }
36    ],
37    "stream": false
38}'

Plain Text

1{
2    "id": "as-14hvmej1ts",
3    "object": "chat.completion",
4    "created": 1742198435,
5    "model": "ernie-4.5-8k-preview",
6    "choices": [
7        {
8            "index": 0,
9            "message": {
10                "role": "assistant",
11                "content": "当然，让我们更详细地解析这两张图片的差异：\n\n1. **主题和内容**：\n   - **第一张图片**：这是一张风景照片，展示了一个自然环境。照片中有一条木制栈道穿过一片绿色的草地，背景是蓝天和一些树木。整体氛围宁静，色彩鲜艳，给人一种开阔和放松的感觉。\n   - **第二张图片**：这是一张特写照片，展示了一只手正在与一个装饰性物品互动。这个物品看起来像是一只带有复杂图案和纹理的乌龟，周围还有一些绿色的植物装饰。整体氛围更加细腻和手工艺感，色彩丰富且细节精致。\n\n2. **视角和构图**：\n   - **第一张图片**：采用了广角视角，捕捉了大面积的自然景观。栈道从前景延伸到背景，引导观众的视线深入画面。天空和地面的比例均衡，给人一种空间感和深度感。\n   - **第二张图片**：采用了特写视角，聚焦于手和乌龟的细节。手指和乌龟的纹理清晰可见，背景模糊，突出主体。构图紧凑，细节丰富，吸引观众注意手工艺品的精细制作。\n\n3. **色彩和光线**：\n   - **第一张图片**：色彩以绿色和蓝色为主，光线自然，可能是拍摄于白天。阳光照射在草地和栈道上，形成明暗对比，增强了画面的立体感。\n   - **第二张图片**：色彩更加多样，乌龟的绿、蓝、黄等颜色鲜艳，光线柔和，可能是室内拍摄。光线集中在手和乌龟上，突出细节和质感。\n\n4. **情感和氛围**：\n   - **第一张图片**：传达出一种宁静、自然和放松的情感，适合用于展示自然风光或户外活动的场景。\n   - **第二张图片**：传达出一种细腻、手工艺和专注的情感，适合用于展示手工艺品或创意作品的场景。\n\n总的来说，这两张图片在主题、视角、色彩和情感上都有显著的不同，分别展示了自然景观和手工艺品的美感。"
12            },
13            "finish_reason": "stop",
14            "flag": 0
15        }
16    ],
17    "usage": {
18        "prompt_tokens": 3130,
19        "completion_tokens": 394,
20        "total_tokens": 3524
21    }
22}

图片 Base 64 编码输入

如果您需要上传本地图像，可以将图像转成 Base 64 编码后输入。以下是一个兼容 OpenAI 接口规范的示例。

Plain Text

1from openai import OpenAI
2import os
3import base64
4
5client = OpenAI(
6    api_key = os.getenv("OPENAI_API_KEY"),  
7    base_url="https://qianfan.baidubce.com/v2",  
8)
9
10# Function to encode the image
11def encode_image(image_path):
12    with open(image_path, "rb") as image_file:
13        return base64.b64encode(image_file.read()).decode("utf-8")
14
15
16# Path to your image
17image_path = "image1.jpg"
18
19# Getting the Base64 string
20base64_image = encode_image(image_path)
21
22response = client.chat.completions.create(
23    model="ernie-4.5-8k-preview",
24    messages=[
25        {
26            "role": "user",
27            "content": [
28                {
29                    "type": "text",
30                    "text": "What is in this image?",
31                },
32                {
33                    "type": "image_url",
34                    "image_url": {
35                        "url": f"data:image/jpeg;base64,{base64_image}"
36                     },
37                },
38            ],
39        }
40    ],
41    stream=True
42)
43
44for chunk in response:
45  if chunk.choices[0].delta.content is not None:
46    print(chunk.choices[0].delta.content, end="")

图像参数使用说明

大模型每一次调用都是无状态的，您需要自行管理传入给模型的信息。如果需要模型多次理解同一张图像，请在每次请求时都传入该图。
支持单图和多图，每一张图片大小不超过10MB，多张图片输入的总token不超过模型上下文长度。如ERNIE-4.5模型，不超过8K token的图片输入。
图片格式： a. 图片base64：JPG、JPEG、PNG和BMP类型，传入的格式需为：data:image/<图片格式>;base64,<Base64编码> b. 图片公网url：支持JPG、JPEG、PNG、BMP和WEBP类型
详细参数格式说明

参数名称	类型	是否必须	描述	备注
type	string	是	只有一个取值: image_url
image_url	obj	是	输入的图片信息	1、支持多图，图片数量不限制，超过8K token会报token超限 2、单图最大不超过10MB（url下载后图片大小，或base64保存图片后大小）
+url	string	是	图片的公网url或者base64	1、支持格式： base64：JPG、JPEG、PNG和BMP等类型 url：支持JPG、JPEG、PNG、BMP和WEBP等类型 2、若为base64，传入的格式需为：data:image/<图片格式>;base64,<Base64编码>
+detail	string	否	图像/分辨率质量	low表示低分辨率，high表示高分辨率

错误码

如果请求错误，服务器返回的JSON文本包含以下参数。


code	错误码
msg	错误描述信息，帮助理解和解决发生的错误
type	错误类型

更多相关错误码，请查看推理服务V2版本错误码说明。

评价此篇文章

有帮助没帮助

百度智能云

千帆AI应用开发者中心-开始使用 qianfan-docs

千帆AI应用开发者中心-开始使用 qianfan-docs

原生多模态

支持模型列表

使用方法

输入

支持的输入模态

输入多模态数据的方式

输出

支持的输出模态

快速开始

前提条件

文本输入

图片输入

图片+文本输入

搜索增强

多轮对话

第一轮

第二轮

图片 Base 64 编码输入

图像参数使用说明

错误码