bedrock-inference

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Amazon Bedrock Inference

Amazon Bedrock推理

Overview

概述

Amazon Bedrock Runtime provides APIs for invoking foundation models including Claude (Opus, Sonnet, Haiku), Nova (Amazon), Titan (Amazon), and third-party models (Cohere, AI21, Meta). Supports both synchronous and asynchronous inference with streaming capabilities.

Purpose: Production-grade model inference with unified API across all Bedrock models

Pattern: Task-based (independent operations for different inference modes)

Key Capabilities:

Model Invocation - Direct model calls with native or Converse API
Streaming - Real-time token streaming for low latency
Async Invocation - Long-running tasks up to 24 hours
Token Counting - Cost estimation before inference
Guardrails - Runtime content filtering and safety
Inference Profiles - Cross-region routing and cost optimization

Quality Targets:

Latency: < 1s first token for streaming
Throughput: Up to 4,000 tokens/sec
Availability: 99.9% SLA with cross-region profiles

Amazon Bedrock Runtime提供用于调用基础模型的API，包括Claude（Opus、Sonnet、Haiku）、Nova（亚马逊）、Titan（亚马逊）以及第三方模型（Cohere、AI21、Meta）。支持同步和异步推理，同时具备流式响应能力。

用途：通过统一API实现全Bedrock模型的生产级模型推理

模式：基于任务（不同推理模式对应独立操作）

核心功能:

模型调用 - 使用原生或Converse API直接调用模型
流式响应 - 实时令牌流式传输，降低延迟
异步调用 - 支持最长24小时的长时任务
令牌计数 - 推理前估算成本
内容防护（Guardrails） - 运行时内容过滤与安全防护
推理配置文件（Inference Profiles） - 跨区域路由与成本优化

质量指标:

延迟：流式响应首令牌延迟<1秒
吞吐量：最高4000令牌/秒
可用性：跨区域配置文件提供99.9% SLA

When to Use

适用场景

Use bedrock-inference when:

Invoking Claude, Nova, Titan, or other Bedrock models
Building conversational AI applications
Implementing streaming responses for better UX
Running long-running async inference tasks
Applying runtime guardrails for content safety
Optimizing costs with inference profiles
Counting tokens before model invocation
Implementing multi-turn conversations

When NOT to Use:

Building complex agents (use bedrock-agentcore)
Knowledge base RAG (use bedrock-knowledge-bases)
Model customization (use bedrock-fine-tuning)

在以下场景中使用bedrock-inference：

调用Claude、Nova、Titan或其他Bedrock模型
构建对话式AI应用
实现流式响应以提升用户体验
运行长时异步推理任务
应用运行时内容防护保障内容安全
通过推理配置文件优化成本
模型调用前进行令牌计数
实现多轮对话

不适用场景:

构建复杂智能体（使用bedrock-agentcore）
知识库RAG（使用bedrock-knowledge-bases）
模型定制（使用bedrock-fine-tuning）

Prerequisites

前置条件

Required

必要条件

AWS account with Bedrock access
Model access enabled in AWS Console
IAM permissions for Bedrock Runtime

拥有Bedrock访问权限的AWS账户
在AWS控制台中启用模型访问权限
具备Bedrock Runtime的IAM权限

Installation

安装

bash

pip install boto3 botocore

bash

pip install boto3 botocore

Enable Model Access

启用模型访问

bash

undefined

bash

undefined

Check available models

查看可用模型

aws bedrock list-foundation-models --region us-east-1

Request model access via Console:

通过控制台申请模型访问权限:

AWS Console → Bedrock → Model access → Manage model access

AWS控制台 → Bedrock → 模型访问 → 管理模型访问

---

---

Model IDs and Inference Profiles

模型ID与推理配置文件

Claude Models (Anthropic)

Claude模型（Anthropic）

Model	Model ID	Inference Profile ID	Region	Max Tokens
Claude Opus 4.5	`anthropic.claude-opus-4-5-20251101-v1:0`	`global.anthropic.claude-opus-4-5-20251101-v1:0`	Global	200K
Claude Sonnet 4.5	`anthropic.claude-sonnet-4-5-20250929-v1:0`	`us.anthropic.claude-sonnet-4-5-20250929-v1:0`	US	200K
Claude Haiku 4.5	`anthropic.claude-haiku-4-5-20251001-v1:0`	`us.anthropic.claude-haiku-4-5-20251001-v1:0`	US	200K
Claude Sonnet 3.5 v2	`anthropic.claude-3-5-sonnet-20241022-v2:0`	`us.anthropic.claude-3-5-sonnet-20241022-v2:0`	US	200K
Claude Haiku 3.5	`anthropic.claude-3-5-haiku-20241022-v1:0`	`us.anthropic.claude-3-5-haiku-20241022-v1:0`	US	200K

模型	模型ID	推理配置文件ID	区域	最大令牌数
Claude Opus 4.5	`anthropic.claude-opus-4-5-20251101-v1:0`	`global.anthropic.claude-opus-4-5-20251101-v1:0`	全球	200K
Claude Sonnet 4.5	`anthropic.claude-sonnet-4-5-20250929-v1:0`	`us.anthropic.claude-sonnet-4-5-20250929-v1:0`	美国	200K
Claude Haiku 4.5	`anthropic.claude-haiku-4-5-20251001-v1:0`	`us.anthropic.claude-haiku-4-5-20251001-v1:0`	美国	200K
Claude Sonnet 3.5 v2	`anthropic.claude-3-5-sonnet-20241022-v2:0`	`us.anthropic.claude-3-5-sonnet-20241022-v2:0`	美国	200K
Claude Haiku 3.5	`anthropic.claude-3-5-haiku-20241022-v1:0`	`us.anthropic.claude-3-5-haiku-20241022-v1:0`	美国	200K

Amazon Nova Models

亚马逊Nova模型

Model	Model ID	Inference Profile ID	Region	Max Tokens
Nova Pro	`amazon.nova-pro-v1:0`	`us.amazon.nova-pro-v1:0`	US	300K
Nova Lite	`amazon.nova-lite-v1:0`	`us.amazon.nova-lite-v1:0`	US	300K
Nova Micro	`amazon.nova-micro-v1:0`	`us.amazon.nova-micro-v1:0`	US	128K

模型	模型ID	推理配置文件ID	区域	最大令牌数
Nova Pro	`amazon.nova-pro-v1:0`	`us.amazon.nova-pro-v1:0`	美国	300K
Nova Lite	`amazon.nova-lite-v1:0`	`us.amazon.nova-lite-v1:0`	美国	300K
Nova Micro	`amazon.nova-micro-v1:0`	`us.amazon.nova-micro-v1:0`	美国	128K

Amazon Titan Models

亚马逊Titan模型

Model	Model ID	Region	Max Tokens
Titan Text Premier	`amazon.titan-text-premier-v1:0`	All	32K
Titan Text Express	`amazon.titan-text-express-v1`	All	8K

模型	模型ID	区域	最大令牌数
Titan Text Premier	`amazon.titan-text-premier-v1:0`	全区域	32K
Titan Text Express	`amazon.titan-text-express-v1`	全区域	8K

Inference Profile Prefixes

推理配置文件前缀

```
us.
```
- US-only routing (lower latency for US traffic)
```
global.
```
- Global cross-region routing (highest availability)
```
apac.
```
- Asia-Pacific routing (lower latency for APAC traffic)

```
us.
```
- 仅限美国区域路由（美国流量延迟更低）
```
global.
```
- 全球跨区域路由（可用性最高）
```
apac.
```
- 亚太区域路由（亚太流量延迟更低）

Quick Reference

快速参考

Client Initialization

客户端初始化

python

import boto3
from typing import Optional

def get_bedrock_client(region_name: str = 'us-east-1',
                        profile_name: Optional[str] = None):
    """Initialize Bedrock Runtime client"""
    session = boto3.Session(
        region_name=region_name,
        profile_name=profile_name
    )
    return session.client('bedrock-runtime')

python

import boto3
from typing import Optional

def get_bedrock_client(region_name: str = 'us-east-1',
                        profile_name: Optional[str] = None):
    """初始化Bedrock Runtime客户端"""
    session = boto3.Session(
        region_name=region_name,
        profile_name=profile_name
    )
    return session.client('bedrock-runtime')

Usage

使用示例

bedrock = get_bedrock_client(region_name='us-west-2')

---

bedrock = get_bedrock_client(region_name='us-west-2')

---

Operations

操作指南

1. Invoke Model (Native API)

1. 调用模型（原生API）

Direct model invocation using model-specific request format.

Basic Invocation:

python

import json

def invoke_claude(prompt: str, model_id: str = 'us.anthropic.claude-sonnet-4-5-20250929-v1:0'):
    """Invoke Claude with native API"""
    bedrock = get_bedrock_client()

    # Claude-specific request format
    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2048,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ],
        "temperature": 0.7,
        "top_p": 0.9
    }

    response = bedrock.invoke_model(
        modelId=model_id,
        body=json.dumps(request_body)
    )

    # Parse response
    response_body = json.loads(response['body'].read())
    return response_body['content'][0]['text']

使用模型特定的请求格式直接调用模型。

基础调用:

python

import json

def invoke_claude(prompt: str, model_id: str = 'us.anthropic.claude-sonnet-4-5-20250929-v1:0'):
    """使用原生API调用Claude"""
    bedrock = get_bedrock_client()

    # Claude特定的请求格式
    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2048,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ],
        "temperature": 0.7,
        "top_p": 0.9
    }

    response = bedrock.invoke_model(
        modelId=model_id,
        body=json.dumps(request_body)
    )

    # 解析响应
    response_body = json.loads(response['body'].read())
    return response_body['content'][0]['text']

Usage

使用示例

result = invoke_claude("Explain quantum computing in simple terms") print(result)


**With System Prompts**:
```python
request_body = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 2048,
    "system": "You are a helpful AI assistant specialized in technical documentation.",
    "messages": [
        {
            "role": "user",
            "content": "Write API documentation for a REST endpoint"
        }
    ]
}

With Tool Use:

python

request_body = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 4096,
    "messages": [
        {
            "role": "user",
            "content": "What's the weather in San Francisco?"
        }
    ],
    "tools": [
        {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "input_schema": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name"
                    }
                },
                "required": ["location"]
            }
        }
    ]
}

result = invoke_claude("用简单的语言解释量子计算") print(result)


**带系统提示词**:
```python
request_body = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 2048,
    "system": "你是一名专注于技术文档的AI助手。",
    "messages": [
        {
            "role": "user",
            "content": "为一个REST端点编写API文档"
        }
    ]
}

带工具调用:

python

request_body = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 4096,
    "messages": [
        {
            "role": "user",
            "content": "旧金山的天气怎么样？"
        }
    ],
    "tools": [
        {
            "name": "get_weather",
            "description": "获取指定地点的当前天气",
            "input_schema": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "城市名称"
                    }
                },
                "required": ["location"]
            }
        }
    ]
}

2. Converse API (Unified Interface)

2. Converse API（统一接口）

Model-agnostic API that works across all Bedrock models with consistent interface.

Basic Conversation:

python

def converse_with_model(
    messages: list,
    model_id: str = 'us.anthropic.claude-sonnet-4-5-20250929-v1:0',
    system_prompts: Optional[list] = None,
    max_tokens: int = 2048
):
    """Converse API for unified model interaction"""
    bedrock = get_bedrock_client()

    inference_config = {
        'maxTokens': max_tokens,
        'temperature': 0.7,
        'topP': 0.9
    }

    request_params = {
        'modelId': model_id,
        'messages': messages,
        'inferenceConfig': inference_config
    }

    if system_prompts:
        request_params['system'] = system_prompts

    response = bedrock.converse(**request_params)

    return response

与模型无关的API，通过一致的接口支持所有Bedrock模型。

基础对话:

python

def converse_with_model(
    messages: list,
    model_id: str = 'us.anthropic.claude-sonnet-4-5-20250929-v1:0',
    system_prompts: Optional[list] = None,
    max_tokens: int = 2048
):
    """使用Converse API实现统一模型交互"""
    bedrock = get_bedrock_client()

    inference_config = {
        'maxTokens': max_tokens,
        'temperature': 0.7,
        'topP': 0.9
    }

    request_params = {
        'modelId': model_id,
        'messages': messages,
        'inferenceConfig': inference_config
    }

    if system_prompts:
        request_params['system'] = system_prompts

    response = bedrock.converse(**request_params)

    return response

Usage

使用示例

messages = [ { 'role': 'user', 'content': [ {'text': 'What are the benefits of microservices architecture?'} ] } ]

system_prompts = [ {'text': 'You are a software architecture expert.'} ]

response = converse_with_model(messages, system_prompts=system_prompts) assistant_message = response['output']['message'] print(assistant_message['content'][0]['text'])


**Multi-turn Conversation**:
```python
def multi_turn_conversation():
    """Multi-turn conversation with context"""
    bedrock = get_bedrock_client()

    messages = []
    model_id = 'us.anthropic.claude-sonnet-4-5-20250929-v1:0'

    # Turn 1
    messages.append({
        'role': 'user',
        'content': [{'text': 'My name is Alice and I work in healthcare.'}]
    })

    response = bedrock.converse(
        modelId=model_id,
        messages=messages,
        inferenceConfig={'maxTokens': 1024}
    )

    # Add assistant response to history
    messages.append(response['output']['message'])

    # Turn 2 (model remembers context)
    messages.append({
        'role': 'user',
        'content': [{'text': 'What are some AI applications in my field?'}]
    })

    response = bedrock.converse(
        modelId=model_id,
        messages=messages,
        inferenceConfig={'maxTokens': 1024}
    )

    return response['output']['message']['content'][0]['text']

With Tool Use (Converse API):

python

def converse_with_tools():
    """Converse API with tool use"""
    bedrock = get_bedrock_client()

    tools = [
        {
            'toolSpec': {
                'name': 'get_stock_price',
                'description': 'Get current stock price for a symbol',
                'inputSchema': {
                    'json': {
                        'type': 'object',
                        'properties': {
                            'symbol': {
                                'type': 'string',
                                'description': 'Stock ticker symbol'
                            }
                        },
                        'required': ['symbol']
                    }
                }
            }
        }
    ]

    messages = [
        {
            'role': 'user',
            'content': [{'text': "What's the price of AAPL stock?"}]
        }
    ]

    response = bedrock.converse(
        modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
        messages=messages,
        toolConfig={'tools': tools},
        inferenceConfig={'maxTokens': 2048}
    )

    # Check if model wants to use a tool
    if response['stopReason'] == 'tool_use':
        tool_use = response['output']['message']['content'][0]['toolUse']
        print(f"Tool requested: {tool_use['name']}")
        print(f"Tool input: {tool_use['input']}")

        # Execute tool and return result
        # (Add tool result to messages and call converse again)

    return response

messages = [ { 'role': 'user', 'content': [ {'text': '微服务架构有哪些优势？'} ] } ]

system_prompts = [ {'text': '你是一名软件架构专家。'} ]

response = converse_with_model(messages, system_prompts=system_prompts) assistant_message = response['output']['message'] print(assistant_message['content'][0]['text'])


**多轮对话**:
```python
def multi_turn_conversation():
    """带上下文的多轮对话"""
    bedrock = get_bedrock_client()

    messages = []
    model_id = 'us.anthropic.claude-sonnet-4-5-20250929-v1:0'

    # 第一轮对话
    messages.append({
        'role': 'user',
        'content': [{'text': '我叫Alice，在医疗行业工作。'}]
    })

    response = bedrock.converse(
        modelId=model_id,
        messages=messages,
        inferenceConfig={'maxTokens': 1024}
    )

    # 将助手响应添加到对话历史
    messages.append(response['output']['message'])

    # 第二轮对话（模型会记住上下文）
    messages.append({
        'role': 'user',
        'content': [{'text': '我的领域有哪些AI应用场景？'}]
    })

    response = bedrock.converse(
        modelId=model_id,
        messages=messages,
        inferenceConfig={'maxTokens': 1024}
    )

    return response['output']['message']['content'][0]['text']

带工具调用（Converse API）:

python

def converse_with_tools():
    """使用Converse API实现工具调用"""
    bedrock = get_bedrock_client()

    tools = [
        {
            'toolSpec': {
                'name': 'get_stock_price',
                'description': '获取指定股票代码的当前价格',
                'inputSchema': {
                    'json': {
                        'type': 'object',
                        'properties': {
                            'symbol': {
                                'type': 'string',
                                'description': '股票代码'
                            }
                        },
                        'required': ['symbol']
                    }
                }
            }
        }
    ]

    messages = [
        {
            'role': 'user',
            'content': [{'text': "AAPL股票的价格是多少？"}]
        }
    ]

    response = bedrock.converse(
        modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
        messages=messages,
        toolConfig={'tools': tools},
        inferenceConfig={'maxTokens': 2048}
    )

    # 检查模型是否需要调用工具
    if response['stopReason'] == 'tool_use':
        tool_use = response['output']['message']['content'][0]['toolUse']
        print(f"请求的工具: {tool_use['name']}")
        print(f"工具输入: {tool_use['input']}")

        # 执行工具并返回结果
        # （将工具结果添加到messages中，再次调用converse）

    return response

3. Stream Response (Real-time Tokens)

3. 流式响应（实时令牌）

Stream tokens as they're generated for lower perceived latency.

Streaming with Native API:

python

def stream_claude_response(prompt: str):
    """Stream response tokens in real-time"""
    bedrock = get_bedrock_client()

    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2048,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ]
    }

    response = bedrock.invoke_model_with_response_stream(
        modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
        body=json.dumps(request_body)
    )

    # Process event stream
    stream = response['body']
    full_text = ""

    for event in stream:
        chunk = event.get('chunk')
        if chunk:
            chunk_obj = json.loads(chunk['bytes'].decode())

            if chunk_obj['type'] == 'content_block_delta':
                delta = chunk_obj['delta']
                if delta['type'] == 'text_delta':
                    text = delta['text']
                    print(text, end='', flush=True)
                    full_text += text

            elif chunk_obj['type'] == 'message_stop':
                print()  # New line at end

    return full_text

在令牌生成时实时流式传输，降低感知延迟。

使用原生API实现流式响应:

python

def stream_claude_response(prompt: str):
    """实时流式传输响应令牌"""
    bedrock = get_bedrock_client()

    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2048,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ]
    }

    response = bedrock.invoke_model_with_response_stream(
        modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
        body=json.dumps(request_body)
    )

    # 处理事件流
    stream = response['body']
    full_text = ""

    for event in stream:
        chunk = event.get('chunk')
        if chunk:
            chunk_obj = json.loads(chunk['bytes'].decode())

            if chunk_obj['type'] == 'content_block_delta':
                delta = chunk_obj['delta']
                if delta['type'] == 'text_delta':
                    text = delta['text']
                    print(text, end='', flush=True)
                    full_text += text

            elif chunk_obj['type'] == 'message_stop':
                print()  # 结束时换行

    return full_text

Usage

使用示例

response = stream_claude_response("Write a short story about a robot")


**Streaming with Converse API**:
```python
def stream_converse(messages: list, model_id: str):
    """Stream response using Converse API"""
    bedrock = get_bedrock_client()

    response = bedrock.converse_stream(
        modelId=model_id,
        messages=messages,
        inferenceConfig={'maxTokens': 2048}
    )

    stream = response['stream']
    full_text = ""

    for event in stream:
        if 'contentBlockDelta' in event:
            delta = event['contentBlockDelta']['delta']
            if 'text' in delta:
                text = delta['text']
                print(text, end='', flush=True)
                full_text += text

        elif 'messageStop' in event:
            print()
            break

    return full_text

response = stream_claude_response("写一个关于机器人的短篇故事")


**使用Converse API实现流式响应**:
```python
def stream_converse(messages: list, model_id: str):
    """使用Converse API实现流式响应"""
    bedrock = get_bedrock_client()

    response = bedrock.converse_stream(
        modelId=model_id,
        messages=messages,
        inferenceConfig={'maxTokens': 2048}
    )

    stream = response['stream']
    full_text = ""

    for event in stream:
        if 'contentBlockDelta' in event:
            delta = event['contentBlockDelta']['delta']
            if 'text' in delta:
                text = delta['text']
                print(text, end='', flush=True)
                full_text += text

        elif 'messageStop' in event:
            print()
            break

    return full_text

Usage

使用示例

messages = [{'role': 'user', 'content': [{'text': 'Explain neural networks'}]}] stream_converse(messages, 'us.anthropic.claude-sonnet-4-5-20250929-v1:0')


**Streaming with Error Handling**:
```python
def safe_streaming(prompt: str):
    """Streaming with comprehensive error handling"""
    bedrock = get_bedrock_client()

    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2048,
        "messages": [{"role": "user", "content": prompt}]
    }

    try:
        response = bedrock.invoke_model_with_response_stream(
            modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
            body=json.dumps(request_body)
        )

        full_text = ""
        for event in response['body']:
            chunk = event.get('chunk')
            if chunk:
                chunk_obj = json.loads(chunk['bytes'].decode())

                if chunk_obj['type'] == 'content_block_delta':
                    text = chunk_obj['delta'].get('text', '')
                    print(text, end='', flush=True)
                    full_text += text

                elif chunk_obj['type'] == 'error':
                    print(f"\nStreaming error: {chunk_obj['error']}")
                    break

        return full_text

    except Exception as e:
        print(f"Stream failed: {e}")
        raise

messages = [{'role': 'user', 'content': [{'text': '解释神经网络'}]}] stream_converse(messages, 'us.anthropic.claude-sonnet-4-5-20250929-v1:0')


**带错误处理的流式响应**:
```python
def safe_streaming(prompt: str):
    """包含全面错误处理的流式响应"""
    bedrock = get_bedrock_client()

    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2048,
        "messages": [{"role": "user", "content": prompt}]
    }

    try:
        response = bedrock.invoke_model_with_response_stream(
            modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
            body=json.dumps(request_body)
        )

        full_text = ""
        for event in response['body']:
            chunk = event.get('chunk')
            if chunk:
                chunk_obj = json.loads(chunk['bytes'].decode())

                if chunk_obj['type'] == 'content_block_delta':
                    text = chunk_obj['delta'].get('text', '')
                    print(text, end='', flush=True)
                    full_text += text

                elif chunk_obj['type'] == 'error':
                    print(f"\n流式响应错误: {chunk_obj['error']}")
                    break

        return full_text

    except Exception as e:
        print(f"流式响应失败: {e}")
        raise

4. Count Tokens

4. 令牌计数

Estimate token usage and costs before invoking models.

Converse Token Counting:

python

def count_tokens(messages: list, model_id: str):
    """Count tokens for cost estimation"""
    bedrock = get_bedrock_client()

    # Optional system prompts
    system_prompts = [
        {'text': 'You are a helpful assistant.'}
    ]

    # Optional tools
    tools = [
        {
            'toolSpec': {
                'name': 'example_tool',
                'description': 'Example tool',
                'inputSchema': {
                    'json': {
                        'type': 'object',
                        'properties': {}
                    }
                }
            }
        }
    ]

    response = bedrock.converse_count(
        modelId=model_id,
        messages=messages,
        system=system_prompts,
        toolConfig={'tools': tools}
    )

    # Get token counts
    usage = response['usage']
    print(f"Input tokens: {usage['inputTokens']}")
    print(f"System tokens: {usage.get('systemTokens', 0)}")
    print(f"Tool tokens: {usage.get('toolTokens', 0)}")
    print(f"Total input: {usage['totalTokens']}")

    return usage

在调用模型前估算令牌使用量和成本。

Converse API令牌计数:

python

def count_tokens(messages: list, model_id: str):
    """计数令牌以估算成本"""
    bedrock = get_bedrock_client()

    # 可选系统提示词
    system_prompts = [
        {'text': '你是一名乐于助人的助手。'}
    ]

    # 可选工具
    tools = [
        {
            'toolSpec': {
                'name': 'example_tool',
                'description': '示例工具',
                'inputSchema': {
                    'json': {
                        'type': 'object',
                        'properties': {}
                    }
                }
            }
        }
    ]

    response = bedrock.converse_count(
        modelId=model_id,
        messages=messages,
        system=system_prompts,
        toolConfig={'tools': tools}
    )

    # 获取令牌计数
    usage = response['usage']
    print(f"输入令牌数: {usage['inputTokens']}")
    print(f"系统令牌数: {usage.get('systemTokens', 0)}")
    print(f"工具令牌数: {usage.get('toolTokens', 0)}")
    print(f"总输入令牌数: {usage['totalTokens']}")

    return usage

Usage

使用示例

messages = [ {'role': 'user', 'content': [{'text': 'This is a test message'}]} ] tokens = count_tokens(messages, 'us.anthropic.claude-sonnet-4-5-20250929-v1:0')


**Cost Estimation**:
```python
def estimate_cost(messages: list, model_id: str, estimated_output_tokens: int = 1000):
    """Estimate inference cost before invocation"""
    bedrock = get_bedrock_client()

    # Count input tokens
    token_response = bedrock.converse_count(
        modelId=model_id,
        messages=messages
    )

    input_tokens = token_response['usage']['totalTokens']

    # Pricing (as of December 2024, prices vary by region)
    pricing = {
        'us.anthropic.claude-opus-4-5-20251101-v1:0': {
            'input': 15.00 / 1_000_000,   # $15 per 1M input tokens
            'output': 75.00 / 1_000_000   # $75 per 1M output tokens
        },
        'us.anthropic.claude-sonnet-4-5-20250929-v1:0': {
            'input': 3.00 / 1_000_000,
            'output': 15.00 / 1_000_000
        },
        'us.anthropic.claude-haiku-4-5-20251001-v1:0': {
            'input': 0.80 / 1_000_000,
            'output': 4.00 / 1_000_000
        }
    }

    if model_id in pricing:
        input_cost = input_tokens * pricing[model_id]['input']
        output_cost = estimated_output_tokens * pricing[model_id]['output']
        total_cost = input_cost + output_cost

        print(f"Input tokens: {input_tokens:,} (${input_cost:.6f})")
        print(f"Estimated output: {estimated_output_tokens:,} (${output_cost:.6f})")
        print(f"Estimated total: ${total_cost:.6f}")

        return {
            'input_tokens': input_tokens,
            'estimated_output_tokens': estimated_output_tokens,
            'input_cost': input_cost,
            'output_cost': output_cost,
            'total_cost': total_cost
        }
    else:
        print("Pricing not available for this model")
        return None

messages = [ {'role': 'user', 'content': [{'text': '这是一条测试消息'}]} ] tokens = count_tokens(messages, 'us.anthropic.claude-sonnet-4-5-20250929-v1:0')


**成本估算**:
```python
def estimate_cost(messages: list, model_id: str, estimated_output_tokens: int = 1000):
    """调用模型前估算推理成本"""
    bedrock = get_bedrock_client()

    # 计数输入令牌
    token_response = bedrock.converse_count(
        modelId=model_id,
        messages=messages
    )

    input_tokens = token_response['usage']['totalTokens']

    # 定价（截至2024年12月，价格因区域而异）
    pricing = {
        'us.anthropic.claude-opus-4-5-20251101-v1:0': {
            'input': 15.00 / 1_000_000,   # 每100万输入令牌15美元
            'output': 75.00 / 1_000_000   # 每100万输出令牌75美元
        },
        'us.anthropic.claude-sonnet-4-5-20250929-v1:0': {
            'input': 3.00 / 1_000_000,
            'output': 15.00 / 1_000_000
        },
        'us.anthropic.claude-haiku-4-5-20251001-v1:0': {
            'input': 0.80 / 1_000_000,
            'output': 4.00 / 1_000_000
        }
    }

    if model_id in pricing:
        input_cost = input_tokens * pricing[model_id]['input']
        output_cost = estimated_output_tokens * pricing[model_id]['output']
        total_cost = input_cost + output_cost

        print(f"输入令牌数: {input_tokens:,} (${input_cost:.6f})")
        print(f"预估输出令牌数: {estimated_output_tokens:,} (${output_cost:.6f})")
        print(f"预估总成本: ${total_cost:.6f}")

        return {
            'input_tokens': input_tokens,
            'estimated_output_tokens': estimated_output_tokens,
            'input_cost': input_cost,
            'output_cost': output_cost,
            'total_cost': total_cost
        }
    else:
        print("该模型暂无定价信息")
        return None

5. Async Invoke (Long-Running Tasks)

5. 异步调用（长时任务）

For inference tasks that take longer than 60 seconds (up to 24 hours).

Start Async Invocation:

python

def async_invoke_model(prompt: str, s3_output_uri: str):
    """Start async model invocation for long tasks"""
    bedrock = get_bedrock_client()

    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 10000,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ]
    }

    response = bedrock.invoke_model_async(
        modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
        modelInput=json.dumps(request_body),
        outputDataConfig={
            's3OutputDataConfig': {
                's3Uri': s3_output_uri
            }
        }
    )

    invocation_arn = response['invocationArn']
    print(f"Async invocation started: {invocation_arn}")

    return invocation_arn

适用于耗时超过60秒（最长24小时）的推理任务。

启动异步调用:

python

def async_invoke_model(prompt: str, s3_output_uri: str):
    """启动长时任务的异步模型调用"""
    bedrock = get_bedrock_client()

    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 10000,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ]
    }

    response = bedrock.invoke_model_async(
        modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
        modelInput=json.dumps(request_body),
        outputDataConfig={
            's3OutputDataConfig': {
                's3Uri': s3_output_uri
            }
        }
    )

    invocation_arn = response['invocationArn']
    print(f"异步调用已启动: {invocation_arn}")

    return invocation_arn

Usage

使用示例

s3_output = 's3://my-bucket/bedrock-outputs/result.json' arn = async_invoke_model("Write a 10,000 word technical guide", s3_output)


**Check Async Status**:
```python
def check_async_status(invocation_arn: str):
    """Check status of async invocation"""
    bedrock = get_bedrock_client()

    response = bedrock.get_async_invoke(
        invocationArn=invocation_arn
    )

    status = response['status']
    print(f"Status: {status}")

    if status == 'Completed':
        output_uri = response['outputDataConfig']['s3OutputDataConfig']['s3Uri']
        print(f"Output available at: {output_uri}")

        # Download and parse result
        # (Use boto3 S3 client to retrieve)

    elif status == 'Failed':
        print(f"Failure reason: {response.get('failureMessage', 'Unknown')}")

    return response

s3_output = 's3://my-bucket/bedrock-outputs/result.json' arn = async_invoke_model("写一份10000字的技术指南", s3_output)


**检查异步调用状态**:
```python
def check_async_status(invocation_arn: str):
    """检查异步调用的状态"""
    bedrock = get_bedrock_client()

    response = bedrock.get_async_invoke(
        invocationArn=invocation_arn
    )

    status = response['status']
    print(f"状态: {status}")

    if status == 'Completed':
        output_uri = response['outputDataConfig']['s3OutputDataConfig']['s3Uri']
        print(f"结果已生成，路径: {output_uri}")

        # 下载并解析结果
        # （使用boto3 S3客户端获取）

    elif status == 'Failed':
        print(f"失败原因: {response.get('failureMessage', '未知')}")

    return response

Usage

使用示例

status = check_async_status(arn)


**List Async Invocations**:
```python
def list_async_invocations(status_filter: Optional[str] = None):
    """List all async invocations"""
    bedrock = get_bedrock_client()

    params = {}
    if status_filter:
        params['statusEquals'] = status_filter  # 'InProgress', 'Completed', 'Failed'

    response = bedrock.list_async_invokes(**params)

    for invocation in response.get('asyncInvokeSummaries', []):
        print(f"ARN: {invocation['invocationArn']}")
        print(f"Status: {invocation['status']}")
        print(f"Submit time: {invocation['submitTime']}")
        print("---")

    return response

status = check_async_status(arn)


**列出所有异步调用**:
```python
def list_async_invocations(status_filter: Optional[str] = None):
    """列出所有异步调用"""
    bedrock = get_bedrock_client()

    params = {}
    if status_filter:
        params['statusEquals'] = status_filter  # 'InProgress', 'Completed', 'Failed'

    response = bedrock.list_async_invokes(**params)

    for invocation in response.get('asyncInvokeSummaries', []):
        print(f"ARN: {invocation['invocationArn']}")
        print(f"状态: {invocation['status']}")
        print(f"提交时间: {invocation['submitTime']}")
        print("---")

    return response

6. Apply Guardrail (Runtime Safety)

6. 应用内容防护（Guardrail）

Apply content filtering and safety policies at runtime.

Invoke with Guardrail:

python

def invoke_with_guardrail(
    prompt: str,
    guardrail_id: str,
    guardrail_version: str = 'DRAFT'
):
    """Invoke model with runtime guardrail"""
    bedrock = get_bedrock_client()

    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2048,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ]
    }

    response = bedrock.invoke_model(
        modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
        body=json.dumps(request_body),
        guardrailIdentifier=guardrail_id,
        guardrailVersion=guardrail_version
    )

    # Check if content was blocked
    response_body = json.loads(response['body'].read())

    if 'amazon-bedrock-guardrailAction' in response['ResponseMetadata']['HTTPHeaders']:
        action = response['ResponseMetadata']['HTTPHeaders']['amazon-bedrock-guardrailAction']
        if action == 'GUARDRAIL_INTERVENED':
            print("Content blocked by guardrail")
            return None

    return response_body['content'][0]['text']

在运行时应用内容过滤和安全策略。

带内容防护的模型调用:

python

def invoke_with_guardrail(
    prompt: str,
    guardrail_id: str,
    guardrail_version: str = 'DRAFT'
):
    """调用带运行时内容防护的模型"""
    bedrock = get_bedrock_client()

    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2048,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ]
    }

    response = bedrock.invoke_model(
        modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
        body=json.dumps(request_body),
        guardrailIdentifier=guardrail_id,
        guardrailVersion=guardrail_version
    )

    # 检查内容是否被拦截
    response_body = json.loads(response['body'].read())

    if 'amazon-bedrock-guardrailAction' in response['ResponseMetadata']['HTTPHeaders']:
        action = response['ResponseMetadata']['HTTPHeaders']['amazon-bedrock-guardrailAction']
        if action == 'GUARDRAIL_INTERVENED':
            print("内容被内容防护拦截")
            return None

    return response_body['content'][0]['text']

Usage

使用示例

result = invoke_with_guardrail( "Tell me about quantum computing", guardrail_id='abc123xyz', guardrail_version='1' )


**Converse with Guardrail**:
```python
def converse_with_guardrail(messages: list, guardrail_config: dict):
    """Converse API with guardrail configuration"""
    bedrock = get_bedrock_client()

    response = bedrock.converse(
        modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
        messages=messages,
        inferenceConfig={'maxTokens': 2048},
        guardrailConfig=guardrail_config
    )

    # Check trace for guardrail intervention
    if 'trace' in response:
        trace = response['trace']['guardrail']
        if trace.get('action') == 'GUARDRAIL_INTERVENED':
            print("Guardrail blocked content")
            for assessment in trace.get('assessments', []):
                print(f"Policy: {assessment['topicPolicy']}")

    return response

result = invoke_with_guardrail( "告诉我关于量子计算的知识", guardrail_id='abc123xyz', guardrail_version='1' )


**带内容防护的Converse API调用**:
```python
def converse_with_guardrail(messages: list, guardrail_config: dict):
    """使用带内容防护配置的Converse API"""
    bedrock = get_bedrock_client()

    response = bedrock.converse(
        modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
        messages=messages,
        inferenceConfig={'maxTokens': 2048},
        guardrailConfig=guardrail_config
    )

    # 检查内容防护拦截痕迹
    if 'trace' in response:
        trace = response['trace']['guardrail']
        if trace.get('action') == 'GUARDRAIL_INTERVENED':
            print("内容防护已拦截内容")
            for assessment in trace.get('assessments', []):
                print(f"触发策略: {assessment['topicPolicy']}")

    return response

Usage

使用示例

guardrail_config = { 'guardrailIdentifier': 'abc123xyz', 'guardrailVersion': '1', 'trace': 'enabled' }

messages = [{'role': 'user', 'content': [{'text': 'Test message'}]}] converse_with_guardrail(messages, guardrail_config)

---

guardrail_config = { 'guardrailIdentifier': 'abc123xyz', 'guardrailVersion': '1', 'trace': 'enabled' }

messages = [{'role': 'user', 'content': [{'text': '测试消息'}]}] converse_with_guardrail(messages, guardrail_config)

---

Error Handling Patterns

错误处理模式

Comprehensive Error Handling

全面错误处理

python

from botocore.exceptions import ClientError, BotoCoreError
import time

def robust_invoke(prompt: str, max_retries: int = 3):
    """Invoke model with retry logic and error handling"""
    bedrock = get_bedrock_client()

    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2048,
        "messages": [{"role": "user", "content": prompt}]
    }

    for attempt in range(max_retries):
        try:
            response = bedrock.invoke_model(
                modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
                body=json.dumps(request_body)
            )

            response_body = json.loads(response['body'].read())
            return response_body['content'][0]['text']

        except ClientError as e:
            error_code = e.response['Error']['Code']

            if error_code == 'ThrottlingException':
                wait_time = (2 ** attempt) + 1  # Exponential backoff
                print(f"Throttled. Waiting {wait_time}s before retry {attempt + 1}/{max_retries}")
                time.sleep(wait_time)
                continue

            elif error_code == 'ModelTimeoutException':
                print("Model timeout - request took too long")
                if attempt < max_retries - 1:
                    time.sleep(2)
                    continue
                raise

            elif error_code == 'ModelErrorException':
                print("Model error - check input format")
                raise

            elif error_code == 'ValidationException':
                print("Invalid parameters")
                raise

            elif error_code == 'AccessDeniedException':
                print("Access denied - check IAM permissions and model access")
                raise

            elif error_code == 'ResourceNotFoundException':
                print("Model not found - check model ID")
                raise

            else:
                print(f"Unexpected error: {error_code}")
                raise

        except BotoCoreError as e:
            print(f"Connection error: {e}")
            if attempt < max_retries - 1:
                time.sleep(2)
                continue
            raise

    raise Exception(f"Failed after {max_retries} attempts")

python

from botocore.exceptions import ClientError, BotoCoreError
import time

def robust_invoke(prompt: str, max_retries: int = 3):
    """带重试逻辑和错误处理的模型调用"""
    bedrock = get_bedrock_client()

    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2048,
        "messages": [{"role": "user", "content": prompt}]
    }

    for attempt in range(max_retries):
        try:
            response = bedrock.invoke_model(
                modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
                body=json.dumps(request_body)
            )

            response_body = json.loads(response['body'].read())
            return response_body['content'][0]['text']

        except ClientError as e:
            error_code = e.response['Error']['Code']

            if error_code == 'ThrottlingException':
                wait_time = (2 ** attempt) + 1  # 指数退避
                print(f"请求被限流。等待{wait_time}秒后重试，当前尝试{attempt + 1}/{max_retries}")
                time.sleep(wait_time)
                continue

            elif error_code == 'ModelTimeoutException':
                print("模型超时 - 请求耗时过长")
                if attempt < max_retries - 1:
                    time.sleep(2)
                    continue
                raise

            elif error_code == 'ModelErrorException':
                print("模型错误 - 检查输入格式")
                raise

            elif error_code == 'ValidationException':
                print("参数无效")
                raise

            elif error_code == 'AccessDeniedException':
                print("访问被拒绝 - 检查IAM权限和模型访问权限")
                raise

            elif error_code == 'ResourceNotFoundException':
                print("模型未找到 - 检查模型ID")
                raise

            else:
                print(f"未知错误: {error_code}")
                raise

        except BotoCoreError as e:
            print(f"连接错误: {e}")
            if attempt < max_retries - 1:
                time.sleep(2)
                continue
            raise

    raise Exception(f"已尝试{max_retries}次，调用失败")

Specific Error Scenarios

特定错误场景处理

python

def handle_model_errors():
    """Common error scenarios and solutions"""
    bedrock = get_bedrock_client()

    try:
        # Attempt invocation
        response = bedrock.invoke_model(
            modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
            body=json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 2048,
                "messages": [{"role": "user", "content": "test"}]
            })
        )

    except ClientError as e:
        error_code = e.response['Error']['Code']

        if error_code == 'ModelNotReadyException':
            # Model is still loading
            print("Model not ready, wait 30 seconds and retry")

        elif error_code == 'ServiceQuotaExceededException':
            # Hit service quota
            print("Exceeded quota - request increase or use different region")

        elif error_code == 'ModelStreamErrorException':
            # Error during streaming
            print("Stream interrupted - restart stream")

python

def handle_model_errors():
    """常见错误场景及解决方案"""
    bedrock = get_bedrock_client()

    try:
        # 尝试调用模型
        response = bedrock.invoke_model(
            modelId='us.anthropic.claude-sonnet-4-5-20250929-v1:0',
            body=json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 2048,
                "messages": [{"role": "user", "content": "test"}]
            })
        )

    except ClientError as e:
        error_code = e.response['Error']['Code']

        if error_code == 'ModelNotReadyException':
            # 模型仍在加载中
            print("模型未就绪，请等待30秒后重试")

        elif error_code == 'ServiceQuotaExceededException':
            # 达到服务配额上限
            print("已超出服务配额 - 申请提升配额或切换区域")

        elif error_code == 'ModelStreamErrorException':
            # 流式响应过程中出错
            print("流式响应中断 - 重新启动流式请求")

Best Practices

最佳实践

1. Cost Optimization

1. 成本优化

python

def cost_optimized_inference(prompt: str, require_high_accuracy: bool = False):
    """Choose model based on task complexity and cost"""

    # Simple tasks → Haiku (cheapest)
    # Moderate tasks → Sonnet (balanced)
    # Complex tasks → Opus (most capable)

    if not require_high_accuracy:
        model_id = 'us.anthropic.claude-haiku-4-5-20251001-v1:0'
        print("Using Haiku for cost efficiency")
    elif require_high_accuracy:
        model_id = 'global.anthropic.claude-opus-4-5-20251101-v1:0'
        print("Using Opus for maximum accuracy")
    else:
        model_id = 'us.anthropic.claude-sonnet-4-5-20250929-v1:0'
        print("Using Sonnet for balanced performance")

    return invoke_claude(prompt, model_id)

python

def cost_optimized_inference(prompt: str, require_high_accuracy: bool = False):
    """根据任务复杂度和成本选择合适的模型"""

    # 简单任务 → Haiku（成本最低）
    # 中等任务 → Sonnet（平衡性能与成本）
    # 复杂任务 → Opus（能力最强）

    if not require_high_accuracy:
        model_id = 'us.anthropic.claude-haiku-4-5-20251001-v1:0'
        print("使用Haiku以提升成本效率")
    elif require_high_accuracy:
        model_id = 'global.anthropic.claude-opus-4-5-20251101-v1:0'
        print("使用Opus以获取最高准确性")
    else:
        model_id = 'us.anthropic.claude-sonnet-4-5-20250929-v1:0'
        print("使用Sonnet以平衡性能与成本")

    return invoke_claude(prompt, model_id)

2. Use Inference Profiles

2. 使用推理配置文件

python

def use_inference_profiles():
    """Leverage inference profiles for cost savings"""

    # Cross-region profiles offer 30-50% cost savings
    # with automatic region failover

    profiles = {
        'global_opus': 'global.anthropic.claude-opus-4-5-20251101-v1:0',
        'us_sonnet': 'us.anthropic.claude-sonnet-4-5-20250929-v1:0',
        'us_haiku': 'us.anthropic.claude-haiku-4-5-20251001-v1:0'
    }

    # Use global profile for high availability
    # Use regional profile for lower latency

    return profiles

python

def use_inference_profiles():
    """利用推理配置文件节省成本"""

    # 跨区域配置文件可节省30-50%的成本
    # 并支持自动区域故障转移

    profiles = {
        'global_opus': 'global.anthropic.claude-opus-4-5-20251101-v1:0',
        'us_sonnet': 'us.anthropic.claude-sonnet-4-5-20250929-v1:0',
        'us_haiku': 'us.anthropic.claude-haiku-4-5-20251001-v1:0'
    }

    # 高可用性场景使用全局配置文件
    # 低延迟需求场景使用区域配置文件

    return profiles

3. Implement Caching

3. 实现缓存

python

from functools import lru_cache
import hashlib

@lru_cache(maxsize=100)
def cached_inference(prompt: str, model_id: str):
    """Cache responses for identical prompts"""
    return invoke_claude(prompt, model_id)

def cache_key(prompt: str) -> str:
    """Generate cache key for prompt"""
    return hashlib.sha256(prompt.encode()).hexdigest()

python

from functools import lru_cache
import hashlib

@lru_cache(maxsize=100)
def cached_inference(prompt: str, model_id: str):
    """对相同提示词的响应进行缓存"""
    return invoke_claude(prompt, model_id)

def cache_key(prompt: str) -> str:
    """为提示词生成缓存键"""
    return hashlib.sha256(prompt.encode()).hexdigest()

4. Monitor Token Usage

4. 监控令牌使用

python

def track_token_usage(messages: list, model_id: str):
    """Track and log token usage"""
    bedrock = get_bedrock_client()

    # Count before invocation
    token_count = bedrock.converse_count(
        modelId=model_id,
        messages=messages
    )

    input_tokens = token_count['usage']['totalTokens']

    # Invoke
    response = bedrock.converse(
        modelId=model_id,
        messages=messages,
        inferenceConfig={'maxTokens': 2048}
    )

    # Get actual output tokens
    output_tokens = response['usage']['outputTokens']
    total_tokens = response['usage']['totalInputTokens'] + output_tokens

    # Log to CloudWatch or database
    print(f"Input: {input_tokens}, Output: {output_tokens}, Total: {total_tokens}")

    return response

python

def track_token_usage(messages: list, model_id: str):
    """跟踪并记录令牌使用情况"""
    bedrock = get_bedrock_client()

    # 调用前计数
    token_count = bedrock.converse_count(
        modelId=model_id,
        messages=messages
    )

    input_tokens = token_count['usage']['totalTokens']

    # 调用模型
    response = bedrock.converse(
        modelId=model_id,
        messages=messages,
        inferenceConfig={'maxTokens': 2048}
    )

    # 获取实际输出令牌数
    output_tokens = response['usage']['outputTokens']
    total_tokens = response['usage']['totalInputTokens'] + output_tokens

    # 记录到CloudWatch或数据库
    print(f"输入令牌数: {input_tokens}, 输出令牌数: {output_tokens}, 总令牌数: {total_tokens}")

    return response

5. Use Streaming for Better UX

5. 使用流式响应提升用户体验

python

def stream_for_user_experience(prompt: str):
    """Always use streaming for interactive applications"""

    # Streaming reduces perceived latency
    # Users see tokens immediately instead of waiting

    return stream_claude_response(prompt)

python

def stream_for_user_experience(prompt: str):
    """交互式应用始终使用流式响应"""

    # 流式响应可降低感知延迟
    # 用户无需等待完整响应，可实时看到内容生成

    return stream_claude_response(prompt)

6. Async for Long Tasks

6. 长时任务使用异步调用

python

def use_async_for_batch(prompts: list, s3_bucket: str):
    """Use async invocation for batch processing"""

    invocation_arns = []

    for idx, prompt in enumerate(prompts):
        s3_uri = f's3://{s3_bucket}/outputs/result-{idx}.json'
        arn = async_invoke_model(prompt, s3_uri)
        invocation_arns.append(arn)

    return invocation_arns

python

def use_async_for_batch(prompts: list, s3_bucket: str):
    """批量处理场景使用异步调用"""

    invocation_arns = []

    for idx, prompt in enumerate(prompts):
        s3_uri = f's3://{s3_bucket}/outputs/result-{idx}.json'
        arn = async_invoke_model(prompt, s3_uri)
        invocation_arns.append(arn)

    return invocation_arns

IAM Permissions

IAM权限

Minimum Runtime Permissions

最小运行时权限

json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream"
      ],
      "Resource": [
        "arn:aws:bedrock:*::foundation-model/anthropic.claude-*",
        "arn:aws:bedrock:*::foundation-model/amazon.nova-*",
        "arn:aws:bedrock:*::foundation-model/amazon.titan-*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:Converse",
        "bedrock:ConverseStream"
      ],
      "Resource": "*"
    }
  ]
}

json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream"
      ],
      "Resource": [
        "arn:aws:bedrock:*::foundation-model/anthropic.claude-*",
        "arn:aws:bedrock:*::foundation-model/amazon.nova-*",
        "arn:aws:bedrock:*::foundation-model/amazon.titan-*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:Converse",
        "bedrock:ConverseStream"
      ],
      "Resource": "*"
    }
  ]
}

With Async Invocation

包含异步调用的权限

json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream",
        "bedrock:InvokeModelAsync",
        "bedrock:GetAsyncInvoke",
        "bedrock:ListAsyncInvokes"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::my-bedrock-bucket/*"
    }
  ]
}

json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream",
        "bedrock:InvokeModelAsync",
        "bedrock:GetAsyncInvoke",
        "bedrock:ListAsyncInvokes"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::my-bedrock-bucket/*"
    }
  ]
}

Progressive Disclosure

渐进式内容披露

Quick Start (This File)

快速入门（本文档）

Client initialization
Model IDs and inference profiles
Basic invocation (native and Converse API)
Streaming responses
Token counting
Async invocation
Guardrail application
Error handling patterns
Best practices

客户端初始化
模型ID与推理配置文件
基础调用（原生与Converse API）
流式响应
令牌计数
异步调用
内容防护应用
错误处理模式
最佳实践

Detailed References

详细参考文档

Advanced Invocation Patterns: Batch processing, parallel requests, custom retry logic, response parsing
Multimodal Support: Image inputs, document parsing, vision capabilities for Claude and Nova
Tool Use and Function Calling: Complete tool use patterns, multi-turn tool conversations, error handling
Performance Optimization: Latency optimization, throughput tuning, cost reduction strategies
Monitoring and Observability: CloudWatch integration, custom metrics, cost tracking, usage analytics

高级调用模式: 批量处理、并行请求、自定义重试逻辑、响应解析
多模态支持: 图像输入、文档解析、Claude和Nova的视觉能力
工具调用与函数调用: 完整工具调用模式、多轮工具对话、错误处理
性能优化: 延迟优化、吞吐量调优、成本降低策略
监控与可观测性: CloudWatch集成、自定义指标、成本跟踪、使用分析

bedrock-inference

Original

Translation

Amazon Bedrock Inference

Amazon Bedrock推理

Overview

概述

When to Use

适用场景

Prerequisites

前置条件

Required

必要条件

Recommended

推荐配置

Installation

安装

Enable Model Access

启用模型访问

Check available models

查看可用模型

Request model access via Console:

通过控制台申请模型访问权限:

AWS Console → Bedrock → Model access → Manage model access

AWS控制台 → Bedrock → 模型访问 → 管理模型访问

Model IDs and Inference Profiles

模型ID与推理配置文件

Claude Models (Anthropic)

Claude模型（Anthropic）

Amazon Nova Models

亚马逊Nova模型

Amazon Titan Models

亚马逊Titan模型

Inference Profile Prefixes

推理配置文件前缀

Quick Reference

快速参考

Client Initialization

客户端初始化

Usage

使用示例

Operations

操作指南

1. Invoke Model (Native API)

1. 调用模型（原生API）

Usage

使用示例

2. Converse API (Unified Interface)

2. Converse API（统一接口）

Usage

使用示例

3. Stream Response (Real-time Tokens)

3. 流式响应（实时令牌）

Usage

使用示例

Usage

使用示例

4. Count Tokens

4. 令牌计数

Usage

使用示例

5. Async Invoke (Long-Running Tasks)

5. 异步调用（长时任务）

Usage

使用示例

Usage

使用示例

6. Apply Guardrail (Runtime Safety)

6. 应用内容防护（Guardrail）

Usage

使用示例

Usage

使用示例

Error Handling Patterns

错误处理模式

Comprehensive Error Handling

全面错误处理

Specific Error Scenarios

特定错误场景处理

Best Practices