Skip to content

Commit bbbcdae

Browse files
authored
Realtime docs (openai#1153)
Documentation (both written and code)
1 parent 32997b2 commit bbbcdae

File tree

10 files changed

+557
-0
lines changed

10 files changed

+557
-0
lines changed

docs/realtime/guide.md

Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
# Guide
2+
3+
This guide provides an in-depth look at building voice-enabled AI agents using the OpenAI Agents SDK's realtime capabilities.
4+
5+
!!! warning "Beta feature"
6+
Realtime agents are in beta. Expect some breaking changes as we improve the implementation.
7+
8+
## Overview
9+
10+
Realtime agents allow for conversational flows, processing audio and text inputs in real time and responding with realtime audio. They maintain persistent connections with OpenAI's Realtime API, enabling natural voice conversations with low latency and the ability to handle interruptions gracefully.
11+
12+
## Architecture
13+
14+
### Core Components
15+
16+
The realtime system consists of several key components:
17+
18+
- **RealtimeAgent**: An agent, configured wiht instructions, tools and handoffs.
19+
- **RealtimeRunner**: Manages configuration. You can call `runner.run()` to get a session.
20+
- **RealtimeSession**: A single interaction session. You typically create one each time a user starts a conversation, and keep it alive until the conversation is done.
21+
- **RealtimeModel**: The underlying model interface (typically OpenAI's WebSocket implementation)
22+
23+
### Session flow
24+
25+
A typical realtime session follows this flow:
26+
27+
1. **Create your RealtimeAgent(s)** with instructions, tools and handoffs.
28+
2. **Set up a RealtimeRunner** with the agent and configuration options
29+
3. **Start the session** using `await runner.run()` which returns a RealtimeSession.
30+
4. **Send audio or text messages** to the session using `send_audio()` or `send_message()`
31+
5. **Listen for events** by iterating over the session - events include audio output, transcripts, tool calls, handoffs, and errors
32+
6. **Handle interruptions** when users speak over the agent, which automatically stops current audio generation
33+
34+
The session maintains the conversation history and manages the persistent connection with the realtime model.
35+
36+
## Agent configuration
37+
38+
RealtimeAgent works similarly to the regular Agent class with some key differences. For full API details, see the [`RealtimeAgent`][agents.realtime.agent.RealtimeAgent] API reference.
39+
40+
Key differences from regular agents:
41+
42+
- Model choice is configured at the session level, not the agent level.
43+
- No structured output support (`outputType` is not supported).
44+
- Voice can be configured per agent but cannot be changed after the first agent speaks.
45+
- All other features like tools, handoffs, and instructions work the same way.
46+
47+
## Session configuration
48+
49+
### Model settings
50+
51+
The session configuration allows you to control the underlying realtime model behavior. You can configure the model name (such as `gpt-4o-realtime-preview`), voice selection (alloy, echo, fable, onyx, nova, shimmer), and supported modalities (text and/or audio). Audio formats can be set for both input and output, with PCM16 being the default.
52+
53+
### Audio configuration
54+
55+
Audio settings control how the session handles voice input and output. You can configure input audio transcription using models like Whisper, set language preferences, and provide transcription prompts to improve accuracy for domain-specific terms. Turn detection settings control when the agent should start and stop responding, with options for voice activity detection thresholds, silence duration, and padding around detected speech.
56+
57+
## Tools and Functions
58+
59+
### Adding Tools
60+
61+
Just like regular agents, realtime agents support function tools that execute during conversations:
62+
63+
```python
64+
from agents import function_tool
65+
66+
@function_tool
67+
def get_weather(city: str) -> str:
68+
"""Get current weather for a city."""
69+
# Your weather API logic here
70+
return f"The weather in {city} is sunny, 72°F"
71+
72+
@function_tool
73+
def book_appointment(date: str, time: str, service: str) -> str:
74+
"""Book an appointment."""
75+
# Your booking logic here
76+
return f"Appointment booked for {service} on {date} at {time}"
77+
78+
agent = RealtimeAgent(
79+
name="Assistant",
80+
instructions="You can help with weather and appointments.",
81+
tools=[get_weather, book_appointment],
82+
)
83+
```
84+
85+
## Handoffs
86+
87+
### Creating Handoffs
88+
89+
Handoffs allow transferring conversations between specialized agents.
90+
91+
```python
92+
from agents.realtime import realtime_handoff
93+
94+
# Specialized agents
95+
billing_agent = RealtimeAgent(
96+
name="Billing Support",
97+
instructions="You specialize in billing and payment issues.",
98+
)
99+
100+
technical_agent = RealtimeAgent(
101+
name="Technical Support",
102+
instructions="You handle technical troubleshooting.",
103+
)
104+
105+
# Main agent with handoffs
106+
main_agent = RealtimeAgent(
107+
name="Customer Service",
108+
instructions="You are the main customer service agent. Hand off to specialists when needed.",
109+
handoffs=[
110+
realtime_handoff(billing_agent, tool_description="Transfer to billing support"),
111+
realtime_handoff(technical_agent, tool_description="Transfer to technical support"),
112+
]
113+
)
114+
```
115+
116+
## Event handling
117+
118+
The session streams events that you can listen to by iterating over the session object. Events include audio output chunks, transcription results, tool execution start and end, agent handoffs, and errors. Key events to handle include:
119+
120+
- **audio**: Raw audio data from the agent's response
121+
- **audio_end**: Agent finished speaking
122+
- **audio_interrupted**: User interrupted the agent
123+
- **tool_start/tool_end**: Tool execution lifecycle
124+
- **handoff**: Agent handoff occurred
125+
- **error**: Error occurred during processing
126+
127+
For complete event details, see [`RealtimeSessionEvent`][agents.realtime.events.RealtimeSessionEvent].
128+
129+
## Guardrails
130+
131+
Only output guardrails are supported for realtime agents. These guardrails are debounced and run periodically (not on every word) to avoid performance issues during real-time generation. The default debounce length is 100 characters, but this is configurable.
132+
133+
When a guardrail is triggered, it generates a `guardrail_tripped` event and can interrupt the agent's current response. The debounce behavior helps balance safety with real-time performance requirements. Unlike text agents, realtime agents do **not** raise an Exception when guardrails are tripped.
134+
135+
## Audio processing
136+
137+
Send audio to the session using [`session.send_audio(audio_bytes)`][agents.realtime.session.RealtimeSession.send_audio] or send text using [`session.send_message()`][agents.realtime.session.RealtimeSession.send_message].
138+
139+
For audio output, listen for `audio` events and play the audio data through your preferred audio library. Make sure to listen for `audio_interrupted` events to stop playback immediately and clear any queued audio when the user interrupts the agent.
140+
141+
## Examples
142+
143+
For complete working examples, check out the [examples/realtime directory](https://github.com/openai/openai-agents-python/tree/main/examples/realtime) which includes demos with and without UI components.

docs/realtime/quickstart.md

Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
# Quickstart
2+
3+
Realtime agents enable voice conversations with your AI agents using OpenAI's Realtime API. This guide walks you through creating your first realtime voice agent.
4+
5+
!!! warning "Beta feature"
6+
Realtime agents are in beta. Expect some breaking changes as we improve the implementation.
7+
8+
## Prerequisites
9+
10+
- Python 3.9 or higher
11+
- OpenAI API key
12+
- Basic familiarity with the OpenAI Agents SDK
13+
14+
## Installation
15+
16+
If you haven't already, install the OpenAI Agents SDK:
17+
18+
```bash
19+
pip install openai-agents
20+
```
21+
22+
## Creating your first realtime agent
23+
24+
### 1. Import required components
25+
26+
```python
27+
import asyncio
28+
from agents.realtime import RealtimeAgent, RealtimeRunner
29+
```
30+
31+
### 2. Create a realtime agent
32+
33+
```python
34+
agent = RealtimeAgent(
35+
name="Assistant",
36+
instructions="You are a helpful voice assistant. Keep your responses conversational and friendly.",
37+
)
38+
```
39+
40+
### 3. Set up the runner
41+
42+
```python
43+
runner = RealtimeRunner(
44+
starting_agent=agent,
45+
config={
46+
"model_settings": {
47+
"model_name": "gpt-4o-realtime-preview",
48+
"voice": "alloy",
49+
"modalities": ["text", "audio"],
50+
}
51+
}
52+
)
53+
```
54+
55+
### 4. Start a session
56+
57+
```python
58+
async def main():
59+
# Start the realtime session
60+
session = await runner.run()
61+
62+
async with session:
63+
# Send a text message to start the conversation
64+
await session.send_message("Hello! How are you today?")
65+
66+
# The agent will stream back audio in real-time (not shown in this example)
67+
# Listen for events from the session
68+
async for event in session:
69+
if event.type == "response.audio_transcript.done":
70+
print(f"Assistant: {event.transcript}")
71+
elif event.type == "conversation.item.input_audio_transcription.completed":
72+
print(f"User: {event.transcript}")
73+
74+
# Run the session
75+
asyncio.run(main())
76+
```
77+
78+
## Complete example
79+
80+
Here's a complete working example:
81+
82+
```python
83+
import asyncio
84+
from agents.realtime import RealtimeAgent, RealtimeRunner
85+
86+
async def main():
87+
# Create the agent
88+
agent = RealtimeAgent(
89+
name="Assistant",
90+
instructions="You are a helpful voice assistant. Keep responses brief and conversational.",
91+
)
92+
93+
# Set up the runner with configuration
94+
runner = RealtimeRunner(
95+
starting_agent=agent,
96+
config={
97+
"model_settings": {
98+
"model_name": "gpt-4o-realtime-preview",
99+
"voice": "alloy",
100+
"modalities": ["text", "audio"],
101+
"input_audio_transcription": {
102+
"model": "whisper-1"
103+
},
104+
"turn_detection": {
105+
"type": "server_vad",
106+
"threshold": 0.5,
107+
"prefix_padding_ms": 300,
108+
"silence_duration_ms": 200
109+
}
110+
}
111+
}
112+
)
113+
114+
# Start the session
115+
session = await runner.run()
116+
117+
async with session:
118+
print("Session started! The agent will stream audio responses in real-time.")
119+
120+
# Process events
121+
async for event in session:
122+
if event.type == "response.audio_transcript.done":
123+
print(f"Assistant: {event.transcript}")
124+
elif event.type == "conversation.item.input_audio_transcription.completed":
125+
print(f"User: {event.transcript}")
126+
elif event.type == "error":
127+
print(f"Error: {event.error}")
128+
break
129+
130+
if __name__ == "__main__":
131+
asyncio.run(main())
132+
```
133+
134+
## Configuration options
135+
136+
### Model settings
137+
138+
- `model_name`: Choose from available realtime models (e.g., `gpt-4o-realtime-preview`)
139+
- `voice`: Select voice (`alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer`)
140+
- `modalities`: Enable text and/or audio (`["text", "audio"]`)
141+
142+
### Audio settings
143+
144+
- `input_audio_format`: Format for input audio (`pcm16`, `g711_ulaw`, `g711_alaw`)
145+
- `output_audio_format`: Format for output audio
146+
- `input_audio_transcription`: Transcription configuration
147+
148+
### Turn detection
149+
150+
- `type`: Detection method (`server_vad`, `semantic_vad`)
151+
- `threshold`: Voice activity threshold (0.0-1.0)
152+
- `silence_duration_ms`: Silence duration to detect turn end
153+
- `prefix_padding_ms`: Audio padding before speech
154+
155+
## Next steps
156+
157+
- [Learn more about realtime agents](guide.md)
158+
- Check out working examples in the [examples/realtime](https://github.com/openai/openai-agents-python/tree/main/examples/realtime) folder
159+
- Add tools to your agent
160+
- Implement handoffs between agents
161+
- Set up guardrails for safety
162+
163+
## Authentication
164+
165+
Make sure your OpenAI API key is set in your environment:
166+
167+
```bash
168+
export OPENAI_API_KEY="your-api-key-here"
169+
```
170+
171+
Or pass it directly when creating the session:
172+
173+
```python
174+
session = await runner.run(model_config={"api_key": "your-api-key"})
175+
```

docs/ref/realtime/agent.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# `RealtimeAgent`
2+
3+
::: agents.realtime.agent.RealtimeAgent

docs/ref/realtime/config.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# Realtime Configuration
2+
3+
## Run Configuration
4+
5+
::: agents.realtime.config.RealtimeRunConfig
6+
7+
## Model Settings
8+
9+
::: agents.realtime.config.RealtimeSessionModelSettings
10+
11+
## Audio Configuration
12+
13+
::: agents.realtime.config.RealtimeInputAudioTranscriptionConfig
14+
::: agents.realtime.config.RealtimeTurnDetectionConfig
15+
16+
## Guardrails Settings
17+
18+
::: agents.realtime.config.RealtimeGuardrailsSettings
19+
20+
## Model Configuration
21+
22+
::: agents.realtime.model.RealtimeModelConfig
23+
24+
## Tracing Configuration
25+
26+
::: agents.realtime.config.RealtimeModelTracingConfig
27+
28+
## User Input Types
29+
30+
::: agents.realtime.config.RealtimeUserInput
31+
::: agents.realtime.config.RealtimeUserInputText
32+
::: agents.realtime.config.RealtimeUserInputMessage
33+
34+
## Client Messages
35+
36+
::: agents.realtime.config.RealtimeClientMessage
37+
38+
## Type Aliases
39+
40+
::: agents.realtime.config.RealtimeModelName
41+
::: agents.realtime.config.RealtimeAudioFormat

0 commit comments

Comments
 (0)