GPT-4o: next-level AI interaction with text, voice & images

GPT-4o is a new AI that can understand and respond to text, audio, and images. It's faster and better at these tasks than previous models. Imagine talking to a computer that can see, hear, and respond naturally - that's the future GPT-4o promises.

GPT-4o: next-level AI interaction with text, voice & images

OpenAI's GPT-4o, a multimodal language model, processes and generates text, audio, and images. It achieves human-like response times for audio input and maintains performance parity with GPT-4 Turbo for English text and code, while excelling in non-English languages. Significantly faster and cheaper, GPT-4o offers superior visual and audio understanding.


Prior to GPT-4o, Voice Mode interactions with ChatGPT averaged latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4). This multi-stage process involved separate models for audio transcription, text processing, and audio generation, leading to information loss for GPT-4 (the core intelligence). It lacked the ability to directly interpret tone, identify multiple speakers, or comprehend background noise. Additionally, output limitations prevented laughter, singing, or emotional expression.

GPT-4o represents a paradigm shift. This novel model leverages end-to-end training across text, vision, and audio modalities. All inputs and outputs are processed by a single neural network, eliminating the information loss inherent in previous pipelines. As GPT-4o marks initial foray into multimodal processing, ongoing exploration will further illuminate the model's capabilities and limitations.



GPT-4o delivers benchmark performance and enhanced capabilities

GPT-4o achieves performance on par with GPT-4 Turbo in areas of text, reasoning, and coding intelligence, as measured by established benchmarks. It surpasses previous models in multilingual capabilities, audio processing, and visual understanding.

Benchmark Performance:

  • Reasoning: GPT-4o sets new records on both 0-shot COT MMLU (general knowledge) and traditional 5-shot no-CoT MMLU benchmarks.
  • Audio: GPT-4o significantly outperforms Whisper-v3 in speech recognition across all languages, particularly those with fewer resources. It also surpasses Whisper-v3 on speech translation tasks.
  • Vision: GPT-4o achieves state-of-the-art performance on various visual perception benchmarks, all conducted in a 0-shot setting.
  • Language Tokenization: The new tokenizer demonstrates efficient compression capabilities across diverse language families.


Safety and Limitations:

GPT-4o incorporates safety measures by design across all modalities. These include filtering training data and refining model behavior through post-training techniques. Additionally, new safety systems provide guardrails for voice outputs.

Comprehensive evaluations based on OpenAI's Preparedness Framework and voluntary commitments were conducted. These assessments, encompassing cybersecurity, CBRN, persuasion, and model autonomy, reveal that GPT-4o presents a medium risk level in each category.

External red teaming with experts from various domains further identified potential risks associated with the new modalities. These insights informed the development of safety interventions to enhance user interaction with GPT-4o. OpenAI remains committed to ongoing risk mitigation as necessary.

The initial release of GPT-4o focuses on text and image inputs with corresponding text outputs. Over time, technical infrastructure, usability considerations, and safety measures will be addressed to enable the release of remaining modalities. For instance, audio outputs will be limited to pre-selected voices at launch and subject to existing safety policies. A forthcoming system card will detail the full range of GPT-4o's modalities and associated considerations.

Testing and development have revealed limitations across all modalities, which will be addressed in future updates.