OpenAI, a leading AI research organization, has announced the release of its latest AI model, GPT-4. The new model is a large multimodal model that can accept image and text inputs and emit text outputs. While it may be less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks. In fact, it has passed a simulated bar exam with a score around the top 10% of test takers.
OpenAI spent six months aligning GPT-4 using lessons from their adversarial testing program, resulting in the best-ever results (though far from perfect) on factuality, steerability, and refusing to go outside of guardrails. The organization rebuilt its entire deep learning stack over the past two years, and co-designed a supercomputer with Azure to train GPT-4.
GPT-4 is available via ChatGPT and the API (with a waitlist) for text input, while the image input capability is currently being prepared for wider availability through a collaboration with a single partner. OpenAI Evals, the organization’s framework for automated evaluation of AI model performance, is also being open-sourced to allow anyone to report shortcomings in their models to help guide further improvements.
GPT-4 considerably outperforms existing large language models and most state-of-the-art models in traditional machine learning benchmarks. It also exhibits similar capabilities on visual inputs as it does on text-only inputs. OpenAI has been using GPT-4 internally with great impact on functions like support, sales, content moderation, and programming, as well as to assist humans in evaluating AI outputs.
OpenAI has been working on each aspect of the plan outlined in their post about defining the behavior of AIs, including steerability. Developers (and soon ChatGPT users) can now prescribe their AI’s style and task by describing those directions in the “system” message. System messages allow API users to significantly customize their users’ experience within bounds.
OpenAI plans to release further analyses and evaluation numbers as well as thorough investigation of the effect of test-time techniques soon.