突发!OpenAI 发布首款具有推理思考能力的大模型o1-preview ( 白皮书中英原文) 2024

GPT452024-09-14 19:02:4318

Introducing OpenAI o1-preview

A new series of reasoning models for solving hard problems.

为解决一系列难题的新推理模型。

Available starting 9.12

We've developed a new series of AI models designed to spend more time thinking before they respond. They can reason through complex tasks and solve harder problems than previous models in science, coding, and math.

我们开发了一系列新的人工智能模型,旨在在做出反应之前花更多的时间思考。它们可以推理复杂的任务,解决比以前的科学、编码和数学模型更难的问题。

Today, we are releasing the first of this series in ChatGPT and our API. This is a preview and we expect regular updates and improvements. Alongside this release, we’re also including evaluations for the next update, currently in development.

今天,我们将在ChatGPT和我们的API中发布本系列的第一个。这是一个预览,我们希望定期更新和改进。除了这个版本,我们还包括目前正在开发的下一个更新的评估。

1. How it works

We trained these models to spend more time thinking through problems before they respond, much like a person would. Through training, they learn to refine their thinking process, try different strategies, and recognize their mistakes. 

我们训练这些模型在问题做出反应之前花更多的时间思考问题,就像一个人一样。通过培训,他们学会了完善自己的思维过程,尝试不同的策略,并认识到自己的错误。

In our tests, the next model update performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology. We also found that it excels in math and coding. In a qualifying exam for the International Mathematics Olympiad (IMO), GPT-4o correctly solved only 13% of problems, while the reasoning model scored 83%. Their coding abilities were evaluated in contests and reached the 89th percentile in Codeforces competitions. You can read more about this in our technical research post.

在我们的测试中,下一次模型更新在物理、化学和生物学的挑战性基准任务上的表现与博士生相似。我们还发现它在数学和编码方面表现出色。在国际数学奥林匹克竞赛(IMO)的资格考试中,GPT-4o仅正确解决了13%的问题,而推理模型的得分为83%。他们的编码能力在比赛中得到了评估,在Codeforce比赛中达到了第89百分位。您可以在我们的技术研究帖子中阅读更多相关信息。

As an early model, it doesn't yet have many of the features that make ChatGPT useful, like browsing the web for information and uploading files and images. For many common cases GPT-4o will be more capable in the near term.

作为早期的模型,它还没有很多让ChatGPT有用的功能,比如浏览网页获取信息、上传文件和图像。对于许多常见情况,GPT-4o在短期内将更有能力。

But for complex reasoning tasks this is a significant advancement and represents a new level of AI capability. Given this, we are resetting the counter back to 1 and naming this series OpenAI o1.

但对于复杂的推理任务来说,这是一个重大的进步,代表了人工智能能力的新水平。鉴于此,我们将计数器重置为1,并将此系列命名为OpenAI o1。

2. Safety

As part of developing these new models, we have come up with a new safety training approach that harnesses their reasoning capabilities to make them adhere to safety and alignment guidelines. By being able to reason about our safety rules in context, it can apply them more effectively. 

作为开发这些新模型的一部分,我们提出了一种新的安全培训方法,利用他们的推理能力,使他们遵守安全和对齐指南。通过能够在上下文中推理我们的安全规则,它可以更有效地应用它们。

One way we measure safety is by testing how well our model continues to follow its safety rules if a user tries to bypass them (known as "jailbreaking"). On one of our hardest jailbreaking tests, GPT-4o scored 22 (on a scale of 0-100) while our o1-preview model scored 84. You can read more about this in the system card and our research post.

我们衡量安全性的一种方法是测试如果用户试图绕过我们的模型(称为“越狱”),我们的模型在多大程度上继续遵循其安全规则。在我们最难的越狱测试中,GPT-4o得分为22(0-100分),而o1预览版得分为84。您可以在系统卡和我们的研究帖子中了解更多信息

To match the new capabilities of these models, we’ve bolstered our safety work, internal governance, and federal government collaboration. This includes rigorous testing and evaluations using our Preparedness Framework(opens in a new window), best-in-class red teaming, and board-level review processes, including by our Safety & Security Committee.

为了与这些模型的新功能相匹配,我们加强了安全工作、内部治理和联邦政府合作。这包括使用我们的准备框架(在新窗口中打开)进行严格的测试和评估,一流的红队合作,以及包括我们的安全与安保委员会在内的董事会级审查流程

To advance our commitment to AI safety, we recently formalized agreements with the U.S. and U.K. AI Safety Institutes. We've begun operationalizing these agreements, including granting the institutes early access to a research version of this model. This was an important first step in our partnership, helping to establish a process for research, evaluation, and testing of future models prior to and following their public release.

为了推进我们对人工智能安全的承诺,我们最近与美国和英国人工智能安全研究所正式签订了协议。我们已经开始实施这些协议,包括允许研究所尽早获得该模型的研究版本。这是我们合作的重要第一步,有助于在未来模型公开发布之前和之后建立研究、评估和测试流程。

3. Whom it’s for

These enhanced reasoning capabilities may be particularly useful if you’re tackling complex problems in science, coding, math, and similar fields. For example, o1 can be used by healthcare researchers to annotate cell sequencing data, by physicists to generate complicated mathematical formulas needed for quantum optics, and by developers in all fields to build and execute multi-step workflows.

如果你正在处理科学、编码、数学和类似领域的复杂问题,这些增强的推理能力可能特别有用。例如,o1可以被医疗保健研究人员用来注释细胞测序数据,被物理学家用来生成量子光学所需的复杂数学公式,被所有领域的开发人员用来构建和执行多步骤工作流程。

4. OpenAI o1-mini

The o1 series excels at accurately generating and debugging complex code. To offer a more efficient solution for developers, we’re also releasing OpenAI o1-mini, a faster, cheaper reasoning model that is particularly effective at coding. As a smaller model, o1-mini is 80% cheaper than o1-preview, making it a powerful, cost-effective model for applications that require reasoning but not broad world knowledge. 

o1系列擅长准确生成和调试复杂代码。为了为开发人员提供更有效的解决方案,我们还发布了OpenAI o1-mini,这是一种更快、更便宜的推理模型,在编码方面特别有效。作为一个较小的模型,o1 mini比o1预览版便宜80%,使其成为需要推理但不需要广泛知识的应用程序的强大、经济高效的模型。

5. How to use OpenAI o1

ChatGPT Plus and Team users will be able to access o1 models in ChatGPT starting today. Both o1-preview and o1-mini can be selected manually in the model picker, and at launch, weekly rate limits will be 30 messages for o1-preview and 50 for o1-mini. We are working to increase those rates and enable ChatGPT to automatically choose the right model for a given prompt.

从今天开始,ChatGPT Plus和团队用户将能够访问ChatGPT中的o1模型。o1预览和o1mini都可以在模型选择器中手动选择,启动时,o1预览的每周速率限制为30条消息,o1mini为50条消息。我们正在努力提高这些速率,并使ChatGPT能够为给定的提示自动选择正确的模型。

ChatGPT Enterprise and Edu users will get access to both models beginning next week. 

ChatGPT Enterprise和Edu用户将从下周开始访问这两种模型。

Developers who qualify for API usage tier 5(opens in a new window) can start prototyping with both models in the API today with a rate limit of 20 RPM. We’re working to increase these limits after additional testing. The API for these models currently doesn't include function calling, streaming, support for system messages, and other features. To get started, check out the API documentation(opens in a new window).

符合API使用层5(在一个新窗口中打开)的开发人员今天可以在API中开始两个模型的原型设计,速率限制为20 RPM。在进一步测试后,我们正在努力提高这些限制。这些模型的API目前不包括函数调用、流、对系统消息的支持和其他功能。要开始,请查看API文档(在新窗口中打开)。

We also are planning to bring o1-mini access to all ChatGPT Free users. 

我们还计划为所有ChatGPT Free用户提供o1 mini访问权限。

6. What’s next

This is an early preview of these reasoning models in ChatGPT and the API. In addition to model updates, we expect to add browsing, file and image uploading, and other features to make them more useful to everyone. 

这是ChatGPT和API中这些推理模型的早期预览。除了模型更新,我们还希望添加浏览、文件和图像上传以及其他功能,使其对每个人都更有用。
除了新的OpenAI o1系列之外,我们还计划继续开发和发布GPT系列中的模型。
We also plan to continue developing and releasing models in our GPT series, in addition to the new OpenAI o1 series.
除了新的OpenAI o1系列之外,我们还计划继续开发和发布GPT系列中的模型。
Learning to Reason with LLMs
We are introducing OpenAI o1, a new large language model trained with reinforcement learning to perform complex reasoning. o1 thinks before it answers—it can produce a long internal chain of thought before responding to the user.
我们正在引入OpenAI o1,这是一种新的大型语言模型,通过强化学习进行训练,以执行复杂的推理。o1在回答之前会思考——它可以在回应用户之前产生一个漫长的内部思维链。
OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA). While the work needed to make this new model as easy to use as current models is still ongoing, we are releasing an early version of this model, OpenAI o1-preview, for immediate use in ChatGPT and to trusted API users(opens in a new window).
OpenAI o1在竞争性编程问题(Codeforce)上排名第89百分位,在美国数学奥林匹克竞赛(AIME)资格赛中跻身美国前500名学生之列,在物理、生物和化学问题(GPQA)的基准上超过了人类博士水平的准确率。虽然使此新模型与当前模型一样易于使用所需的工作仍在进行中,但我们正在发布此模型的早期版本OpenAI o1-preview,以便在ChatGPT中立即使用,并向受信任的API用户开放(在新窗口中打开)。
Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.
我们的大规模强化学习算法教导模型如何在高度数据高效的训练过程中使用其思维链进行高效思考。我们发现,o1的性能随着更多的强化学习(训练时间计算)和更多的思考时间(测试时间计算)而持续提高。扩展这种方法的限制与LLM预培训的限制有很大不同,我们正在继续研究它们。
o1 performance smoothly improves with both train-time and test-time compute
o1性能随着训练时间和测试时间的计算而平稳提高
7. Evals
评估
To highlight the reasoning improvement over GPT-4o, we tested our models on a diverse set of human exams and ML benchmarks. We show that o1 significantly outperforms GPT-4o on the vast majority of these reasoning-heavy tasks. Unless otherwise specified, we evaluated o1 on the maximal test-time compute setting.
为了强调GPT-4o在推理方面的改进,我们在一组不同的人类考试和机器学习基准上测试了我们的模型。我们发现,在绝大多数推理繁重的任务中,o1的表现明显优于GPT-4o。除非另有说明,否则我们在最大测试时间计算设置上评估o1。
o1 greatly improves over GPT-4o on challenging reasoning benchmarks. Solid bars show pass@1 accuracy and the shaded region shows the performance of majority vote (consensus) with 64 samples.
o1在具有挑战性的推理基准上大大优于GPT-4o。实心条形图pass@1准确性和阴影区域显示了64个样本的多数投票(共识)的性能。
o1 improves over GPT-4o on a wide range of benchmarks, including 54/57 MMLU subcategories. Seven are shown for illustration.
o1在广泛的基准测试中比GPT-4o有所改进,包括54/57 MMLU子类别。图中显示了七个用于说明。
In many reasoning-heavy benchmarks, o1 rivals the performance of human experts. Recent frontier models1 do so well on MATH2 and GSM8K that these benchmarks are no longer effective at differentiating models. We evaluated math performance on AIME, an exam designed to challenge the brightest high school math students in America. On the 2024 AIME exams, GPT-4o only solved on average 12% (1.8/15) of problems. o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function. A score of 13.9 places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad.
在许多推理繁重的基准测试中,o1的性能可与人类专家相媲美。最近的前沿模型1在MATH2和GSM8K上表现良好,以至于这些基准在区分模型方面不再有效。我们评估了AIME的数学成绩,这是一项旨在挑战美国最聪明的高中数学学生的考试。在2024年的AIME考试中,GPT-4o平均只解决了12%(1.8/15)的问题。当使用学习到的评分函数对1000个样本进行重新排名时,o1平均为74%(11.1/15),每个问题有一个样本,83%(12.5/15)在64个样本之间达成共识,93%(13.9/15)。13.9的分数使其跻身全国前500名学生之列,并超过了美国数学奥林匹克竞赛的门槛。
We also evaluated o1 on GPQA diamond, a difficult intelligence benchmark which tests for expertise in chemistry, physics and biology. In order to compare models to humans, we recruited experts with PhDs to answer GPQA-diamond questions. We found that o1 surpassed the performance of those human experts, becoming the first model to do so on this benchmark. These results do not imply that o1 is more capable than a PhD in all respects — only that the model is more proficient in solving some problems that a PhD would be expected to solve. On several other ML benchmarks, o1 improved over the state-of-the-art. With its vision perception capabilities enabled, o1 scored 78.2% on MMMU, making it the first model to be competitive with human experts. It also outperformed GPT-4o on 54 out of 57 MMLU subcategories.
我们还在GPQA钻石上评估了o1,这是一个困难的智力基准,用于测试化学、物理和生物学方面的专业知识。为了将模型与人类进行比较,我们招募了具有博士学位的专家来回答GPQA钻石问题。我们发现o1超越了那些人类专家的表现,成为第一个在这个基准上做到这一点的模型。这些结果并不意味着o1在所有方面都比博士更有能力,只是该模型更擅长解决博士应该解决的一些问题。在其他几个机器学习基准测试中,o1比最先进的有所改进。由于启用了视觉感知能力,o1在MMMU上的得分为78.2%,使其成为第一个与人类专家竞争的模型。在57个MMLU子类别中,它在54个方面的表现也优于GPT-40。
8. Chain of Thought
思维链
Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem. Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working. This process dramatically improves the model’s ability to reason. To illustrate this leap forward, we showcase the chain of thought from o1-preview on several difficult problems below.
与人类在回答难题之前长时间思考的方式类似,o1在试图解决问题时也会使用一连串的思维。通过强化学习,o1学会磨练其思维链并完善其使用的策略。它学会了识别和纠正错误。它学会了把棘手的步骤分解成更简单的步骤。它学会了在当前方法不起作用时尝试不同的方法。这个过程极大地提高了模型的推理能力。为了说明这一飞跃,我们展示了o1预览版对下面几个难题的思路。
9. Coding
We trained a model that scored 213 points and ranked in the 49th percentile in the 2024 International Olympiad in Informatics (IOI), by initializing from o1 and training to further improve programming skills. This model competed in the 2024 IOI under the same conditions as the human contestants. It had ten hours to solve six challenging algorithmic problems and was allowed 50 submissions per problem.
我们训练了一个模型,该模型在2024年国际信息学奥林匹克竞赛(IOI)中得分213分,排名第49百分位,通过从o1开始初始化和训练来进一步提高编程技能。该模型在与人类参赛者相同的条件下参加了2024年IOI比赛。它有十个小时的时间来解决六个具有挑战性的算法问题,每个问题允许提交50份。
For each problem, our system sampled many candidate submissions and submitted 50 of them based on a test-time selection strategy. Submissions were selected based on performance on the IOI public test cases, model-generated test cases, and a learned scoring function. If we had instead submitted at random, we would have only scored 156 points on average, suggesting that this strategy was worth nearly 60 points under competition constraints.
对于每个问题,我们的系统对许多候选提交进行了抽样,并根据测试时间选择策略提交了其中的50个。提交的内容是根据IOI公共测试用例、模型生成的测试用例和学习到的评分函数的性能进行选择的。如果我们随机提交,我们的平均得分只有156分,这表明在竞争限制下,这种策略的价值接近60分。
With a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy.
通过放宽提交约束,我们发现模型性能显著提高。当每个问题允许10000次提交时,即使没有任何测试时间选择策略,该模型也获得了362.14分,高于金牌门槛。
Finally, we simulated competitive programming contests hosted by Codeforces to demonstrate this model’s coding skill. Our evaluations closely matched competition rules and allowed for 10 submissions. GPT-4o achieved an Elo rating3 of 808, which is in the 11th percentile of human competitors. This model far exceeded both GPT-4o and o1—it achieved an Elo rating of 1807, performing better than 93% of competitors.
最后,我们模拟了Codeforce主办的竞争性编程竞赛,以展示该模型的编码技能。我们的评估与竞赛规则非常匹配,允许提交10份参赛作品。GPT-4o的Elo评分为808,在人类竞争对手中排名第11位。该型号远远超过GPT-4o和o1,其Elo评级为1807,表现优于93%的竞争对手。
Further fine-tuning on programming competitions improves o1. The improved model ranked in the 49th percentile in the 2024 International Olympiad in
对编程竞赛的进一步微调提高了o1。改进后的模型在2024年国际奥林匹克运动会上排名第49位
10. Human preference evaluation
人类偏好评估
In addition to exams and academic benchmarks, we also evaluated human preference of o1-preview vs GPT-4o on challenging, open-ended prompts in a broad spectrum of domains. In this evaluation, human trainers were shown anonymized responses to a prompt from o1-preview and GPT-4o, and voted for which response they preferred. o1-preview is preferred to gpt-4o by a large margin in reasoning-heavy categories like data analysis, coding, and math. However, o1-preview is not preferred on some natural language tasks, suggesting that it is not well-suited for all use cases.
除了考试和学术基准之外,我们还评估了人类对o1预览与GPT-4o在广泛领域中具有挑战性的开放式提示的偏好。在这项评估中,向人类培训师展示了对o1预览和GPT-4o提示的匿名回应,并投票选出他们喜欢的回应。在数据分析、编码和数学等推理繁重的类别中,o1预览比gpt-4o更受欢迎。然而,o1预览在某些自然语言任务中并不受欢迎,这表明它并不适合所有用例。
People prefer o1-preview in domains that benefit from better reasoning.
人们更喜欢在受益于更好推理的领域进行o1预览。
11. Safety
Chain of thought reasoning provides new opportunities for alignment and safety. We found that integrating our policies for model behavior into the chain of thought of a reasoning model is an effective way to robustly teach human values and principles. By teaching the model our safety rules and how to reason about them in context, we found evidence of reasoning capability directly benefiting model robustness: o1-preview achieved substantially improved performance on key jailbreak evaluations and our hardest internal benchmarks for evaluating our model's safety refusal boundaries. We believe that using a chain of thought offers significant advances for safety and alignment because (1) it enables us to observe the model thinking in a legible way, and (2) the model reasoning about safety rules is more robust to out-of-distribution scenarios.
思维链推理为对齐和安全提供了新的机会。我们发现,将我们的模型行为策略整合到推理模型的思维链中,是强有力地教授人类价值观和原则的有效方法。通过教授模型我们的安全规则以及如何在上下文中对其进行推理,我们发现了推理能力直接有益于模型鲁棒性的证据:o1预览在关键越狱评估和我们评估模型安全拒绝边界的最困难的内部基准上实现了显著提高的性能。我们认为,使用思维链可以在安全性和一致性方面取得重大进展,因为(1)它使我们能够以清晰的方式观察模型思维,(2)关于安全规则的模型推理对分布外的场景更稳健。
To stress-test our improvements, we conducted a suite of safety tests and red-teaming before deployment, in accordance with our Preparedness Framework(opens in a new window). We found that chain of thought reasoning contributed to capability improvements across our evaluations. Of particular note, we observed interesting instances of reward hacking(opens in a new window). Detailed results from these evaluations can be found in the accompanying System Card.
为了对我们的改进进行压力测试,我们在部署前根据我们的准备框架(在新窗口中打开)进行了一系列安全测试和红队合作。我们发现,思维链推理有助于我们在评估中提高能力。特别值得注意的是,我们观察到了有趣的奖励黑客事件(在新窗口中打开)。这些评估的详细结果可以在随附的系统卡中找到。
12. Hiding the Chains of Thought
隐藏的思维链
We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to "read the mind" of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.
我们认为,隐藏的思维链为监控模型提供了独特的机会。假设它是忠实和清晰的,隐藏的思维链允许我们“阅读模型的思想”并理解其思维过程。例如,在未来,我们可能希望监控思维链,寻找操纵用户的迹象。然而,为了实现这一点,模型必须能够以不变的形式自由表达其思想,因此我们无法将任何政策合规性或用户偏好训练到思想链上。我们也不想让用户直接看到不一致的思维链。
Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.
因此,在权衡了用户体验、竞争优势和追求思维链监控的选择等多个因素后,我们决定不向用户展示原始的思维链。我们承认这一决定有缺点。我们努力通过教模型从答案中的思维链中再现任何有用的想法来部分弥补这一点。对于o1模型系列,我们展示了模型生成的思维链摘要。
13. Conclusion
结论
o1 significantly advances the state-of-the-art in AI reasoning. We plan to release improved versions of this model as we continue iterating. We expect these new reasoning capabilities will improve our ability to align models to human values and principles. We believe o1 – and its successors – will unlock many new use cases for AI in science, coding, math, and related fields. We are excited for users and API developers to discover how it can improve their daily work.
o1显著推进了人工智能推理的最新进展。我们计划在继续迭代的同时发布此模型的改进版本。我们预计这些新的推理能力将提高我们将模型与人类价值观和原则相一致的能力。我们相信o1及其继任者将为人工智能在科学、编码、数学和相关领域开辟许多新的用例。我们很高兴用户和API开发人员能够发现它如何改善他们的日常工作

本文链接:https://lipu365.com/gpt4_gpt5_156.html

必应chatgpt4.0chatgpt4.0中文版怎么用网页版chatgpt4.0chatgpt4.0哪里可以免费用chatgpt要会员吗AI音乐工具M2UGenOpenAIGPTs商店DALL·E

相关文章