News
Releases, benchmarks, and analysis from across the LLM ecosystem.· news
Introducing computer use in Gemini 3.5 Flash
Google introduced computer-use capabilities in Gemini 3.5 Flash, enabling the model to interact with computer interfaces as part of its workflow. It matters because this moves Gemini from text and image generation toward agentic task execution, a step toward automating multi-step actions in software.
Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World
FFASR introduced a new leaderboard for benchmarking automatic speech recognition in real-world conditions. It matters because it gives researchers and practitioners a way to compare ASR systems on practical, noisy, and diverse audio rather than only on controlled test sets.
How GPT-5 helped immunologist Derya Unutmaz solve a 3-year-old mystery
GPT-5 Pro helped immunologist Derya Unutmaz solve a three-year-old mystery about T cell behavior. The result matters because it could advance cancer and autoimmune research by providing new insights into immune system regulation.
Build real agentic apps using CUGA: two dozen working examples on a lightweight harness
CUGA introduces a lightweight harness for building agentic applications, with two dozen working examples demonstrating the approach. It matters because the examples suggest a practical way to prototype and ship agent workflows without a heavy framework stack.
Experimenting with the proposed Cross-Origin Storage API in Transformers.js
Transformers.js is experimenting with the proposed Cross-Origin Storage API to enable cross-origin access patterns for browser-based ML workloads. This matters because it could simplify loading and sharing model assets across origins, but the source excerpt provides no implementation details or performance data.
Daybreak: Tools for securing every organization in the world
OpenAI introduced Daybreak tools, including Codex Security and GPT-5.5-Cyber, to help organizations find, validate, and patch vulnerabilities at scale. The new tools are meant to bring AI-assisted security workflows to every organization, with a focus on faster vulnerability discovery and remediation.
Codex-maxxing for long-running work
Jason Liu describes using Codex to preserve context and manage complex projects so work can continue beyond a single prompt. The key point is that Codex is being used for long-running tasks where retaining state and continuity matters more than one-shot answers.
We got local models to triage the OpenClaw repo for FREE!*
Local models were used to triage the OpenClaw repo at no cost, according to the source excerpt. This matters because it suggests offline or on-device models can handle repository triage without relying on paid hosted inference.
Samsung Electronics brings ChatGPT and Codex to employees
Samsung Electronics is deploying ChatGPT Enterprise and Codex to employees worldwide, marking one of OpenAI’s largest enterprise AI rollouts. This gives Samsung broad access to OpenAI’s tools across its global workforce and signals continued expansion of enterprise AI adoption at major hardware companies.
New usage analytics and updated spend controls for enterprises
OpenAI introduced new spend controls and usage analytics for ChatGPT Enterprise to help organizations manage costs as they scale AI usage. The update matters because it gives enterprise admins more visibility and control over spending, which is often a key blocker to broader deployment.
Improving health intelligence in ChatGPT
GPT-5.5 Instant improves ChatGPT’s health and wellness responses with stronger reasoning, better context handling, clearer communication, and physician-informed evaluations. The update matters because it is aimed at making health advice in ChatGPT more reliable and easier to understand, which is especially important for sensitive medical and wellness queries.
Using AI to help physicians diagnose rare genetic diseases affecting children
Researchers used an OpenAI reasoning model to help diagnose rare genetic diseases in children, producing 18 new diagnoses in previously unsolved cases. The result shows how reasoning models can support clinicians on difficult diagnostic workups, especially when standard testing has not found an answer.
Is it agentic enough? Benchmarking open models on your own tooling
A new piece discusses benchmarking open models on a user’s own tooling to judge whether they are “agentic enough.” It matters because agentic capability depends heavily on real workflows and tools, so custom evaluation can reveal gaps that standard benchmarks miss.
Beyond LoRA: Can you beat the most popular fine-tuning technique?
A piece titled “Beyond LoRA: Can you beat the most popular fine-tuning technique?” examines whether methods newer or different from LoRA can outperform the standard parameter-efficient fine-tuning approach. It matters because LoRA is the baseline for many LLM adaptation workflows, so any competitive alternative could change how models are customized for lower cost and memory use.
A near-autonomous AI chemist improves a challenging reaction in medicinal chemistry
OpenAI and Molecule.one reported that a near-autonomous AI chemist using GPT-5.4 improved a challenging medicinal-chemistry reaction. The result suggests large models can do more than propose molecules, potentially accelerating optimization of real drug-making steps.
Agentic Resource Discovery: Let agents search
The piece introduces “Agentic Resource Discovery,” an approach that lets AI agents search for resources rather than relying on fixed retrieval pipelines. It matters because agent-driven search can make systems more flexible and adaptive, though the excerpt provides no technical specifics or performance numbers.
Introducing LifeSciBench
LifeSciBench is an expert-authored, expert-reviewed benchmark designed to evaluate how AI systems handle real-world life science research tasks and decisions. It matters because it targets practical scientific decision-making rather than toy benchmarks, giving a more realistic test of model utility in life sciences.
Securing the future of AI agents
The piece describes an AI Control Roadmap for securing internal systems by combining traditional safeguards with real-time monitoring. It matters because AI agents can act autonomously inside enterprise environments, so layered controls and continuous oversight are needed to reduce risk.
How an astrophysicist uses Codex to help simulate black holes
Astrophysicist Chi-kwan Chan uses Codex to help build black hole simulations for studying extreme physics and testing Einstein’s theory of general relativity. It shows how coding assistants can speed scientific simulation work in a domain where accurate models are computationally demanding and physically complex.
OpenAI to acquire Ona
OpenAI plans to acquire Ona to add secure, persistent cloud environments to Codex, enabling long-running AI agents across enterprise workflows. The deal matters because persistent environments are a key missing piece for agentic coding and enterprise automation, where models need stable state and execution over long periods.
BBVA puts AI at the core of banking with OpenAI
BBVA scaled ChatGPT Enterprise to 100,000 employees and partnered with OpenAI to put AI at the center of its global banking transformation. This matters because it shows one of the world’s largest banks moving AI from pilots to enterprise-wide deployment at massive employee scale.
Access OpenAI models and Codex through your Oracle cloud commitment
Oracle is offering access to OpenAI models and Codex through Oracle Cloud, allowing customers to use existing cloud commitments to build and deploy AI applications with enterprise security and governance. This matters because it lets enterprises consume OpenAI capabilities without changing procurement, while keeping workloads inside Oracle’s security and governance framework.
Investing in multi-agent AI safety research
Google DeepMind and partners announced a $10 million funding call for research on multi-agent AI safety. The funding targets safety problems that arise when multiple agents interact, a notable focus as agentic systems become more common and harder to control.
Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech
Researchers benchmarked frontier ASR systems on code-switched speech to test how well voice agents handle bilingual customers. The work matters because code-switching is common in real conversations, and ASR errors there can directly degrade support quality and automation reliability.
Fluid, natural voice translation with Gemini 3.5 Live Translate
Gemini 3.5 Live Translate adds near real-time, natural speech translation to Google AI Studio, Google Translate, and Google Meet. It matters because it extends Gemini’s voice translation into products used for development, consumer translation, and video meetings, making live cross-language conversation more fluid.
How engineers at Nextdoor use Codex to build without limits
Nextdoor engineers are using Codex with GPT-5.5 to investigate hard-to-reproduce issues and build across platforms. The setup is meant to remove engineering bottlenecks so they can spend more time on product outcomes instead of debugging and platform-specific work.
What Codex unlocks for Notion
Notion says it uses Codex to one-shot specs and help build features like AI Voice Input for the web, while multiplying engineering output across small teams. The notable detail is that Codex is being positioned as a practical force multiplier for product and engineering work, not just a coding assistant.
NeuroBait: I fine-tuned a model to spark dopamine for ADHD brain
A model called NeuroBait was fine-tuned to create content intended to trigger dopamine responses for an ADHD brain. The notable detail is that it frames AI personalization around neurodivergent attention patterns rather than general engagement optimization.
Measuring the impact of learning with AI in Sierra Leone and beyond
A randomized controlled trial in Sierra Leone found that Gemini’s Guided Learning feature can increase student engagement and speed up learning. The result suggests that AI tutoring tools may have measurable educational benefits in low-resource settings, though the excerpt does not provide the study’s exact effect sizes or model details beyond Gemini.
The Open Source Community is backing OpenEnv for Agentic RL
The open source community is backing OpenEnv for agentic RL. It matters because this suggests growing support for a shared environment standard for training and evaluating agentic reinforcement learning systems.
Sponsors especially OPENAI CODEX voucher usage for codex - openAI challange
OpenAI is running a Codex challenge that specifically mentions sponsor support and voucher usage for access to Codex. The notable detail is the emphasis on vouchers, suggesting a sponsored or limited-access setup for participants using Codex.
Her · हेर — a detective for your Claude Code sessions
Her (हेर) is a detective tool for Claude Code sessions. It is notable as a session-analysis utility, though the excerpt provides no additional details about its features, scale, or model-specific behavior.
Thousand Token Wood: shipping a multi-agent economy on a 3B model
Thousand Token Wood describes shipping a multi-agent economy built on a 3B model. It matters because it suggests that complex agent interactions and economic behaviors can be deployed on relatively small models, potentially lowering compute and cost barriers.
How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent
NVIDIA’s guide explains how to fine-tune Nemotron 3.5 ASR for a specific language, domain, or accent. This matters because adapting an ASR model to target speech conditions can improve recognition accuracy in settings the base model does not handle well.
How Endava is redesigning software delivery around AI agents
Endava is redesigning its software delivery process around AI agents, using ChatGPT Enterprise and Codex to accelerate development and automate workflows across the enterprise. The move matters because it signals a shift toward an AI-native engineering culture where agents handle more of the routine delivery work and speed up teams.
Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining
NVIDIA introduced Task-Seeded Synthetic Q&A Generation as a data-generation method for Nemotron pretraining, using task-specific prompts to produce synthetic question-answer pairs. It matters because synthetic Q&A can expand pretraining data at scale while steering the model toward better task coverage and instruction-following behavior.
Dreaming: Better memory for a more helpful ChatGPT
ChatGPT is introducing a new memory system designed to remember user preferences and keep context fresh and relevant across conversations. This matters because better memory should make the assistant more helpful and less repetitive, but the excerpt does not include details on rollout timing, controls, or the underlying technical changes.
How Wasmer used Codex to build a Node.js runtime for the edge
Wasmer used Codex with GPT-5.5 to build a Node.js runtime for the edge, accelerating development by 10x to 20x and shipping in weeks instead of months. The notable detail is that an AI-assisted workflow compressed a project that would normally take months into weeks, highlighting how Codex can speed up infrastructure-level development.
Holo3.1: Fast & Local Computer Use Agents
Holo3.1 introduces fast, local computer-use agents designed to operate on-device rather than through cloud-hosted workflows. This matters because local execution can reduce latency, improve privacy, and make interactive agentic automation more practical on consumer hardware.
Codex for every role, tool, and workflow
OpenAI is introducing new Codex plugins, sites, and annotations aimed at analysts, marketers, designers, investors, and other teams. These additions matter because they extend Codex beyond coding into role-specific workflows, making AI useful for more day-to-day knowledge work.
Codex is becoming a productivity tool for everyone
Codex is being positioned as a productivity tool for a broad range of knowledge workers, with the Next Era of Knowledge Work report highlighting its use for AI-powered research, data analysis, workflow automation, and content creation. It matters because the report frames Codex as moving beyond a niche coding assistant into a general-purpose work tool that can speed up multiple white-collar tasks.
How we used Gemini to build Google I/O 2026
Google says Googlers used Gemini to help build Google I/O 2026. The notable detail is that the company is showcasing its own AI tools being used in the production process, though the excerpt provides no specific numbers or model variants.
Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains
JetBrains introduced Mellum2, a 12B mixture-of-experts model. It matters because the MoE design suggests JetBrains is targeting stronger performance efficiency for code-focused AI while keeping the parameter count relatively compact.
Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic
The piece argues that enterprise AI adoption needs agent logic, not just larger LLMs, to scale effectively across real workflows. It matters because agentic systems can coordinate tools, state, and multi-step decisions, which is often what production business use cases require beyond text generation.
OpenAI frontier models and Codex are now available on AWS
OpenAI frontier models and Codex are now generally available on AWS, letting enterprises access OpenAI through the AWS environments, controls, and procurement workflows they already use. This matters because it gives customers a simpler path from evaluation to production while keeping AI deployment inside their existing cloud and purchasing setup.
11 demos of Gemini Omni and Gemini 3.5 in action
Google I/O 2026 included 11 demo videos showing Gemini Omni and Gemini 3.5 in action. The demos highlight Google’s latest multimodal models in practical use, giving a concrete look at the capabilities behind the announcement.
How Braintrust turns customer requests into code with Codex
Braintrust engineers are using Codex with GPT-5.5 to turn customer requests into code and to run experiments faster. The notable detail is that the workflow is aimed at speeding up engineering iteration, showing how paired coding models can move requests from idea to implementation more efficiently.
Strengthening societal resilience with Rosalind Biodefense
OpenAI launched Rosalind Biodefense, expanding trusted access to GPT-Rosalind for vetted developers and U.S. government partners working on biodefense, public health, and pandemic preparedness. It matters because the rollout opens frontier AI to a narrower, controlled set of users for high-stakes biological defense and readiness work.
Catch up on 12 major I/O 2026 moments
Google I/O 2026 highlighted 12 major keynote moments, including updates on Gemini Omni and Gemini 3.5 Flash. The lineup matters because it signals Google's latest push across its Gemini model family, with specific new model names suggesting continued expansion in capability and product coverage.
How Endava builds an agentic organization with Codex
Endava says it uses Codex to build an agentic organization that accelerates software delivery and cuts requirements analysis from weeks to hours. The notable detail is the scale of the workflow change: Codex is being used not just for coding assistance but to compress an early delivery phase that typically bottlenecks projects.