OpenAI’s Longest Outage Blamed on a New Telemetry Tool Gone Wrong
On Wednesday, OpenAI experienced one of its longest service interruptions. The disruption hit multiple products like ChatGPT, the video tool Sora, and the developer API. It started around 3 p.m. Pacific Time and lasted about three hours before everything was back to normal.
The cause wasn’t a hack or new product launch. Instead, OpenAI blamed a new telemetry tool they rolled out that day. This tool was meant to gather performance data from their Kubernetes system, which is software used to manage apps running in isolated spaces.
The new service caused the Kubernetes system to get overloaded. Since many of OpenAI’s services depend on Kubernetes for DNS resolution (basically, translating web addresses to IP addresses), the overload had a widespread impact.
DNS caching, which temporarily stores these translations, made things worse by delaying the company’s ability to see the issue clearly.
OpenAI noticed the problem just minutes before users started facing issues. However, fixing it wasn’t easy because the overloaded system was hard to access.
OpenAI described the situation as ‘multiple systems and processes failing at the same time in unexpected ways.’ The usual tests didn’t catch the issue, and fixing it took longer than expected because engineers couldn’t get into the affected servers.
To avoid this in the future, OpenAI plans to improve how they roll out updates, add better monitoring tools, and ensure their engineers can always access Kubernetes servers, no matter what happens.
OpenAI apologized, saying, ‘We’ve fallen short of our own expectations’ and acknowledged the frustration caused to ChatGPT users, developers, and businesses relying on their services.