Resolved
Sep 29 at 11:00am HDT

Incident Report: Feature Flag Provider Outage

Summary

At ~17:00 UTC, our feature flag provider experienced an outage that led to elevated errors and slower response times in their API. Our caching system initially mitigated the impact, but as the outage worsened, cached values began to expire, resulting in elevated latency and slowness across our platform. Normal service was restored by ~18:40 UTC.

Timeline (UTC)

17:00 – Feature flag provider outage begins, causing elevated errors and slower API responses.
17:00–18:00 – Our caching systems mitigated the impact by serving cached flag values.
18:00–18:30 – Cached values expired, causing platform jobs to take longer and resulting in elevated latency and platform slowness.
~18:30 – Provider mitigated their outage and restored normal service.
18:40 – Platform throughput automatically recovered; normal service resumed.

Impact

The US region was most heavily affected.
EU and AU regions experienced only minor impact.
Platform latency and job execution times were elevated for ~30 minutes until recovery.

Next Steps / Mitigation

We are implementing additional safeguards to further reduce our dependency on the provider.
These mitigations will help minimize and ultimately eliminate similar risks in the future.

Updated
Sep 29 at 09:30am HDT

We’re back! API performance should now be stable. Thanks for bearing with us during the disruption.

Created
Sep 29 at 09:00am HDT

We’re investigating an issue with our API that is causing increased latency and degraded performance for some users. We’re working to fix the problem as quickly as we can. We’ll share another update shortly.

API latency spike

Incident Report: Feature Flag Provider Outage

Summary

Timeline (UTC)

Impact

Next Steps / Mitigation