The issue
Today at around 4am GMT our dashboard and some API services became unresponsive for a group of our customers. This issue was limited to a group of customers at the beginning but grew to affect more customers within 4 hours. We managed to track down the issue to an external script related to cloud health status and fix it by 8:14GMT.
This issue manifested itself by not being able to start new deployments from the UI or commandline (Redeployment stayed functional).
What happened?
The root of the problem was a faulty script on our side which was trying to retrieve public cloud status information from http://status.cloud66.com/status.json
. This page was available on our old status page, but not on the new one. This script was responsible for showing the "Cloud Status" on top of every dashboard page as well as some information on the API payloads.
On the 20th of July (7 days ago) we switched our status page to status.io to provide more detailed information about our product incidents and maintenance windows.
The issue with the missing endpoint was not picked up during development or tests on our test and staging environments because status.io
was returning a 404 immediately which our script was dealing with normally. However, in the early morning hours of Monday, status.io
started to block the repeated calls to the wrong endpoint from our production IP addresses, causing the script to timeout.
Cloud status is cached aggressively at different stages for different purposes and different customers depending on which cloud providers they use. This means many customers were getting the cached values and were not affected until their cache started to invalidate and they started to experience the issue.
The fix
We have switched the source for the cloud status to a different source and are monitoring its performance.
Prevention
We are now reviewing our dependency to outside API endpoints and working on measures to protect us from any unpredictable behaviour from the endpoint.
We are sorry for this dashboard/API outage and are working hard to make sure it doesn't happen again. We have reached out to all customers affected publicly and privately to make sure everything is in order. If you are experiencing issues with the dashboard, please let us know.