Critical Service Outage

Incident Report for Circle

Postmortem

Incident summary

From 04:05 to 09:07 on Oct 20th, 2020, we observed elevated error rates and latency before a total systems outage. This event was triggered by a bad Cloudflare software deploy. The incident was automatically reported by our monitoring systems at 04:06 and manually reported by the support team at 04:09. Our team began working to investigate the issue at 04:06.

Impact

For ~5 hours, Circle was completely offline and the website was unavailable. This incident affected all users.

Detection

This incident was detected when the automatic monitoring system was triggered and Aaron was notified.

Next, Conor was paged, to assist in the investigation and look for possible mitigation steps.

Response

After receiving a page at 04:06, Aaron began investigating almost instantly. He attempted to re-connect all clusters, however, this was unsuccessful.
Aaron paged Conor so they could work together to determine a strategy to mitigate the issues.
Flatbird was paged throughout the outage until 7:49 when he began working to migrate the database off of GalaxyGate.
GalaxyGate/CF were able to resolve the networking issues before the database had been fully migrated.

Recovery

GalaxyGate was unable to contact Cloudflare until 08:03 when their owner was online and investigating. Shortly after, the issue was identified and Cloudflare began working to deploy a fix. At 09:03 our server regained connection to Cloudflare and services began re-connecting.

* We are still awaiting GalaxyGate’s postmortem to give us more insight into the recovery of our network.

Timeline

All times are PDT.

04:05 - Circle monitors detected a possible outage and queued a retry in 30 seconds (this helps us not get false positives when a one time error occurs)

04:06 - Our monitoring system sent a page to Aaron and myself.

04:07 - Aaron began investigating.

04:10 - We updated our status page.

05:10 - GalaxyGate began investigating.

07:56 - Flatbird was online and began working to move Circle’s infrastructure off of GalaxyGate.

09:03 - Cloudflare resolved the networking issues and services began restarting.

Root cause

Unknown at this time.

Lessons learned

GalaxyGate has had too many issues for us to continue relying on them. Over the next week, we will be working to edit our cluster manager to use some new technology. Once the modifications are complete, we will be moving Circle to a new provider.

Posted Oct 20, 2020 - 09:48 PDT

Resolved

This incident has been resolved. We will publish a postmortem here shortly.

Thank you for your patience!!

Posted Oct 20, 2020 - 09:11 PDT

Monitoring

All services are currently restarting.

Posted Oct 20, 2020 - 09:06 PDT

Update

We're working to move Circle off of GalaxyGate's infrastructure. We've hit a couple of unexpected bumps along the way but the move is still in progress. There is no ETA.

Posted Oct 20, 2020 - 07:53 PDT

Update

GalaxyGate is still working to contact Cloudflare.

Posted Oct 20, 2020 - 06:11 PDT

Identified

GalaxyGate has identified the problem as a "bad CloudFlare deploy". We are talking with GalaxyGate to develop an ETA to service restoration.

Thank you for your patience.

Posted Oct 20, 2020 - 04:36 PDT

Update

A developer is beginning to investigate the situation.

Posted Oct 20, 2020 - 04:28 PDT

Update

I've pressed the SOS button to alert Flatbird, I'm continuing to attempt to resolve the issue. -Aaron

Posted Oct 20, 2020 - 04:18 PDT

Investigating

We are currently investigating this issue.

Posted Oct 20, 2020 - 04:10 PDT

This incident affected: Circle and Website.