Multiple Plivo services, such as SMS API, Inbound MMS API, Console, Number Purchase, and Lookup API, faced full or partial interruptions for different lengths of time. Voice API and Zentrunk services remained unaffected.
From: 2023-02-02 08:40 AM UTC
To: 2023-02-02 05:15 PM UTC
Service | Severity | Customers affected | Duration | From | To |
---|---|---|---|---|---|
Console | Not accessible | Prepaid customers | 8 hours | 2023-02-02 08:40 AM UTC | 2023-02-02 04:35 PM UTC |
SMS API | Non-functional | Prepaid customers | 2 hours | 2023-02-02 08:40 AM UTC | 2023-02-02 10:45 AM UTC |
SMS API | Non-functional | Prepaid customers | 2.5 hours | 2023-02-02 01:45 PM UTC | 2023-02-02 04:30 PM UTC |
Inbound MMS | Non-functional | Prepaid customers | 8.5 hours | 2023-02-02 08:40 AM UTC | 2023-02-02 05:15 PM UTC |
Lookup API | Non-functional | All customers | 8.5 hours | 2023-02-02 08:40 AM UTC | 2023-02-02 05:15 PM UTC |
Number Purchase | Non-functional | Prepaid customers | 8.5 hours | 2023-02-02 08:40 AM UTC | 2023-02-02 05:15 PM UTC |
At 2023-02-02 08:40 AM UTC, Plivo deployed a release to its billing platform. Right after this deployment, the service in the billing platform, which is responsible for maintaining Plivo account balance deductions for prepaid customers’ accounts, became unresponsive.
This incident led to several Plivo services (namely: SMS API, Inbound MMS, Console, Phone Number APIs, and Lookup API) not being able to communicate with the billing platform. Without confirmation that the account balance was successfully deducted from prepaid customers’ accounts, the above services started experiencing failures.
Other Plivo services (Voice API and Zentrunk) were not affected because a circuit breaker was implemented to prevent this failure. Unfortunately, the affected Plivo services did not have this failsafe in place.
As part of the follow-up root cause analysis, it was identified that the issue was caused by a very high CPU utilization (99%) on the Redis cluster used by the billing platform despite the overall volume of the Plivo API’s traffic not increasing substantially.
Our team immediately rolled back the last deployment with no effect. The Redis cluster was still showing an abnormally high CPU utilization. As a next step, the team increased the number of nodes in the Redis cluster, but the issue persisted. The team further created a new Redis cluster and switched the billing platform to this new cluster with no success. The CPU utilization was still abnormally high on the new cluster.
While the underlying issue was being investigated, temporary measures were undertaken in parallel to restore services wherever possible. The SMS API, Plivo Console, and Inbound MMS services were restored by skipping the account balance deduction confirmation from the billing platform.
Further investigation revealed that the billing platform service triggered too many Redis commands named ‘COMMAND’. Our team observed that despite having a Redis connection pool set up in the billing platform, new connections to the Redis cluster were constantly created for all Redis commands.
Finally, after further investigation, the team identified that the Golang Redis client (go-redis) version used had several issues related to initializing new connections with Redis cluster engine v6. Upgrading the Golang Redis client version from v6 to v7 post-regression and performance tests fixed the issue around 2023-02-02 05:15 PM UTC.
Why did it happen? | What caused it? |
---|---|
Why did the billing platform service become unresponsive? | High CPU utilization (99%) on the Redis cluster. |
Why was the CPU utilization so high? | High active connections count from the billing platform service to the Redis cluster. |
Why was there a high active connections count? | The billing platform service triggered too many Redis commands named “COMMAND”. |
Why were too many Redis commands (“COMMAND”) being fired? | Incompatibility between the Golang Redis Client and the Redis cluster (v6). Each new connection created from the Redis Client would execute “COMMAND” but fail. |
Why did this incompatibility exist? | The Redis cluster was upgraded on Jan 15, 2023, from v5 to v6. |
Why did we not observe this issue right after the Redis cluster upgrade? | The billing platform connections to the Redis cluster were not re-initialized during and after the Redis cluster upgrade on Jan 15. On Feb 2, when the engineering team deployed a change on the billing platform service, it triggered a complete restart and a hard reset of all the connections between the Redis client (go-redis) and the upgraded Redis cluster v6. |