Payment service issue impacting SMS, MMS, Lookup APIs for the prepaid accounts.
Incident Report for Plivo
Postmortem

ROOT CAUSE ANALYSIS

Issue

Multiple Plivo services, such as SMS API, Inbound MMS API, Console, Number Purchase, and Lookup API, faced full or partial interruptions for different lengths of time. Voice API and Zentrunk services remained unaffected.

Duration

From: 2023-02-02 08:40 AM UTC

To: 2023-02-02 05:15 PM UTC

Services affected

Service Severity Customers affected Duration From To
Console Not accessible Prepaid customers 8 hours 2023-02-02 08:40 AM UTC 2023-02-02 04:35 PM UTC
SMS API Non-functional Prepaid customers 2 hours 2023-02-02 08:40 AM UTC 2023-02-02 10:45 AM UTC
SMS API Non-functional Prepaid customers 2.5 hours 2023-02-02 01:45 PM UTC 2023-02-02 04:30 PM UTC
Inbound MMS Non-functional Prepaid customers 8.5 hours 2023-02-02 08:40 AM UTC 2023-02-02 05:15 PM UTC
Lookup API Non-functional All customers 8.5 hours 2023-02-02 08:40 AM UTC 2023-02-02 05:15 PM UTC
Number Purchase Non-functional Prepaid customers 8.5 hours 2023-02-02 08:40 AM UTC 2023-02-02 05:15 PM UTC

Description

At 2023-02-02 08:40 AM UTC, Plivo deployed a release to its billing platform. Right after this deployment, the service in the billing platform, which is responsible for maintaining Plivo account balance deductions for prepaid customers’ accounts, became unresponsive.

This incident led to several Plivo services (namely: SMS API, Inbound MMS, Console, Phone Number APIs, and Lookup API) not being able to communicate with the billing platform. Without confirmation that the account balance was successfully deducted from prepaid customers’ accounts, the above services started experiencing failures.

Other Plivo services (Voice API and Zentrunk) were not affected because a circuit breaker was implemented to prevent this failure. Unfortunately, the affected Plivo services did not have this failsafe in place.

As part of the follow-up root cause analysis, it was identified that the issue was caused by a very high CPU utilization (99%) on the Redis cluster used by the billing platform despite the overall volume of the Plivo API’s traffic not increasing substantially.

Our team immediately rolled back the last deployment with no effect. The Redis cluster was still showing an abnormally high CPU utilization. As a next step, the team increased the number of nodes in the Redis cluster, but the issue persisted. The team further created a new Redis cluster and switched the billing platform to this new cluster with no success. The CPU utilization was still abnormally high on the new cluster.

While the underlying issue was being investigated, temporary measures were undertaken in parallel to restore services wherever possible. The SMS API, Plivo Console, and Inbound MMS services were restored by skipping the account balance deduction confirmation from the billing platform.

Further investigation revealed that the billing platform service triggered ​​too many Redis commands named ‘COMMAND’. Our team observed that despite having a Redis connection pool set up in the billing platform, new connections to the Redis cluster were constantly created for all Redis commands.

Finally, after further investigation, the team identified that the Golang Redis client (go-redis) version used had several issues related to initializing new connections with Redis cluster engine v6. Upgrading the Golang Redis client version from v6 to v7 post-regression and performance tests fixed the issue around 2023-02-02 05:15 PM UTC.

Root Cause Analysis

Why did it happen? What caused it?
Why did the billing platform service become unresponsive? High CPU utilization (99%) on the Redis cluster.
Why was the CPU utilization so high? High active connections count from the billing platform service to the Redis cluster.
Why was there a high active connections count? The billing platform service triggered too many Redis commands named “COMMAND”.
Why were too many Redis commands (“COMMAND”) being fired? Incompatibility between the Golang Redis Client and the Redis cluster (v6). Each new connection created from the Redis Client would execute “COMMAND” but fail.
Why did this incompatibility exist? The Redis cluster was upgraded on Jan 15, 2023, from v5 to v6.
Why did we not observe this issue right after the Redis cluster upgrade? The billing platform connections to the Redis cluster were not re-initialized during and after the Redis cluster upgrade on Jan 15. On Feb 2, when the engineering team deployed a change on the billing platform service, it triggered a complete restart and a hard reset of all the connections between the Redis client (go-redis) and the upgraded Redis cluster v6.

Chronology of events

  • 2023-02-02 08:40 AM UTC: The Plivo billing platform was updated, which led to failures in credit balance operations, causing multiple Plivo services to be affected.
  • 2023-02-02 09:00 AM UTC: The deployment was reverted. However, the billing platform service stayed unresponsive.
  • 2023-02-02 09:00 AM UTC: A temporary fix was deployed to restore SMS API & Inbound MMS services by processing messages without balance deductions.
  • 2023-02-02 03:40 PM UTC: The engineering team finally identified the root cause: the Go Redis client (go-redis) connection pooling implementation was incompatible with the Redis cluster v6.
  • 2023-02-02 04:35 PM UTC: A temporary fix was deployed to make the Plivo Console accessible.2023-02-02 05:15 PM UTC: Corrective upgrades were tested and deployed in production to resolve the incident.

Corrective and preventive action plan

  • Conduct compatibility checks across the production setup to ensure the system can handle similar instances without impacting services.
  • Conduct performance tests on a production scale, and validate the test metrics, such as latency, functionality, CPU and memory utilization, and throughput, using actual production data.
  • Audit all Plivo services and implement the circuit breaker pattern for every service, including SMS API, Inbound MMS, Plivo Console, and Lookup APIs to minimize related services downtime.
  • Ensure that impact mitigation is integrated into the incident response process and that it is considered at every stage, from incident detection to resolution.

Technical references

Posted Feb 08, 2023 - 15:17 UTC

Resolved
This incident has been resolved.
Posted Feb 02, 2023 - 17:46 UTC
Monitoring
Our engineers have confirmed that the services have been restored and are working normally. We will continue to monitor the situation and provide another update as soon as possible.
Posted Feb 02, 2023 - 17:34 UTC
Update
Our engineers have confirmed Lookup API working normally while they work on MMS And SMS APIs. We will provide an update as soon as we have additional information.
Posted Feb 02, 2023 - 17:07 UTC
Update
We continue to implement a solution to the identified issue. We will provide an update as soon as we have additional information.
Posted Feb 02, 2023 - 16:02 UTC
Identified
We have identified the issue and our team is working to implement a solution. We will provide an update as soon as we have additional information.
Posted Feb 02, 2023 - 13:32 UTC
Investigating
Dear Valued Customer,

We are observing issue impacting service for SMS, MMS, Lookup APIs for the prepaid accounts. All other services are operating normally.

We are actively working with our team to get this issue resolved as soon as possible.

Please contact Plivo Support for any questions.
Support portal: https://support.plivo.com/support/home
Posted Feb 02, 2023 - 12:49 UTC
This incident affected: Messaging API (MMS API) and Number Lookup API.