SIP Phone and WebSDK issue
Incident Report for Plivo
Postmortem

Description

Today we had an incident blocking part of WebRTC clients to connect to our platform. One of our provider had an emergency maintenance on their datacenter where some of our servers for WebRTC registration are hosted. During the maintenance, some servers went down but the service was still operational. When the faulty servers were restored, they were re-deployed with a previous firewall configuration. This firewall configuration has very aggressive policies blocking excessive registration attempts and rate limiting packets. Because of those policies, the servers started to block legitimate traffic including WebRTC clients.

Timeline

  • May 13 18:33 UTC : Receiving alert regarding servers down in one datacenter
  • May 13 18:46 UTC : Provider is updating us regarding an emergency network maintenance
  • May 13 19:16 UTC : Servers are back but still not in production
  • May 13 19:37 UTC : Servers are back in production
  • May 13 21:13 UTC : Support Team is escalating to Devops Team tickets from some customers related to connection timeouts with WebRTC client.
  • May 13 21:27 UTC : Devops Team is investigating with the provider.
  • May 13 21:53 UTC : Root cause is identified. Firewall is blacklisting and rate limiting source IPs on the servers.
  • May 13 22:15 UTC : Firewall is flushed and restored with correct policies. Service is restored for all blocked IPs.

What we are doing to prevent this problem in future

We have a monitoring system in place to alert us if there are too much connections and registration failures. Even with such monitoring, we were not able to catch an abnormal volume of blocked packets. We will now monitor the incoming traffic and raise an alert if the number of active packets on our registration ports is below a certain limit. We will also raise an alert if we see a sudden spike of packets dropped. Finally, we will make sure firewall configuration deployed is always the last version and past versions cannot be restored and installed on our production servers.

Posted May 13, 2017 - 23:51 UTC

Resolved
The issue has been resolved. The websocket service to register WebRTC clients was unresponsive for few minutes.
Posted May 13, 2017 - 22:20 UTC
Investigating
We are experiencing an issue with our SIP Registrar service impacting SIP Phones and WebSDK registration and calls.
Posted May 13, 2017 - 22:12 UTC