Today we had an incident blocking part of WebRTC clients to connect to our platform. One of our provider had an emergency maintenance on their datacenter where some of our servers for WebRTC registration are hosted. During the maintenance, some servers went down but the service was still operational. When the faulty servers were restored, they were re-deployed with a previous firewall configuration. This firewall configuration has very aggressive policies blocking excessive registration attempts and rate limiting packets. Because of those policies, the servers started to block legitimate traffic including WebRTC clients.
We have a monitoring system in place to alert us if there are too much connections and registration failures. Even with such monitoring, we were not able to catch an abnormal volume of blocked packets. We will now monitor the incoming traffic and raise an alert if the number of active packets on our registration ports is below a certain limit. We will also raise an alert if we see a sudden spike of packets dropped. Finally, we will make sure firewall configuration deployed is always the last version and past versions cannot be restored and installed on our production servers.