This is how you deal with route leaks
That, we must say, is the unique story so far.
Here«s the beginning: for approximately an hour, starting at 19:28 UTC on April 1, 2020, the largest Russian ISP — Rostelecom (AS12389) — was announcing prefixes belonging to prominent internet players: Akamai, Cloudflare, Hetzner, Digital Ocean, Amazon AWS, and other famous names.
Before the issue was resolved, paths between the largest cloud networks were somewhat disrupted — the Internet blinked. The route leak was distributed quite well through Rascom (AS20764), then Cogent (AS174) and in a couple of minutes through Level3 (AS3356) to the world. The issue suddenly became bad enough that it saturated the route decision-making process for a few Tier-1 ISPs.
It looked like this:
With that:
This leak affected 8870 network prefixes belonging to almost 200 autonomous systems. With a lot of invalid announcements that weren«t discarded by all those accepting tiers. Ultimately, it wouldn«t change the day, but the distribution of the route leak could be lower if the filters were in place. Take a look at RIPE BGPlay if you want to observe the dynamics of what has happened: https://stat.ripe.net/widget/bgplay#w.resource=2.17.123.0/24
As we wrote yesterday, all network engineers should be aware of what they are doing, preventing the chances of such a crucial mistake. The mistake Rostelecom has made illustrates how fragile the IETF-standardized BGP routing is, and especially — during such stressful times in terms of traffic growth.
However, what makes the case very different is that Rostelecom got a warning from the Qrator.Radar«s real-time feed and reached out for help with the incident troubleshooting.
Given the simplicity of the BGP mistakes, during the coronavirus crisis, it«s so easy to allow for an error. However, with the monitoring data provided, the incident came to an end rather quickly, and the proper routing was restored.
We strongly encourage other ISPs who are not Rostelecom to start monitoring their BGP announcements to prevent incidents of scale. And, of course, RPKI Origin Validation is something everyone shouldn«t just think about, but implement.