The self-described "intelligent global network" provided by Internet
securityand caching service CloudFlare took a coffee break this
morning, forcing a number of the Web's top sites offline for the
better part of an hour or so.Or, as CloudFlare CEO Matthew Prince put
it in today's post-mortem blog post, "CloudFlare effectively dropped
off the Internet." And when he says that, he means it, literally – the
outagealso took CloudFlare's own site offline in addition to sites
like 4chan, Wikileaks, and the other 785,000 or so websites making use
of CloudFlare's services.So, what happened?First up, it's important to
understand what CloudFlare actually does. It servesas an intermediary
of-sorts for those looking to access sites that make use of the
service, caching static pages to speed up load times and using its
anycast DNS capabilities to filter out malicious traffic – like
distributed denial of service attacks – to keep its members' sites
online and unbothered.
as CloudFlare describes:"The nature of CloudFlare's Anycasted network
is that we inherently increasethe surface area to absorb such an
attack. A distributed botnet will have aportion of its denial of
service traffic absorbed by each of our data centers. "According to
CloudFlare, the company noticed a DDOS attack against one of its
member sites early this morning. A member of CloudFlare's operations
team sent out a tweak to CloudFlare's routers that was designed to get
themto drop any packets that appeared to be part of the attack –
identified as packets ranging from 99,971 to 99,985 bytes in
length."Flowspec accepted the rule and relayed it to our edge network.
What should have happened is that no packet should have matched that
rule because no packet was actually that large. What happened instead
is that the routers encountered the rule and then proceeded to consume
all their RAM until they crashed,"Prince wrote."In all cases, we run a
monitoring process that reboots the routers automatically when they
crash. That worked in a few cases. Unfortunately, many of the routers
crashed in such a way that they did not reboot automatically and we
were not able toaccess the routers' management ports. Even though some
data centers came back online initially, they fell back over again
because all the traffic across our entire network hit them and
overloaded their resources."CloudFlare's network and operations teams
ultimately had to remove the aforementioned filter rule from its
routers and have its data center employees manually reboot the
affected routers. For CloudFlare customers protected by service-level
agreements, the company plans to issue credit for today's hour or so
worth of downtime.
No comments:
Post a Comment