PHP UK Conference 2013

Diseñando sistemas de alta disponibilidad

David Gardner  · 

Transcripción

Extracto de la transcripción automática del vídeo realizada por YouTube.

okay so my name is Dave for the benefit of people next door this is me and i work at halo which is the taxi out for anyone who hasn't heard of it and my talk today is entitled planning to fail being that this is the first talk I've ever done a PHP

conference there seems to be like a degree of tempting fate in the title and I did al-manar about this title I considered a cut of less aggressive planning for failure but then that to me seemed like more of a monitoring talk you know reacting to failure whereas

I'm really going after failure here I'm actually planning to fail I'm going to break stuff here and the aim is by breaking stuff we can end up with a reliable system a system that works at scale and it doesn't fail so that's what my talk

is about but before we get into the details I do want i'll rewind to the beginning so where I started out in the PHP world the wonder of inserting a PHP open tag into an HTML table Drella in some web page and then you know getting some stuff from the database

that's the way it used to be done and my website kind of still does that it's a little bit more advanced than that and it runs wordpress you know it's arguably over-engineered it runs on a single VPS it's a really simple PHP MySQL web app and

my websites quite reliable I don't actually have any reliability figures I don't monitor it if it went down I probably wouldn't notice but the Google Analytics traffic seems reasonably kind of stable so I'll assume that my website is pretty

reliable and part of the reason for the reason that you know the reason it's reliable is because it's a very simple website it doesn't have to do a lot it could even arguably be simple I could probably be a static HTML it doesn't have much

growth you can see the traffic figures are pretty stable I haven't been slashed dotted recently there's no enormous spikes there's nothing out of the ordinary and there's only one person who ever works in there and that's me and I then

work on it very often so over all that kind of leads to a nice stable system and stable systems are normally reliable so at halo halo for anyone who doesn't know it's a taxi app you press a button on your smartphone and we'll dispatch a black cab

driver in London to come and get you the company has been around for a while now but oh just over a year and it's it's a company that it's experiencing quite large growth we've just completed second round of financing and we've raised 50

million dollars so far so that kind of gives you some idea as to the scale of the company's ambitions and that kind of does translate to the tech level as well it's a reasonably high volume website it's not enormous it's not you know like Facebook

or the thing but we we have via the app gets a reasonable amount of traffic and it's quite unreliable traffic it can be you know special events like New Year's Eve can catch us off guard and we haven't had that many New Year's Eve since we've

been launched so we're not really sure what to expect so this is that kind of degree of the unknown the functionality seems quite simple on the surface you press a button a taxi comes in guess you but there is quite a lot of complexity there people whenever

I try and explain what I do to my dad he is like so it's just a taxi app how many people were there what do you all do and I'll try and explain what we do and how we build it but there is quite a lot of functionality there and there's a lot of

engineers working on that functionality so this is just some like stats about halo and where we wear out and where we've been halo launched in in London it's a London company in november two thousand eleven and we're now in five cities in three

countries on two continents in the US and europe a halo Haley's except around the world every five seconds so we've got quite a lot of traction already in quite a short amount of time and the growth is kind of roughly thirty percent month or month

so we're getting bigger and bigger and what does that mean what does that mean for halo operates and halos reliability this is a quote I found in one of the papers from the nice girl summer project I don't know if anyone was familiar with that it's

quite an interesting website and it was all about getting together and discussing distributed systems and they've got a list of a load of papers actually of academic type papers and this is from one about harvest and yield and it brings together two ideas

and one of them is from Brooks and it's saying that the complexity of a software project grows as the square of the number of Engineers and then it brings together a second idea which is saying that most failures in complex systems are going to result

from intra component bugs rather than so the kind of you know how component a tools to component be rather than bugs within one component so the conclusion of this paper makes is that less machinery is is quadratically better so basically the more stuff you've

got that's not not just like a linear you know I'm going to have a linear impact is actually have a quadratic impact on on how your system is likely to fail so halo well we have we have a bunch of services so we have you know ten component to the moment

and we're adding more we run machines are on AWS in three geographic regions in nine availability zones lots and lots of instances I think including all of our staging and test ones we've got a couple of hundred instances and we've got ten engineers

building the service side of it so that's ignoring all the app stuff and this is just kind of the backend stuff and in fact we're adding people so if anyone's interested you can come and grab me later so what this means is our overall reliability

is in danger this extra complexity is and the fact that hey I've got all these engineers always all these moving parts all this machinery means that we're going to have to do something to kind of protect our reliability of our overall system and the

coping strategy one of our coping strategies that would come up with is to kind of embrace failure and really and really take Taylor on so by doing something a lot you generally get better at it so by failing a lot we can get better at dealing with failure

so this is the the chaos monkey kind of approach and I'm sure a lot of people have heard of that and this is a kind of Netflix popularized this idea which is they've got they've got a sort of demon that runs during office hours and randomly kills

boxes within their live infrastructure and that's quite a brave approach and we're not quite at that level yet but it's the same sort of concept so let's just go back and see how this might impact my website my reliable website the high run

from my single VPC that runs PHP and MySQL so it is a reliable system but it's not resilient my my website is not resilient if I applied the chaos monkey approach to to my website it's not going to do me many favors any one component I choose to randomly

kill is going to make everything stop working so what we need to do is we need to kind of think about this ahead of time for my website this doesn't matter my website is not a bus timetable up this is not life and death stuff so it doesn't matter no

one's going to die but we do need to choose a stack that's going to work we need to pick things that are going to react well to this pattern and so what I'd like to do now is talk a little bit about the halo stack and how we choose technologies

that we use a halo and the sorts of technologies that we're using four different things season Howe halo started out entirely but it's kind of close on the PHP side certainly there is a lot of driver as well we have like in a single box bit on my website

so obviously we kind of already get in the hint that this probably isn't all that resilient which means that it may be reliable but it's not going to go well with failure so we're splitting it up we're taking the functionality that there is

[ ... ]

Nota: se han omitido las otras 4.140 palabras de la transcripción completa para cumplir con las normas de «uso razonable» de YouTube.