However when combined together in architectures there is the possibility that any one component could suffer an outage resulting in an overall availability that is not equal to the the component services.
In this example there are three possible failure modes:
Therefore the overall availability of this "system" must lower than 99.95%. My rationale for thinking this is if the SLA for both services was:
The service will be available 23 hours out of 24
Both component parts are within their SLA but the total system was unavailable for 2 hours out of 24.
In this architecture there are a large number of failure modes however principally:
Because Traffic Manager is a circuit breaker it is capable of detecting an outage in either region and routing traffic to the working region, however there is still a single point of failure in the form of Traffic Manager so the total availability of the "system" cannot be higher than 99.99%.
How can the compound availability of the two systems above be calculated and documented for the business, potentially requiring rearchitecting if the business desires a higher service level than the architecture is capable of providing?
If you want to annotate the diagrams, I have built them in Lucid Chart and created a multi-use link, bear in mind that anyone can edit this so you might want to create a copy of the pages to annotate.
Richard Slater asked Mar 29, 2017 at 10:57 Richard Slater Richard Slater 11.7k 7 7 gold badges 42 42 silver badges 81 81 bronze badges Lowest SLA from SPOF, assuming your app is able to cope with the session breaking ? Commented Mar 29, 2017 at 11:11@Tensibai - I don't think it can be, based upon my first example if the SLA for both services was it will be available 23 hours out of 24 then, the App Service could be out between 0100 and 0200 and the Database out between 0500 and 0600, both component parts are within their SLA but the total system was unavailable for 2 hours out of 24. Make sense?
Commented Mar 29, 2017 at 11:15 Yep, makes sense, but in this case the resulting should be the product of all no ? Commented Mar 29, 2017 at 11:21 I mean app 99.95 x sql 99.95 should be the overall availability of the group Commented Mar 29, 2017 at 11:23Keep in mind also that you can build a system that's more reliable than its components, through retries or failovers or degradation instead of full failure.
Commented Dec 18, 2018 at 7:22After reading Tensibai's excellent answer, I realised I used to be able to calculate this for network analysis purposes. I dug out my copy of High Availability Network Fundamentals by Chris Oggerino and had a crack at working this out from, not quite first principals.
Taking my serial example directly out of Tensibai's answer is simply a case of multiplying the probability of each component being available by the other:
Calculating it in parallel is a little more complicated as we do need to consider what the percentage unavailability will be:
The calculation is done as follows:
0.1% * 0.1% = 0.0001%
100% - 0.0001% = 99.9999%
99.99% * 99.9999% = 99.9899%
99.9899% is close to 99.99%
I ended up using Excel to perform the calculations, here is the values:
. and the formulas .
answered Mar 30, 2017 at 20:56 Richard Slater Richard Slater 11.7k 7 7 gold badges 42 42 silver badges 81 81 bronze badgesThat's it, in a more straightforward way than mine (I felt the need to demonstrate the maths behind :))
Commented Mar 30, 2017 at 21:14 Agreed, your answer is really good for the maths. Commented Mar 30, 2017 at 21:19 SQL Azure is 99.99% not 99.95% Commented Jun 11, 2019 at 15:17@JefferyTang it (probably) was at the question/answer writing time (I don't exactly remember) and the actual value doesn't change the methodology to get the answer to "How to calculate the compound SLA from individual parts SLA" which is the real question.
Commented Jun 11, 2019 at 15:29I'd take that as a math problem with the SLA being the probability of being OK.
In this case we can rely on probability rules to get an overall.
For your first case the probability that App Service (A) and Sql Service (B) are down at the same time is the product of their probability:
P(A)*P(B) = 0.0005 * 0.0005 = 0,00000025
The probability that one of them is down is the sum of their probability:
P(A)+P(B) = 0.001
When two events are independents the resulting formula to take in account the probability of both being down is:
P(A,B) = P(A) + P(B) - P(A)*P(B) = 0.001 - 0,00000025 = 0,00099975
So the overall SLA would be 1 - 0,00099975 = 0,99900025 wich in percent is 99.900025 %
A simplification is the product of the first probability: 0.9995 * 0.9995 = 0,99900025 .
Applied to your 1h/24h outage (4,166666% of a day) this gives (decimals are abbreviated):
0.0416 + 0.0416 - (0.0416 * 0.0416) = 0,081597222
So the probability of being OK is 1 - 0.0816 = 0.9184 in percent: 91,84%
24 * 0.0816 = 1.95 h
This is less than the worst case of 2 hours because there's a chance both are down at the same time.
Keeping that in mind, you may notice the availability for each is 95,84% and 0,958333333 * 0,958333333 = 0,918402778 which is our 91.84% from above (sorry for the full decimals here, but they are needed for the demonstration)
Now for your second case, we'll start gain from our compound probability for each region (Sorry I dismissed the change for SQL to keep it reasonable), assuming there's no independent probability for the region itself and that each region is isolated and as such a DB failure take only its region down.
We have the traffic manager OK probability P(T) = 0.9999 and each app+DB couple with a OK probability P(G) = 0,99900025 from
How much region we have play a role as we have to apply the product of failure probability only to get the probability both region are down as the same time:
0,00099975 * 0,00099975 = 0,0000009995000625 which means an overall availability of at least one region of 99,049375 %
Now we have the overall regions availability, the product with the traffic manager one give us the overall availability of the system:
0.9999 * 0,9999990004999375 = 0,99989900059988750625
The overall availability is 99.989900 %
Another source as explanation is available on Azure's docs (link courtesy of Raj Rao)