Medicine for Black Friday fever

23 November 2016

Black Friday again !  Lots of discounts, many tempted customers, many online purchases but as each year many failures of online shops. It happens year after year  when servers get Black Friday Fever, and year after year some potential customers are dissatisfied because exactly in the moment when they wanted their precious smartphone at a discounted price the e-commerce site failed. And they waited maybe one year for this discounted day…But why do they fail? Well the reason is obvious… too many people trying to do the same thing simultaneously, in other words performance issues.

But what may be the cure for this Black Fever?  As in many successful treatments prevention is better than reaction. I am talking about knowing what to expect and have a strategy prepared. How is that achievable? Through a good collaboration between a testing team and operational team. Operational team is supposed to keep that e-commerce site working in these difficult conditions, we cannot allocate a huge number of resources for only one day just to cover any possibility of combination of user access. It would be too expensive. Here comes an independent testing team more precisely a performance testing team who can simulate the real behavior of users in different load combinations to see exactly the weakest points of the system. By real behavior I mean including simultaneous clicks on a button (which usually may be the cause of failures).

Your system will work at the maximum speed of the weakest component, whether it is hardware or software. The performance test may identify the weakest components and the root cause of possible errors so you can improve with lower costs. Not always increasing the hardware is the best solution, usually through detailed software configurations (knowing exactly where the problem is) you can obtain significant performance improvements.  So solution number one to prevent Black Fridays production failures is to fine-tune the software part following a performance test with deep diagnostics of the root cause of issues.

But the magic ingredient of cost effective improvements is still knowledge: performance testing can deliver information regarding certain limits of the system so you know what to expect. The first limit is the number of concurrent users for which the system responds as expected (Limit 0). By increasing the number of users (it is called stress testing) we determine the number of users for which the system responds within acceptable parameters (it is obviously slow but still usable), that is Limit 1 – acceptable. We will increase the users by respecting their real behavior until we discover Limit 2 that is until system become so slow that it is no longer usable but it still responds … and we keep increasing the number of users until the system fails (as they did last Friday). At this third limit, the failure limit we will stop the test.

How can we use this information in practice and what is the connection with productivity? Remember in solution number one I said to determine the root cause of issues, so we do in the case of stress test and on top of that now we know when failure will occur. When we reach limit 1 we know that we have a certain amount of time (calculated based on the behavior of users and analysis of past experiences) until limit 2 will be reached. So at this time we can increase resources at the weak point we discovered during the performance test. When we reach limit 2 we know how much time we have until limit 3 and at lease we can send warning messages or store the requests in a queue, anything that might create a better experience for our customers. You do not want the customers to discover errors, failures or bugs. They will be very very bad for your image.  Or at least if you decide to let the system fail you make an informed decision. Or at least if you decide to let the system fail you make an informed decision.

If the failure surprises you, the engineers in charge with the system need some time to discover the error, some time to find a fix for it and apply it… all live, all that time your e-commerce site is down!

With performance testing you will be like in an emergency room of the best hospital, you get a call from the ambulance announcing you what patient you will receive, you have some time to prepare and when the patient arrives you act with maximum speed and efficiency ! This is the value of testing, you know!

