How SRE Can Make E-Commerce 20-30% More Sustainable During Peak Seasons

December is one of the busiest months of the year for retailers and the holidays have people rushing to shop: according to McKinsey, in 2021 alone, people intend to buy 7% more during the holidays compared to 2020 and 9% more than in 2019. So how can businesses make sure they are ready to provide their customers with a high-quality shopping experience?
6 min read
12/23/21
All articles
By Denis Baranov
Principal Solutions Consultant
All articles
By Frankie Mitton
Business Development Manager
Share
How SRE Can Make E-Commerce 20-30% More Sustainable During Peak Seasons

The first thing a business needs to do is stabilize their systems in order to handle a traffic influx, where the number of orders may grow, let’s say, from 5K to 7-10K within a month.

This is where site reliability engineering (SRE) specialists come in. They play a big part in stabilizing systems and improving the parameters, systems’ responsiveness, and importantly endurance.

SRE practices are highly effective when you need to avoid a system crash, ensuring its uninterrupted operation. SRE specialists are well-versed in various areas of information technology; they see the entire system and understand how it functions.

From the user interface they interact with computers and servers and can easily navigate processes. SRE methods can be useful in various business areas where monolithic systems are being built.

Site Reliability Engineering Case Study

For many years, DataArt has been helping clients in the Retail and Distribution sector to create custom supply chain management solutions using SRE practices. For instance, we helped a large online retailer with the optimization and maintenance of their online marketplace.

We focused on the internal part responsible for the entire process from order placement to its delivery to the customer’s door, that is optimization of delivery, routing, and vehicle planning. We worked on the project for over a year, studying the system and trying to maintain and optimize it. During this process, we recognised that it was not ready for significant fluctuations in traffic – after all, their system was developed decades ago. It has grown so much that it has become overly complicated. When customer demand significantly increased, we had to completely redesign the system to meet new requirements and cope with high seasonal demand.

It was necessary to locate bottlenecks and immediately fix them. To do so, we suggested involving our SRE engineers who enhanced the internal part of the system to ensure its sustainability while optimization of the web shop performance was carried out in parallel.

The situation was urgent and we managed to optimize the system and increase its stability at no additional cost to the client, without expanding the hardware. All optimization was done on the software and process side. We did a great job of identifying potential areas of improvement and refactoring and optimizing the old code.

Debugging & Fixing: Things to Be Aware of

One of the biggest concerns in these situations is testing and debugging. A test environment is used only for tests, and we cannot conduct stress tests on it. We cannot artificially generate the load which may occur in real life during a peak season, such as Christmas, or during the first lockdown. It is possible to prepare for moments like this only when the system is heavily overloaded, e.g. during the Christmas period. 

Debugging has to start early during production, which makes it harder - you need to identify and analyze the bug, taking a log from the work program where you did something to understand what is happening and at what point it stops. You have to be really careful as this is an operating business, and one wrong move can affect a large number of people.

During the holidays last year, one of our clients had a service that ensured interaction between all client’s systems. When the load on the service increased, it started to fail and could not process the requests. Introduction of the service allowed us to identify system's vulnerabilities. We took a closer look at the vulnerable area and modified just one line. After the modification of just a single line the system could easily sustain the load 3 times higher! If it previously failed at 3K people, now it could handle 9K and more if needed.

To fix an issue, you have to track it down, analyze it, come up with a solution, test it and only after all that - make changes.

For better efficiency, we developed a monitoring system for the whole platform. It allowed us to track real-time system activity, collect all the necessary statistics and metrics, and understand how the system functions. This solution gave us the opportunity to see exactly where a failure might occur and fix it before it does.

Thanks to our knowledge of SRE practices, we managed to optimize the system within a tight timeframe in a critical situation. The number of orders that the system can handle has grown by 3 times, while the number of system crashes actually dropped to 0. The system now is scalable enough to support future leaps in online demand.

Key Takeaways

1. If you have a complex system, to ensure its stability during peak seasons (and in general), engage a team of excellent site reliability engineering specialists. SRE can be applied in various industries and can be useful in those sectors where monolithic applications prevail.

2. To prevent crashes, create a monitoring system to observe and collect the statistics and metrics to quickly detect vulnerabilities and try to fix them preventively. Monitoring is necessary both for the system as a whole and for each of its parts. Increase in the traffic load first slows down the system, and, if it still has vulnerabilities, it will snowball, and the system will stop (it does not crash but stops).

3. If you have a large outdated system, and you do not want to invest in its development, but still want to ensure uninterrupted operation - this can be done by optimizing the processes and refactoring the code.

4. When it comes to testing and debugging, first you have to track down the problem, then examine it, offer a solution, test it and only make changes after all the previous steps are completed. Sometimes, a small change in a single line of code can lead to huge system improvement.

5. The alert system should be configured so that it is triggered only at the right time, but still does not allow you to miss a serious malfunction. What may seem at a first glance insignificant, can lead to serious disruptions and consequences.

6. To keep track of who is responsible for which section, as well as alert the team and cooperate to solve the problem, write an escalation algorithm.

7. Your IT partner must be very trustworthy and have a strong customer focus and sense of responsibility; you would want them to take care of your system as they would of their own.

Conclusion

In order to avoid system failures, retailers need to implement comprehensive performance monitoring and alert systems. There must be a team of excellent specialists, SRE engineers, who can see the whole system, understand the process, and react accordingly. The way your vendor approaches cooperation is extremely important and so having a reliable IT partner is key.

DataArt can help retailers to stabilize their systems! Our company has handled many similar projects and can be useful from mapping through to the actual implementation and testing of new software. Contact us for more details!

Sign Up for Updates!

Subscribe now to receive industry-related articles and updates

Choose industries of interest
Thank You for Joining!

You will receive regular updates based on your interests. No spam guaranteed

Add another email address
READ MORE
Sign Up for Updates!
Choose industries of interest
Thank You for Joining!

You will receive regular updates based on your interests. No spam guaranteed

Add another email address
Welcome
We are glad you found us
Please explore our services and find out how we can support your business goals.
Get in Touch Envelope