When describing the benefits of a new web application and observability platform to manage its online sales, Silvia Thom, Chief Technology Officer at Zalora, draws the analogy of an engineer boarding a flight and then having to quickly abandon their seat when alerted to a potentially critical technical issue.
“In the past we had systems to alert us before any critical technical issue occurred but it wasn't doing so in a timely manner,” she said in an interview with iTNewsAsia for this case study.
Zalora in 2018 turned to New Relic, which provided the retailer with the ability to uncover blind-spot vulnerabilities and critical errors that can adversely affect the end-user experience.
“After deploying the new platform, we had the systems in place to be alerted before any critical technical issue occurred. That really saved the day,” Thom said.
Previously Zalora’s data was distributed across different systems, making it difficult to get a transparent view.
“With the New Relic dashboards we can now easily and quickly understand what’s happening in our IT environment. The visualisation tools on the dashboards provide instant access and real-time insights into our tech stack which were not available,” said Thom.
Since the pandemic online retail has risen dramatically, she added, with many brand partners having to close physical retail stores and collaborating with Zalora, creating a new dynamic for how the retailer went to market.
"Online sales events are always very demanding from a technical point of view. Many online retailers suffer outages, and we’ve also experienced issues in past years. With the continued rise of online shopping – driven significantly by the COVID-19 crisis – we knew we needed to increase our usage to get observability and optimise our systems."
Going full hilt on observability
With its system architecture growing increasingly complex, and more applications needing to be monitored every year, Zalora has moved towards full observability. It massively scaled up its use of New Relic in terms of host count. In September 2020 the retailer had 270 hosts, rising to 300 in a matter of weeks.
Using New Relic has also helped prevent issues from escalating, fixing them before they impact the end customer experience.
"One transaction, in particular, made a number of requests to the database, causing the entire system to become sluggish due to a system overload. New Relic was quick and efficient in identifying the rogue query, which allowed us to optimise the code and saved us a considerable amount of downtime," Thom said. “This now takes a few seconds. In the past we didn’t have the systems in place to get to the root cause of the system overload.”
Load testing ensured seamless 11/11 sales
Zalora now has a ‘single pane of glass’ overview of all its systems and can carry out load testing.
With mega sale days falling on dates such as 9/9,10/10 and 11/11, the tech team was constantly on high alert. The surge in traffic on these days is immense, Thom explained, putting a strain on systems, while uptime was also critical.
"When there is a mega day, the revenue that we're making is really crucial for us as a business. Any potential downtime, even if it's just a couple of minutes, would have a significant financial impact.”
“During any of our mega days – what we call the big shopping days like 11.11 Singles Day, Black Friday etc. – we have war rooms where we monitor New Relic dashboards. We look at real-time data such as, for example, the number of API requests per minute to monitor and control the load on our systems,” she added.
If we see load going up, then this helps us a lot in making real-time and fast decisions about scaling up our infrastructure. We also look at things like latency and loading time to see how good our customer experience is in real-time in terms of speeding and loading screen after screen.
- Silvia Thom, Chief Technology Officer, at Zalora
Thom said New Relic is central to Zalora’s pre-load testing regime as it was critical to ensure that network capacity can scale to manage traffic spikes on mega sale days. To prepare for the last 11/11 sale, the retailer used the platform to record and collect data on transactions from past sales events. This was then used as an estimate for the 11/11 campaign.
She added that Zalora experienced higher success in terms of customers and conversions than previous years. Citing examples, group net merchandise value grew more than 30% year-on-year, the number of new customers grew about 18%, reactivated customers by 68.2% and conversion rate jumped about 22.2%.
"Running a load test with the estimation gave us a good number of servers and specifications for our database which we employed to pre-scale the systems before sales events," Thom said.
"We now use New Relic a lot in load testing for peak sale events - that’s where we see the most value. In the weeks leading up to a sale period, we load-test the microservices, all the different apps, and all the different underlying infrastructure across all the different country sites, to make sure that they perform as expected on those days."
"It’s very important for my team that they can move fast and have the right tools at hand to solve any particular part of their engineering puzzle."
According to Thom, the next step for Zalora is to move to New Relic One, which will eliminate host-based pricing and thus be more cost-effective, as well as delivering “the ability to instrument everything”.
The retailer is also modernising existing architecture to help accelerate time to market, and growing its SRE team by 33% to around 40 experts in 2021, increasingly focused on observability.
Zalora is also looking at implementing SLOs, SLA, and SLIs within its engineering teams and setting specific KPIs, Thom adds, a move introduced after attending a recent New Relic workshop.