GoPay's reliability boosted by chaos engineering

GoToFinancial is applying chaos engineering principles to reduce "novel incidents" and improve the reliability of the infrastructure that powers its GoPay service.

GoPay - which is a payment and personal finance platform in Indonesia, with Southeast Asia (SEA) ambitions - is one of a number of services that come under GoToFinancial, which is part of GoTo, the merged entity of Gojek and Tokopedia.

GoPay can be used for QR code payments as well as to pay for e-commerce orders, food deliveries or similar.

Speaking at KubeCon Europe 2022, GoToFinancial engineering manager, Iqbal Farabi, said that operating in the payments space naturally invites system complexity, because the aim is to make GoPay an acceptable payment method in as many places as possible.

“It means that GoPay will always come with a lot of integrations with third parties, and this means that we have a lot of microservices and components in our system,” Farabi said.

“It also means that more and more people are getting involved in developing and maintaining the system.”

The result is that an already-complex system naturally becomes even more complex over time - and Farabi said a milestone of sorts was reached on that front when it was determined that no single person at GoToFinancial could possibly understand the system end-to-end.

“Everyone understands only a part of the system,” he said.

“That also means that interaction between all the subsystems that make [up] the whole system is not very easy to understand, even for people that have been there for a very long time.

“This leads to a lot of increasingly novel incidents happening - incidents we’ve never faced before.”

Farabi said that GoPay’s 300-strong engineering team set about trying to improve the reliability of the underlying infrastructure about a year ago.

His colleague, senior engineering manager Giovanni Sakti, said that underpinning GoPay there are “more or less 500-plus services, around 30-plus Kubernetes clusters in multiple environments”, with workloads split across multicloud and some in on-premises data centres.

Improve reliability

Farabi said the company wanted to reduce novel incidents and improve reliability, in part so it would not fall afoul of regulated reliability thresholds for financial entities.

“If you cross below those thresholds, there are consequences you are going to face with regards to your license to operate as a financial tech company,” Farabi said.

“Therefore, reliability has been a big topic for us. It’s been the main topic for us in our infrastructure engineering team.”

In early work on improving reliability, it quickly became apparent that every team had its own definition of reliability, and that there was no commonly-agreed view.

In addition, the number of moving parts made incident response challenging and, due to the rapid growth of GoPay and the teams and systems behind it, ownership and accountability for different parts were often difficult to establish.

Much changed over the past year, according to Farabi.

The teams agreed on organisation-wide service level indicators (SLIs) and objectives (SLOs), and came up with a “better incident handling mechanism”, and better ways “to track which services and components belong to which team” and who is ultimately accountable.

“However, despite all of that effort, some novel incidents kept happening,” Farabi recalls.

“At some point, there were a few very rough weeks in which some big outages happened - to the point that our stakeholders - the leadership and above - came to sit with us [to understand] what happened.”

Out of this came the resolve to try chaos engineering - a set of practices designed to improve system reliability by making it more resistant to failure when encountering unknown or unanticipated operating conditions.

Farabi said that initially, internal impressions of chaos engineering were that it was about “burning the place down” and “wrecking everything in production”.

However, this kind of testing was not possible in a regulated system, and with some additional research, the nuances of chaos engineering became clearer.

“What we found out is chaos engineering is not only about introducing chaos but especially in organisations that really care about reliability metrics, chaos engineering is about continuous verification,” Farabi said.

“Continuous verification is about proactive experimentation in software to verify the behaviour of our system.”

Setting chaos goals

According to Sakti, GoToFinancial started on the chaos engineering path with two clear goals in mind: “build institutional knowledge and improve our systems.”

The need to build institutional knowledge comes from wanting more staff to understand how GoPay’s production environment works.

Sakti said the company was surprised to find that “peoples’ understanding of the system is completely different from one another.”

“Even for people who are directly working with the system,” Sakti said.

One of the first exercises that GoPay’s teams conducted was to revisit and review all past incident post-mortems “to centralise knowledge on our known limits or capabilities”.

The review meant being able to understand whether recommendations from these post-mortems were carried out.

“The problem with those kinds of items is that it might not be recorded in the organisation’s scope or manner because it might only be tribal knowledge of some certain teams,” he said.

“By revisiting all of the post-mortems again, this time as an organisation, we’re at the same time building our knowledge on our current limits.”

From this effort, GoPay’s team classified the scenarios it had previously encountered according to how much action had to be undertaken to bring the systems back online.

The categorisations run from those where “we need manual intervention to recover”, to scenarios where the team is unsure of the appropriate action, right up to scenarios where the system is capable of recovering on its own.

The goal is to have most if not all systems able to recover on their own. However, Sakti noted that this ability would need to be verified continuously.

GoToFinancial is currently doing the majority of these verifications manually, Farabi said.

“Going forward, we’d like to do it in a more automated and continuous manner.”

It has experimented with two tools to enable this - Chaos Mesh and LitmusChaos - and found both capable of helping to run and manage experimentation.

“Tooling will help us to conduct our experiments more systematically,” Farabi said.

The experiments are still mostly being run in GameDay exercises - effectively simulated environments.

However, the long-term goal is to be able to run experiments on the production system, without causing issues.

“This is the grail of chaos engineering,” Sakti said. “It’ll take a lot of preparation before we even get here.”

Farabi added, “Ultimately as an organisation, what we want is to continuously improve institutional knowledge about our complex system and take actions to improve its reliability.”