Tech

How One Small Mistake Can Disrupt Production: Insights from Mykyta Savin, DevOps Infrastructure Architect and P2H

8 min read

Cloud & DevOps, GovTech and Software Development

The article was originally published in DOU.UA

In March of 2023, the influential online conference, DevOps fwdays’23, was dedicated to sharing best practices and tools in the DevOps sector. The lineup featured speakers hailing from renowned tech companies such as Softserve, Spotify, Luxoft, Snyk, Solidify, Zencore, Mondoo, and P2H, just to name a few.

Mykyta Savin, DevOps Infrastructure Architect at P2H, presented “How We Block Production: Triangulating Issues, Solutions, and Postmortems,” highlighting the impact of minor errors on production and providing strategies for resolution.

While we’re eager to share this valuable insight, we hope you won’t need it.

Project Overview:

P2H spearheads the development of an E-Government platform meticulously addressing the intricacies of integrating with government systems. This platform streamlines interactions with the country’s labor market for both citizens and businesses. Embracing an asynchronous architecture, the platform continually evolves with new services, leveraging a robust tech stack and meticulous processes.

Tech Stack and Processes:

Microservice architecture
Front end: Vue.js, React.js
Back end: Ruby, Ruby on Rails, Java, PHP
Message broker: RabbitMQ
Global cache: Elasticsearch
Infrastructure: Docker
Monitoring, observation, and tracing: Grafana, Grafana Loki, Grafana Tempo, Prometheus, OpenTelemetry, Vector
Integrations: IBM APP Connect, IBM API Connect, Absher, Unifonic, Mada, SADAD, and more

The project employs a microservice architecture, boasting over a hundred microservices currently in production, predominantly developed in Ruby, with Java powering the latest additions. Adopting the Enterprise Service Bus (ESB) pattern and leveraging RabbitMQ as the message broker facilitates seamless handling of the project’s asynchronous operations.

Elasticsearch and PostgreSQL form the robust storage layer, complemented by Docker and Docker Compose for streamlined infrastructure management. To ensure compliance with government regulations, an internal provider from Saudi Arabia is utilized, addressing data locality requirements.

Monitoring is facilitated through the Grafana Stack, integrated with various ministries and private institutions. RabbitMQ operates as a resilient cluster of four nodes, accessible via the prod-rabbit-new-lb load balancer.

Problem Identification and Team Actions

Automated alerts from Prometheus Alarms highlighted concerning trends related to:

Average page processing time.
Number of gateway timeouts.

Upon manual confirmation by the operations team, it became evident that the website was experiencing significant sluggishness. Consequently, an incident was swiftly initiated, prompting the formation of a dedicated war room. Comprising the operations team, L3 support, and service owner representatives, the war room convened to address the escalating problem.

The situation rapidly deteriorated, with the service becoming unresponsive within 15 minutes of detection. To mitigate the impact on clients, the service owner promptly transitioned it to maintenance mode and restricted client access.

Triangulation

Amidst mounting pressure from the client’s management team, we embarked on a thorough investigation guided by a concise, yet effective, Anamnesis checklist designed to pinpoint and address the issue. We asked the following questions:

Have there been any recent system changes?
Have there been any recent deployments?
Is RabbitMQ (RMQ) functioning properly without any overloads?
Are there any anomalous entries in the logs or monitoring system?
Have we systematically examined all components?

Given the project’s reliance on an ESB architecture, the specifics of RMQ’s operation are critical and among the initial items on our checklist for verification. Once RMQ’s functionality is confirmed, we then examine for potential overloads and resource utilization.

A concerning issue arose when we noticed a significant increase in traffic handled by the RMQ load balancer. While it typically manages around 30 Mbit of traffic per second, the current load spiked to 300 Mbit.

Initially, we didn’t fully grasp the seriousness of this deviation, prompting a thorough investigation into the sources of system traffic and why this anomaly wasn’t immediately apparent.

After about 20 minutes of investigation, we revisited the load balancer, noting sustained high traffic levels for the same duration; the traffic continued to be 300 Mbit in both directions. Upon closer examination, we identified a port handling 300 megabits of traffic, monopolizing the bandwidth on the load balancer’s network port.

We suspected that the excessive bandwidth usage was related to RMQ, which was one of the problems that caused the system to fail.

We delved deeper into its queue dynamics. Although the message count seemed normal, the discrepancy between messages retrieved and re-queued suggested an issue. The count of messages taken from the queues and re-queued was close to zero. In other words, the RMQ cluster was occupied by something utilizing the entire bandwidth, but new messages were not being enqueued and old ones were not being removed. This scenario resembled a potential cache poisoning event, prompting further investigation.

Despite finding no abnormalities in the log records, RMQ’s control panel displayed a slight increase in unacknowledged messages, catching our attention.

As you may already know, RMQ supports various message types. In our case, we utilized the type that requires acknowledgement from the client, indicating that the client processes the message and notifies RMQ accordingly. Should the client fail to acknowledge processing RMQ, the message is not deleted and remains in the queue. Eventually, RMQ returns the message to the queue and assigns it to another client after a certain period of time. This mechanism safeguards against message loss and ensures eventual processing. Our circumstances appeared to mirror this behavior. Therefore, to validate our hypothesis, we opted to identify the responsible service.

Problem Resolution

Following a comprehensive investigation of the Node exporters, we pinpointed an instance generating an unusual volume of traffic, correlating with the load on the RMQ load balancer. We then uncovered a group of containers experiencing frequent restarts, which, in turn, caused the service to repeatedly restart. What we found upon further investigation was that the service was not, in fact, restarting, but rather it was being killed by the out-of-memory (OOM) Killer, and then automatically restarting.

Given that the service was developed using Ruby, it inherently utilized the Sneakers framework’s default behavior to prefetch multiple messages from RMQ. This led to the realization that the service was fetching excessively large messages, surpassing Docker’s memory constraints. Consequently, the container was terminated, severing the connection to RMQ. RMQ subsequently re-enqueued the prefetched messages, delivering them to a new service instance upon restart, albeit to a different instance attempting to read from the same queue. As a significant number of these messages accumulated, cache poisoning ensued within the RMQ, monopolizing bandwidth with the processing and re-enqueuing of prefetched messages.

The Outcome

In the short term, manually adjusting Docker’s memory limits swiftly restored normal system operation, achieving recovery within 10 minutes. The incident was entirely resolved approximately one and a half hours after it happened.

For the long term, additional investigation and analysis of the service were undertaken to pinpoint the origin of large message generation and rectify the issue. This solution was implemented on the same day.

Conclusions

To mitigate similar issues in the future, we enhanced the monitoring system and began collecting metrics to track:

Services experiencing frequent restarts, accompanied by corresponding alarms.
Docker containers susceptible to termination by the OOM Killer
System processes susceptible to termination by the OOM Killer.

Additionally, we initiated the collection of metrics related to message sizes in RMQ and implemented alarms to notify us if a message size exceeded predefined thresholds.

Since implementing these measures, we have not encountered similar problems. The incidence of container restarts due to the OOM Killer has ceased, enabling us to redirect our focus to other system-related matters.

Overall, our system has been operating smoothly and without any notable incidents.

How One Small Mistake Can Disrupt Production: Insights from Mykyta Savin, DevOps Infrastructure Architect and P2H

Project Overview:

Tech Stack and Processes:

Problem Identification and Team Actions

Triangulation

Problem Resolution

The Outcome

Conclusions

Next Articles

Comparing Active Record and Sequel in Ruby and Ruby on Rails

Putting the User First: Redesigning the Donation Platform

Cloud migration: how it works and why businesses need it

Sidekiq in Ruby on Rails 7

Have a task with an important mission? Let’s discuss it!

Thank you for getting in touch!

Thank you for getting in touch!

How One Small Mistake Can Disrupt Production: Insights from Mykyta Savin, DevOps Infrastructure Architect and P2H

Project Overview:

Tech Stack and Processes:

Problem Identification and Team Actions

Triangulation

Problem Resolution

The Outcome

Conclusions

More to read

Next Articles

Comparing Active Record and Sequel in Ruby and Ruby on Rails

Putting the User First: Redesigning the Donation Platform

Cloud migration: how it works and why businesses need it

Sidekiq in Ruby on Rails 7

Have a task with an important mission? Let’s discuss it!

Thank you for getting in touch!

Thank you for getting in touch!