How one small mistake can block a prod- the experience of Mykyta Savin, P2H DevOps Architect - P2H Inc.
Left arrow
Tech

How one small mistake can block a prod- the experience of Mykyta Savin, P2H DevOps Architect

Clock icon 8 min read
Top right decor Bottom left decor

In March of this year, the theme of an influential online conference, DevOps fwdays’23, was DevOps practices and tools. The speakers were developers and engineers from SoftServe, Spotify, Luxoft, Snyk, Xebia | Xpirit, Solidify, Zencore, Mondoo, and others. 

Mykyta Savin, DevOps Infrastructure Architect at P2H, delivered a presentation on how a small mistake can block production and what to do in such cases (“How we block production. Triangulate issue, fix and postmortem”). 

We are sharing this case as it will come in handy for anyone having similar issues.

Brief product description

P2H has developed an e-government platform for a client from Saudi Arabia. The platform’s work is related to facilitating interaction with the labor market, and the product’s target audience is the country’s citizens and businesses. 

The development has been ongoing for several years and is constantly changing and expanding with new services. The platform is based on an asynchronous architecture, which takes into account the idiosyncrasies of working with integration points in the Saudi Arabian government.

Tech Stack and processes

  • Microservice architecture
  • Front end: Vue.js, React.js
  • Back end: Ruby, Ruby on Rails, Java, PHP
  • Message broker: RabbitMQ
  • Global cache: Elasticsearch
  • Infrastructure: Docker
  • Monitoring, observation, and tracing: Grafana, Grafana Loki, Grafana Tempo, Prometheus, OpenTelemetry, Vector
  • Integrations: IBM APP Connect, IBM API Connect, Absher, Unifonic, Mada, SADAD, and more

The project is based on a microservice architecture. Over a hundred microservices are currently in production, most of which are written in Ruby. New microservices are being launched in Java. The Enterprise Service Bus (ESB) pattern and the RabbitMQ message broker were chosen to organize the project’s asynchronous nature. 

The storage layer is built on Elasticsearch and PostgreSQL, and the infrastructure uses Docker, Docker Compose, and an internal provider from Saudi Arabia to meet the data locality requirements of the government regulator. Grafana Stack is used for monitoring, along with numerous integration points with various ministries and private institutions. RabbitMQ functions as a cluster of four nodes accessible through the prod-rabbit-new-lb load balancer.

Problem identification and team actions

The problem was identified through automated alerts on Prometheus Alarms, specifically:

  • Average page processing time fired.
  • Number of gateway timeouts fired.

The operations team immediately manually verified the alerts and found that the website performed very slowly. As a result, an incident was opened, and a war room was formed, including the operations team, L3 support, and representatives from the service owner. 

The problem escalated rapidly, as 15 minutes after detection, the service practically stopped responding. The service owner had to switch to maintenance mode and restrict client access.

Triangulation

Understandably, we were working under pressure from the customer’s management to identify the issue. We have a prepared checklist which we bring into play when such issues arise. The items are fairly obvious but effective checkpoints.

Checklist:

  • Have there been any recent system changes?
  • Have there been any recent deployments?
  • Is RabbitMQ (RMQ) functioning properly without any overloads?
  • Have there been any unusual entries in the logs or monitoring system lately?
  • If needed, systematically check all system components.

Since the project’s architecture is based on ESB, it is very sensitive to the specifics of RMQ’s operation, the verification of which is one of the first items on the checklist. If RMQ works well, it is worth checking whether there are any overloads and what resources are being used.

We immediately noticed that the load balancer for RMQ was handling a relatively large amount of traffic, which is unacceptable. In normal mode, the load balancer uses about 30 Mbit of traffic per second, but 300 Mbit was being used.

At first, we didn’t notice the fact that the number was very “flat”. We spent time on a (quite feverish) search through the system: What generates traffic in RMQ? Where are the messages coming from? And why is this not visible from the monitoring?

Having spent 20 minutes searching for the source of the traffic, we again returned to the load balancer and were interested in the fact that for 20 minutes, the traffic continued to be 300 megabits in both directions. We checked the characteristics of the port — bingo! A 300-megabit port. Something was eating up all the bandwidth on the load balancer’s network port!

Accordingly, we determined that something was using the entire bandwidth for RMQ, which was one of the problems that caused the system to fail.

We began our investigation from this point: the number of messages in the queues appeared normal, but the count of messages taken from the queues and re-queued was close to zero. In other words, the RMQ cluster was occupied by something utilizing the entire bandwidth, but new messages were not being queued, and old ones were not being removed. This resembled a cache poisoner scenario, so we started digging deeper. 

There was nothing unusual in the logs, but the RMQ control panel indicated a slightly higher number of unacknowledged messages, which caught our attention.

As you may know, RMQ has several types of messages. We use the type that requires an acknowledgment from the client, meaning the client processes the message and informs RMQ about it. If, for some reason, the client fails to acknowledge the processing to RMQ, the message is not deleted from the queue. After a certain period, RMQ returns it back to the queue and hands it over to another client. This mechanism is in place to ensure that messages are not lost and will eventually be processed by someone. Our situation seemed to align with this behavior, so to test our hypothesis, we decided to find the service responsible for it.

Problem resolution

After thoroughly investigating the Node exporters, we identified an instance generating an unusual amount of traffic, matching the traffic received by the RMQ load balancer. Within this instance, we discovered a group of containers constantly restarting, causing the service to restart repeatedly. Upon further investigation, we found that the service was not restarting but rather being killed by the OOM Killer (out-of-memory killer) and then automatically restarted.

Since the service was built with Ruby, which by default in the Sneakers framework performs prefetching of multiple messages from RMQ, it appeared that the service was prefetching very large messages. These resided in memory, exceeding Docker’s limits, resulting in the container being killed. 

Consequently, the connection to RMQ was lost, and RMQ re-queued the preselected (prefetched) messages, delivering them to a new service instance after the restart but to a different one attempting to read from the same queue. 

As a significant number of such messages accumulated, we experienced cache poisoning within RMQ, with the entire bandwidth being occupied by the processing and re-enqueuing of prefetched messages.

The result

  • Short-term solution

    manually increasing the memory limits in Docker allowed the system to resume normal operation and recover within 10 minutes. The incident was fully resolved approximately one and a half hours after its occurrence.
  • Long-term solution

    further investigation and analysis of the service were conducted to identify the source generating large-sized messages and address the issue. The solution was implemented on the same day.

Conclusions

To prevent similar issues in the future, we made improvements to the monitoring system and started collecting metrics to track:

  • Services experiencing frequent restarts with corresponding alarms.
  • Docker containers being killed by the OOM Killer.
  • Processes in the system being killed by the OOM Killer.
  • Additionally, we started collecting metrics related to the message size in RMQ and implemented alarms to alert us if a message size exceeded certain thresholds.

Since then, we have not encountered similar problems. The metrics related to container restarts due to the OOM Killer have ceased, allowing us to focus on other system issues, and overall, everything has been running smoothly and without incident.

Next Articles

Load more

Have a task with an important mission? Let’s discuss it!

    Thank you for getting in touch!

    We appreciate you contacting us. One of our colleagues will get back in touch with you soon!

    Have a great day!

    Thank you for getting in touch!

    Your form data is saved and will be sent to the site administrator as soon as your network is stable.