Legacy Knowledge Base
Published Jun. 30, 2025

Mitigation Plan for Node Scale-Down Outages Affecting Non-HA Customers

Written By

Vincent Liu

How To articles are not official guidelines or officially supported documentation. They are community-contributed content and may not always reflect the latest updates to Liferay DXP. We welcome your feedback to improve How To articles!

While we make every effort to ensure this Knowledge Base is accurate, it may not always reflect the most recent updates or official guidelines.We appreciate your understanding and encourage you to reach out with any feedback or concerns.

Legacy Article

You are viewing an article from our legacy "FastTrack" publication program, made available for informational purposes. Articles in this program were published without a requirement for independent editing or verification and are provided"as is" without guarantee.

Before using any information from this article, independently verify its suitability for your situation and project.
Note: please note that Liferay has renamed its Liferay Experience Could offerings to Liferay SaaS (formerly LXC) and Liferay PaaS (formerly LXC-SM).

Issue

  • In a non-High Availability (HA) environment, unplanned site outage occurs due to Kubernetes Scale Down event:
    Month Day 00:00:00.00 liferay xxx-xxx deleting pod for node scale down

Environment

  • Liferay PaaS

Understanding Kubernetes Node Scaling Event

Liferay PaaS operates on shared infrastructure, which offers benefits in efficient resource utilization with trade-offs. Google Kubernetes Engine (GKE) dynamically adjusts the number of nodes based on workload demands.

When Kubernetes detects a need for more resources, it triggers a ScaleUp event, adding a new node and scheduling a new pod in the background. Conversely, when the cluster detects underutilized nodes, it scales down to enhance resource allocation and cost efficiency.

Why Node Scale-Down causes outages in Non-HA environments

Not having a High Availability (HA) setup results in significant disadvantages in terms of reliability, performance, and business continuity, introducing a single point of failure.

Kubernetes node scaling occurs automatically and can cause involuntary disruptions. In a non-HA environment, services operate with a single replica. As a result, unplanned outages, such as those caused by Kubernetes scale-downs, are more likely to occur.

Why are we experiencing more outages recently?

While our infrastructure has always been dynamic, the frequency of these events can vary. It's possible that the conditions triggering more frequent scale-downs were not present previously, or their impact was less noticeable. 

The frequency of these scale-down events depends on the overall resource usage on the cloud infrastructure by all users on the cluster. When resource demands fluctuate, the system may frequently switch between scaling up and down.

Node scale-down events in Kubernetes may also become more frequent due to additional factors, including but not limited to:

  • Increased load and resource utilization: Higher demand on resources can trigger more frequent scaling adjustments
  • Workload changes: Changes in workloads can affect how resources are used, leading to scaling actions
  • Cloud provider updates: Google Kubernetes Engine (GKE) may update autoscaling mechanisms or infrastructure, impacting scaling behavior
  • Hardware or OS Maintenance: Maintenance activities can affect node availability
  • Network Load Balancing Adjustments: Changes by Google in network load balancing can influence node scaling

Please note that Kubernetes operational events are automated in the background, and the specific reasons are not publicly disclosed.

Mitigation Strategy

To enhance reliability and reduce the risk of downtime, we highly recommend Upgrading to a High Availability (HA) subscription. This ensures multiple replicas of each relevant service are in place, so if one instance fails during a node scaling event, others can continue to operate.
Did this article resolve your issue ?

Legacy Knowledge Base