Mitigation Plan for Node Scale-Down Outages Affecting Non-HA Customers

Note: please note that Liferay has renamed its Liferay Experience Could offerings to Liferay SaaS (formerly LXC) and Liferay PaaS (formerly LXC-SM).

Issue

In a non-High Availability (HA) environment, unplanned site outage occurs due to Kubernetes Scale Down event:
```
Month Day 00:00:00.00 liferay xxx-xxx deleting pod for node scale down
```

Environment

Liferay PaaS

Understanding Kubernetes Node Scaling Event

Liferay PaaS operates on shared infrastructure, which offers benefits in efficient resource utilization with trade-offs. Google Kubernetes Engine (GKE) dynamically adjusts the number of nodes based on workload demands.

When Kubernetes detects a need for more resources, it triggers a ScaleUp event, adding a new node and scheduling a new pod in the background. Conversely, when the cluster detects underutilized nodes, it scales down to enhance resource allocation and cost efficiency.

Why Node Scale-Down causes outages in Non-HA environments

Not having a High Availability (HA) setup results in significant disadvantages in terms of reliability, performance, and business continuity, introducing a single point of failure.

Kubernetes node scaling occurs automatically and can cause involuntary disruptions. In a non-HA environment, services operate with a single replica. As a result, unplanned outages, such as those caused by Kubernetes scale-downs, are more likely to occur.

Why are we experiencing more outages recently?

While our infrastructure has always been dynamic, the frequency of these events can vary. It's possible that the conditions triggering more frequent scale-downs were not present previously, or their impact was less noticeable.

The frequency of these scale-down events depends on the overall resource usage on the cloud infrastructure by all users on the cluster. When resource demands fluctuate, the system may frequently switch between scaling up and down.

Node scale-down events in Kubernetes may also become more frequent due to additional factors, including but not limited to:

Increased load and resource utilization: Higher demand on resources can trigger more frequent scaling adjustments
Workload changes: Changes in workloads can affect how resources are used, leading to scaling actions
Cloud provider updates: Google Kubernetes Engine (GKE) may update autoscaling mechanisms or infrastructure, impacting scaling behavior
Hardware or OS Maintenance: Maintenance activities can affect node availability
Network Load Balancing Adjustments: Changes by Google in network load balancing can influence node scaling

Please note that Kubernetes operational events are automated in the background, and the specific reasons are not publicly disclosed.

Mitigation Strategy

To enhance reliability and reduce the risk of downtime, we highly recommend Upgrading to a High Availability (HA) subscription. This ensures multiple replicas of each relevant service are in place, so if one instance fails during a node scaling event, others can continue to operate.