Issue
- In a non-High Availability (HA) environment, unplanned site outage occurs due to Kubernetes Scale Down event:
Month Day 00:00:00.00 liferay xxx-xxx deleting pod for node scale down
Environment
- Liferay PaaS
Understanding Kubernetes Node Scaling Event
Liferay PaaS operates on shared infrastructure, which offers benefits in efficient resource utilization with trade-offs. Google Kubernetes Engine (GKE) dynamically adjusts the number of nodes based on workload demands.
When Kubernetes detects a need for more resources, it triggers a ScaleUp event, adding a new node and scheduling a new pod in the background. Conversely, when the cluster detects underutilized nodes, it scales down to enhance resource allocation and cost efficiency.
Why Node Scale-Down causes outages in Non-HA environments
Not having a High Availability (HA) setup results in significant disadvantages in terms of reliability, performance, and business continuity, introducing a single point of failure.
Kubernetes node scaling occurs automatically and can cause involuntary disruptions. In a non-HA environment, services operate with a single replica. As a result, unplanned outages, such as those caused by Kubernetes scale-downs, are more likely to occur.
Why are we experiencing more outages recently?
While our infrastructure has always been dynamic, the frequency of these events can vary. It's possible that the conditions triggering more frequent scale-downs were not present previously, or their impact was less noticeable.
The frequency of these scale-down events depends on the overall resource usage on the cloud infrastructure by all users on the cluster. When resource demands fluctuate, the system may frequently switch between scaling up and down.
Node scale-down events in Kubernetes may also become more frequent due to additional factors, including but not limited to:
- Increased load and resource utilization: Higher demand on resources can trigger more frequent scaling adjustments
- Workload changes: Changes in workloads can affect how resources are used, leading to scaling actions
- Cloud provider updates: Google Kubernetes Engine (GKE) may update autoscaling mechanisms or infrastructure, impacting scaling behavior
- Hardware or OS Maintenance: Maintenance activities can affect node availability
- Network Load Balancing Adjustments: Changes by Google in network load balancing can influence node scaling
Please note that Kubernetes operational events are automated in the background, and the specific reasons are not publicly disclosed.