CXone Knowledge Management - – Monitoring complete. Status = All Services Running Normally

Incident Report for CXone Expert EU

Postmortem

Impact Start Time (UTC): 2026-05-07 11:55:00

Impact End Time (UTC): 2026-05-07 13:51:00

‌

Incident Summary:

On 2026-05-07, some NiCE CXone Mpower customers in the EU region experienced slowness when accessing sites, while others were unable to access the platform entirely with a "504 Gateway Timeout" error within the CXone Mpower Expert knowledge portal. The service degradation was caused by increased traffic volumes combined with performance limitations in certain backend processes. The impact was resolved after scaling up pod resources and restarting the proxy pods, which restored platform stability.

Root Cause:

The service degradation was caused by increased traffic volumes combined with performance limitations in certain backend processes, impacting the EU regional platform.

Under elevated traffic conditions, including automated crawler activity, some requests followed less optimized processing paths, increasing system load. This was further amplified by legacy or complex page content requiring more intensive processing.

While scaling actions helped restore capacity, they also introduced temporary overhead that contributed to intermittent performance degradation. Additionally, although autoscaling functioned as designed, it reached its limits and was insufficient to address constraints related to per-pod Central Processing Unit (CPU) capacity.

Overall, evolving traffic patterns exposed underlying performance limitations, highlighting the need for targeted code optimizations and increased per-service capacity.

Corrective Actions:

Detection

Although built-in alerting mechanisms were in place to detect this type of condition, alerts did not consistently reach the responsible teams as expected. In some cases, alerts were grouped or suppressed, delaying timely visibility of the issue. Internal teams became aware of the impact primarily through customer reports of slowness when accessing sites within the CXone Mpower Expert knowledge portal.
Enhance alert notification delivery to ensure alarms are reliably triggered and routed to the appropriate response teams. Update by EOD MT on 2026-05-22.

Remediation

The impact was resolved after scaling up pod resources and restarting the proxy pods, which restored platform stability. Completed on 2026-05-07.

Prevention

Enhanced traffic filtering rules at the WAF layer to identify and block a significant portion of automated bot traffic contributing to elevated system load. Completed on 2026-05-07.
Implemented interim mitigation measures to maintain system stability and ensure consistent performance while permanent improvements are finalized and deployed. Completed on 2026-05-12.
Baseline system capacity was increased by raising the minimum number of pods and allocating higher CPU resources to ensure sufficient resources are consistently available, reduce reliance on dynamic scaling, and improve overall system stability during periods of increased demand. Completed on 2026-05-12.
The Engineering team will implement targeted software optimizations, including improvements to a specific endpoint that previously introduced cascading effects during scaling events. These enhancements are designed to reduce resource contention and improve system efficiency under high load conditions. Update by EOD MT on 2026-05-22.

Incident Timeline (UTC):

2026-05-07 11:55 - The first customer case opened, and Tech Support (TS) engineers began the troubleshooting investigation

2026-05-07 11:56 - TS engineers notified the Network Operations Center (NOC) engineers about the reported customer impact; a major incident was proposed and confirmed

2026-05-07 12:09 - Engineers identified a suspected cause and increased the resources of the web pods to improve system performance

2026-05-07 12:18 - Engineers also scaled up resources for the Application Programming Interface (API) pods to further stabilize performance

2026-05-07 12:28 - Peak of 504 Gateway Timeout errors observed across EU sites

2026-05-07 13:18 - The platform continued to catch-up and engineers were already seeing improvements in system performance

2026-05-07 13:42 - Engineers restarted proxy pods, resulting in continued performance improvements while monitoring system stability

2026-05-07 13:48 - Platform performance returned to normal levels, with continued validation and monitoring underway

2026-05-07 13:51 - The platform stabilized fully. The impact was resolved following resource scaling, and after successful validation, the major incident was marked as resolved

Posted May 15, 2026 - 21:45 UTC

Resolved

CXone Knowledge Management - Service Disruption Resolved - All Services Running Normally. The CXone Mpower Expert Engineering team has deployed a fix and monitored the deployment to make sure sites are stable. The issue is now resolved at this time. Event duration 1 hr 56 mins

Posted May 08, 2026 - 13:27 UTC