Prometheus Istio Server restarting or in crashback loop
NOTE: Prometheus and Grafana are deprecated and are planned to be removed. If you want to install a custom stack, take a look at Install a custom kube-prometheus-stack in Kyma.
Condition
Prometheus Istio Server is restarting or in a crashback loop.
Cause
Prometheus Istio Server scrapes metrics from all envoy side cars, which may lead to OOM issues.
For example, this can happen when a high number of workloads perform a lot of communication to other workloads, or when workloads are created and deleted dynamically.
In such cases, the cardinality of the Istio metrics may increase too much and cause the container to be killed because of OOM (Istio telemetry V2 currently doesn't support the concept of metric expiry).
There can be other causes for the Prometheus Istio Server to restart or crash, but the following istructions focus on fixing the OOM issue.
Remedy
To prevent the OOM issue, you can increase the memory limit. Additionally, you can choose to decrease the volume of data by dropping additional labels.
CAUTION: Dropping additional labels with
prometheus-istio.envoyStats.labeldropRegex
has the side effect that graphs in Kiali don't work.
For both solutions, you can choose to change your Kyma cluster settings or directly update the Istio Prometheus resources.
Change the Kyma settings
To increase the memory limit, create a values YAML file with the following content:
Click to copymonitoring:prometheus-istio:server:resources:limits:memory: "6Gi"TIP: You should be fine with increasing the limit to 6Gi. However, if your resources are scarce, try increasing the value gradually in steps of 1Gi.
Deploy the values YAML file with the following command:
Click to copykyma deploy --values-file {VALUES_FILE_PATH}If the problem persists, drop additional labels for the Istio metrics with the following values YAML file:
Click to copymonitoring:prometheus-istio:envoyStats:labeldropRegex: "^(grpc_response_status|source_version|source_principal|source_app|response_flags|request_protocol|destination_version|destination_principal|destination_app|destination_canonical_service|destination_canonical_revision|source_canonical_revision|source_canonical_service)$"Change the settings with the following command:
Click to copykyma deploy --values-file {VALUES_FILE_PATH}
Change the Istio Prometheus configuration
To increase the memory for
prometheus-istio-server
, run the following command:Click to copykubectl edit deployment -n kyma monitoring-prometheus-istio-serverIn your deployment resource, set the following limits for memory:
Click to copyresources:limits:cpu: 600mmemory: 6000Mirequests:cpu: 40mmemory: 200MiTIP: You should be fine with increasing the limit to 6Gi. However, if your resources are scarce, try increasing the value gradually in steps of 1Gi.
If the problem persists, drop additional labels for the Istio metrics by editing
prometheus-istio server
:Click to copykubectl edit configmap -n kyma-system monitoring-prometheus-istio-serverModify the following values:
Click to copymetric_relabel_configs:- separator: ;regex: ^(grpc_response_status|source_version|destination_version|source_app|destination_app)$replacement: $1action: labeldropChange regex in the following way:
Click to copyregex: ^(grpc_response_status|source_version|source_principal|source_app|response_flags|request_protocol|destination_version|destination_principal|destination_app|destination_canonical_service|destination_canonical_revision|source_canonical_revision|source_canonical_service)$Save the ConfigMap and restart
prometheus-istio-server
for the changes to take effect:Click to copykubectl rollout restart deployment -n kyma-system monitoring-prometheus-istio-server