Monitoring AKS Clusters with Azure Monitor

Azure Monitor Container Insights is the built-in monitoring solution for AKS clusters. It collects node and pod metrics, container logs, and Kubernetes events, storing everything in Log Analytics where you can query it with KQL and build alerts. This page covers enabling Container Insights, the key metrics it provides, and the KQL queries most useful for day-to-day operations.

Enabling Container Insights

Container Insights runs as an agent DaemonSet on your nodes. You can enable it when creating a cluster or add it to an existing cluster. It requires a Log Analytics workspace to send data to.

# Create a Log Analytics workspace
az monitor log-analytics workspace create \
  --resource-group myResourceGroup \
  --workspace-name myAKSLogs \
  --location eastus

# Get the workspace resource ID
WORKSPACE_ID=$(az monitor log-analytics workspace show \
  --resource-group myResourceGroup \
  --workspace-name myAKSLogs \
  --query id -o tsv)

# Enable Container Insights on an existing cluster
az aks enable-addons \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --addons monitoring \
  --workspace-resource-id $WORKSPACE_ID

# Verify the addon is running
kubectl get daemonset omsagent -n kube-system

If you enable monitoring at cluster creation time, pass the workspace ID in the az aks create command:

az aks create \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --node-count 3 \
  --enable-addons monitoring \
  --workspace-resource-id $WORKSPACE_ID \
  --generate-ssh-keys

After enabling, give the agent 5–10 minutes to begin shipping data before expecting results in Log Analytics queries.

What Container Insights captures

The monitoring agent collects data at multiple levels. Understanding what is captured helps you know where to look when investigating issues.

Data typeLog Analytics tableFrequency
Node CPU and memoryPerf60 seconds
Pod CPU and memoryPerf60 seconds
Container inventory (name, image, state)ContainerInventory60 seconds
Container stdout/stderr logsContainerLog / ContainerLogV2Real-time
Kubernetes eventsKubeEventsOn event
Pod state (running, pending, failed)KubePodInventory60 seconds
Node state and conditionsKubeNodeInventory60 seconds

Node-level metrics include total CPU/memory capacity, allocatable CPU/memory (after system overhead), and actual utilization. Pod-level metrics show requested vs actual CPU and memory usage per container.

Useful KQL queries

Log Analytics queries run in the KQL (Kusto Query Language) dialect. The queries below cover the most common operational scenarios.

Node CPU utilization over the last hour:

Perf
| where TimeGenerated > ago(1h)
| where ObjectName == "K8SNode"
| where CounterName == "cpuUsageNanoCores"
| summarize AvgCPU = avg(CounterValue) by bin(TimeGenerated, 5m), Computer
| render timechart

Pods in a failed or pending state:

KubePodInventory
| where TimeGenerated > ago(30m)
| where PodStatus in ("Failed", "Pending")
| summarize arg_max(TimeGenerated, *) by PodUid
| project TimeGenerated, Namespace, Name, PodStatus, ContainerStatusReason
| order by TimeGenerated desc

Node memory pressure — nodes using more than 80% of allocatable memory:

Perf
| where TimeGenerated > ago(15m)
| where ObjectName == "K8SNode"
| where CounterName in ("memoryRssBytes", "memoryCapacityBytes")
| summarize Value = avg(CounterValue) by CounterName, Computer
| evaluate pivot(CounterName, any(Value))
| extend MemoryPct = memoryRssBytes / memoryCapacityBytes * 100
| where MemoryPct > 80
| project Computer, MemoryPct

Container restarts in the last 24 hours — ordered by restart count:

KubePodInventory
| where TimeGenerated > ago(24h)
| where ContainerRestartCount > 0
| summarize arg_max(TimeGenerated, *) by ContainerID
| project Namespace, Name, ContainerName, ContainerRestartCount
| order by ContainerRestartCount desc

OOMKilled events:

KubeEvents
| where TimeGenerated > ago(24h)
| where Reason == "OOMKilling"
| project TimeGenerated, Namespace, Name, Message
| order by TimeGenerated desc

Image pull failures:

KubeEvents
| where TimeGenerated > ago(1h)
| where Reason in ("Failed", "BackOff")
| where Message contains "pull"
| project TimeGenerated, Namespace, Name, Message

Viewing metrics in the Azure Portal

The Azure Portal provides pre-built dashboards for Container Insights. Navigate to your AKS cluster in the portal, then select Monitoring from the left menu. The Insights section shows:

  • Cluster tab — node count, CPU and memory utilization, active pod count over time.
  • Nodes tab — per-node CPU, memory, disk, and network metrics with drill-down to pods running on that node.
  • Controllers tab — per-Deployment and ReplicaSet metrics.
  • Containers tab — per-container CPU and memory with live log access.

The Live Data feature streams real-time logs and events without needing to run kubectl. It is useful for quick debugging without opening a terminal.

Tip

Pin individual charts from the Container Insights dashboards to a shared Azure Dashboard for your team. This gives operations teams a consistent view without requiring portal navigation access to the cluster resource itself.

Setting alerts on node CPU and pod failures

Azure Monitor alerts can trigger on metric thresholds or on KQL query results. For AKS, metric alerts are faster (1-minute evaluation) while log alerts run on a schedule (minimum 5 minutes).

# Create a metric alert for node CPU > 80% for 5 minutes
az monitor metrics alert create \
  --name "AKS Node CPU High" \
  --resource-group myResourceGroup \
  --scopes /subscriptions/<sub-id>/resourceGroups/myRG/providers/Microsoft.ContainerService/managedClusters/myAKSCluster \
  --condition "avg Percentage CPU > 80" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --severity 2 \
  --description "Node CPU above 80% for 5 minutes"

For pod failure alerts, create a scheduled log alert that queries KubePodInventory:

# Create a log alert rule for failed pods
az monitor scheduled-query create \
  --resource-group myResourceGroup \
  --name "AKS Failed Pods" \
  --scopes /subscriptions/<sub-id>/resourceGroups/myRG/providers/microsoft.operationalinsights/workspaces/myAKSLogs \
  --condition-query "KubePodInventory | where TimeGenerated > ago(5m) | where PodStatus == 'Failed' | summarize count()" \
  --condition-threshold 0 \
  --condition-operator GreaterThan \
  --evaluation-frequency 5m \
  --window-duration 5m \
  --severity 2 \
  --description "One or more pods in Failed state"

Alerts can send notifications via Action Groups — email, SMS, webhook, Azure Function, or Logic App. Create an Action Group in the portal or with az monitor action-group create and attach it to alert rules.

Common mistakes

  1. Collecting logs from all namespaces including kube-system. The kube-system namespace generates a large volume of logs from system components. Including it in Container Insights collection significantly increases Log Analytics ingestion costs with limited operational value. Configure the agent’s ConfigMap to exclude kube-system and other system namespaces unless you specifically need to debug control plane components.
  2. Setting alerts without testing them. Creating a CPU alert without verifying it fires under real load means you discover the alert is misconfigured during an actual incident. Test alert rules by temporarily lowering the threshold, confirming the alert triggers and the notification arrives, then restoring the correct threshold.
  3. Querying ContainerLog instead of ContainerLogV2. ContainerLogV2 is the newer schema that includes the pod name and namespace directly in the log record, making queries far simpler. ContainerLog requires joining with KubePodInventory to get context. New clusters use ContainerLogV2 by default; older clusters may need the agent ConfigMap updated to enable it.

Frequently asked questions

What does Container Insights collect?

Container Insights collects node-level CPU and memory utilization, pod and container CPU and memory metrics, container stdout/stderr logs, Kubernetes events, and live data for pods and nodes. It ships this data to a Log Analytics workspace where you can query it with KQL.

Is Container Insights free?

No. Container Insights ingests data into a Log Analytics workspace, which charges based on data volume ingested and retention duration. The first 5 GB per billing account per month is free. Costs vary by cluster size and log verbosity. You can reduce costs by configuring collection settings to exclude high-volume namespaces like kube-system.

Can I use Prometheus and Grafana instead of Container Insights?

Yes. AKS supports a managed Prometheus offering (Azure Monitor managed service for Prometheus) that scrapes metrics from your cluster and stores them in an Azure Monitor workspace. You can then connect Azure Managed Grafana to visualize them. This is the recommended path for teams already using Prometheus-style metrics and PromQL queries.

Last verified: 19 March 2026 Cloud services change frequently. Verify details against official documentation before making infrastructure decisions.