Monitoring AKS Clusters with Azure Monitor
Azure Monitor Container Insights is the built-in monitoring solution for AKS clusters. It collects node and pod metrics, container logs, and Kubernetes events, storing everything in Log Analytics where you can query it with KQL and build alerts. This page covers enabling Container Insights, the key metrics it provides, and the KQL queries most useful for day-to-day operations.
Enabling Container Insights
Container Insights runs as an agent DaemonSet on your nodes. You can enable it when creating a cluster or add it to an existing cluster. It requires a Log Analytics workspace to send data to.
# Create a Log Analytics workspace
az monitor log-analytics workspace create \
--resource-group myResourceGroup \
--workspace-name myAKSLogs \
--location eastus
# Get the workspace resource ID
WORKSPACE_ID=$(az monitor log-analytics workspace show \
--resource-group myResourceGroup \
--workspace-name myAKSLogs \
--query id -o tsv)
# Enable Container Insights on an existing cluster
az aks enable-addons \
--resource-group myResourceGroup \
--name myAKSCluster \
--addons monitoring \
--workspace-resource-id $WORKSPACE_ID
# Verify the addon is running
kubectl get daemonset omsagent -n kube-systemIf you enable monitoring at cluster creation time, pass the workspace ID in the az aks create command:
az aks create \
--resource-group myResourceGroup \
--name myAKSCluster \
--node-count 3 \
--enable-addons monitoring \
--workspace-resource-id $WORKSPACE_ID \
--generate-ssh-keysAfter enabling, give the agent 5–10 minutes to begin shipping data before expecting results in Log Analytics queries.
What Container Insights captures
The monitoring agent collects data at multiple levels. Understanding what is captured helps you know where to look when investigating issues.
| Data type | Log Analytics table | Frequency |
|---|---|---|
| Node CPU and memory | Perf | 60 seconds |
| Pod CPU and memory | Perf | 60 seconds |
| Container inventory (name, image, state) | ContainerInventory | 60 seconds |
| Container stdout/stderr logs | ContainerLog / ContainerLogV2 | Real-time |
| Kubernetes events | KubeEvents | On event |
| Pod state (running, pending, failed) | KubePodInventory | 60 seconds |
| Node state and conditions | KubeNodeInventory | 60 seconds |
Node-level metrics include total CPU/memory capacity, allocatable CPU/memory (after system overhead), and actual utilization. Pod-level metrics show requested vs actual CPU and memory usage per container.
Useful KQL queries
Log Analytics queries run in the KQL (Kusto Query Language) dialect. The queries below cover the most common operational scenarios.
Node CPU utilization over the last hour:
Perf
| where TimeGenerated > ago(1h)
| where ObjectName == "K8SNode"
| where CounterName == "cpuUsageNanoCores"
| summarize AvgCPU = avg(CounterValue) by bin(TimeGenerated, 5m), Computer
| render timechartPods in a failed or pending state:
KubePodInventory
| where TimeGenerated > ago(30m)
| where PodStatus in ("Failed", "Pending")
| summarize arg_max(TimeGenerated, *) by PodUid
| project TimeGenerated, Namespace, Name, PodStatus, ContainerStatusReason
| order by TimeGenerated descNode memory pressure — nodes using more than 80% of allocatable memory:
Perf
| where TimeGenerated > ago(15m)
| where ObjectName == "K8SNode"
| where CounterName in ("memoryRssBytes", "memoryCapacityBytes")
| summarize Value = avg(CounterValue) by CounterName, Computer
| evaluate pivot(CounterName, any(Value))
| extend MemoryPct = memoryRssBytes / memoryCapacityBytes * 100
| where MemoryPct > 80
| project Computer, MemoryPctContainer restarts in the last 24 hours — ordered by restart count:
KubePodInventory
| where TimeGenerated > ago(24h)
| where ContainerRestartCount > 0
| summarize arg_max(TimeGenerated, *) by ContainerID
| project Namespace, Name, ContainerName, ContainerRestartCount
| order by ContainerRestartCount descOOMKilled events:
KubeEvents
| where TimeGenerated > ago(24h)
| where Reason == "OOMKilling"
| project TimeGenerated, Namespace, Name, Message
| order by TimeGenerated descImage pull failures:
KubeEvents
| where TimeGenerated > ago(1h)
| where Reason in ("Failed", "BackOff")
| where Message contains "pull"
| project TimeGenerated, Namespace, Name, MessageViewing metrics in the Azure Portal
The Azure Portal provides pre-built dashboards for Container Insights. Navigate to your AKS cluster in the portal, then select Monitoring from the left menu. The Insights section shows:
- Cluster tab — node count, CPU and memory utilization, active pod count over time.
- Nodes tab — per-node CPU, memory, disk, and network metrics with drill-down to pods running on that node.
- Controllers tab — per-Deployment and ReplicaSet metrics.
- Containers tab — per-container CPU and memory with live log access.
The Live Data feature streams real-time logs and events without needing to run kubectl. It is useful for quick debugging without opening a terminal.
Pin individual charts from the Container Insights dashboards to a shared Azure Dashboard for your team. This gives operations teams a consistent view without requiring portal navigation access to the cluster resource itself.
Setting alerts on node CPU and pod failures
Azure Monitor alerts can trigger on metric thresholds or on KQL query results. For AKS, metric alerts are faster (1-minute evaluation) while log alerts run on a schedule (minimum 5 minutes).
# Create a metric alert for node CPU > 80% for 5 minutes
az monitor metrics alert create \
--name "AKS Node CPU High" \
--resource-group myResourceGroup \
--scopes /subscriptions/<sub-id>/resourceGroups/myRG/providers/Microsoft.ContainerService/managedClusters/myAKSCluster \
--condition "avg Percentage CPU > 80" \
--window-size 5m \
--evaluation-frequency 1m \
--severity 2 \
--description "Node CPU above 80% for 5 minutes"For pod failure alerts, create a scheduled log alert that queries KubePodInventory:
# Create a log alert rule for failed pods
az monitor scheduled-query create \
--resource-group myResourceGroup \
--name "AKS Failed Pods" \
--scopes /subscriptions/<sub-id>/resourceGroups/myRG/providers/microsoft.operationalinsights/workspaces/myAKSLogs \
--condition-query "KubePodInventory | where TimeGenerated > ago(5m) | where PodStatus == 'Failed' | summarize count()" \
--condition-threshold 0 \
--condition-operator GreaterThan \
--evaluation-frequency 5m \
--window-duration 5m \
--severity 2 \
--description "One or more pods in Failed state"Alerts can send notifications via Action Groups — email, SMS, webhook, Azure Function, or Logic App. Create an Action Group in the portal or with az monitor action-group create and attach it to alert rules.
Common mistakes
- Collecting logs from all namespaces including kube-system. The kube-system namespace generates a large volume of logs from system components. Including it in Container Insights collection significantly increases Log Analytics ingestion costs with limited operational value. Configure the agent’s ConfigMap to exclude kube-system and other system namespaces unless you specifically need to debug control plane components.
- Setting alerts without testing them. Creating a CPU alert without verifying it fires under real load means you discover the alert is misconfigured during an actual incident. Test alert rules by temporarily lowering the threshold, confirming the alert triggers and the notification arrives, then restoring the correct threshold.
- Querying ContainerLog instead of ContainerLogV2. ContainerLogV2 is the newer schema that includes the pod name and namespace directly in the log record, making queries far simpler. ContainerLog requires joining with KubePodInventory to get context. New clusters use ContainerLogV2 by default; older clusters may need the agent ConfigMap updated to enable it.
Summary
- Container Insights is enabled as an AKS add-on and ships node metrics, pod metrics, container logs, and Kubernetes events to a Log Analytics workspace.
- Key tables for querying are Perf (CPU/memory), KubePodInventory (pod state), KubeEvents (cluster events), and ContainerLogV2 (container output).
- Metric alerts evaluate every 1 minute and are best for CPU/memory thresholds; scheduled log alerts evaluate on a query schedule and are best for state-based conditions like failed pods.
- Use ContainerLogV2 for log queries and configure collection exclusions for high-volume system namespaces to manage Log Analytics costs.
Frequently asked questions
What does Container Insights collect?
Container Insights collects node-level CPU and memory utilization, pod and container CPU and memory metrics, container stdout/stderr logs, Kubernetes events, and live data for pods and nodes. It ships this data to a Log Analytics workspace where you can query it with KQL.
Is Container Insights free?
No. Container Insights ingests data into a Log Analytics workspace, which charges based on data volume ingested and retention duration. The first 5 GB per billing account per month is free. Costs vary by cluster size and log verbosity. You can reduce costs by configuring collection settings to exclude high-volume namespaces like kube-system.
Can I use Prometheus and Grafana instead of Container Insights?
Yes. AKS supports a managed Prometheus offering (Azure Monitor managed service for Prometheus) that scrapes metrics from your cluster and stores them in an Azure Monitor workspace. You can then connect Azure Managed Grafana to visualize them. This is the recommended path for teams already using Prometheus-style metrics and PromQL queries.