Networking for Cloud Engineers: What You Actually Need to Know
Networking is the skill that catches beginners off guard most often. You can deploy infrastructure confidently and still lose hours to a connectivity problem you do not have the vocabulary to diagnose. This page covers the networking concepts that come up most frequently in cloud work — focused on practical understanding, not exam theory.
Why networking problems are so common in cloud
Cloud infrastructure is networked by default. Every service communicates over a network. Every permission to communicate has to be explicitly granted through firewall rules, security groups, or network policies. When something does not work, the cause is very often networking.
The pattern repeats constantly: service A cannot reach service B. Without networking knowledge, the debugging process is guesswork. With it, you can work through a systematic checklist and find the problem in minutes.
This page gives you that checklist — but it is useful only if you understand the concepts behind it. Start there.
VPCs and subnets in practice
A Virtual Private Cloud (VPC) is your isolated network in the cloud. It has an IP address range (a CIDR block), and everything you deploy into it gets an IP address from that range. You control what can communicate with what.
A subnet is a subdivision of the VPC IP range, tied to a specific availability zone. Resources are deployed into subnets, not directly into VPCs.
The key design decision: public versus private subnets.
- A public subnet has a route to an Internet Gateway. Resources in it can send and receive traffic from the internet (if security groups allow).
- A private subnet has no route to an internet gateway. Resources in it cannot be reached from the internet directly, and cannot initiate outbound internet connections without a NAT gateway.
The standard architecture for a production application: load balancer in a public subnet, application servers and databases in private subnets. The load balancer handles internet traffic and forwards it to the private resources.
CIDR notation: reading and planning IP ranges
CIDR (Classless Inter-Domain Routing) notation is how IP address ranges are written. You see it everywhere in cloud work: 10.0.0.0/16, 172.16.0.0/12, 192.168.1.0/24.
The number after the slash (the “prefix length”) tells you how many bits of the address are fixed. The remaining bits are available for hosts. The calculation:
| CIDR | Fixed bits | Available addresses | Usable hosts (approx) |
|---|---|---|---|
| /8 | 8 | 16,777,216 | ~16.7 million |
| /16 | 16 | 65,536 | ~65,500 |
| /24 | 24 | 256 | ~254 |
| /28 | 28 | 16 | ~11 |
For a VPC, a /16 is common — it gives you 65,536 addresses to allocate across subnets. For subnets, /24 is a reasonable size (256 addresses per subnet). Smaller subnets like /28 are used for specific things like VPN gateway subnets that only need a few addresses.
Mistake to avoid: Making VPC CIDR blocks too small. If your VPC is a /24 (256 addresses), you will run out of IP space when the team grows and more services are deployed. Start with at least /16 for a VPC. It costs nothing to have a larger CIDR block.
Security groups and firewall rules
Security groups (AWS) and firewall rules (GCP/Azure) control which network traffic is allowed to reach your resources. They are stateful — if you allow inbound traffic on a port, the return traffic is automatically allowed.
Every rule specifies: direction (inbound/outbound), protocol (TCP, UDP, ICMP), port range, and source or destination (an IP range or another security group).
Common patterns:
# Allow HTTPS from anywhere (public-facing load balancer)
Inbound: TCP 443 from 0.0.0.0/0
# Allow application traffic only from the load balancer security group
Inbound: TCP 3000 from sg-loadbalancer
# Allow database access only from the application servers
Inbound: TCP 5432 from sg-app-servers
# Deny everything else (default behaviour — no rule = deny)Using security group references (pointing a rule at another security group rather than an IP range) is better practice than using IP ranges for internal traffic. It is more maintainable — if an application server’s IP changes, you do not need to update the database security group.
Mistake to avoid: Opening security group rules with 0.0.0.0/0 for all internal traffic because you “just want it to work”. This is effectively turning off the security group. Always try to narrow rules to the minimum required access.
DNS: how name resolution works and how to debug it
DNS translates human-readable domain names (api.myapp.com) into IP addresses that services can connect to. When DNS is wrong, connections fail with confusing errors — “connection refused”, “host not found”, or timeouts.
The record types you encounter most in cloud work:
| Record type | What it does | Example |
|---|---|---|
| A | Maps a name to an IPv4 address | api.myapp.com → 10.0.1.50 |
| CNAME | Alias from one name to another | www.myapp.com → myapp.com |
| AAAA | Maps a name to an IPv6 address | api.myapp.com → 2001:db8::1 |
| MX | Mail server for a domain | Not relevant for cloud services |
| TXT | Arbitrary text (used for verification) | myapp.com → "v=spf1..." |
Debugging DNS problems from the command line:
# Look up an A record
dig api.myapp.com
# Look up a specific record type
dig CNAME www.myapp.com
# Use a specific DNS server
dig api.myapp.com @8.8.8.8
# Check what DNS server your system is using
cat /etc/resolv.conf
# Simple lookup (less detail than dig)
nslookup api.myapp.comCommon DNS scenarios in cloud work: A new service is deployed but requests still go to the old endpoint — DNS is cached (TTL has not expired). A service inside a VPC cannot reach an internal endpoint — the VPC’s private DNS resolver needs the correct configuration. A certificate validation fails — the DNS record for the ACME challenge has not propagated yet.
Load balancers: L4 versus L7
Load balancers distribute traffic across multiple instances of your application. Understanding the difference between Layer 4 and Layer 7 load balancers tells you when to use which.
Layer 4 (transport layer) load balancers work at the TCP/UDP level. They forward packets based on IP address and port without inspecting the content. Fast, low overhead, but cannot make routing decisions based on URLs, headers, or hostnames.
Layer 7 (application layer) load balancers work at the HTTP level. They can read the request — the URL path, host header, cookies — and route based on that. You can send /api/* to one backend and /static/* to another. They terminate TLS, handle retries, and provide more useful access logs.
In cloud environments: use an L7 load balancer (ALB in AWS, Application Load Balancer in GCP/Azure) for web applications and APIs. Use an L4 load balancer (NLB in AWS) for protocols that are not HTTP — databases, game servers, anything that needs to preserve the source IP address.
NAT gateways and private endpoint access
Resources in private subnets need a way to initiate outbound connections (to download packages, call external APIs, reach cloud services) without being publicly accessible themselves. A NAT gateway handles this: it translates the private IP to a public IP for outbound traffic, but does not allow inbound connections.
Private endpoints (VPC Interface Endpoints in AWS, Private Service Connect in GCP) are for accessing cloud service APIs without traffic leaving your VPC. Without a private endpoint, a Lambda function in a private subnet calling the S3 API sends traffic out through the internet gateway or NAT gateway. With a private endpoint, the traffic stays inside the VPC entirely.
Trade-off: NAT gateways cost money (per hour and per GB of traffic). For high-throughput workloads calling cloud APIs heavily, private endpoints can be cheaper. For low-throughput workloads, NAT gateways are simpler to set up.
The debugging checklist: “why can’t service A reach service B?”
When connectivity between two services fails, work through this checklist:
- Is the target service actually running and listening on the expected port? Check with
ss -tlnpon the target host, or check the service health status in the cloud console. - Are the services in the same VPC, or peered VPCs? Resources in different VPCs cannot communicate without explicit VPC peering, Transit Gateway, or Private Link.
- Are the security groups / firewall rules correct? Check that an inbound rule on the target allows traffic from the source’s IP or security group on the right port.
- Are the subnet route tables correct? For traffic between subnets or VPCs, the route table must have an entry for the destination CIDR.
- Is DNS resolving correctly? Run
dig hostnamefrom the source. If the name does not resolve, the problem is DNS, not connectivity. - Is there a network ACL blocking traffic? Network ACLs (AWS) are stateless and evaluated before security groups — they can block traffic even when the security group allows it.
Most problems are in steps 3 or 4. Security groups and route tables account for the majority of cloud connectivity failures.
Summary
- Public subnets route to the internet; private subnets do not — most production application tiers belong in private subnets
- CIDR blocks define IP ranges; start VPCs at /16 or larger to avoid running out of addresses
- Security groups should reference other security groups for internal traffic, not raw IP ranges
- L7 load balancers understand HTTP and can route by URL path; L4 load balancers route by IP and port
- When connectivity fails, check security groups, route tables, and DNS in that order