GCP Security Best Practices for Production: Practical Checklist
This page is a hands-on checklist for securing production workloads on Google Cloud. It covers Cloud Run, GKE, Compute Engine, and Cloud SQL. Work through each section to verify your IAM, networking, secrets, logging, encryption, containers, org-level guardrails, and recovery readiness. Start with IAM and networking. They reduce blast radius more than anything else.
Simple explanation
Think of a production GCP project like a building. You would not give every tenant a master key, leave the fire exits propped open, and skip the smoke detectors. Production security works the same way: limit who can access what, close the doors that should be closed, and make sure you will know quickly when something goes wrong.
Every control on this page does one of three things:
- Reduces blast radius. Limiting what each identity can do means a compromised service account cannot touch resources it does not need.
- Reduces exposure. Keeping resources off the public internet and encrypting data at rest and in transit removes entire categories of attack.
- Reduces recovery time. Logging, alerting, and tested backups mean you detect problems faster and recover without guessing.
This checklist is not about perfection. It is about covering the controls that matter most and doing them well.
When to use this checklist
- Before go-live. Run through every section before your first production deployment.
- During a production hardening review. Use it as a structured audit of an existing system.
- During compliance or audit prep. Map each section to your compliance framework’s controls.
- After rapid growth or architecture changes. New services, new team members, and new integrations create drift.
- After inheriting an existing GCP environment. Baseline what is already in place before making changes.
How production security works in GCP
GCP production security is layered. No single control is enough on its own. The layers reinforce each other:
- Identity and access (IAM). Who and what can do things in your project.
- Network exposure. What is reachable from the internet and what is private.
- Secrets and credentials. How passwords, keys, and tokens are stored and delivered.
- Logging and detection. What gets recorded and what triggers an alert.
- Data protection and encryption. How data is protected at rest and in transit.
- Workload and image hardening. How container images and runtimes are secured.
- Organization-level guardrails. Preventive policies that apply across projects.
- Backup and incident readiness. How you recover and investigate when something goes wrong.
The checklist below follows this order. Each section includes what to verify, why it matters, and practical commands where relevant.
IAM and human access
Most production security incidents are made worse by excessive permissions. An attacker who compromises a service account with Project Editor access can modify almost anything. One who compromises an account scoped to a single Cloud Storage bucket is largely contained.
IAM is like a keycard system. The default Compute Engine service account is a master keycard that opens every door in the building. A dedicated service account with a scoped role is a keycard that only opens the one room that workload actually needs. If someone steals it, they get into one room, not the whole building.
Checklist:
- Every workload runs as a dedicated service account, not the default Compute Engine service account.
- No service account or human user has Owner, Editor, or Viewer (basic roles) at the project level. Use predefined roles scoped to specific services.
- Human and workload identities are separate. Humans use their Google Workspace or Cloud Identity accounts. Workloads use service accounts.
- Humans with production access use phishing-resistant MFA (security keys or passkeys).
- Temporary elevated access for humans uses IAM Conditions with expiring bindings or Privileged Access Manager instead of permanent elevated roles.
- IAM policy is reviewed monthly. Bindings for former team members or decommissioned services are removed.
For a detailed guide on scoping permissions correctly, see Principle of Least Privilege.
# Create a dedicated service account for a specific workload
gcloud iam service-accounts create api-server-sa \
--display-name="API Server Service Account" \
--project=my-app-prod
# Grant a specific predefined role, not Editor
gcloud projects add-iam-policy-binding my-app-prod \
--member="serviceAccount:api-server-sa@my-app-prod.iam.gserviceaccount.com" \
--role="roles/datastore.user"
# Deploy Cloud Run with this dedicated service account
gcloud run deploy api-server \
--service-account=api-server-sa@my-app-prod.iam.gserviceaccount.com \
--region=us-central1 \
--project=my-app-prodService accounts and workload identity
Service accounts are the identity layer for your workloads. Misconfigured service accounts are behind a large share of GCP security incidents.
Checklist:
- Every Cloud Run service, GKE pod, and Compute Engine VM uses a dedicated service account. Never the default.
- Service-to-service calls use IAM identity tokens (OIDC), not API keys or shared secrets.
- No service account has user-managed keys in production. Use Workload Identity Federation for external systems (GitHub Actions, AWS, on-prem) and Workload Identity for GKE for Kubernetes pods.
- If a service account key exists, there is a documented reason and a rotation schedule. Understand why service account keys are dangerous before creating one.
- Service account permissions are scoped to specific resources, not entire projects.
A service account key is a long-lived credential. If it leaks through a commit, a log, or a compromised CI runner, an attacker can impersonate that service account from anywhere in the world. Workload Identity Federation and service account impersonation eliminate keys entirely by letting workloads prove their identity without a stored secret.
# List all user-managed keys for service accounts in a project
# Any results here are potential risks worth reviewing
gcloud iam service-accounts keys list \
--iam-account=api-server-sa@my-app-prod.iam.gserviceaccount.com \
--managed-by=user \
--project=my-app-prodFor a full explanation of key types and risks, see Service Account Keys Explained.
Network exposure and ingress
Every resource with a public IP is an attack surface. The default network in GCP comes with permissive firewall rules that are not suitable for production.
GCP creates a default VPC in every new project with firewall rules that allow SSH and RDP from the entire internet. This is useful for learning but dangerous in production. Delete the default network and create a custom-mode VPC with explicit, restrictive rules. Better yet, use an Organization Policy to skip default network creation entirely.
Checklist:
- Production uses a custom-mode VPC, not the default network. Delete the default network in production projects.
- VMs and databases use private IP addresses only. Cloud NAT provides outbound internet access where needed.
- Cloud SQL instances have private IP only. No public IP.
- Cloud Run services use
—ingress=internal-and-cloud-load-balancing. The*.run.appURL is not directly accessible from the internet. - No firewall rule has
0.0.0.0/0as a source range except for load balancer health check ranges and public-facing load balancer frontends. - Cloud Run and Cloud Functions use Serverless VPC Access or Direct VPC Egress to reach private resources.
- Public-facing entry points are protected by Cloud Armor for WAF and DDoS mitigation.
- For highly sensitive workloads, VPC Service Controls are configured to prevent data exfiltration from GCP APIs.
# Restrict Cloud Run ingress to load balancer traffic only
gcloud run services update api-server \
--ingress=internal-and-cloud-load-balancing \
--region=us-central1 \
--project=my-app-prod
# Create a restrictive firewall rule: allow only LB health checks
gcloud compute firewall-rules create allow-lb-health-check \
--network=my-app-vpc \
--direction=INGRESS \
--action=ALLOW \
--rules=tcp:8080 \
--source-ranges=130.211.0.0/22,35.191.0.0/16 \
--target-tags=api-server \
--project=my-app-prodFor a full guide on building a production VPC, see VPC Networks Explained and Network Security Best Practices.
Secrets and credentials
Hardcoded credentials in source code are among the most common causes of production breaches. When a secret is committed to a repository, even a private one, it becomes part of the commit history and is accessible to anyone with repo access, forever.
Deleting a secret from the latest commit does not remove it. It still exists in the Git history. If a password or API key has ever been committed, rotate it immediately. Then use Secret Manager going forward.
Checklist:
- No database passwords, API keys, OAuth secrets, or private keys appear in source code, Dockerfiles, Kubernetes YAML, or Terraform files.
- All credentials are stored in Secret Manager.
- Cloud Run services mount secrets from Secret Manager using
—update-secrets, not plain-text environment variables. - Service accounts have
secretmanager.secretAccessorbound to specific secrets, not the entire Secret Manager service. - Secrets are rotated on a defined schedule. Old versions are disabled after rotation.
- CI/CD pipelines retrieve secrets at build or deploy time. Secrets are never baked into images.
# Store a database password in Secret Manager
echo -n "my-secure-database-password" | \
gcloud secrets create db-password \
--data-file=- \
--project=my-app-prod
# Grant access to a specific secret only
gcloud secrets add-iam-policy-binding db-password \
--member="serviceAccount:api-server-sa@my-app-prod.iam.gserviceaccount.com" \
--role="roles/secretmanager.secretAccessor" \
--project=my-app-prod
# Mount the secret at Cloud Run deploy time
gcloud run deploy api-server \
--update-secrets=DB_PASSWORD=db-password:latest \
--region=us-central1 \
--project=my-app-prodFor CI/CD-specific patterns, see Secrets in CI/CD Pipelines.
Logging, audit, and detection
Cloud Audit Logs record every GCP API call. Admin Activity logs are always on and record configuration changes. Data Access logs must be explicitly enabled and record reads and writes to data.
Audit logs are security cameras for your GCP project. Admin Activity logs are always recording. But Data Access logs, the cameras that show who opened the filing cabinet and read the documents, are turned off by default. If you only enable them after an incident, you are reviewing footage that does not exist. Turn them on before you need them.
Checklist:
- Data Access audit logs are enabled for all services handling sensitive or regulated data: Cloud SQL, Cloud Storage, Secret Manager, Firestore, BigQuery.
- Audit logs are exported to Cloud Storage or BigQuery for retention beyond the default 400-day window when compliance requires it.
- VPC Flow Logs are enabled on production subnets to record network traffic metadata for forensics and anomaly detection.
- Alerts are configured for dangerous changes: IAM bindings that add
allUsersorallAuthenticatedUsers, deletion of production resources, changes to firewall rules, and disabling of logging. - Log access is controlled by IAM. Not everyone on the team should be able to view audit logs.
What to watch for:
- New IAM bindings granting basic roles (Owner/Editor/Viewer)
- Service account key creation
- Firewall rule changes that open
0.0.0.0/0 - Deletion of Cloud Storage buckets, Cloud SQL instances, or secrets
- API calls from unexpected IP ranges or at unusual times
Understand the difference between log types in GCP and how to use them for detecting suspicious activity.
# Export audit logs to Cloud Storage for long-term retention
gcloud logging sinks create audit-log-archive \
storage.googleapis.com/my-app-audit-logs \
--log-filter='logName:"cloudaudit.googleapis.com"' \
--project=my-app-prod
# Enable VPC Flow Logs on a production subnet
gcloud compute networks subnets update my-app-subnet \
--enable-flow-logs \
--region=us-central1 \
--project=my-app-prodData protection and encryption
Google Cloud encrypts all data at rest and in transit between its services by default. You do not need to configure anything for baseline encryption. The question is whether you need more control.
Checklist:
- Understand that Google-managed encryption (the default) covers most workloads. Do not overcomplicate this unless you have a specific requirement.
- Use Customer-Managed Encryption Keys (CMEK) when your compliance framework requires you to control the key lifecycle, such as the ability to revoke access by destroying the key.
- CMEK is relevant for Cloud SQL, Cloud Storage, BigQuery, Compute Engine disks, and Artifact Registry. Enable it per service, not globally.
- Know the difference: Secret Manager stores secret values you retrieve at runtime. Cloud KMS manages encryption keys used to encrypt and decrypt data. They solve different problems.
- TLS is enforced between clients and Google Cloud services. Ensure your own application endpoints also enforce HTTPS. Never serve production traffic over plain HTTP.
If you are not subject to specific compliance requirements (PCI-DSS, HIPAA, FedRAMP), Google-managed encryption keys are sufficient. Adding CMEK increases operational complexity: you must manage key rotation, prevent accidental key destruction, and handle key access IAM. Start with the defaults and add CMEK only when a compliance requirement demands it.
Container, workload, and supply-chain security
Every container image you deploy carries its dependencies, and their vulnerabilities, into production. Supply-chain security means verifying what goes into your images and what is allowed to run.
A container image is like a shipping container. You packed your application inside, but it also contains the operating system libraries, language runtimes, and every dependency you installed. If any of those have a known vulnerability, you are shipping that vulnerability straight into production. Scanning is the customs inspection before the container crosses the border.
Checklist:
- All container images are stored in Artifact Registry, not Docker Hub or another external registry. See Artifact Registry Best Practices.
- Container Analysis vulnerability scanning is enabled on Artifact Registry repositories. Images are scanned automatically on push.
- Critical and High severity CVEs are reviewed before deployment. Use Binary Authorization to enforce a policy gate when your team or compliance requirements justify it.
- Base images are updated at least monthly. Automated dependency update PRs (Dependabot, Renovate) are reviewed promptly.
- Images are rebuilt on a schedule (weekly or monthly) to pick up OS-level patches even when application code has not changed.
- CI/CD pipelines use a hardened build process: pinned dependencies, minimal base images, no secrets baked into layers.
- Deployment guardrails enforce policy as code, blocking deployments that violate security constraints before they reach production.
Workload-specific notes:
- Cloud Run: Google manages the runtime sandbox. Focus on image hygiene, Secret Manager integration, and ingress restrictions. See Cloud Run Security Model.
- GKE: You manage node pools, network policies, and pod security standards in addition to image hygiene. Use private GKE clusters and Workload Identity for GKE to limit exposure.
- Compute Engine: You manage everything: OS patching, SSH key management, firewall rules, and runtime configuration. Use OS Login instead of project-wide SSH keys, and keep instances patched with OS Patch Management.
Org guardrails and policy controls
Organization Policy constraints are preventive controls that apply across all projects in your GCP organization. They stop misconfigurations before they happen. No amount of IAM review catches what a policy constraint prevents automatically.
IAM controls are like locks on individual doors. Organization policies are like building codes. A building code says “no door in this building may be left without a lock.” Even if someone forgets to lock one door, the code ensures the lock is there in the first place. Org policies work the same way: they enforce rules across every project so individual teams cannot accidentally create unsafe configurations.
Checklist:
- Restrict external IP addresses on VMs (
compute.vmExternalIpAccess) to prevent accidental public exposure. - Restrict resource locations (
gcp.resourceLocations) to keep data in approved regions. See Restricting Resource Locations. - Disable service account key creation (
iam.disableServiceAccountKeyCreation) unless explicitly exempted for specific projects. - Restrict public access to Cloud Storage buckets (
storage.publicAccessPrevention). - Disable default network creation in new projects (
compute.skipDefaultNetworkCreation).
Why this matters: Organization policies are the highest-leverage security controls in GCP. A single policy constraint can eliminate an entire class of misconfiguration across every project in your organization. If you have an org node, start here.
Security Command Center
Security Command Center (SCC) Standard tier is free and automatically scans your GCP organization for common misconfigurations. Review its findings weekly. It detects:
- Service accounts with Owner or Editor roles
- Cloud Storage buckets that are publicly accessible
- Cloud SQL instances with a public IP address
- Firewall rules allowing unrestricted inbound access on sensitive ports
- Logging disabled on key services
- Service account keys that have not been rotated
Treat Critical and High SCC findings as incidents: resolve within 24–48 hours in production. Medium findings should be scheduled and resolved within the next sprint cycle.
The Premium tier adds continuous compliance reporting against CIS GCP Benchmark, NIST 800-53, and ISO 27001. The Standard tier is sufficient for most production hardening work.
Backup, disaster recovery, and incident readiness
Backups are a security control, not just an operations task. Ransomware, accidental deletion, and compromised credentials all require the ability to restore data and investigate what happened.
Checklist:
- Cloud SQL automated backups are enabled with point-in-time recovery (PITR). Test a restore at least quarterly.
- Critical Cloud Storage buckets use Object Versioning or retention policies to protect against accidental or malicious deletion.
- Audit logs are exported with sufficient retention for incident investigation. 90 days minimum, longer if compliance requires it.
- The team knows who to contact, what to check first, and where logs are stored when a security incident occurs. A one-page runbook is enough to start.
- Recovery has been tested. A backup that has never been restored is a hope, not a control.
Schedule a quarterly “restore drill.” Pick a Cloud SQL backup, restore it to a temporary instance, and verify the data is intact. Delete the temporary instance when done. This takes less than an hour and is the only way to know your backups actually work.
For detailed recovery patterns, see Disaster Recovery Strategies and Incident Response with Monitoring.
Cloud Run vs GKE vs Compute Engine: what changes
Some security controls are universal. Others depend on how much infrastructure you manage.
| Control | Cloud Run | GKE | Compute Engine |
|---|---|---|---|
| OS patching | Google manages | You patch nodes (or use auto-upgrade) | You patch everything |
| Runtime isolation | gVisor sandbox (automatic) | You configure pod security standards | You harden the OS |
| Network policy | Ingress setting only | Kubernetes NetworkPolicy | VPC firewall rules |
| Identity for workloads | Service account per service | Workload Identity per pod | Service account per VM |
| Secret injection | —update-secrets | Secret Manager CSI driver or env | App reads from Secret Manager API |
| SSH / shell access | Not applicable | kubectl exec (RBAC-controlled) | OS Login + IAP tunneling |
| Image scanning | Artifact Registry scanning | Artifact Registry + Binary Auth | Not container-based (use OS inventory) |
The pattern: Cloud Run gives you the smallest security surface. GKE adds cluster and node security. Compute Engine adds full OS responsibility. Choose the runtime that matches your team’s ability to manage the security surface.
Common beginner mistakes
Using the default Compute Engine service account. It has Editor permissions on the project. Any code running with this account can modify almost anything. Create a dedicated service account with only the roles that specific workload needs.
Leaving Cloud Run publicly accessible behind a load balancer. If you add Cloud Armor to a load balancer but leave the Cloud Run
*.run.appURL open, attackers can bypass the WAF entirely. Set—ingress=internal-and-cloud-load-balancing.Storing secrets in plain text in Terraform, YAML, or Dockerfiles. These end up in version control and state files. Use Secret Manager and reference secrets by name.
Not enabling Data Access audit logs. Without them, you have no record of who read data from Cloud SQL, downloaded files from Cloud Storage, or accessed secrets. Investigation after a breach is severely limited.
Assuming a private repository means secrets are safe. Anyone with repo access can see committed secrets. Secrets in commit history survive even after deletion from the current branch. Use Secret Manager, not your repo.
Treating recovery as separate from security. Untested backups, missing log retention, and no incident runbook mean a security event becomes a prolonged outage. Test your restores. Know where your logs are.
Frequently asked questions
What is the first GCP security control to fix in production?
Replace the default Compute Engine service account. The default service account has Editor-level permissions on the project, which means any compromised workload can read, modify, or delete almost everything. Create a dedicated service account per workload with only the predefined roles it actually needs. This single change limits the blast radius of any incident more than almost anything else you can do.
Should Cloud SQL have a public IP in production?
No. Deploy Cloud SQL with private IP only. Use the Cloud SQL Auth Proxy or a Serverless VPC Connector for application connectivity. A Cloud SQL instance with a public IP is reachable from the internet. Even with strong passwords and authorized networks configured, the attack surface is larger than necessary. Private IP removes that entire category of risk.
What is the difference between Secret Manager and Cloud KMS?
Secret Manager stores and retrieves secret values like database passwords, API keys, and certificates. Cloud KMS manages encryption keys used to encrypt and decrypt data. If you need to store a password, use Secret Manager. If you need to encrypt a file or a database column with a key you control, use Cloud KMS. Some teams use both: Secret Manager for application secrets, Cloud KMS for customer-managed encryption keys (CMEK) on services like Cloud SQL and Cloud Storage.
Do I need Binary Authorization for every workload?
Not necessarily. Binary Authorization is most valuable when you need a hard policy gate that prevents unverified container images from running in production. If you are a small team with a single CI/CD pipeline and Artifact Registry vulnerability scanning, Binary Authorization adds a layer of assurance but is not strictly required on day one. Start with vulnerability scanning in Artifact Registry, and add Binary Authorization when your compliance requirements or team size justify it.
How is securing Cloud Run different from securing GKE or VMs?
Cloud Run handles OS patching, runtime isolation, and scaling for you. Your security focus is on IAM, ingress restrictions, secret injection, and container image hygiene. GKE requires you to also manage node security, network policies, pod security standards, and cluster-level RBAC. Compute Engine VMs require all of that plus OS hardening, SSH key management, and patch management. The more infrastructure you manage, the more security controls you own.