Cloud SQL Backups, PITR, and High Availability: What to Enable for Production
Cloud SQL gives you four tools for building a resilient database: automated backups, on-demand backups, point-in-time recovery (PITR), and high availability (HA). Each solves a different problem. None of them replaces the others. The hardest part is not the configuration. It is knowing which one you actually need.
Backups let you recover from catastrophic data loss. PITR lets you recover from
a bad DELETE or a failed migration by rewinding to a precise
moment. High availability keeps your instance online when a zone goes down,
automatically promoting a warm standby in roughly 60 seconds. Read replicas
handle read traffic, but they are not the same as HA, and assuming they are
is one of the most common Cloud SQL configuration mistakes.
This page walks through all four mechanisms, explains what each one protects against, and helps you decide what to enable for your workload. If you are new to Cloud SQL itself, start with the Cloud SQL overview first.
What each mechanism actually does
Before getting into configuration, here is the plain-English version of each tool:
Automated backup: a daily snapshot of your entire database. If something goes badly wrong, you can restore to the state it was in at the last backup.
On-demand backup: a manual snapshot you trigger yourself. Useful before risky operations like schema changes or bulk updates. Persists indefinitely until you delete it.
Point-in-time recovery (PITR): recovery to any specific second within the log retention window. Not just “yesterday’s backup” but “yesterday at 14:27:43, before that migration ran.”
High availability (HA): a second copy of your instance in a different availability zone, kept in sync with every write to the primary. If the primary zone fails, Cloud SQL promotes the standby automatically. The standby is never directly accessible; it exists only to take over.
Read replica: a separate Cloud SQL instance that receives changes asynchronously from the primary. You can query it for reads. It does not fail over automatically if the primary fails.
A backup is a safety net. PITR is a time machine. High availability is a co-pilot who takes the controls the moment something goes wrong. A read replica is a colleague who can handle some of your calls, but cannot take the wheel in an emergency.
What this page helps you decide
- Whether automated backups alone are enough for your workload
- When PITR is worth enabling, and what MySQL requires that PostgreSQL does not
- When high availability justifies the extra cost
- Why read replicas are not a substitute for HA, and what happens if you treat them as one
- What a sensible production Cloud SQL setup looks like before go-live
How Cloud SQL recovery works end to end
These mechanisms form a layered recovery model, not four separate unrelated settings. Here is how they fit together.
Every day, Cloud SQL takes a full backup of your instance and stores it in Cloud Storage, replicated across regions for durability. That is your baseline recovery point. By default you keep seven of these; you can increase the count up to 365.
Between full backups, Cloud SQL can continuously archive logs: binary logs for MySQL, WAL (write-ahead log) for PostgreSQL. These logs fill the gap between backup snapshots and give you PITR. When you trigger a point-in-time restore, Cloud SQL takes the nearest full backup and replays the captured logs up to your chosen timestamp. The result is a new Cloud SQL instance at that exact state.
High availability is separate from both. It does not change how backups or PITR work. It provisions a standby instance in a different zone within the same region and keeps it synchronised in real time. When the primary zone fails, Cloud SQL detects this automatically and promotes the standby. Your application reconnects to the same IP address; Cloud SQL updates the DNS record behind the scenes.
Read replicas sit outside this recovery stack entirely. They are useful for distributing read traffic across multiple instances, but they use asynchronous replication and do not provide automatic failover. For a broader view of availability patterns across GCP services, see the guide on designing highly available systems on GCP.
Automated backups
Cloud SQL performs one automated full backup per day during a configurable backup window. Backups are stored in Cloud Storage and replicated across regions for durability. Schedule the window during a low-traffic period to minimise any performance impact.
# Enable backups and set the backup window at instance creation
gcloud sql instances create my-db-instance \
--database-version=POSTGRES_15 \
--region=europe-west2 \
--tier=db-n1-standard-2 \
--backup-start-time=02:00 \
--retained-backups-count=14
# Update backup settings on an existing instance
gcloud sql instances patch my-db-instance \
--backup-start-time=02:00 \
--retained-backups-count=14
# List available backups
gcloud sql backups list --instance=my-db-instanceSetting —retained-backups-count=14 keeps the last 14 daily
backups, roughly two weeks. But if the instance is paused or a backup job
fails on a given day, that day does not count. Plan your retention number
with this in mind.
For PostgreSQL instances, enabling backups also automatically enables WAL archiving, which is required for PITR. For MySQL, PITR requires an additional flag covered in the next section.
On-demand backups
You can take a backup at any time. On-demand backups are not affected by the retained-backups-count limit; they persist until you delete them manually, which makes them useful as long-lived checkpoints.
# Create an on-demand backup immediately
gcloud sql backups create --instance=my-db-instance
# Describe a specific backup to confirm it succeeded
gcloud sql backups describe BACKUP_ID --instance=my-db-instance
# Delete a backup when it is no longer needed
gcloud sql backups delete BACKUP_ID --instance=my-db-instanceTake one immediately before a schema migration, a bulk data load or delete, or before promoting a read replica. This gives you a clean restore point that will not age out of the rolling window before you are confident the operation succeeded.
Point-in-time recovery (PITR)
PITR lets you recover your database to any second within the log retention
window. This is the right tool when the damage was not an infrastructure
failure but a logical one: someone ran
DELETE FROM orders WHERE status = ‘pending’ at the
wrong time, or a migration introduced corrupted data and you need to rewind
to before it ran.
The behaviour differs between database engines:
MySQL: PITR requires binary logging, which must be explicitly enabled at instance creation with
—enable-bin-log. The logs only exist from the moment it is turned on; you cannot enable it retroactively and recover historical data. See the MySQL on Cloud SQL guide for more on binary logging and replica configuration.PostgreSQL: WAL archiving is enabled automatically whenever backups are turned on. No additional flag is needed.
If you create a MySQL instance without —enable-bin-log, PITR
is not available. You cannot add it later and recover data from before it
was enabled. This is easy to miss because automated backups still work
without it.
# MySQL: create an instance with binary logging enabled for PITR
gcloud sql instances create my-mysql-instance \
--database-version=MYSQL_8_0 \
--region=europe-west2 \
--tier=db-n1-standard-2 \
--enable-bin-log \
--backup-start-time=02:00
# Restore to a specific point in time (creates a new instance, not an overwrite)
gcloud sql instances clone my-db-instance my-db-restored \
--point-in-time=2026-03-07T14:30:00.000ZPITR always creates a new Cloud SQL instance rather than overwriting the existing one. This is intentional. During incident response, you can restore and verify the recovered data while the production instance stays accessible. Once you have confirmed the state looks correct, update your application’s connection string to point at the restored instance, or export and re-import specific tables back into the primary.
During an incident it is easy to enter a local time by mistake. Restoring to 14:30 UTC when you meant 14:30 BST means arriving an hour late to the recovery point. That is a frustrating and avoidable error.
High availability in Cloud SQL
A Cloud SQL high-availability instance consists of a primary in one zone and a standby in a different zone within the same region. Every write to the primary is synchronously replicated to the standby before being acknowledged. The standby is always current, but this adds a small amount of write latency compared to a single-zone instance.
If the primary zone becomes unavailable, Cloud SQL detects this automatically and promotes the standby. Failover typically completes within 60 seconds. Your application connects to the same IP address; Cloud SQL updates the DNS record behind the scenes. The former standby becomes the new primary, and Cloud SQL provisions a replacement standby in another zone.
The standby is not accessible for reads. It does not serve queries. It exists solely to become the new primary during a zone failure. If you need to distribute read load across instances, that is what read replicas are for, and they are independent of HA.
If someone runs DROP TABLE or a bad migration on the primary,
that change replicates to the standby immediately. HA will not save you
from it. For protection against logical errors, you need backups and PITR.
HA and backups solve completely different problems.
Cost is roughly double that of a single-zone instance because you are running two instances (primary plus standby) with duplicated compute and storage. For workloads where zone downtime is unacceptable, that cost is easy to justify. For internal tools or non-critical databases, ZONAL is often fine.
# Create a high-availability (regional) instance
gcloud sql instances create my-ha-instance \
--database-version=POSTGRES_15 \
--region=europe-west2 \
--tier=db-n1-standard-2 \
--availability-type=REGIONAL \
--backup-start-time=02:00
# Enable HA on an existing instance (requires a brief restart)
gcloud sql instances patch my-db-instance \
--availability-type=REGIONAL
# Trigger a manual failover to test the behaviour
gcloud sql instances failover my-ha-instanceThe default availability type is ZONAL, a single instance
with no standby. For production, set —availability-type=REGIONAL.
Then run gcloud sql instances failover in a staging environment
to see how your application behaves during the switchover. Finding out
during a real outage is not the time to discover a connection-handling issue.
Restoring from a backup
# Restore a backup to a different instance (safer for investigation)
gcloud sql backups restore BACKUP_ID \
--restore-instance=my-db-restored \
--backup-instance=my-db-instance
# Restore a backup to the same instance (overwrites current data immediately)
gcloud sql backups restore BACKUP_ID \
--restore-instance=my-db-instanceRestoring to the same instance overwrites all current data immediately and irreversibly. During incident response, always restore to a new instance first. Verify it contains what you expect before touching the production instance. Overwriting production while still diagnosing can destroy evidence and make the situation significantly harder to recover from.
Backups, PITR, HA, and read replicas compared
These four features are frequently confused because they all relate to resilience. Here is a direct comparison by what each one actually does.
Purpose: data recovery from catastrophic loss.
Protects against instance deletion, data corruption, and accidental drops. Runs daily and is automatic once configured. Does not reduce downtime during a zone failure. Does not protect against in-window logical errors.
Purpose: granular recovery to a specific moment.
Protects against accidental deletions, bad migrations, and data corruption within the log window. Does not reduce infrastructure downtime. Does not help with read scaling. Requires enabling binary logs on MySQL. Restore creates a new instance, never an overwrite.
Purpose: reducing downtime during zone failure.
Provides automatic failover within roughly 60 seconds. Does not protect
against accidental deletion or bad queries; those replicate to the standby
immediately. Does not help with read scaling. Costs roughly double. Must be
explicitly enabled with —availability-type=REGIONAL.
Purpose: distributing read traffic.
Provides scalable read throughput across multiple instances. Does not provide automatic failover; promotion is manual and takes time. Does not protect against data loss. Works well alongside HA, not as a replacement for it.
The key distinction: HA and read replicas address availability; backups and PITR address recoverability. A resilient production database uses all of them. For a wider view of recovery architecture in GCP, see the disaster recovery strategies guide.
When to use each option
The right configuration depends on what level of downtime and data loss your workload can tolerate. Here are four common scenarios:
Small internal tool or development environment: automated backups with the default 7-day retention are usually sufficient. ZONAL availability is fine. For PostgreSQL, PITR is free to enable since it is automatic when backups are on. Skip HA unless the team depends on this database heavily during the working day.
Production app with moderate uptime requirements: enable backups with 14 to 30 day retention, enable PITR (add
—enable-bin-logat creation for MySQL), and consider HA. If a few minutes of downtime during a zone failure is covered by your SLA, ZONAL may still be acceptable, but be explicit about that decision.Business-critical production database: enable HA, enable PITR, set backup retention to 30 days or more, take on-demand backups before major operations, and test your failover and restore processes before go-live. Set up alerts in Cloud Monitoring for backup failures and replication lag.
Read-heavy reporting workload: add one or more read replicas to offload analytics queries from the primary. Keep HA enabled on the primary if it is production-facing. Read replicas do not provide failover; they are for scaling, not resilience.
If you are still deciding whether Cloud SQL is the right choice for your use case, the choosing the right storage service guide covers that decision across Cloud SQL, Firestore, Bigtable, and others.
Common mistakes
Assuming a read replica provides automatic failover. A read replica does not take over if the primary fails. Promotion is manual and takes time. If your application depends on automatic failover, you need
—availability-type=REGIONAL, not a read replica. This is the most common Cloud SQL HA misconfiguration.Not enabling binary logging for MySQL PITR. Automated backups alone do not enable PITR on MySQL. Without
—enable-bin-log, the closest you can recover to is the previous full backup. This flag must be set at instance creation; you cannot enable it retroactively and recover historical data.Thinking HA replaces backups. HA protects against zone failure. It does not protect against accidental data deletion or a bad migration. Those changes replicate to the standby immediately. You still need backups and PITR for data recovery.
Restoring directly to the production instance while diagnosing an incident. Restoring to the same instance overwrites all current data immediately. Always restore to a new instance first when investigating. Overwriting production before you understand what happened can destroy evidence and make recovery harder.
Using local time instead of UTC for PITR timestamps. Cloud SQL PITR timestamps must be in UTC. During a stressful incident it is easy to enter a local time by mistake. Restoring to the wrong point because of a timezone error is common and frustrating.
Leaving backup retention at the default without reviewing it. Seven backups is fine for development, but limiting for production. If you discover a data problem more than a week after it occurred, a 7-backup window will not cover you. Review and increase retention when you first configure the instance, not after an incident.
Production checklist for Cloud SQL resilience
Before taking a Cloud SQL instance to production, work through this list. Most of these settings cannot be changed retroactively without downtime or data risk:
- Automated backups are enabled with a window during off-peak hours
- Backup retention count is set to 14 or more (review what your recovery window requires)
- PITR is confirmed: for MySQL, verify
—enable-bin-logwas set at creation; for PostgreSQL, confirm backups are on - High availability is set to
REGIONALif zone downtime is unacceptable for this workload - A manual failover test has been run with
gcloud sql instances failoverto confirm the behaviour - A backup restore has been tested on a non-production instance to confirm the process works end to end
- The team knows to take an on-demand backup before running migrations or bulk operations
- Alerts are configured in Cloud Monitoring for backup failures and replication lag
- Connections to the instance are secured via the Auth Proxy or private IP (see connecting to Cloud SQL securely)
Summary
- Backups, PITR, HA, and read replicas solve different problems and are complementary, not interchangeable
- Automated backups run daily; configure the window and retention count before going to production
- PITR on MySQL requires
—enable-bin-logat instance creation; on PostgreSQL it is automatic when backups are on - PITR restores create a new instance; they do not overwrite the existing primary
- HA uses
—availability-type=REGIONALand provisions a standby in a different zone with automatic failover in ~60 seconds - HA standbys are not accessible for reads; they exist only for failover
- HA does not protect against logical errors like bad queries or accidental deletes; those replicate immediately to the standby
- Read replicas handle read scaling; they do not replace HA for zone-level availability
- On-demand backups persist indefinitely until deleted; take one before any risky operation
Frequently asked questions
How does point-in-time recovery work in Cloud SQL?
PITR lets you restore a database to any second within the log retention window. For MySQL, binary logging must be explicitly enabled with --enable-bin-log at instance creation — without it, the closest you can recover to is the last full backup. For PostgreSQL, WAL archiving is automatic whenever backups are enabled. You specify a UTC timestamp and Cloud SQL replays the captured logs on top of the nearest full backup to reach that exact state. PITR always creates a new instance rather than overwriting the existing one.
What is the difference between a Cloud SQL HA standby and a read replica?
An HA standby sits in a different zone in the same region, uses synchronous replication, is not user-accessible, and fails over automatically within roughly 60 seconds if the primary zone goes down. A read replica is a separate Cloud SQL instance using asynchronous replication that you can query for reads — but it does not fail over automatically and must be manually promoted. Both can run simultaneously: the standby handles zone failure, the replica handles read scaling. They are complementary, not interchangeable.
How many automated backups should I retain?
The default is 7 (roughly one week). For most production workloads, 14 to 30 is a more practical choice. Retention is counted by number of backups, not calendar days: retaining 14 means you keep the last 14 daily backups. If the instance pauses or a backup job fails on a given day, that day does not count. On-demand backups are not included in this count and persist until you delete them manually.
Does high availability replace backups in Cloud SQL?
No. HA protects against zone failure by failing over to a standby — it does not help you recover from accidental data deletion, a bad migration, or corruption. If someone runs DROP TABLE on the primary, that change replicates immediately to the standby. You need backups and PITR for data recovery. HA and backups solve different problems and should both be enabled for production.
When should I enable high availability in Cloud SQL?
Enable HA for any database where unplanned downtime has a real business impact — production apps, customer-facing services, or anything with an SLA. It roughly doubles the instance cost because you are paying for both the primary and the standby. For internal tools, staging environments, or workloads where a short outage is acceptable, ZONAL (single-zone) is often fine.
Does HA protect against accidental deletion or a bad SQL query?
No. HA is for infrastructure failure, not logical errors. If you accidentally delete rows or run a destructive migration, the HA standby replicates that change immediately — it cannot save you from it. To recover from logical errors, you need PITR or a backup restore. HA and backups are complementary and both belong in a production setup.