BigQuery’s Cross-Region Replication: Elevating Geo-Redundancy for Your Data

Virinchi T
Fournine Cloud
Published in
5 min readNov 23, 2023

--

Geographical redundancy is crucial in designing a robust cloud-based data lake architecture. It serves various purposes for customers, such as facilitating low-latency reads for end users by placing data closer to them, ensuring compliance with regulatory standards, co-locating data with related services, and maintaining redundancy for critical applications.

BigQuery currently ensures data resilience by storing duplicates in two separate Google Cloud zones within a dataset region. This replication mechanism involves synchronous dual writes, mitigating potential data loss in case of zonal failures, be it due to soft causes like power failures or network partitions, or hard causes like natural disasters such as floods, earthquakes, or hurricanes.

The introduction of cross-region dataset replication in BigQuery marks a significant advancement. This feature allows effortless replication of any dataset, including ongoing changes, across different cloud regions. Apart from continuous replication, users can leverage this functionality to seamlessly migrate BigQuery datasets from a source region to a chosen destination region, enhancing flexibility in data management strategies.

How it works?

BigQuery provides a primary and secondary configuration for replication across regions:

The primary region is where BigQuery designates the selected region as the location of the primary replica upon dataset creation.

The secondary region is where adding a dataset replica occurs, designated by BigQuery as the secondary replica. Users can choose the secondary region, and it’s possible to have multiple secondary replicas.

Functionally, the primary replica allows write access, while the secondary replica operates in a read-only mode. Any writes made to the primary replica are asynchronously copied to the secondary one. Within each region, data redundancy is ensured with storage in two zones, and all data replication occurs exclusively within the Google Cloud network.

Despite residing in different regions, replicas share the same names. Consequently, there’s no need for query adjustments when referencing a replica situated in a different region.

Google Cloud Platform

How to create replica for a given dataset

Before starting, please keep in mind that If you are using BigQuery’s capacity reservations, you will need to have a reservation in the location of the secondary replica. Otherwise, your queries will use BigQuery’s on-demand processing model.

To initiate dataset replication, use the ALTER SCHEMA ADD REPLICA DDL statement.

Within each region or multi-region setup, it’s possible to add a single replica to any dataset. Upon adding a replica, there’s a timeframe required for the initial copy process to finalize. During this replication period, querying the primary replica remains unaffected, maintaining full query processing capacity without any reductions.

-- Create the primary replica in the primary region.
CREATE SCHEMA my_dataset OPTIONS(location='us-west1');

-- Create a replica in the secondary region.
ALTER SCHEMA my_dataset
ADD REPLICA `us-east1`
OPTIONS(location='us-east1');

To confirm the status that the secondary replica has successfully been created, you can query the creation_complete column in the INFORMATION_SCHEMA.SCHEMATA_REPLICAS view.

-- Check the status of the replica in the secondary region.
SELECT creation_time, schema_name, replica_name, creation_complete
FROM `region-us-west1`.INFORMATION_SCHEMA.SCHEMATA_REPLICAS
WHERE schema_name = 'my_dataset';

How to query the secondary replica

Once initial creation is complete, you can run read-only queries against a secondary replica. To do so, set the job location to the secondary region in query settings or the BigQuery API. If you do not specify a location, BigQuery automatically routes your queries to the location of the primary replica.

-- Query the data in the secondary region..
SELECT COUNT(*)
FROM my_dataset.my_table;

How to promote the secondary replica as primary

To designate a replica as the primary one, utilize the ALTER SCHEMA SET OPTIONS DDL statement and configure the primary_replica option. For this action, it’s essential to explicitly specify the job location to the secondary region within the query settings.

ALTER SCHEMA my_dataset SET OPTIONS(primary_replica = 'us-east1')

After a few seconds, the secondary replica becomes primary, and you can run both read and write operations in the new location. Similarly, the primary replica becomes secondary and only supports read operations.

How to remove a dataset replica

To discontinue dataset replication and remove a replica, user the
ALTER SCHEMA DROP REPLICA DDL statement. If your replication serves migration purposes from one region to another, deleting the replica becomes necessary after promoting the secondary replica to the primary position. While this step isn’t mandatory, it proves beneficial when a dataset replica is unnecessary beyond fulfilling migration requirements.

ALTER SCHEMA my_dataset
DROP REPLICA IF EXISTS `us-west1`;

Pricing

All this very nice and good to have depending on your orginzation needs, however therw is price involved for data replication from one region to another

Limitations

BigQuery dataset replication is subject to the following limitations:

  • Data in streams that haven’t been committed are not replicated in the secondary replica if the table is written to with the BigQuery Storage Write API. Replication of streaming data from the BigQuery Storage Write API or tabledata.insertAll is best-effort and may see high replication delay.
  • Tables injected through Datastream(/datastream-for-bigquery) to BigQuery by using Change Data Capture is not supported.
  • Replication and switchover are managed through SQL data definition language (DDL) statements.
  • You are limited to one replica of each dataset for each region or multi-region. You cannot create two secondary replicas of the same dataset in the same destination region.
  • Resources within replicas are subject to the limitations as described in Resource behavior.
  • Policy tags and associated data policies are not replicated to the secondary replica. Any queries that reference columns with policy tags in regions other than the original region fail, even if that replica is promoted.
  • Time travel is only available in the secondary replica after the creation of the secondary replica is completed.
  • You can only replicate a dataset with fewer than 100,000 tables.
  • You are limited to a maximum of 4 replica promotions per day per dataset.
  • Replication bandwidth for the initial copy operation is limited to 1 GB/second per project per continent-pair. Replication at this speed is not guaranteed.
  • Tables with Customer-managed encryption keys (CMEK) applied are not queryable in the secondary region if the replica_kms_key value is not configured.

Let me know if this is helpful!!

--

--