Looking for Confluent Platform Cluster Linking docs? This page describes Cluster Linking on Confluent Cloud. If you are looking for Confluent Platform documentation, check out Cluster Linking on Confluent Platform.
Disaster Recovery and Failover¶
Introduction¶
Deploying a disaster recovery strategy with Cluster Linking can increase availability and reliability of your mission-critical applications by minimizing data loss and downtime during unexpected disasters, like public cloud provider outages. This document will explain how to design your disaster recovery strategy and execute a failover.
To see what clusters can use Cluster Linking, see the supported cluster types.
Goal of Disaster Recovery with Cluster Linking¶
You start with a “primary” cluster that contains data in its topics and metadata used for operations, like consumer offsets and ACLs. Your applications are powered by producers that send data into those topics, and consumers that read data out of those topics.
You can use Cluster Linking to create a disaster recovery (“DR”) cluster that is in a different region or cloud than the primary cluster. When an outage hits the primary cluster, the DR cluster will have an up-to-date copy of your data and metadata. the producers and consumers can switch over to the DR cluster, allowing them to continue running with low downtime and minimal data loss. Thus, your applications can continue to serve your business and customers during a disaster.
Setting Up a Disaster Recovery Cluster¶
For the Disaster Recovery (“DR”) cluster to be ready to use when disaster strikes, it will need to have an up-to-date copy of the primary cluster’s topic data, consumer group offsets, and ACLs:
- The DR cluster needs up-to-date topic data so that consumers can process messages that they haven’t yet consumed. Consumers that are lagging can continue to process topic data while missing as few messages as possible. Any future consumers you create can process historical data without missing any data that was produced before the disaster. This helps you achieve a low Recovery Point Objective (“RPO”) when a disaster happens.
- The DR cluster needs up-to-date consumer group offsets so that when the consumers switch over to the DR cluster, they can continue processing messages from the point where they left off. This minimizes the number of duplicate messages the consumers read, which helps you minimize application downtime. This helps you achieve a low Recovery Time Objective (“RTO”).
- The DR cluster needs up-to-date ACLs so that the producers and consumers can already be authorized to connect to it when they switch over. Having these ACLs already set and up-to-date also helps you achieve a low RTO.
To set up a Disaster Recovery cluster to use with Cluster Linking:
- If needed, create a new Dedicated Confluent Cloud cluster with public internet in a different region or cloud provider to use as the DR cluster.
- Create a cluster link from the primary cluster to the DR cluster.
the cluster link should have these configurations:
- Enable consumer offset sync. If you plan to failover only a subset of the consumer groups to the DR cluster, then use a filter to only select those consumer group names. Otherwise, sync all consumer group names.
- Enable ACL sync. If you plan to failover only a subset of the Kafka clients to the DR cluster, then use a filter to select only those clients. Otherwise, sync all ACLs.
- Using the cluster link, create a mirror topic on the DR cluster for each of the primary cluster’s topics. If you only want DR for a subset of the topics, then only create mirror topics for that subset.
With those steps, you create a Disaster Recovery cluster that stays up to date as the primary cluster’s data and metadata change.
Whenever you create a new topic on the primary cluster that you want to have DR, create a mirror topic for it on the DR cluster.
Tip
Each Kafka client needs an API key and secret for each cluster that it connects to. To achieve a low RTO, create API keys on the DR cluster ahead of time, and store them in a vault where your Kafka clients can retrieve them when they connect to the DR cluster.
Monitoring a Disaster Recovery Cluster¶
The Disaster Recovery (“DR”) cluster needs to stay up-to-date with the primary cluster so you can minimize data loss when a disaster hits the primary cluster. Because Cluster Linking is an “asynchronous” process, there may be “lag:” messages that exist on the primary cluster but haven’t yet been mirrored to the DR cluster. Lagged data is at risk of being lost when a disaster strikes.
Monitoring Lag with the Metrics API¶
You can monitor the DR cluster’s lag using built-in metrics to see how much data any mirror topic risks losing during a disaster. The Metrics API’s mirror lag metric reports an estimate of the maximum number of lagged messages on a mirror topic’s partitions.
Viewing Lag in the CLI¶
In the Confluent Cloud CLI, there are two ways to see a mirror topic’s lag at that point in time:
ccloud kafka mirror list
lists all mirror topics on the destination cluster, and includes a column calledMax Per Partition Mirror Lag
, which shows the maximum lag among each mirror topic’s partitions. You can filter on a specific cluster link or mirror topic status with the--link
and--mirror-status
flags.ccloud kafka mirror describe <mirror-topic> --link <link-name>
shows detailed information about each of that mirror topic’s partitions, including a column calledPartition Mirror Lag
, which shows each partition’s estimated lag.
Querying for Lag in the REST API¶
The Confluent Community REST API returns a list of a mirror topic’s partitions and lag at these endpoints:
/kafka/v3/clusters/<destination-cluster-id>/links/<link-name>/mirrors
returns all mirror topics for the cluster link./kafka/v3/clusters/<destination-cluster-id>/links/<link-name>/mirrors/<mirror-topic>
returns only the specified mirror topic.
The list of partitions and lags takes this format:
"mirror_lags": [
{
"partition": 0,
"lag": 24
},
{
"partition": 1,
"lag": 42
},
{
"partition": 2,
"lag": 15
},
...
],
Tutorial¶
To explore this use case, you will use Cluster Linking to fail over a topic and a simple command-line consumer and producer from an original cluster to a Disaster Recovery (“DR”) cluster.
Prerequisites¶
Got Confluent Cloud? Make sure it’s up-to-date. If you already have Confluent Cloud installed, just use
ccloud update
to get the latest version of the Confluent Cloud CLI with new Cluster Linking commands and tools.
- Make sure you have followed the steps under Commands and Prerequisites in the overview. These steps tell you the easiest way to get an up-to-date version of Confluent Cloud if you don’t already have it, and provide a quick overview of Cluster Linking commands.
- The DR cluster must be a Dedicated cluster with public internet endpoints.
- The Original cluster can be a Basic, Standard, or Dedicated cluster with Public internet endpoints. If you do not have these clusters already, you can create them in the Confluent Cloud UI or in the Confluent Cloud CLI.
Note
You can use failover between an eligible Confluent Platform cluster and an eligible Confluent Cloud cluster. You will need to use the Confluent Cloud CLI for the Confluent Cloud cluster, and the Confluent CLI for the Confluent Platform cluster.
What the Tutorial Covers¶
This tutorial demos use of the Confluent Cloud CLI Cluster Linking commands to create a DR cluster and failover to it.
A REST API for Cluster Linking commands may be available in future releases.
You will start by building a cluster link to the DR cluster (destination) and mirroring all pertinent topics, ACLs, and consumer group offsets. This is the “steady state” setup.
Then you will incur a sample outage on the original (source) cluster. When this happens, the producers, consumers, and cluster link will not be able to interact with the original cluster.
You will then call a failover command that converts the mirror topics on the DR cluster into regular topics.
Finally, you will move producers and consumers over to the DR cluster, and continue operations. The DR cluster has become our new source of truth.
Set up Steady State¶
Create or choose the clusters you want to use¶
Log on to the Confluent Cloud web UI.
If you do not already have clusters created, create two clusters in the same environment, as described in Create a Cluster in Confluent Cloud.
At least one of these must be a Dedicated cluster, which serves as the DR cluster (the “destination”, which will host the mirror topics).
The original or source cluster can be any type of cluster. To make these quickly distinguishable, you might want to make the source cluster a Basic or Standard cluster, and/or name these ORIGINAL and DR.
Keep notes of required information¶
Tip
To keep track of this information, you may find it easiest to simply save the output of the Confluent Cloud commands to a text file and/or use shell environment variables. If you do so, be sure to safeguard API keys and secrets afterwards by deleting files, or moving only the security codes to safer storage. For details on how to do this, and other time-savers, see Pro Tips for the CLI in the overview.
For this walkthrough, you will need the following details accessible through Confluent Cloud commands.
The cluster ID of your original cluster (
<original-cluster-id>
for purposes of this tutorial). To get cluster IDs on Confluent Cloud, type the following at the Confluent Cloud CLI. In the examples below, the original cluster ID islkc-xkd1g
.ccloud kafka cluster list
The bootstrap server for the original cluster. To get this, type the following command, replacing
<original-cluster-id>
with the cluster ID for your source cluster.ccloud kafka cluster describe <original-cluster-id>
Your output should resemble the following. In the example output below, the bootstrap server is the value:
SASL_SSL://pkc-9kyp5.us-east-1.aws.confluent.cloud:9092
. This value is referred to as<original-bootstrap-server>
in this tutorial.+--------------+---------------------------------------------------------+ | Id | lkc-xkd1g | | Name | AWS US | | Type | DEDICATED | | Ingress | 50 | | Egress | 150 | | Storage | Infinite | | Provider | aws | | Availability | single-zone | | Region | us-east-1 | | Status | UP | | Endpoint | SASL_SSL://pkc-9kyp5.us-east-1.aws.confluent.cloud:9092 | | ApiEndpoint | https://pkac-rrwjk.us-east-1.aws.confluent.cloud | | RestEndpoint | https://pkc-9kyp5.us-east-1.aws.confluent.cloud:443 | | ClusterSize | 1 | +--------------+---------------------------------------------------------+
The cluster ID of your DR cluster (
<DR-cluster-id>
). You can get this in the same way as your original cluster ID, with this commandccloud kafka cluster list
.ccloud kafka cluster list
The bootstrap server of your DR cluster (
<DR-bootstrap-server>
). You can get this in the same way as your original cluster bootstrap server, by using the describe command, replacing<DR-cluster-id>
with your destination cluster ID.ccloud kafka cluster describe <DR-cluster-id>
Create a cluster link between the original and the DR cluster¶
Start by creating a cluster link that is mirroring topics, ACLs, and consumer group offsets from the source cluster to the destination cluster.
Set up privileges for the cluster link to access topics on the source cluster¶
Your cluster link needs privileges to read the appropriate topics on your source cluster. To give it these privileges, you create two mechanisms:
- A service account for the cluster link. Service accounts are used in Confluent Cloud to group together applications and entities that need access to your Confluent Cloud resources.
- An API key and secret that is associated with the cluster link’s service account and the source cluster. The link will use this API key to authenticate with the source cluster when it is fetching topic information and messages. A service account can have many API keys, but for this tutorial, you need only one.
To create these resources, do the following:
Create a service account for this cluster link.
ccloud service-account create Cluster-Linking-Demo --description "For the cluster link created for the DR failover tutorial"
Your output should resemble the following.
+-------------+-----------+ | Id | 254122 | | Resource ID | sa-lqxn16 | | Name | ... | | Description | ... | +-------------+-----------+
Save the ID field (
<service-account-id>
for the purposes of this tutorial).Create the API key and secret.
ccloud api-key create --resource <original-cluster-id> --service-account <service-account-id>
Note
Store this key and secret somewhere safe. When you create the cluster link, you must supply it with this API key and secret, which will be stored on the cluster link itself.
Allow the cluster link to read topics on the source cluster. Give the cluster link’s service account the ACLs to be able to READ and DESCRIBE_CONFIGS for all topics.
ccloud kafka acl create --allow --service-account <service-account-id> --operation READ --operation DESCRIBE_CONFIGS --topic "*" --cluster <original-cluster-id>
Tip
The example above allows read access to all topics by using the asterisk (
--topic "*"
) instead of specifying particular topics. If you wish, you can narrow this down to a specific set of topics to mirror. For example, to allow the cluster link to read all topics that begin with a “clicks” prefix, you can do this:ccloud kafka acl create --allow --service-account <service-account-id> --operation READ --operation DESCRIBE_CONFIGS --topic "clicks" --prefix --cluster <original-cluster-id>
Provide the capability to sync ACLs from the source to the destination cluster.
This allows consumers, producers, and other services to continue running, even in the event of a failure.
To do this, you must give the cluster link’s service account an ACL to DESCRIBE the source cluster:
ccloud kafka acl create --allow --service-account <service-account-id> --operation DESCRIBE --cluster-scope --cluster <original-cluster-id>
Provide the capability to sync consumer group offsets for mirror topics over this cluster link, so that consumers can pick up at the offset at which they left off.
There are two sets of ACLs your clusters need for this.
Give the cluster link’s service account the appropriate ACLs to DESCRIBE topics, and to READ and DESCRIBE consumer groups on the source (original) cluster.
ccloud kafka acl create --allow --service-account <service-account-id> --operation DESCRIBE --topic "*" --cluster <original-cluster-id>
ccloud kafka acl create --allow --service-account <service-account-id> --operation READ --operation DESCRIBE --consumer-group "*" --cluster <original-cluster-id>
Give the cluster link’s service account ACLs to READ and ALTER topics on the destination (DR) cluster, and ACLs to READ its consumer groups.
ccloud kafka acl create --allow --service-account <service-account-id> --operation READ --operation ALTER --topic "*" --cluster <DR-cluster-id>
ccloud kafka acl create --allow --service-account <service-account-id> --operation READ --consumer-group "*" --cluster <DR-cluster-id>
Create the cluster link¶
Create a configuration file that turns on ACL sync, consumer group offset sync, and includes the security credentials for the link.
To do this, copy the following lines into a new file called
dr-link.config
, and then replace<api-key>
and<api-secret>
with the key and secret you just created.consumer.offset.sync.enable=true consumer.offset.group.filters={"groupFilters": [{"name": "*","patternType": "LITERAL","filterType": "INCLUDE"}]} consumer.offset.sync.ms=1000 acl.sync.enable=true acl.sync.ms=1000 acl.filters={ "aclFilters": [ { "resourceFilter": { "resourceType": "any", "patternType": "any" }, "accessFilter": { "operation": "any", "permissionType": "any" } } ] } topic.config.sync.ms=1000 security.protocol=SASL_SSL sasl.mechanism=PLAIN sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="<api-key>" password="<api-secret>";
A few notes on this configuration, which does the following:
- Syncs offsets for all consumer group (
*
) for these mirror topics. You can filter these down by passing in either exact topic names (”patternType”: “LITERAL”
) or prefixes (”patternType”: “PREFIX”
) to either include ("filterType": "INCLUDE"
) or exclude ("filterType": "EXCLUDE"
). - Syncs all ACLs that are on the Original cluster (
*
) so that consumers, producers, and other services can access the topics on the DR cluster. Unlike the consumer group offset sync, this is not limited to mirror topics, and may sync ACLs for other topics. You can filter these down by passing in either exact topic names (”patternType”: “LITERAL”
) or prefixes (”patternType”: “PREFIX”
) to either include ("filterType": "INCLUDE"
) or exclude ("filterType": "EXCLUDE"
). - Syncs consumer offsets, ACLs, and topic configurations every 1000 millseconds. This gives the link more up-to-date values for these. It comes at the expense of higher data throughput over the link, and potentially lower maximum throughput for the topic data. We recommend trying different values to see what works best for your specific cluster link.
- The last line in the file starts with
sasl.jaas.config
and ends with a semicolon (;
). This must be all on one line, as shown.
- Syncs offsets for all consumer group (
Create the cluster link as shown below, with the command
ccloud kafka link create <flags>
.In this example, the cluster link is called
dr-link
.ccloud kafka link create dr-link \ --cluster <dr-cluster-id> \ --source-cluster-id <original-cluster-id> \ --source-bootstrap-server <original-bootstrap-server> \ --config-file dr-link.config
If this is successful, you should get this message:
Created cluster link "dr-link".
Also, you can verify that the link was created by listing existing links:
ccloud kafka link list --cluster <DR-cluster-id>
Configure the source topic on the original cluster and mirror topic on the DR cluster¶
Create a topic called
dr-topic
on the original cluster.For the sake of this demo, create this topic with only one partition (
--partitions 1
). Having only one partition makes it easier to notice how the consumer offsets are synced from original cluster to DR cluster.ccloud kafka topic create dr-topic --partitions 1 --cluster <original-cluster-id>
You should see the message
Created topic “dr-topic”
, as shown in the example below.> ccloud kafka topic create dr-topic --partitions 1 --cluster lkc-xkd1g Created topic "dr-topic".
You can verify this by listing topics.
ccloud kafka topic list --cluster <original-cluster-id>
Create a mirror topic of
dr-topic
on the DR cluster.ccloud kafka mirror create dr-topic --link <link-name> --cluster <DR-cluster-id>
You should see the message
Created mirror topic “dr-topic”
, as shown in the example below.> ccloud kafka mirror create dr-topic --link dr-link --cluster lkc-r68yp Created mirror topic "dr-topic".
Tip
In the current preview release, you must create each mirror topic one-by-one with a CLI or API command. Future releases may provide a service to Cluster Linking that can automatically create mirror topics when new topics are created on your source cluster. In this case, you could filter the topics based on a prefix. That feature would be useful for automatically creating DR topics on a DR cluster.
At this point, you have a topic on your original cluster that is mirroring all of its data, ACLs, and consumer group offsets to a mirror topic on your DR cluster.
Produce and consume some data on the original cluster¶
In this section, you will simulate an application that is producing data to and consuming data from your original cluster. You will use the Confluent Cloud CLI to consume and produce functions to do so.
Create a service account to represent your CLI based demo clients.
ccloud service-account create CLI --description "From CLI"
Your output should resemble:
+-------------+-----------+ | Id | 254262 | | Resource ID | sa-ldr3w1 | | Name | CLI | | Description | From CLI | +-------------+-----------+
In the above example, the
<cli-service-account-id>
is 254262.Create an API key and secret for this service account on your original cluster, and save these as
<original-CLI-api-key>
and<original-CLI-api-secret>
.ccloud api-key create --resource <original-cluster-id> --service-account <cli-service-account-id>
Create an API key and secret for this service account on your DR cluster, and save these as
<DR-CLI-api-key>
and<DR-CLI-api-secret>
.ccloud api-key create --resource <DR-cluster-id> --service-account <cli-service-account-id>
Give your CLI service account enough ACLs to produce and consume messages on your Original cluster.
ccloud kafka acl create --service-account <cli-service-account-id> --allow --operation READ --operation DESCRIBE --operation WRITE --topic "*" --cluster <original-cluster-id>
ccloud kafka acl create --service-account <cli-service-account-id> --allow --operation DESCRIBE --operation READ --consumer-group "*" --cluster <original-cluster-id>
Now you can produce and consume some data on the original cluster.
Tell your CLI to use your original API key on your original cluster.
ccloud api-key use <original-cluster-api-key> --resource <original-cluster-id>
Produce the numbers 1-5 to your topic
seq 1 5 | ccloud kafka topic produce dr-topic --cluster <original-cluster-id>
You should see this output, but the command should complete without needing you to press ^C or ^D.
Starting Kafka Producer. ^C or ^D to exit
Tip
If you get an error message indicating
unable to connect to Kafka cluster
, wait for a minute or two, then try again. For recently created Kafka clusters and API keys, it may take a few minutes before the resources are ready.Start a CLI consumer to read from the
dr-topic
topic, and give it the namecli-consumer
.As a part of this command, pass in the flag
--from-beginning
to tell the consumer to start from offset0
.ccloud kafka topic consume dr-topic --group cli-consumer --from-beginning
After the consumer reads all 5 messages, press Ctrl + C to quit the consumer.
In order to observe how consumers pick up from the correct offset on a failover, artificially force some consumer lag on your consumer.
Produce numbers 6-10 to your topic.
seq 6 10 | ccloud kafka topic produce dr-topic
You should see the following output, but the command should complete without needing you to press ^C or ^D.
Starting Kafka Producer. ^C or ^D to exit
Now, you have produced 10 messages to your topic on your original cluster, but your cli-consumer has only consumed 5.
Monitoring mirroring lag¶
Because Cluster Linking is an asynchronous process, there may be mirroring lag between the source cluster and the destination cluster.
You can see what your mirroring lag is on a per-partition basis for your DR topic with this command:
ccloud kafka mirror describe dr-topic --link dr-link --cluster <dr-cluster-id>
LinkName | MirrorTopicName | Partition | PartitionMirrorLag | SourceTopicName | MirrorStatus | StatusTimeMs
+-----------+-----------------+-----------+--------------------+-----------------+--------------+---------------+
dr-link | dr-topic | 0 | 0 | dr-topic | ACTIVE | 1624030963587
dr-link | dr-topic | 1 | 0 | dr-topic | ACTIVE | 1624030963587
dr-link | dr-topic | 2 | 0 | dr-topic | ACTIVE | 1624030963587
dr-link | dr-topic | 3 | 0 | dr-topic | ACTIVE | 1624030963587
dr-link | dr-topic | 4 | 0 | dr-topic | ACTIVE | 1624030963587
dr-link | dr-topic | 5 | 0 | dr-topic | ACTIVE | 1624030963587
You can also monitor your lag and your mirroring metrics through the Confluent Cloud Metrics API. These two metrics are exposed:
- MaxLag shows the maximum lag (in number of messages) among the partitions that are being mirrored. It is available on a per-topic and a per-link basis. This gives you a sense of how much data will be on the Original cluster only at the point of failover.
- Mirroring Throughput shows on a per-link or per-topic basis how much data is being mirrored.
List cluster links and mirror topics¶
At various points in a workflow, it may be useful to get lists of cluster linking resources such as links or mirror topics. You might want to do this for the purposes of monitoring, or before starting a failover or migration.
To list the cluster links on the active cluster:
ccloud kafka link list
You can get a list mirror topics on a link, or on a cluster.
To list the mirror topics on a specified cluster link:
ccloud kafka mirror list --link <link-name> --cluster <cluster-id>
To list all mirror topics on a particular cluster:
ccloud kafka mirror list --cluster <cluster-id>
Simulate a failover to the DR cluster¶
In a disaster event, your original cluster is usually unreachable.
In this section, you will go through the steps you follow on the DR cluster in order to resume operations.
Perform a dry run of a failover to preview the results without actually executing the command. To do this, simply add the
--dry-run
flag to the end of the command.ccloud kafka mirror failover <mirror-topic-name> --link <link-name> --cluster <DR-cluster-id> --dry-run
For example:
ccloud kafka mirror failover dr-topic --link dr-link --cluster <DR-cluster-id> --dry-run
Stop the mirror topic to convert it to a normal, writeable topic.
ccloud kafka mirror failover <mirror-topic-name> --link <link-name> --cluster <DR-cluster-id>
For this example, the mirror topic name and link name will be as follows.
ccloud kafka mirror failover dr-topic --link dr-link --cluster <DR-cluster-id>
Expected output:
MirrorTopicName | Partition | PartitionMirrorLag | ErrorMessage | ErrorCode ------------------------------------------------------------------------- dr-topic | 0 | 0 | |
The stop command is irreversible. Once you change your mirror topic to a regular topic, you cannot change it back to a mirror topic. If you want it to be a mirror topic once again, you will need to delete it and recreate it as a mirror topic.
Now you can produce and consume data on the DR cluster.
Set your CLI to use the DR cluster’s API key:
ccloud api-key use <DR-CLI-api-key> --resource <DR-cluster-id>
Produce numbers 11-15 on the topic, to show that it is a writeable topic.
seq 11 15 | ccloud kafka topic produce dr-topic --cluster <DR-cluster-id>
You should see this output, but the command should complete without needing you to press ^C or ^D.
Starting Kafka Producer. ^C or ^D to exit
“Move” your consumer group to the DR cluster, and consume from
dr-topic
on the DR cluster.ccloud kafka topic consume dr-topic --group cli-consumer --cluster <DR-cluster-id>
You should expect the consumer to start consuming at number 6, since that’s where it left off on the original cluster. If it does, that will show that its consumer offset was correctly synced. It should consume through number 15, which is the last message you produced on the DR cluster.
After you see number 15, hit Ctrl + C to quit your consumer.
Expected output:
Starting Kafka Consumer. ^C or ^D to exit 6 7 8 9 10 11 12 13 14 15 ^CStopping Consumer.
You have now failed over your CLI producer and consumer to the DR cluster, where they continued operations smoothly.
Suggested Resources¶
- This tutorial covered a specific use case for disaster recovery. Share Data Across Clusters, Regions, and Clouds provides a tutorial on data sharing across topics which may be in the same or different clusters, regions, and clouds. This is another basic use case for Cluster Linking.
- Mirror Topics provides a concept overview of this feature of Cluster Linking.