Skip to main content

How it works

Health checks

Temporal Cloud automates failovers with internal health checks. This process monitors request error rates, latencies, and infrastructure issues that could cause service disruptions, such as request timeouts. Temporal Cloud triggers failovers when these indicators exceed our allowed thresholds.

High availability Namespaces use asynchronous replication. Temporal Cloud transmits Workflow updates from the active Namespace, along with associated History events, to the standby replica with a short delay. This delay is called the replication lag. Temporal Cloud aims to maintain a P95 replication delay of less than 1 minute. In this context, P95 means 95% of requests are processed faster than the specified limit.

Replication lags are a concern because:

  • Replication lags may lead to a forced failover, potentially causing Workflows to rollback in progress.
  • Lags may cause recently started Workflows to be temporarily unavailable until the Namespace recovers. Temporal event versioning and conflict resolution mechanisms ensure the Workflow Event History can be replayed. Critical operations like Signals won’t be lost.

Types of Failover scenarios

The Temporal Cloud failover mechanism supports several modes for executing Namespace failovers. These modes include graceful failover ("handover"), forced failover, and a hybrid mode. The hybrid mode is Temporal Cloud’s default Namespace behavior. The following sections describe each style.

Graceful failover (handover)

In this mode, Temporal Cloud fully processes and drains replication tasks. Temporal Cloud pauses traffic to the Namespace before the failover. Graceful failover prevents the loss of progress and avoids data conflicts.

The Namespace experiences a short period of unavailability, defaulting to 10 seconds. During this period:

  • Existing Workflows stop progress.
  • Temporal Cloud returns a "Service unavailable error". This error is retryable by the Temporal SDKs.
  • State transitions will not happen and tasks are not dispatched.
  • User requests like start/signal Workflow are rejected.
  • Operations are paused during handover.

This mode favors consistency over availability.

Forced failover

In this mode, Temporal Cloud immediately activates the replica in the standby Namespace. Events that weren't replicated due to replication lag undergo conflict resolution when they reach the new active Namespace.

This mode prioritizes availability over consistency.

Hybrid failover mode

While graceful failovers are preferred for consistency, they aren’t always practical. Temporal Cloud’s hybrid failover mode (the default mode) limits the initial graceful failover attempt to 10 seconds or less.

During this period:

  • Existing Workflows stop progress.
  • Temporal Cloud returns a "Service unavailable error", which is retried by SDKs.

If the graceful approach doesn’t resolve the issue, Temporal Cloud automatically switches to a forced failover.

This strategy balances consistency and availability requirements.

Metadata replication

Updates to high availability Namespace records automatically duplicate to the replica. This metadata includes configurations such as retention periods, Search Attributes, and other settings. Temporal Cloud ensures that all isolation domains and regions will eventually share a consistent and unified view of the Namespace metadata.

info

A Namespace failover that changes the identifier for the active element field of a Namespace record is an update. This update is replicated via the Namespace metadata mechanism.

Workflow Execution replication

Temporal Cloud restricts certain Workflow operations to the active region:

  • You may only update Workflows in the active Namespace.
  • You may only dispatch Workflow Tasks and Activity Tasks from the active Namespace. Forward progress in a Workflow Execution can therefore only be made in the active Namespace.

These limits mean that certain requests, such as Start Workflow and Signal Workflow, are processed by and limited to the active Namespace. Standby replicas may receive API requests from Clients and Workers. They automatically forward these requests to the active Namespace for execution. As Workflow Executions progress and are operated on, replication tasks created in the active Namespace are dispatched to the standby replica. Processing these replication tasks ensures that the standby replica undergoes the same state transitions as the active Namespace. This enables replicated tasks to synchronize and achieve the same state as the original tasks.

Standby replicas do not distribute Workflow or Activity Tasks. Instead, they perform verification tasks to confirm that intended operations are executed so Workflows reach the desired state. This mechanism ensures consistency and reliability in the replication process across Temporal regions.

Failover downtime and DNS updates

High availability Namespaces provide an “all-active” experience for Temporal users. This helps limit or eliminate downtime during Namespace failover. There's a short time window from when a standby replica becomes the active Namespace to when Clients and Workers receive a DNS update. During this time requests forward from the now passive (formerly active) replica Namespace to the newly active (formerly standby replica) Namespace.

Conflict Resolution

High availability Namespaces rely on asynchronous event replication across Temporal isolation domains and regions. In the event of a non-graceful failover across regions, replication lag may result in a temporary setback in Workflow progress.

Namespaces that do not participate in high availability can be configured to provide at-most-once semantics for Activities execution (when Maximum Attempts is set to 0). High availability Namespaces provide at-least-once semantics for execution of Activities. Completed Activities may be re-dispatched in a newly active region, leading to repeated executions.

When a Workflow Execution is updated in a new Namespace following a failover, events from the previously active Namespace that arrive after the failover can't be directly applied. At this point, Temporal Cloud has forked the Workflow History.

After failover, Temporal Cloud creates a new branch history for execution, and begins its conflict resolution process. The Temporal Service ensures that Workflow Histories remain valid and are replayable by SDKs post-failover or after conflict resolution. This capability is crucial for Workflow Executions to continue their forward progress.

danger

Design your activities to succeed once and only once. This "idempotent" approach avoids process duplication that could withdraw money twice or ship extra orders by mistake. Run-once actions maintain data integrity and prevent costly errors. Idempotency keeps operations from producing additional effects. Protect your processes from accidental or repeated actions for more reliable execution.

Needs new images
Before failoverAfter failover
Before failoverAfter failover

The failover process

A failover shifts Workflow Execution processing from an active Temporal Namespace to a standby replica during outages or other incidents. Standby replicas duplicate data and prevent data loss during failover.

What happens during the failover process?

Temporal Cloud initiates a Namespace failover when it detects an incident or outage that raises error rates or latency in the active region of a high availability Namespace. The failover shifts Workflow processing to a replica that isn’t affected by the incident. This lets existing Workflows continue and new Workflows start while the incident is fixed. Once the incident is resolved, Temporal Cloud performs a "failback" by shifting Workflow Execution processing back to the original Namespace.

info

You can test the failover of your high availability Namespace by manually triggering a failover using the UI page or the 'tcld' CLI utility. In most scenarios, we recommend you let Temporal handle failovers for you.