Failed envelope preprocessing and postprocessing

Incident Report for SignatureAPI

Postmortem

Incident report: envelope processing failures during a queue infrastructure upgrade

Date: 2026-06-17

Status: Resolved

Customer impact window: ~16:03–16:10 UTC (about 7 minutes)

Summary

During a planned upgrade of the infrastructure that powers our processing queues, a brief failover caused a short window where some envelopes could not be moved between processing stages. Hundreds live envelopes failed during that window.

We have recovered every envelope that had already been signed. A smaller set that failed at creation time were not recovered automatically and needs to be submitted again. No signed documents or signature data were lost.

What happened

We upgraded the cache and queue infrastructure behind the API. The upgrade includes a failover from one node to another. The failover itself took only a few seconds, but client connections took a little longer to reconnect.

During that short window, the API could not hand envelopes to the processing queue, so those envelopes were marked as failed instead of continuing.

All the other operations of the API were working as usual.

Impact

The failures fall into two groups:

  • Already-signed envelopes (recovered): envelopes where every recipient had finished signing but the final document had not yet been generated. We re-processed all of these. The signed documents were generated and delivered to recipients. No action is needed from you. These were delayed by roughly 30 minutes; there was no data loss.
  • Envelopes that failed at creation (need to be re-submitted): envelopes that failed before processing started were never created. There is nothing to recover, so these need to be created again.

The API itself stayed up throughout. Only envelopes created or finalized in the ~7-minute window were affected.

Root cause

The upgrade triggers a failover. While clients reconnected to the new node, the API briefly could not enqueue work. The API treated these short-lived enqueue errors as permanent and failed the affected envelopes, instead of retrying once the connection recovered.

How this was tested first

We did not apply this change directly to production. The underlying change was rolled out to our staging environment first. There, we deliberately rehearsed the same node failover the upgrade performs and confirmed two things: the service stayed available, and the system reconnected and recovered on its own.

What the staging rehearsal did not reproduce was the specific behavior that hit production: an envelope being marked failed instead of retrying. That only happens when live envelopes are moving through the queue at the exact moment of the failover, and our staging rehearsal was not carrying live traffic. Closing that gap — both in how we enqueue work and in how we rehearse — is the main fix below.

What we are doing

  • Making envelope enqueueing resilient to brief connection blips, so a short failover causes a small delay rather than a failure.
  • Improving how our clients reconnect after a failover so the disruption window is shorter.
  • Reviewing the timing of future maintenance to favor low-traffic windows.

We are sorry for the disruption. If you have questions about a specific envelope, contact support and we will help.

Posted Jun 17, 2026 - 16:59 UTC

Resolved

The fix was applied, we are updating the affected envelopes.
Posted Jun 17, 2026 - 16:18 UTC

Update

We are continuing to monitor for any further issues.
Posted Jun 17, 2026 - 16:18 UTC

Monitoring

We've restarted the affected workers and identified every affected envelope. Envelopes that failed in postprocessing will be recovered automatically — no action needed. Envelopes that failed in preprocessing were not created; please submit them again. We're monitoring to confirm full recovery.
Posted Jun 17, 2026 - 16:15 UTC

Identified

We've identified the cause. During the upgrade, some queue workers and brokers kept connecting to the old cluster's addresses instead of the new ones. We're restarting them to restore normal processing.
Posted Jun 17, 2026 - 16:14 UTC

Investigating

We're upgrading our queue system. During the upgrade, some envelopes are failing in preprocessing and postprocessing. We're investigating. Envelopes affected in postprocessing will be recovered
Posted Jun 17, 2026 - 16:09 UTC
This incident affected: API.