Failure Behaviors

During instances of Cloud Provider Outages, network outages, or even just routine maintenance, there can be periods of downtime. Intentional, or un-intentional, this downtime can present the question - "What happens to my flows?

This page is intended to describe the following behaviors:

Prior to an outage
During an outage
After the outage has cleared

Prior to Outage

Before an outage or maintenance occurs, the client is communicating to the API, through either flow runs, or through the worker / agent. State transitions are occurring, health checks are polling, etc. During normal operational behavior, the client will automatically retry operations based on PREFECT_CLIENT_MAX_RETRIES and PREFECT_CLIENT_RETRY_EXTRA_CODES environment variables. Essentially, this is "business as usual", and there is no consideration or expectation of a failure yet. To discuss the during and after an outage however, let's identify a few different scenarios you might encounter. Because of the nature of the client-server interactions with the API, we will target three common flow-run scenarios, and three server side scenarios.

Client Side

Scenario 1 - A flow run that does model training has begun executing at 11:00am GMT. This model typically takes 2 hours to execute, and you anticipate it completing by 1:00pm GMT. Scenario 2 - A flow run that does business analytics is scheduled to begin executing at 12:15pm GMT. This flow-run only takes 5 minutes to complete. Scenario 3 - A flow run that is event driven as reports are dropped into an S3 bucket. An event triggers this flow at 11:59am GMT. It takes minimal time to complete.

Server Side

Scenario A - An accidental networking change breaks all traffic routing to the API. The outage begins at exactly 12:00pm GMT for some odd reason, and is resolved by 2:00pm GMT. Scenario B - Regular scheduled maintenance occurs to vacuum the database, perform a database migration, and update to the latest version of Prefect Self-Hosted. The maintenance begins at exactly 12:00pm GMT, and maintenance only takes 15 minutes. Scenario C - A node was evicted from the cluster, and some of your pods are in an inconsistent state. This occurs at exactly 12:00pm GMT again, but is quickly remediate by scaling in a new node. The outage also only lasts 15 minutes.

What happens?

1A.

The model training tasks begin to fail, as the API is no longer reachable. While the core business logic executing the training can succeed, the state transition to mark a task or flow as completed will fail. These failures can be retried N number of times based on PREFECT_CLIENT_MAX_RETRIES. Once these retries have been exhausted, the flow will crash as a consequence of the API being unreachable.

At this time, there are a few considerations. The flow at the time of the outage was in a RUNNING state, and unable to update state from here. As a consequence of the outage, the executing environment will likely have wrapped up the flow, however the UI and API will likely show it still in a RUNNING state. With no heartbeat mechanism within the flow-run, the API cannot safely determine if the flow-run genuinely exists in a RUNNING state, or is no longer in a RUNNING state. It would become necessary to manually cancel these flow-runs that are "stuck".

From here, how much is data loss, and how much can be recovered? If results are not persisted or saved, then the full execution model is lost, and will need to be re-run from the beginning once the outage is resolved. If results ARE persisted, then re-running the flow will determine the persisted and cached results, resuming where it left off prior to the outage.

2A.

The flow run prior to the outage would be placed in a SCHEDULED state. Once the outage begins, state transitions will not occur, however, as it existed in a SCHEDULED state, will remain in this state until the outage has been remediated.

Once the outage has completed, the client in the execution environment will be able to successfully retrieve the flow-run (albeit in a LATE status, has the current time will be later than the scheduled time) and begin execution. There is no data-loss in this scenario, only a delay in execution.

3A.

11:59am GMT was chosen for this scenario as it could present as a race condition. If the event is triggered, the flow-run begins executing at 11:59am, and completes (successfully transitioning to a COMPLETED state) prior to the outage, then it simply ran to completion before the outage occurs.

In an different outcome however, perhaps there is latency in accessing the S3 bucket, or the upload for the completed report takes a bit longer to complete. In this situation, the core business logic can successfully complete (the report is successfully analyzed, process, and uploaded), but the outage occurs on the final task of the upload. Consequently, the flow run has "completed" from a business perspective, but will not be able to update the state to the API. The outcome here will be a flow-run that is stuck in RUNNING on the API / UI, has completed from an artifact / business perspective, and has crashed from the client perspective due to inability to transition state.

Once the outage has been remediated, the flow-run can be updated to a COMPLETED state, and the artifact retrieved. Alternatively, the flow-run can be set to a CANCELLING state and re-run. While in this case, a re-run might be a trivial operation to complete, in the event of a more expensive compute operation, it might be desirable to persist results to ensure previously computed work is not repeated.

1B.

During regularly scheduled maintenance, Istio gateway rules can / should be updated to add Retry-After headers, beginning at 12:00pm GMT. This will present as a 503 HTTP error, with the Retry-After field set to the duration in number of seconds. The client executing the flow-run, as well as the worker have core logic to handle this header, and effectively wait to try again.

In the context of your training model, the executing environment will continue to be provisioned during this time period. If the duration of seconds is 60 seconds for the Retry-After, then it's anticipated you would receive give or take 15 503 errors, before removing the rule, and execution will simply resume. There is no data loss experience, only a maintenance delay.

2B (or not 2B).

Timing is of the essence here. The flow-run will be in a SCHEDULED state prior to the maintenance period. The worker will likely attempt to submit the flow-run for execution, and either be successful once the rules are lifted, or receive at least one 503 before they have been.

Once the rules have been lifted, execution fully resumes. For this scenario it is possible no errors are experienced, or only a brief delay from the scheduled start time.

3B.

This will be very similar to 3A in nature up to the point of completion. In 3A, we have an outage with an undetermined time of resolution, and no induced client side behavior. As a result, 3A can crash client side even after completing the core logic. For 3B however, we assume the core logic has been able to complete, and simply the final state transition to COMPLETED cannot be mode. The request gets turned a 503 with the Retry-After header. The flow-run then waits the appropriate period of time before trying again. Once the maintenance window has completed, and the headers removed, the flow-run successfully transitions to the COMPLETED state.

1C.

In this scenario, the execution will begin receiving failures to reach the API, similar to 1A, up to the duration and number of PREFECT_CLIENT_MAX_RETRIES. Because the outage experienced here is only 15 minutes in duration, the behavior is very deterministic. Either the flow crashes (similar to 1A) as it has run out of duration and attempts to retry, or the API comes back up within the remaining window, and execution continues.

2C.

As the flow is in a SCHEDULED state, it simply does not get submitted for execution until the outage is over, and connectivity to the API is re-established.

3C.

Very similar in nature to 1C. While the total time to execute is very low, the key component here to a successful flow-run is if the API is restored to full operations prior to the client side retries. Similar to 3A, if the API is not successfully restored in time before retries are exhausted, the output and core logic might have successfully executed, but the flow run remains in a RUNNING state within the UI.

Takeaways

The most important factors in determining failure, are:

Length of outage
PREFECT_MAX_CLIENT_RETRIES

The most important factors in determining data loss, and resuming execution are: