TLDR: Limited outage of ‘State Stays’ checks between 11:00 AM - 3:00 PM CST today, 2023-08-29. Issue now resolved.
Summary:
At 2:30 PM CST today, we became aware of an issue affecting a subset of ‘state-stays’ checks. The issue was isolated to one server, which has since been removed from our cluster. Normal operation has been restored.
Impact:
If you had rules set to trigger between 11:00 AM and 3:00 PM CST today, they might have failed to execute ‘state stays’ checks or delayed tasks if they ran on the affected server. Rules triggered after 3:00 PM CST should be working as usual.
Rules that self-reschedule (eg. ‘loops’) may need to be manually restarted.
Note that self-scheduling rules are not officially supported, but I call this out as some community members use this approach. -Josh
Next Steps:
We apologize for any inconvenience this may have caused. We are exploring why the problematic server wasn’t automatically flagged and working on solutions for automated fixes should this issue reoccur.
TLDR : Recurrence of the Timer Scheduler issue on Sunday, 2023-10-22. We’ve added new checks to catch future occurrences. Some rules may need to be manually reset.
Issue Recurrence:
The ‘State Stays’ issue reappeared on Sunday, 2023-10-22, affecting certain timer and state-stays rules.
Previous Fixes:
The health check added in September didn’t catch this issue. This is partly because we rely on a core Google Service that’s difficult to simulate for testing.
New Fixes:
We’ve set up a two-tiered health check system:
Readiness Check: Runs every 5 seconds. If it fails twice, the affected server is removed.
Liveness Check: Runs every 30 seconds. If it fails four times, a new server is spawned.
Both checks have backup systems for added reliability.
Next Steps:
We are identifying the affected rules to automatically reschedule them. For now, you can manually reset any impacted timers by simply opening and saving the associated rule.
We’re also monitoring the new checks closely and will tweak them as needed.