Account for state only changes in LastWriteVersion (#5872)
## What changed?
<!-- Describe what has changed in this PR -->
- Account for state only changes in LastWriteVersion
- ^ also means LastWriteVersion could change after workflow is closed.
- NOTE: this PR is stacked on top of https://github.com/temporalio/temporal/pull/5860
## Why?
<!-- Tell your future self why have you made these changes -->
- Account for state only changes
## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
- Existing tests & new unit tests
## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->
## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
<!-- Describe what has changed in this PR -->
- Refresh sub state machine tasks
## Why?
<!-- Tell your future self why have you made these changes -->
- Refresh sub state machine tasks
## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
- Added unit test
## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->
## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
Allow updating workflow current version after close (#5860)
## What changed?
<!-- Describe what has changed in this PR -->
- Allow updating workflow current version after close
## Why?
<!-- Tell your future self why have you made these changes -->
- Workflow state machine could be updated after workflow closed from
user POV
## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
- Existing tests & added new unit tests
## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->
## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
Move update proto to persistence proto package (#5940)
## What changed?
<!-- Describe what has changed in this PR -->
- Move update proto to persistence proto package
## Why?
<!-- Tell your future self why have you made these changes -->
- Avoid cycle dependency when later adding proto messages defined in
persistence package (specifically, `VersionedTransition` in this case)
into UpdateInfo.
## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
- Existing test
## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->
## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
Fix schedule_action_delay metric for buffered actions (#5884)
## What changed?
- Calculate "delay" for buffered actions from previous action close.
- Update a few logs.
## Why?
If an action is held up by waiting for a previous action to finish (e.g.
BufferOne, BufferAll, CancelOther overlap policies), it's not fair to
count the waiting time as the "delay" for that action.
## How did you test it?
Tested locally by building up a backlog and checking that the metric
didn't increase (and did before this change). Also tested upgrade and
downgrade.
## Potential risks
This is touching workflow code, but is compatible because:
- If it sees nil for the close time or desired time, that's fine, it
just falls back to actual time.
- All the behavior changes are in logs and metrics only.
## What changed?
<!-- Describe what has changed in this PR -->
Fixing some flaky tests.
## Why?
<!-- Tell your future self why have you made these changes -->
## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->
## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
797bbdf...
by
Tim Deeb-Swihart <email address hidden>
Reconnect to SQL databases when connections fail (#5926)
## What changed?
Both our PostgreSQL and MySQL database backends will now automatically
reconnect to the database when certain errors occur: all errors chosen
have been experienced when testing this behavior through an AWS Aurora
RDS failover of either MySQL or PostgreSQL.
For both backends we will reconnect when we see:
- `ECONNRESET`
- `ECONNABORTED`
- `ECONNREFUSED`
- `io.EOF`
- `io.ErrUnexpectedEOF`
- `database/sql/driver.ErrBadConn`
for postgres we will also reconnect on the following SQLStates:
- `25006` read-only transaction
- `57P03` cannot connect now
- `0A000` feature not supported, but ONLY when the message is `cannot
set transaction read-write mode during recovery`
for mysql we will also reconnect when we see the following error codes:
- `1040` too many connections
- `1792` read-only transaction (SQLstate `25006`)
- `1836` running in read-only mode
This logic is easily extensible should we discover more failure modes
over time
## Why?
We've had multiple community reports of Temporal problems during RDS
failover. One part of this is the fact that we wouldn't necessarily
reconnect; we were at the whims of our chosen SQL abstraction's
connection pooling logic.
## How did you test it?
I manually tested this functionality in the presence of repeated RDS
failovers:
- [x] postgres12 plugin with pq driver
- [x] postgres12 plugin with pgx driver
- [x] mysql plugin
Automated testing will be added to our regular testing pipelines once
our infrastructure friends have added the support I need (it's in
progress)
## Potential risks
We're concerned there's a correctness issue in our PostgreSQL backend
that's related to our behavior during an RDS failover. If we merge this
before I figure out what's going on, we could hide the issue and make it
harder to reproduce.
## What changed?
<!-- Describe what has changed in this PR -->
Run tests in cloud branch
## Why?
<!-- Tell your future self why have you made these changes -->
Verify cherry picked commits in cloud branch are good.
## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->
## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->