While deploying a change for T375821 I encounter a weird situation where the pipeline failed several times trying to overwrite an existing checkpoints.
It is believed that for some reasons flink decided to resume operations from an old checkpoint and while attempting to write subsequent ones they failed because they already existed.
According to the documentation upgradeMode=savepoint and I don't see a good reason to not use it.
AC:
- agree to use upgradeMode=savepoint or document the reason why we prefer last-state
- update the helm values to start using it