-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to restart uncompleted downsampling tasks in ES 8.13 and above #106880
Labels
>bug
:StorageEngine/Downsampling
Downsampling (replacement for rollups) - Turn fine-grained time-based data into coarser-grained data
Team:StorageEngine
Comments
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
dnhatn
pushed a commit
that referenced
this issue
Mar 29, 2024
Missing a check on the transport version results in unreadable cluster state if it includes a serialized instance of DownsampleShardTaskParams. #98023 introduced an optional string array including dimensions used by time serie indices. Reading an optional array requires reading a boolean first which is required to know if an array of values exists in serialized form. From 8.13 on we try to read such a boolean which is not there because older versions don't write any boolean nor any string array. Here we include the check on versions for backward compatibility skipping reading any boolean or array whatsoever whenever not possible. Customers using downsampling might have cluster states including such serielized objects and would be unable to upgrade to version 8.13. They will be able to upgrade to any version including this fix. This fix has a side effect #106880
dnhatn
pushed a commit
to dnhatn/elasticsearch
that referenced
this issue
Mar 29, 2024
Missing a check on the transport version results in unreadable cluster state if it includes a serialized instance of DownsampleShardTaskParams. serie indices. Reading an optional array requires reading a boolean first which is required to know if an array of values exists in serialized form. From 8.13 on we try to read such a boolean which is not there because older versions don't write any boolean nor any string array. Here we include the check on versions for backward compatibility skipping reading any boolean or array whatsoever whenever not possible. Customers using downsampling might have cluster states including such serielized objects and would be unable to upgrade to version 8.13. They will be able to upgrade to any version including this fix. This fix has a side effect elastic#106880
elasticsearchmachine
pushed a commit
that referenced
this issue
Mar 29, 2024
…06896) Missing a check on the transport version results in unreadable cluster state if it includes a serialized instance of DownsampleShardTaskParams. serie indices. Reading an optional array requires reading a boolean first which is required to know if an array of values exists in serialized form. From 8.13 on we try to read such a boolean which is not there because older versions don't write any boolean nor any string array. Here we include the check on versions for backward compatibility skipping reading any boolean or array whatsoever whenever not possible. Customers using downsampling might have cluster states including such serielized objects and would be unable to upgrade to version 8.13. They will be able to upgrade to any version including this fix. This fix has a side effect #106880 Co-authored-by: Salvatore Campagna <[email protected]>
This was fixed by #106878 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
>bug
:StorageEngine/Downsampling
Downsampling (replacement for rollups) - Turn fine-grained time-based data into coarser-grained data
Team:StorageEngine
Elasticsearch Version
8.13 and above
Installed Plugins
No response
Java Version
bundled
OS Version
all
Problem Description
PR #97557 introduced
DownsampleShardTaskParams
a data structure used by our persistent task framework to store task specific data including, in this case, downsampling tasks specific data for tasks started when a downsampling operation is carried out.PR #98023 introduced an array of strings
dimensions
which is used to store the set of dimensions defined for the original index the downsampling task is operating onto. This is required because with TSID Hashing we lose the ability to decode dimensions just by decoding the _tsid field and we need to store them unencoded somewhere else to support resuming interrupted persistent tasks.Addition of the new
dimensions
string array changes the format of our wire protocol which we use when serialising and deserialising instances of objects likeDownsampleShardTaskParams
. This kind of changes require code to handle backward compatibility with nodes running older versions of Elasticsearch which "speak" a different version of the wire protocol. The check is missing (this is the bug!) as result, newer versions of Elasticsearch try to read a boolean unconditionally and later on, if the boolean is true, an array of strings (dimensions
), ignoring the fact that the boolean and string array might or might not be there. Older versions of Elasticsearch do not serialize such boolean and/or string array since that did not exist when the older version was released. This is why newer versions of Elasticsearch need the check on the wire protocol version and need to implement backward compatible behaviour.Moreover instances of
DownsampleShardTaskParams
are serialised as part of the cluster state which is written/read by nodes in the cluster and which needs to be readable by new nodes running a newer version of Elasticsearch after an upgrade. This is why the upgrade process is affected.The issue happens because a node running Elasticsearch older than 8.13 (8.10.x-8.12.x) writes such cluster state with
DownsampleShardTaskParams
not including thedimensions
string array. Then, after nodes start moving to a new version as a result of an upgrade to 8.13, deserialising the cluster state fails in the node running version 8.13 because thedimensions
array is missing.(NOTE: hopefully failure in deserielizing the cluster state means the node running version 8.13 will never be able to join the cluster).
Steps to Reproduce
Ideally could happen just by having at least one downsampling task starting, then upgrading to version 8.13 while the downsampling task is running. Note also that the executor is not going to restart them as a result of the failure being unrecoverable.
Logs (if relevant)
No response
The text was updated successfully, but these errors were encountered: