-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QUESTION] create dataset to fit using a stride between consecutive samples equal to output_chunk_length #2621
Comments
Hi @asmaletale, The dataset is implemented this way because users generally want to have as many training sample as possible. By mimicking the inference stride during the training, there are considerable risk of introducing undesired bias in the model. Even if you train the model once a day, at a given hour, you ideally want to use all the historical data. First of all, for Starting over from class CustomGSD_stride(GenericShiftedDataset):
def __init__(
self,
target_series: Union[TimeSeries, Sequence[TimeSeries]],
covariates: Optional[Union[TimeSeries, Sequence[TimeSeries]]] = None,
input_chunk_length: int = 12,
output_chunk_length: int = 1,
shift: int = 1,
shift_covariates: bool = False,
max_samples_per_ts: Optional[int] = None,
covariate_type: CovariateType = CovariateType.NONE,
use_static_covariates: bool = True,
sample_weight: Optional[Union[TimeSeries, Sequence[TimeSeries], str]] = None,
):
super().__init__(
target_series=target_series,
covariates=covariates,
input_chunk_length=input_chunk_length,
output_chunk_length=output_chunk_length,
shift=shift,
shift_covariates=shift_covariates,
max_samples_per_ts=max_samples_per_ts,
covariate_type=covariate_type,
use_static_covariates=use_static_covariates,
sample_weight=sample_weight,
)
self.stride = output_chunk_length
# recompute the max_sample_per_ts to take the stride into account
self.max_samples_per_ts = (
max((len(ts) - self.size_of_both_chunks) // self.stride + 1 for ts in self.target_series)
)
# update the attribute depending on max_sample_per_ts
self.ideal_nr_samples = len(self.target_series) * self.max_samples_per_ts
def __getitem__(
self, idx
) -> Tuple[
np.ndarray,
Optional[np.ndarray],
Optional[np.ndarray],
Optional[np.ndarray],
np.ndarray,
]:
target_idx = idx // self.max_samples_per_ts
target_series = self.target_series[target_idx]
target_vals = target_series.random_component_values(copy=False)
# compute the number of sample in a given ts taking the stride into account
n_samples_in_ts = (len(target_vals) - self.size_of_both_chunks ) // self.stride + 1
# apply the stride to the idx conversion/mapping
end_of_output_idx = (
len(target_series)
- (idx - (target_idx * self.max_samples_per_ts)) % n_samples_in_ts * self.stride
)
# [...] not changed
return past_target, covariate, static_covariate, sample_weight, future_target
class CustomMCSD_stride(MixedCovariatesSequentialDataset):
def __init__(
self,
target_series: Union[TimeSeries, Sequence[TimeSeries]],
past_covariates: Optional[Union[TimeSeries, Sequence[TimeSeries]]] = None,
future_covariates: Optional[Union[TimeSeries, Sequence[TimeSeries]]] = None,
input_chunk_length: int = 12,
output_chunk_length: int = 1,
max_samples_per_ts: Optional[int] = None,
use_static_covariates: bool = True,
sample_weight: Optional[Union[TimeSeries, Sequence[TimeSeries], str]] = None,
):
# Past dataset
self.ds_past = CustomGSD_stride(
target_series=target_series,
covariates=past_covariates,
input_chunk_length=input_chunk_length,
output_chunk_length=output_chunk_length,
# shift must be >= input_chunk_length or the features and targets will overlap
shift=input_chunk_length,
shift_covariates=False,
max_samples_per_ts=max_samples_per_ts,
covariate_type=CovariateType.PAST,
use_static_covariates=use_static_covariates,
sample_weight=sample_weight,
)
# [...] not changed
# input and output chunk lengths
icl, ocl = 5, 2
# the length of the target series will result in 2 samples
series = linear_timeseries(end_value=icl + ocl*2 -1, length=icl + ocl*2).astype(np.float32)
fc1 = series + 10
fc2 = series + 100
# covariates needs to be combined in a single series
fc = fc1.stack(fc2)
ds = CustomMCSD_stride(
target_series=[series],
past_covariates=None,
future_covariates=None, #[fc],
input_chunk_length=icl,
output_chunk_length=ocl,
max_samples_per_ts=None,
use_static_covariates=False,
sample_weight=None,
)
print(len(ds))
>>> 2
for idx in range(len(ds)):
print(ds[idx])
>>> (array([[2.],
[3.],
[4.],
[5.],
[6.]], dtype=float32), None, None, None, None, None, array([[7.],
[8.]], dtype=float32))
(array([[0.],
[1.],
[2.],
[3.],
[4.]], dtype=float32), None, None, None, None, None, array([[5.],
[6.]], dtype=float32))
model = TiDEModel(input_chunk_length=icl, output_chunk_length=ocl, n_epochs=3)
# works as expected
model.fit_from_dataset(ds) |
Thank you @madtoinou for your quick and effective reply! Most of all, thank you for the provided example. Finally, can you please help me to understand better what do you mean with "undesidered bias"? If I have let's say, as future covariate, a forecast provided every day for the next 24 hours, wouldn't a 1 timestep stride during training create a target sample that has, for example, 23 hours related to the provided forecast on one day, and 1h related to the provided forecast of the next day? I am aware that there is no overlapping in this specific case, and I believe that the stride=1, in this case, is helping to achieve generality, but I thought that this was due to a sort of "controlled noise" provided to the model. Is this what you are referring to? Sorry for the non approriate wording, I'm not really a data scientist :) |
No, you should be able to use Indeed, but then, you should have been able to use the code detailed in the issue you linked. Even if the approach implemented above remain valid if you concatenate all the 24h forecasts used at future covariates. |
Hello everyone, and thank you for this awesome library.
I'm currently working weather data, for which I have a forecasted series every day at the same hour.
At the current state of my implementation, i'm using a mix of covariates (past and future) to train the model.
From my understanding, darts by default is shifting each new sample by one timestep from the previous one. A different behaviour can be achieved creating a custom class (inherited from GenericShiftedDataset), consecutively packed in another custom class inherited from MixedCovariatesSequentialDataset.
My goal is to create a dataset in which each sample (intended as a sequence of input_chunk_length+output_chunk_length time steps) is shifted from the previous one by a stride of output_chunk_length time steps. My interest is to evaluate how the model would perform if trained in the same way as i'm expecting to call it at the inference stage, i.e. once a day, instead of the current training approach with a stride=1.
I attempted a very basic implementation, following the example in #2421 which is unfortunately not working at all and I'm not confident enough to fly solo:
then i create the dataset and then fit from dataset
The error i'm getting at this stage is "ValueError: The dataset contains past covariates that don't extend far enough. (index 13128-th sample)" at the fit stage. Please note that the timseries creation and the train and validation split is perfectly working in the default train approach (i.e. without the custom dataset and instead using the fit() method from the model)
Any help or insight would be much appreciated, i'm a bit lost right now.
Thank again for your work!
The text was updated successfully, but these errors were encountered: