Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/slice intersect multi series #2592

Merged

Conversation

ymatzkevich
Copy link
Contributor

@ymatzkevich ymatzkevich commented Nov 12, 2024

Checklist before merging this PR:

  • Mentioned all issues that this PR fixes or addresses.
  • Summarized the updates of this PR under Summary.
  • Added an entry under Unreleased in the Changelog.

Fixes #2042.

Summary

The function TimeSeries.slice_intersect() (see documentation) allows to intersect a TimeSeries with another one so that they end up with the same time indices. However, if one wants to intersect multiple series, that function would need to be called several times or the intersection would need to be done by hand using e.g. xarray. The new function slice_intersect() introduced with this PR solves this issue for an arbitrary number of TimeSeries.

Essentially, given a list of TimeSeries having the same time index type, slice_intersect() will output the aligned list meaning that all TimeSeries in it will have the same start and end time (if the intersection exists).

Other Information

If the given TimeSeries do not have all the same time index type (e.g. some have a RangeIndex and some DateTimeIndex), the function will raise an error.

@ymatzkevich ymatzkevich force-pushed the feat/slice_intersect_multi_series branch from 0ae1729 to b6f6812 Compare November 12, 2024 15:48
Copy link

codecov bot commented Nov 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.16%. Comparing base (0a52490) to head (2ad4cf9).
Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2592      +/-   ##
==========================================
- Coverage   94.20%   94.16%   -0.05%     
==========================================
  Files         141      141              
  Lines       15491    15501      +10     
==========================================
+ Hits        14594    14596       +2     
- Misses        897      905       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@dennisbader dennisbader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @ymatzkevich, it looks really good already 🚀
Just had some minor suggestions here and there. After that we can merge

darts/timeseries.py Outdated Show resolved Hide resolved
darts/timeseries.py Outdated Show resolved Hide resolved
darts/timeseries.py Outdated Show resolved Hide resolved
darts/timeseries.py Outdated Show resolved Hide resolved
darts/timeseries.py Outdated Show resolved Hide resolved
darts/timeseries.py Outdated Show resolved Hide resolved
darts/timeseries.py Outdated Show resolved Hide resolved
darts/tests/test_timeseries.py Outdated Show resolved Hide resolved
darts/tests/test_timeseries.py Show resolved Hide resolved
@ymatzkevich
Copy link
Contributor Author

ymatzkevich commented Jan 23, 2025

In order to compare between the different options available to implement slice_intersect, I have written the following benchmark:

import time
import itertools
import numpy as np
import pandas as pd

from darts import TimeSeries
from darts.timeseries import slice_intersect
from darts.utils.utils import generate_index

def helper_test_intersect(freq, is_mixed_freq: bool, N):
    start = pd.Timestamp("20130101") if isinstance(freq, str) else 0
    freq = pd.tseries.frequencies.to_offset(freq) if isinstance(freq, str) else freq

    # handle identical and mixed frequency setup
    if not is_mixed_freq:
        freq_other = freq
        n_steps = 11
    elif "2" not in str(freq):  # 1 or "1D"
        freq_other = freq * 2
        n_steps = 21
    else:  # 2 or "2D"
        freq_other = freq / 2
        n_steps = 11
    freq_other = int(freq_other) if isinstance(freq_other, float) else freq_other

    idx = generate_index(start=start, freq=freq, length=n_steps)
    end = idx[-1]

    # we construct 2 different series that will be used for the intersection
    startA = start
    endA = end
    idxA = generate_index(startA, endA, freq=freq_other)
    seriesA = TimeSeries.from_series(pd.Series(range(len(idxA)), index=idxA))

    startB = start + freq
    endB = startB + 6 * freq_other
    idxB = generate_index(startB, endB, freq=freq_other)
    seriesB = TimeSeries.from_series(pd.Series(range(len(idxB)), index=idxB))

    iterations = 100 # to have a statistical sample from which we compute mean time

    start_time = time.time()

    for _ in range(iterations):
        sequence = [seriesA, seriesB]*N
        int_sequence = slice_intersect(sequence) # we do not need to use the intersected sequence, just to compute it for benchmarking

    end_time = time.time()
    time_taken = end_time - start_time
    mean_time = time_taken/iterations 

    return mean_time

freq_list = ["D", "2D", 1, 2] # different types of frequencies 
is_mixed_freq_list = [False, True] # mixed frequencies
N = 5 # determines size of sequence that we are intersecting

combinations = list(itertools.product(freq_list, is_mixed_freq_list)) # we test all combinations
length = len(combinations) # number of combinations

total_time = 0
for i, (freq, is_mixed_freq) in enumerate(combinations):
    print(f"combination {i+1}/{length}: (freq,is_mixed_freq)=({freq},{is_mixed_freq})")
    total_time += helper_test_intersect(freq, is_mixed_freq, N)

mean_time = total_time / length

print(f"N={N}, mean_time={mean_time}")

The results of the benchmark are shown here, in seconds:

N=5 N=50 N=500
using TimeSeries.slice_intersect() 0.0021 s 0.0229 s 0.2374 s
using pandas intersection function 0.0031 s 0.0312 s 0.3275 s
using xarray.align 0.0022 s 0.0225 s 0.2345 s

While using xarray.align to perform this task compares to the present logic in terms of speed, it does not preserve mixed frequencies properly and hence does not pass all the unit tests. While first intersecting on the time indexes using the intersection function from pandas could have been thought to be faster, it is still slower than the Darts implementation which takes advantage of the TimeSeries structure.

Copy link
Collaborator

@dennisbader dennisbader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank a lot @ymatzkevich for this very nice PR and performance report 💯

Everything looked fine, I took the opportunity to make some minor adaptions.
Now it's ready to be merged 🚀

@dennisbader dennisbader merged commit 1d7f0d1 into unit8co:master Jan 24, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Union function to find the intersection of time series
3 participants