Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CQT, iCQT, and VQT implementations and testing #3804

Open
wants to merge 47 commits into
base: main
Choose a base branch
from

Conversation

d-dawg78
Copy link

@d-dawg78 d-dawg78 commented Jun 27, 2024

Hey everyone,

I am happy to propose the addition of the CQT, iCQT, and VQT. The first two have been requested by issue 588. Since the CQT is a VQT with parameter gamma=0, I figured the VQT should be added to the package too. It also figures quite prominently in the research community, even as a time-frequency representation for neural networks. Here are a few important details.

General

The proposed transforms follow and test against the librosa implementations. Note that, since the algorithms are based on recursive sub-sampling, the results between the proposed transforms and librosa gradually diverge as the number of resampling iterations increases; the resampling algorithms differ. The librosa comparison test thresholds are adapted as such. The implementation being matched is the following:

librosa_vqt = vqt(
    y=<Y>,
    sr=<SAMPLE_RATE>,
    hop_length=<HOP_LENGTH>,
    fmin=<F_MIN>,
    n_bins=<N_BINS>,
    intervals="equal",
    gamma=<GAMMA>,
    bins_per_octave=<BINS_PER_OCTAVE>,
    tuning=0.,
    filter_scale=1,
    norm=1,
    sparsity=0.,
    window=<WINDOW>,
    scale=False,
    pad_mode="constant",
    res_type=<RES_TYPE>,
    dtype=<DTYPE>,
)

The <ARGUMENTS> (similar throughout all three transforms) are the controllable ones in the proposed code . The others are "hard-coded". In my opinion, they should stay that way to avoid unnecessary complexity. Future iterations of the transform could incorporate some of these arguments however, if requested by the community!

Tests

I was unable to make the transforms torch-scriptable. Maybe this should be the focus of a future PR. For the rest, I was able to test on CPU but not GPU for installation reasons. Feel free to let me know if any are lacking.

Speed

On the audio snippet from here, over 100 iterations, with dtype=torch.float64:

VQT - torchaudio: 15.208; librosa 50.121 (seconds)
CQT - torchaudio: 15.188; librosa 47.686 (seconds)
iCQT - torchaudio: 7.029; librosa 200.069 (seconds)

Sanity Check

Here's an image of the CQT-gram generated using the following parameters:

SAMPLE_RATE = 44100
HOP_LENGTH = 512
F_MIN = 32.703
N_BINS = 108
BINS_PER_OCTAVE = 12

cqts

The results are pretty much identical! Feel free to request changes or ask me any questions on this PR. I'll be happy to answer, and am excited to get these transforms to the package 🫡

Copy link

pytorch-bot bot commented Jun 27, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/audio/3804

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@d-dawg78 d-dawg78 marked this pull request as ready for review July 1, 2024 02:32
@d-dawg78 d-dawg78 requested a review from a team as a code owner July 1, 2024 02:32
@zaptrem
Copy link

zaptrem commented Jul 1, 2024

Awesome contribution! A bunch of torch.ones tensors are initiated on CPU regardless of the input tensor's device. Also, it would be nice if there was an inverse VQT function as well. Also also, do you know a set of parameters that would result in a perfect or nearly perfect reconstruction? I had to fiddle with the filter lengths code to get something that was even close, but there's still an upper frequency buzzing sound and increased loudness at the start/end. I also noticed my 262,144 input to a CQT with hop_size 512 had an output size of 513 instead of 512 unless I set the hop_size to 513, but that may be a result of the aforementioned fiddling.

@d-dawg78
Copy link
Author

d-dawg78 commented Jul 1, 2024

Hey, here's to addressing the feedback ☝️

  1. Good catch on the torch.ones front - the most recent commit should address this issue.
  2. We are following the librosa VQT, CQT, and iCQT algorithms, and they opted not to implement the inverse VQT for good reason. I think we should do the same, at least for now.
  3. Here are parameters that led to decent waveform reconstruction on my end:
SAMPLE_RATE = 16000
HOP_LENGTH = 256
F_MIN = 32.703
N_BINS = 672
BINS_PER_OCTAVE = 96

Increasing the N_BINS and BINS_PER_OCTAVE accordingly increases CQT resolution, and by extension the reconstruction is much better 🙂

  1. I don't really have a good answer for this. Probably the result of the set of parameters you're using..?

@zaptrem
Copy link

zaptrem commented Jul 7, 2024

Here's an example of the high frequency artifacts/aliasing(?) in the reconstruction I can't get rid of (using your implementation without my adjustments):

sample_rate = 44100
hop_length = 512
f_min = 20
n_bins = 1280
bins_per_octave = 128

Original:
https://github.com/pytorch/audio/assets/1612230/3888b6b4-0695-4475-a89f-8db0bd22c552

Recon:
https://github.com/pytorch/audio/assets/1612230/e4d8934e-abcf-4419-a1ba-a09e0c562fc1

Even if I apply a low-pass filter to chop out freqs above 8000 before passing it into the above, I still get distortion when the bass beats:

recon.mp4

@d-dawg78
Copy link
Author

d-dawg78 commented Jul 12, 2024

Hey,

Thank you for your patience - busy week! I managed to get decent reconstruction, without too many audible artefacts, using the following parameters, and without any transformations to the original signal:

SAMPLE_RATE = 44100
HOP_LENGTH = 256
F_MIN = 32.703
N_BINS = 1728
BINS_PER_OCTAVE = 192

There are two issues with the parameters you are using:

  1. Your f_min should be mapped to a note frequency, so that your CQT bins are properly aligned with the equal temperament tuning system.
  2. Your bins_per_octave parameter should be a multiple of 12, so that your frequency bins align with tones, semitones etc..

Of course, feel free to play around with lower resolutions!

icqt.mp4

@zaptrem
Copy link

zaptrem commented Aug 9, 2024

Hey,

Thank you for your patience - busy week! I managed to get decent reconstruction, without too many audible artefacts, using the following parameters, and without any transformations to the original signal:

SAMPLE_RATE = 44100
HOP_LENGTH = 256
F_MIN = 32.703
N_BINS = 1728
BINS_PER_OCTAVE = 192

There are two issues with the parameters you are using:

  1. Your f_min should be mapped to a note frequency, so that your CQT bins are properly aligned with the equal temperament tuning system.
  2. Your bins_per_octave parameter should be a multiple of 12, so that your frequency bins align with tones, semitones etc..

Of course, feel free to play around with lower resolutions!

icqt.mp4

Thanks! However, this has 1728 * 256 size, whereas a normal spectrogram can store enough info for a perfect strong COLA reconstruction in only 512 * 256 numbers (or 1024 * 512, etc). Shouldn't a CQT be able to do the same without significant artifacting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants