Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yet another vector rewrite #53

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

JoeyBF
Copy link
Collaborator

@JoeyBF JoeyBF commented Nov 26, 2021

Here's a reworking of how we handle vectors, to set the stage for matrices later. I abstracted away the vector-like behavior that FpVector, Slice and SliceMut shared and I put it in traits, BaseVector for immutable methods and BaseVectorMut for mutable ones. That lets us introduce some more vector-like structs down the line, like Row for a row of a matrix or ArchivedFpVector if we start using rkyv. I'm opening this as a draft because:

  1. The ext library works but the rest of the repo has to be fixed still (mostly just importing those two traits).
  2. It's still missing almost all the documentation.
  3. There's a non-trivial performance hit when using this new module, and I would like to find why and fix it if possible.

@dalcde
Copy link
Contributor

dalcde commented Nov 26, 2021 via email

@JoeyBF
Copy link
Collaborator Author

JoeyBF commented Nov 26, 2021

  1. The ext library works

Actually I take that back. The ext library compiles, and all fp and algebra tests pass, but the resolution starts getting messed up at stem 19. So much so that the computation up to t = 40 already basically takes forever. The fp bench results are sane though, I can still share those if you want.

@dalcde
Copy link
Contributor

dalcde commented Nov 26, 2021 via email

@JoeyBF
Copy link
Collaborator Author

JoeyBF commented Nov 26, 2021

Yeah, the rank of each bidegree explodes. Here's a sample output

Module (default: S_2):
Load from save? (optional):
Max s (default: 15):
Max t (default: 30):
Save file (optional):
·
·
·
·
·                                     :
·                                   · : ·
·                                 ·   : · ⠿
·                             ·   ·   · · ⠿ *
·                     ·       · · ·   · · ⠿ * *
·                   · ·     · · · ·   · : ⡿ * * *
·                 ·   ·     · :   · · · ∴ ⣿ * * * *
·             ·   ·         · ·   · : · ∴ ⡿ * * * * *
·     ·       · · ·         · ·   · · · · ⁙ ⁘ ⣿ * * * *
·   · ·     · · ·           · · ·   ·         : : : : * *
· ·   ·       ·               ·
·

@JoeyBF
Copy link
Collaborator Author

JoeyBF commented Nov 27, 2021

Update: I tracked the error down to the debug_assert at matrix_inner.rs:747. The first time the inside of the for loop is executed is at t = 24, s = 3, and that's where the assert fails and our problems begin. Now I just need to check why it fails, since my reimplementation should be semantically equivalent to the current one.

Edit: Nevermind that, my debugger was acting weird. (t,s) = (24,3) is not the first bidegree where the loop is executed, but it is the bidegree where the assert is violated. The column 28 is a pivot but shouldn't be.

Edit 2: I tracked down the bug to bidegree t = 21, s = 3. The new implementation has a few bits flipped in rows 46 and 48 of the matrix in step_resolution, and the error propagates to the later bidegrees. Now I just need to figure out why those bits aren't set properly. This is starting to sound like something goes wrong in the algebra subcrate, which I'm the least familiar with.

@JoeyBF JoeyBF force-pushed the new_vector branch 3 times, most recently from c1930a9 to 373e919 Compare November 28, 2021 08:53
ext/crates/fp/src/limb.rs Outdated Show resolved Hide resolved
ext/crates/fp/src/vector/generic.rs Outdated Show resolved Hide resolved
@dalcde
Copy link
Contributor

dalcde commented Nov 29, 2021

Here is a thought --- how different is BaseVector from Into<Slice> in the old world?

@dalcde
Copy link
Contributor

dalcde commented Dec 1, 2021

The correct trait bounds are really for<'a> &'a T: Into<Slice<'a>> and for<'a> &'a mut T: Into<SliceMut<'a>>. I used them here:

pub fn stream_quasi_inverse<T, S>(
p: ValidPrime,
data: &mut impl Read,
results: &mut [T],
inputs: &[S],
) -> std::io::Result<()>
where
for<'a> &'a mut T: Into<SliceMut<'a>>,
for<'a> &'a S: Into<Slice<'a>>,

with implementations
impl<'a, 'b> From<&'a mut SliceMut<'b>> for SliceMut<'a> {
fn from(slice: &'a mut SliceMut<'b>) -> SliceMut<'a> {
slice.copy()
}
}
impl<'a, 'b> From<&'a Slice<'b>> for Slice<'a> {
fn from(slice: &'a Slice<'b>) -> Slice<'a> {
*slice
}
}
impl<'a, 'b> From<&'a SliceMut<'b>> for Slice<'a> {
fn from(slice: &'a SliceMut<'b>) -> Slice<'a> {
slice.as_slice()
}
}
impl<'a> From<&'a FpVector> for Slice<'a> {
fn from(v: &'a FpVector) -> Slice<'a> {
v.as_slice()
}
}
impl<'a> From<&'a mut FpVector> for SliceMut<'a> {
fn from(v: &'a mut FpVector) -> SliceMut<'a> {
v.as_slice_mut()
}
}

@JoeyBF
Copy link
Collaborator Author

JoeyBF commented Dec 5, 2021

Here is a thought --- how different is BaseVector from Into<Slice> in the old world?

Ideally we would want something like impl<T: BaseVector> From<T> for Slice but that would imply a conflicting implementation of From<Slice> on Slice. I still think it's important to distinguish the two, at least semantically. For example, this allows FpVector to have much more optimized implementations of several methods.

@dalcde
Copy link
Contributor

dalcde commented Dec 5, 2021 via email

@dalcde
Copy link
Contributor

dalcde commented Dec 5, 2021 via email

@JoeyBF
Copy link
Collaborator Author

JoeyBF commented Dec 6, 2021

Storing [Limb] is what makes it a DST. A &[Limb] is fine because it is a pointer to a DST, and has a known size (namely 2 * 64 bits).

Of course. I was thinking that std::slice::from_raw_parts returned a raw slice instead of a reference; since it's just a reference, there should be no issue storing it.

Why would there be a conflict? I would be surprised if the FpVector improvement would be significant, especially if the matrix slices are going to be slices anyway. (otoh, an AlignedSlice struct would be nice and good for semantics). For maintenance puprposes, I think there is definitely a benifit to having a smaller code footprint if possible.

Any T implements From<T>, but since Slice also implements BaseVector that would give us a second implementation of From<Slice> on Slice. I haven't tried yanking out the special FpVector methods but my intuition would tell me that they give a decent speedup.

What did you have in mind for an AlignedSlice struct? Something like a struct that is guaranteed to begin/end on a limb boundary? Logically that would be what is returned by some split_borrow method on FpVector.

@dalcde
Copy link
Contributor

dalcde commented Dec 6, 2021 via email

@dalcde
Copy link
Contributor

dalcde commented Mar 16, 2022

Just as a note, after #80 we cannot make e.g. Module::act take in impl BaseVector since it would no longer be object safe.

@JoeyBF JoeyBF marked this pull request as ready for review May 31, 2022 05:45
@JoeyBF
Copy link
Collaborator Author

JoeyBF commented May 31, 2022

All tests pass except the selenium ones, and they don't look like the same flakey ones we used to have. I'm not sure how to fix them, since at first glance they don't seem related to fp

@dalcde
Copy link
Contributor

dalcde commented May 31, 2022 via email

@JoeyBF
Copy link
Collaborator Author

JoeyBF commented May 31, 2022

I'm not sure how I should check that locally. How do I start the webserver?

@dalcde
Copy link
Contributor

dalcde commented May 31, 2022 via email

@JoeyBF
Copy link
Collaborator Author

JoeyBF commented Jun 1, 2022

It seems adding a differential called a code path in fp that wasn't tested, but it's fixed now

@JoeyBF JoeyBF requested a review from dalcde June 17, 2022 04:58
@dalcde
Copy link
Contributor

dalcde commented Jun 19, 2022

Do you mind rebasing and resolving merge conflicts first?

ext/crates/fp/src/limb.rs Outdated Show resolved Hide resolved
@JoeyBF
Copy link
Collaborator Author

JoeyBF commented Jun 26, 2022

How does this rewrite affect performance?

I'm still running benchmarks but I'll comment on that shortly

@JoeyBF
Copy link
Collaborator Author

JoeyBF commented Jun 26, 2022

Here are the benchmarks, compared using critcmp

group                             master                                 new_vector
-----                             ------                                 ----------
row_reduce_2/row_reduce_2_10      1.00   565.5±27.64ns        ? ?/sec    1.11   627.9±33.43ns        ? ?/sec
row_reduce_2/row_reduce_2_100     1.00     15.9±0.67µs        ? ?/sec    1.03     16.3±1.35µs        ? ?/sec
row_reduce_2/row_reduce_2_1000    1.00  1163.2±31.80µs        ? ?/sec    1.05  1224.1±76.33µs        ? ?/sec
row_reduce_2/row_reduce_2_20      1.00  1334.2±45.95ns        ? ?/sec    1.26  1674.6±272.55ns        ? ?/sec
row_reduce_2/row_reduce_2_2000    1.00      5.0±0.13ms        ? ?/sec    1.05      5.2±0.13ms        ? ?/sec
row_reduce_2/row_reduce_2_4000    1.00     29.5±0.93ms        ? ?/sec    1.02     29.9±1.57ms        ? ?/sec
row_reduce_2/row_reduce_2_420     1.00    205.2±5.12µs        ? ?/sec    1.17   241.1±47.42µs        ? ?/sec
row_reduce_2/row_reduce_2_69      1.00      8.6±0.23µs        ? ?/sec    1.09      9.3±1.15µs        ? ?/sec
row_reduce_3/row_reduce_3_10      1.00   971.7±41.19ns        ? ?/sec    1.35  1313.6±52.65ns        ? ?/sec
row_reduce_3/row_reduce_3_100     1.00    110.5±3.71µs        ? ?/sec    1.38    152.5±1.55µs        ? ?/sec
row_reduce_3/row_reduce_3_1000    1.13     35.2±0.93ms        ? ?/sec    1.00     31.1±1.02ms        ? ?/sec
row_reduce_3/row_reduce_3_20      1.00      3.5±0.10µs        ? ?/sec    1.46      5.1±0.18µs        ? ?/sec
row_reduce_3/row_reduce_3_420     1.00      3.4±0.07ms        ? ?/sec    1.12      3.8±0.43ms        ? ?/sec
row_reduce_3/row_reduce_3_69      1.00     48.1±1.12µs        ? ?/sec    1.45     69.8±1.76µs        ? ?/sec
row_reduce_5/row_reduce_5_10      1.00   990.6±26.66ns        ? ?/sec    1.22  1213.4±61.51ns        ? ?/sec
row_reduce_5/row_reduce_5_100     1.00    153.9±3.17µs        ? ?/sec    1.15   177.2±24.86µs        ? ?/sec
row_reduce_5/row_reduce_5_1000    1.30     65.6±2.12ms        ? ?/sec    1.00     50.6±1.57ms        ? ?/sec
row_reduce_5/row_reduce_5_20      1.00      3.8±0.04µs        ? ?/sec    1.32      5.0±0.20µs        ? ?/sec
row_reduce_5/row_reduce_5_420     1.18      6.2±0.05ms        ? ?/sec    1.00      5.3±0.38ms        ? ?/sec
row_reduce_5/row_reduce_5_69      1.00     59.9±1.93µs        ? ?/sec    1.22     72.8±2.89µs        ? ?/sec
row_reduce_7/row_reduce_7_10      1.00  1810.6±76.17ns        ? ?/sec    1.19      2.2±0.05µs        ? ?/sec
row_reduce_7/row_reduce_7_100     1.00   486.1±49.60µs        ? ?/sec    1.12   542.8±70.90µs        ? ?/sec
row_reduce_7/row_reduce_7_1000    1.07   318.7±10.00ms        ? ?/sec    1.00   296.8±14.54ms        ? ?/sec
row_reduce_7/row_reduce_7_20      1.00      9.1±0.15µs        ? ?/sec    1.23     11.2±0.16µs        ? ?/sec
row_reduce_7/row_reduce_7_420     1.06     25.4±1.29ms        ? ?/sec    1.00     23.9±0.58ms        ? ?/sec
row_reduce_7/row_reduce_7_69      1.00    177.7±4.62µs        ? ?/sec    1.13    201.0±6.65µs        ? ?/sec

It looks like it tends to be better for bigger matrices, but worse otherwise. Also not great for p = 2. I was expecting something roughly like this, once new matrices will be in place the performance should go back up with interest. I didn't bench all of ext however.

@dalcde
Copy link
Contributor

dalcde commented Jun 26, 2022 via email

@JoeyBF
Copy link
Collaborator Author

JoeyBF commented Jun 26, 2022

I'm trying to run iai to figure it out but cachegrind keeps crashing. Could you confirm that you also get Failed to run benchmark in cachegrind. Exit code: signal: 4 when running cargo bench --bench iai in the ext folder on the master branch?

@JoeyBF
Copy link
Collaborator Author

JoeyBF commented Jun 26, 2022

Actually, even compiling cargo build --release --example resolve and trying valgrind --tool=cachegrind target/release/examples/resolve gives me an illegal instruction error

@JoeyBF
Copy link
Collaborator Author

JoeyBF commented Jun 28, 2022

I figured it was a recent instruction set that was tripping up valgrind so I ran iai with target CPU x86-64. I'm having trouble reading the results, but it seems that cache locality is a bit worse and there are more instructions in general when the dimensions are small, but fewer than master when the dimensions get bigger. Here's the comparison; master on the left, this PR on the right. That's probably all I can say without diving into the asm

@JoeyBF JoeyBF mentioned this pull request Jun 29, 2022
4 tasks
@dalcde
Copy link
Contributor

dalcde commented Jul 3, 2022 via email

@JoeyBF
Copy link
Collaborator Author

JoeyBF commented Jul 3, 2022

What would you mean by very fast? Looking at top I would guess something like 30s-1m maybe. That's also a big part of why I think #101 would be useful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants