Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New iterator released: shuffled #28

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open

Conversation

hoxnox
Copy link

@hoxnox hoxnox commented May 11, 2016

Allows iteration over a sequence in shuffled order. Randomization released through Linear Feedback Shift Register.
Additional convinient feature - ability to restore iterator state with zero cost (not present in README - see tests).

@hoxnox
Copy link
Author

hoxnox commented May 11, 2016

Typo in README code example - sorted not replaced with shuffled.

@ryanhaining
Copy link
Owner

I like the idea but the implementation seems pretty intense and not the most obvious so I'm having a bit of trouble figuring out why. What are the advantages of all this over creating a sequence of iterators into the container in the ShuffledView constructor, and then std::shuffle()ing it?

@hoxnox
Copy link
Author

hoxnox commented May 15, 2016

It's zero-cost by memory. Every step is just some simple asm operations (mask, shift). If you use std::shuffle - you must have enough memory to store all shuffled set. In the presented case we can shuffle, for example, very big files line-by-line, or rows in big SQL tables (...if we don't care about speed of random seeking through it. If we do - we can use Mixed Product - it's sequential). Off course if all shuffled data already are in the memory - we have random-access iterators (through dumb_advance), so shuffling will be as fast as std::shuffle.
I use this method, for example, to shuffle through whole IPv4 address space (with ability to resume) on the machine with 512MB of memory.

@ryanhaining
Copy link
Owner

Well that does sound pretty cool. Can I see the IPv4 shuffling code?

@hoxnox
Copy link
Author

hoxnox commented May 16, 2016

In this test I use IPv4 pseudo container and shuffle through it:
https://github.com/hoxnox/rators/blob/devel/tests/test_shufipv4.hpp

rators project will be overwritten to use cppitertools::suffled and cppitertools::mixed_product soon

hoxnox added 2 commits May 18, 2016 08:56
1. There was a bug. When we approximate, for example 10 with power of
two - we got 16 (2^4). So register size must be 4. But instead register
if size 5 was used. It's not efficient.

2. Even we have efficient std::distance and operator+(int n) functions
dumb_advanced and dumb_size doesn't uses them, so they was replaced with
std::advance and std::distance accordingly
@hoxnox
Copy link
Author

hoxnox commented May 18, 2016

IPv4 shuffling with help of cppitertools and https://github.com/hoxnox/iptools with zero memory cost:

    for (auto i : iter::shuffled(cidr_v4("0.0.0.0/0")))
        void(0);

@ryanhaining
Copy link
Owner

I'm sorry this is taking me so long to get through. I'm working on just language stuff for the most part at this time. You have a point in your test file to make sure "shuffled not store container inside" but it needs to in the case of an rvalue

auto s = iter::shuffle(std::vector<int>{1, 2, 3});
// okay now s has an iterator into that vector
// but the vector temporary has already been destroyed
auto it = std::begin(s);
*it; // use-after-free

Why did you make a point of this? I've made the change on my checkout on the branch but I want to make sure I'm not missing something.

Why do we need the check at the top of operator++ to see if *this == owner->end()?

What is being gained by having Distance be a template when we're aiming to deduce in the function calls?

@hoxnox
Copy link
Author

hoxnox commented May 22, 2016

  1. I've copied tests from sorted. The test moves rvalues and binds to lvalues hadn't passed "as is" because shuffled doesn't store container inside. So I made a note about it. We can just delete this test.
  2. We need to stay at end() when operator++ applies to end(). Yes, technically we can shift LSFR with state 0 and it will stay at 0, so we doesn't need to check if *this == owner->end().
  3. We need to store the size of the container. So we need to know the type. We can use std::iterator_traits::difference_type, but it will impose additional requirements to iterators. I haven't enough experience to make decision how to do it better.

@ryanhaining
Copy link
Owner

  1. I'm saying it needs to store the container because it is currently callable with a temporary. Try running the example under valgrind to see why. This was an issue with itertools a couple of years ago and has since been fixed.
  2. The check isn't necessary in that case because it's UB to increment an iterator passed the end. Nothing else has this check.
  3. I get that, but why template it if it's always size_t anyway?

1. Store the shuffled container internally
2. Removed unnecessary checking in operator++
3. Distance is always uint64_t
@hoxnox
Copy link
Author

hoxnox commented May 23, 2016

You convinced me. =)

@hoxnox
Copy link
Author

hoxnox commented Jun 1, 2016

I tested the code in 2 projects. It successfully runs in production under Gentoo-amd64, Debian-jessie-x64 and Debian-jessie-i386.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants