Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add get_index and it will return the index of the element #470

Merged
merged 8 commits into from
May 12, 2023

Conversation

bryceustc
Copy link
Contributor

When using the CRoaring library, the normal way to get the 1 index of a value in a RoaringBitmap is:

  1. Call contains() to check if the value exists
  2. If it exists, call rank() to get the 1 index
  3. If it not exists, do something else
    However, this requires traversing the RoaringBitmap twice, which is inefficient. To improve this, a get_idx() interface was created. It works as follows:
  4. Traverse the RoaringBitmap once
  5. If the value is found, return its 1 index in the set
  6. If the value is not found, return -1

@lemire
Copy link
Member

lemire commented May 5, 2023

This is reasonable and potentially useful but while contains and rank are well known terms (to me at least), I am unfamiliar with the terminology 1 index. Do you have a reference (maybe in other software) regarding this terminology?

If it is a terminology you created, we will have to discuss it further.

@bryceustc
Copy link
Contributor Author

bryceustc commented May 6, 2023

Apologies for any confusion around the term '1 index'. My intent is to refer simply to the index of a given value in the RoaringBitmap.

As an example, if a Roaring Bitmap is {2, 4, 5, 7, 8, 9, 10}, I want get the index of value 7, it will return 4, which means that 7 is the 4th 1 in the set. If i want get the index of value 6, it will return -1, which means that 6 is not a element of the set. In a word, The proposed get_idx() interface aims to find the index (or -1 if not exsist) in a single traversal.

@lemire
Copy link
Member

lemire commented May 6, 2023

@bryceustc I do understand the concept.

I am concerned about the terminology.

I don't have much to propose, but "get_idx" and "1 index" won't work. You know what it means, but the average programmer won't be able to tell.

What about "get_index" and just say that we are return the index of the element?

@bryceustc
Copy link
Contributor Author

It's a good suggestion, I will modify the relevant code and descriptions. Thanks :)

@bryceustc bryceustc changed the title add get_idx function to return the 1 index of x add get_index and it will return the index of the element May 6, 2023
@bryceustc
Copy link
Contributor Author

I have modified the relevant content according to your suggestions. Thank you in advance for reviewing again.

cpp/roaring.hh Outdated
* when x isn't in the set, but the rank funciton will return a
* non-negative number.
*/
int64_t get_index(uint32_t x) const noexcept {
Copy link
Contributor

@longqimin longqimin May 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryceustc nice work. but the name/comments may be a little confusing. index start from 0 in general, while the rank value start from 1 (if contains)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks for your comments, it should start from 0. I have modified the relevant codes and descriptions, please review again when convenient :)

size_t i = 0;
for (; i < length; ++i) {
if (arr[i] == x) {
is_present = true;
Copy link
Contributor

@longqimin longqimin May 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can return i; here for simplicity.

and the rests LGTM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your meticulous work!

@Dr-Emann
Copy link
Member

Dr-Emann commented May 7, 2023

Maybe it would be worthwhile to take a page from binarySearch, and return -(i+1) rather than always returning -1, so you can know what the index of the number would be if it were in the bitmap.

roaring_bitmap_select is the opposite of this function, we should mention it in this function, and vice versa.

Copy link
Member

@Dr-Emann Dr-Emann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About naming, not many of the roaring functions have get in them, even when they're logically getters. I wonder if a better name might be roaring_bitmap_index_of, or something referencing select since that's the dual of this function, like roaring_bitmap_selection_index?

@@ -799,6 +799,15 @@ bool roaring_bitmap_select(const roaring_bitmap_t *r, uint32_t rank,
*/
uint64_t roaring_bitmap_rank(const roaring_bitmap_t *r, uint32_t x);

/**
* Returns the index of x in the given roaring bitmap.
* If the roaring bitmap dosen't contain x , this function will return -1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dosen't -> doesn't

* Returns the index of x in the given roaring bitmap.
* If the roaring bitmap dosen't contain x , this function will return -1.
* The difference with rank function is that this function will return -1 when x
* is not the element of roaring bitmap, but the rank funciton will return a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

funciton -> function

src/roaring.c Show resolved Hide resolved
cpp/roaring.hh Outdated
* when x isn't in the set, but the rank funciton will return a
* non-negative number.
*/
int64_t get_index(uint32_t x) const noexcept {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other C++ methods are camelCase, I think this should be getIndex

Do we want to add the same to roaring64map.hh?

@bryceustc
Copy link
Contributor Author

bryceustc commented May 8, 2023

@Dr-Emann Thanks for your corrections. I have revised the relevant codes and comment. Please review again when it's convenient. :)

In addition, there are a few points I would like to discuss further with you.

  1. return -(i+1)
    Actually, in our practical project, we want to quickly get the index of the number by calling get_index, and we only care about the numbers that belong to the bitmap. If it isn't in the bitmap, returing -1 is acceptable, just for a better performance.
  2. naming
    As for the naming, I wonder if roaring_bitmap_get_index will be more straightforward than roaring_bitmap_index_of and roaring_bitmap_selection_index. Might roaring_bitmap_get_index also be feasible?
  3. add the same to roaring64map
    I implemented the getIndex method for roaring64map in the latest commit. But I am concerned about that whether __int128_t is not incompatible in some situations.

roaring_iter != roaring_destination; ++roaring_iter) {
index += roaring_iter->second.cardinality();
}
index += roaring_destination->second.getIndex(lowBytes(x));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryceustc it is a bug to add -1 to index

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the type of index should be int128_t

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review, the bug and comments have been fixed in the latest commit.

@@ -1092,6 +1093,23 @@ public:
return result;
}

/**
* Returns the number of integers that are smaller or equal to x.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comments is misleading

* when x isn't in the set, but the rank function will return a
* non-negative number.
*/
int128_t getIndex(uint64_t x) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rank uses uint64_t for its return type. I think it might be reasonable, if every range container took 1 byte, storing a roaring bitmap of all uint64s would take 256 TB of memory, and there's a lot more overhead than 1 byte per range (0..0xFFFF)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm afraid I didn't grasp what you meant...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to use int128_t, because if you ever had an item at an index greater than an int64_t can represent, you're already out of memory. The rank method returns a uint64_t, and I argue this method should return an int64_t.

The most efficient way to get there is to have a range container for every slot in every bitmap. Each range container can cover 0x10000 items. So to get to 0x8000_0000_0000_0000 (the first index which can't be returned in an i64), you need 0x0000_8000_0000_000 range containers (140737488355328). If each range container (and the keys array, and the container type array, and malloc overhead, and...) magically took just 1 byte per container (spoiler, it will take a LOT more than that), you'd still need 128TiB of memory. There's simply no way to ever realistically have a RoaringBitmap64 with more than 2^63 items.

Copy link
Member

@lemire lemire May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My math is slightly more pessimistic. Each container can represent 65536 values. I get 2**64/(2**16) is 281474976710656. Assuming one byte per container that is... 256 TiB.

I think that Windows cannot ever support more than 6 TB of memory... https://learn.microsoft.com/fr-ca/windows/win32/memory/memory-limits-for-windows-releases?redirectedfrom=MSDN

A reasonable solution here is to use a 64-bit value and explain why you do so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed integer means 2^63 rather than 2^64. I made that mistake initially, you can see I said 256 TB in my first message. Not that it matters, either way, even assuming an impossible 1 byte per container, it's not gonna fit in even virtual memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right ok. Most systems have a theoretical limit of maybe 2^56 bytes per process. Of course, this may change in the future, but before that happens, int128_t will be part of the every standard programming language.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thank you all for the professional and detailed explanations. I have changed int128_t to int64_t in the latest commit.

@lemire
Copy link
Member

lemire commented May 11, 2023

Running tests.

@lemire
Copy link
Member

lemire commented May 11, 2023

@Dr-Emann @longqimin This PR looks good to me.

Any objection to merging?

@longqimin
Copy link
Contributor

@Dr-Emann @longqimin This PR looks good to me.

Any objection to merging?

LGTM too

@lemire lemire merged commit 7174265 into RoaringBitmap:master May 12, 2023
@lemire
Copy link
Member

lemire commented May 12, 2023

Merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants