Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collision of hashed IDs #4612

Closed
opadin opened this issue Oct 2, 2021 · 13 comments
Closed

Collision of hashed IDs #4612

opadin opened this issue Oct 2, 2021 · 13 comments
Labels
label/id and id stack implicit identifiers, pushid(), id stack

Comments

@opadin
Copy link

opadin commented Oct 2, 2021

I want to use ImGui in my product, but i fear the collision of the hashed IDs.

This is not a point of - for example - buttons, which i create during development. These collisions are detected right there and make no problems. The problem can arise, if there are any user generated content. Like an hierarchical scene-object treeview or a file explorer. If the user can not select a file, because another one is always selected first, that would be just annoying. But when the wrong functionality is executed because of such collision, that would be already a NoGo.

One possibility would be to change the code to accept at all places also customer IDs (not to confuse with the stringification at end with ##). There are only some places, where GetID() is called, it would be easy to adapt the functions to this. Are there any other possibilities or perhaps hints, how other users work with this 'fear'?

Greetings, and sorry, for my bad english ... not my native one.

p.s. I have alreday worked with ImGui (small app for live-testing of GLSL and some debugging ui) and it works like a charm. Thats why I do not want to reinvent the wheel again.

@PathogenDavid
Copy link
Contributor

Dear ImGui has used the same hashing algorithm (CRC32) since it's initial release in 2014. If nobody has ever had a major unexpected hash collision in the past seven years, you probably will not be the first.

One possibility would be to change the code to accept at all places also customer IDs

Keep in mind that IDs are stored in a stack. You can use PushID/PopID to create your own stack scopes. You should push an ID that is intrinsically unique before submitting UI elements which have an ID that is directly or indirectly provided by the user.


Out of curiosity I wrote a dirty little test to check how long it would take to create a collision using the pattern set out in the FAQ for submitting many widgets programmatically:

#include "imgui_internal.h"
#include <unordered_set>
static void GH4612()
{
    ImGui::Begin("GH-4612");
    std::unordered_set<ImGuiID> hashes;
    ImGuiWindow* window = ImGui::GetCurrentWindow();

    for (int i = 0; i >= 0; i++)
    {
        if (i % 1000 == 0)
        {
            printf("%d...\n", i);
        }

        ImGuiID id = window->GetIDNoKeepAlive(i);

        if (!hashes.insert(id).second)
        {
            printf("Hash collision happened for %d (hash = %d)\n", i, id);
            IM_ASSERT(false);
        }
    }
    printf("No hash collisions for all positive integers.\n");

    ImGui::End();
}

This test went through around 536,870,000 tests in ~10 minutes before it consumed so much memory that my computer locked up trying to swap things in and out of the page file and I had to force reboot it and rewrite this entire comment. 😅

I'm no statistics expert, but I think that means you're more likely to win the lottery than you are to stumble upon a hash collision in Dear ImGui.


Finally, if it's still keeping you up at night, you could pretty easily replace the implementation of ImHashData/ImHashStr with a more robust hashing algorithm like xxHash.

@ocornut
Copy link
Owner

ocornut commented Oct 2, 2021

I think eventually we'll switch to using a lightweight 64-bit hash function to reduce that risk further.

I honestly expected people to see CRC32 in profiler and that would be a reason to change, but even though CRC32 is quite unoptimal (and essentially hogs 1 KB of data in cache) that hasn't happened. Several people have suggested to replace CRC32 but seemingly everyone did it from the theoretical point of view KNOWING that CRC32 is unoptimal, not from the practical point of view of seeing CRC32 in profiler measurements.

@opadin
Copy link
Author

opadin commented Oct 2, 2021

@PathogenDavid: Thanks for your response. And mentioning the enormous usage since the very first beginning of Dear ImGui without big complains concerning the ID collision is indeed a strong hint. Your test program is even better.

I enhanced your idea, using the word-list from https://github.com/dwyl/english-words and calling ImHashStr for all words with the same stack-base-id (tested 1 til 15). For every used stack-base-id i got between twenty and thirty collisions (the word-list contains ca. 465.000 words). Mostly of them absolutely unrelated and extremly seldom used expressions. But some of them could be (even theoretically) a folder/file-name in the same namespace and therefore same stack-base-id.

Collision-Examples (in the order: Seed, Word A, Word B):

4 | acappella | bestiaries
5 | compete | stepsons
5 | undercut | unpicketed
8 | baseness | unsupernatural
11 | feinschmeckers | overloved

@ocornut : Thanks for your response, Maestro - impressive work done with this fine peace of software.

Your idea on more bits is interesting but it does not help the general question. I have no problem with the implementation of the hash method itself, but with the possibility of collisions. And even another hash method with more bits and with a more complex algorithm could (AND WOULD) have collisions. That is the nature of hash-functions. No, the current hash method is fine (it needs still fast enough to calculate thousands of widgets in every frame).

But to put it in a game/product and/or to use it for example on a explosion-hazardous area and/or to get a specific certificate the software needs to be absolutely reliable. That is not possible with the current hash-method. But that is the only thing i have seen in the code, which does not meet the high standard of all other components of Dear ImGui (I don't mean that in a bad way, I just don't have the right words here. For most users, everything seems to work fine). And the simplest reliable solution would be to let the developer provide the IDs.

I think, next days I will make a try. I will implement functions to provide additionally(!) a custom ID and try to convert the demo app to this style. If there is some interests on the implementation, I can provide it, when I am ready. Perhaps I come to another conclusion during that work.

@PathogenDavid
Copy link
Contributor

I honestly expected people to see CRC32 in profiler and that would be a reason to change, but even though CRC32 is quite unoptimal (and essentially hogs 1 KB of data in cache) that hasn't happened.

@ocornut I am not an expert at cache stuff, but in theory wouldn't the issue potentially hide in other parts of the program because the processor is smart enough to keep the CRC table warm in the cache at the expense of other things?


But to put it in a game/product and/or to use it for example on a explosion-hazardous area and/or to get a specific certificate the software needs to be absolutely reliable.

@Triseltan Speaking as someone who has worked on safety-critical software before (avionics specifically) and was responsible for a lot of the safety certification aspect of our software: Dear ImGui should never be used in a safety-critical context.

@opadin
Copy link
Author

opadin commented Oct 3, 2021

Speaking as someone who has worked on safety-critical software before (avionics specifically) and was responsible for a lot of the safety certification aspect of our software: Dear ImGui should never be used in a safety-critical context.

@PathogenDavid : I don't think, this is a place to discuss this, but I can understand your point.

I looked into the using software list and it is just astonishing. I can remember long ago (I follow the project since the beginning), when @ocornut first time mentioned that he heard, Blizzard has used Dear ImGui ... or when someone found a mention in some nintendo tools. And with Sponsors like Blizzard and Ubisoft ... who am I to overthink such successful people (Blizzard itself uses a string hash algorithm inside their mpq packages ... and rename a resource during development if there are a collision).

@ocornut ocornut added the label/id and id stack implicit identifiers, pushid(), id stack label Oct 4, 2021
@opadin
Copy link
Author

opadin commented Oct 9, 2021

The experiment with user-generated IDs.

Basically I can say: It works. However, there are three major limitations.

First: There is the well-known problem of how to generate the IDs. Since the beginning of the idea of Immediate GUIs the necessary unique assignment of a widget created 'on-the-fly' over several frames was a problem. Original ideas like the __LINE__ macro as a unique ID (difficult with multiple files or when working with scripting languages) or comparing widget bounds between two frames (doesn't work for animations) were discarded. An excellent overview of various identification options can be found at https://gist.github.com/niklas-ourmachinery/9e37bdcad5bacaaf09ad4f5bb93ecfaf. Niklas Gray is the co-developer of the engine formerly known as Bitsquid, later as Autodesk Stingray (unfortunately discontinued). Even @ocornut writes there about Dear ImGui.

Secondly, it is extremely time-consuming to change all the necessary functions. Dear ImGui is simply not designed to pass IDs (except as a new seed for the stack, or encapsulated as a string in the ### possibility). Moreover, quite a few IDs are also generated hidden in the middle of various functions (#MOVE, #CLOSE, #COLLAPSE, ##Menu_..., ##Popup_... and so on). There one would have to intervene also, since a mixing of hash values and user-generated IDs could lead again to collisions, which we want to avoid straight. Last but not least, the IDStack would become obsolete, but it is used for other operations besides ID assignment (comparisons, evaluations, etc.).

Thirdly: The particularly simple work with Dear ImGui would become a tedious, time-consuming work, because you have to generate IDs at every point, remember them (static or in objects with an additional ID field or as an extra maintained enumeration) and still have to pass them. But it is this simple, fast work with Dear ImGui that brought it to this immense success. It is thanks to @ocornut that he has consistently relied on automatic generation of the IDs right from the start of the project. And I am convinced that this was the decisive factor to prefer Dear ImGui to all other immediate mode GUIs.

In conclusion, my idea would be "somehow" implementable ... but except for the preventing collision it would create a lot of new problems and a lot of opaque code.

Lastly: I have another idea in the background, which I will present shortly.

@opadin
Copy link
Author

opadin commented Oct 10, 2021

Trying with a symbol table

Since I'm using Dear ImGui for some side projects, I'm thinking about how to eliminate collisions. At the very beginning I thought about a symbol table, like they are used in in compiler construction (but there for storing identifiers and detecting duplicate identifiers ... exactly these collisions). But then I discarded it, because the effort seemed too huge. But now I would like to use Dear ImGui professionally (and failed with the alternative user-generated IDs - see comment above), so I looked into it again.
Symbol tables are basically just hash tables whose entries contain a string (on which the hash is calculated) and corresponding data about it (in a programming language, for example, whether it's a variable or a type, what scope it's in, and related information in each case).

What makes them so fascinating to me, and just right for tracking UI elements, are the following reasons:They have been around for a very long time and have been studied since the first compilers, support scopes, are designed to find entries with specific string as fast as possible and insert new entries without wasting time, and store entries of different length while not wasting too much memory.

To make a long story short, it worked better than expected.I had to rework all GetID and PushID calls (especially tables and loading/saving the settings data was a bit tricky). It's not perfect by any means (see open issues below), but it's already working quite reliably.

How does it work:

My original plan was to continue using the IDs generated so far, just including a sort into the symbol table. On the one hand this would have had the charm that all IDs would have remained the same, but on the other hand it would have had a big drawback: All comparisons of IDs (just generated ID, ActiveID, MoveID, etc.) would have had to compare the ID everywhere first and - if they were the same - would have had to search the symbol table. Making changes everywhere would have been extremely time-consuming and also bad for further maintenance.

Fortunately, I came up with a better idea. Instead of using the generated ID, the offset in the symbol table is used (all "symbols" with the same ID are linked together in a simple concatenated list, but stored in different places in the symbol table memory).

A symbol entry is (currently) composed as follows:

struct SymbolTableEntry {
    ImU32 prev; // Ofs of SymbolTableEntry of ID stack
    ImU32 next; // Ofs of next SymbolTableEntry with same ID
    ImU32 type; // ImGuiDataType ... TODO make it smaller!
    ImU32 length; // length of entry;
    ImU8 data[1];
};

Here is a picture of the current development (left is an excerpt of the symbol table - with the info in each case to offset, stack offset aka scope, type and data ):

imgui

Open points are still

  • Clean up the table at the end of a frame. (This is still a bigger and very important construction site. Currently the symbol table is just growing. Better would be to keep only the necessary IDs (ActiveID, NavIDs, all Windows IDs) at the end of a frame and 'throw away' the rest. Unfortunately the idea with the 'offsets' is here rather hindering).
  • Possibly rework the hash function (I'm not sure about that, still I use the original algorithms from ImHashStr/ImHashData and that works great).
  • Automatic resizing of the symbol table (currently still a static memory block limited to just under one megabyte)
  • PushOverrideID - there I have to distinguish between standalone ID (as with the InstanceIds of the ImGui table widgets) or 'correctly' generated IDs (then offsets into the symbol table).
  • Optimize the data storage of the symbol table entries
  • Automated random generation of test elements/widgets in large numbers with Dear ImGui as test field to test collisions and performance.

Interesting: While testing with the symbol table I already found a first collision in the current code (Synced Tables: First Column Header of the first table and First Column Header of the second table). Here is the GIF for it. First I hover over one, then the other element, then hover + click on the first, then the second widget. (btw the symbol table version prevented exactly this collision in the test)

imgui-collision-01

p.s. I always like to give the thoughts and inner workings of my coding. If that is not desired, or there is generally not much interest in the topic of avoiding collisions, feel free to close the issue.

@ocornut
Copy link
Owner

ocornut commented Oct 10, 2021 via email

@TerensTare
Copy link

TerensTare commented Nov 8, 2021

I think eventually we'll switch to using a lightweight 64-bit hash function to reduce that risk further.

I honestly expected people to see CRC32 in profiler and that would be a reason to change, but even though CRC32 is quite unoptimal (and essentially hogs 1 KB of data in cache) that hasn't happened. Several people have suggested to replace CRC32 but seemingly everyone did it from the theoretical point of view KNOWING that CRC32 is unoptimal, not from the practical point of view of seeing CRC32 in profiler measurements.

I'm a bit late to the party and not sure if the "not-the-best-performant-hash" issue is still valid, but it can be optimized for strings known at compile time. For comparison:

// Note HashedString and Hash are made-up names
HashedString str = Hash("str"); // "str" is a string known at compile-time, calculate hash at compile time

std::string str = "str";
HashedString str = Hash(str); // str is NOT a string known at compile-time, fall back to runtime calculations

The idea is explained on this article and EnTT offers an implementation. The TL;DR is that constexpr + overload resolution is used to distinct compile-time strings and evaluate the hash at constexpr. Not sure if this can be achieved at all the language standards ImGui targets but I believe this change is beneficial to have given a considerable amount of string IDs are known at compile-time.

@ocornut
Copy link
Owner

ocornut commented Nov 8, 2021

but it can be optimized for strings known at compile time. For comparison:

No it cannot. You are misunderstanding how the ID stack system of dear imgui works. Every hash is seeded by a value depending on the context. They are not "absolute" hashes. https://github.com/ocornut/imgui/blob/master/docs/FAQ.md#q-about-the-id-stack-system

@TerensTare
Copy link

My apologies, you are right. I wasn't totally aware about how the id stack works and misunderstood it.

@ocornut
Copy link
Owner

ocornut commented Nov 8, 2021

I'll close this topic as IHMO at this point is an undesirable hogging of attention.

I do not disagree that we could eventually switch to a 64-bit hash (I think we will eventually, both to reduce collision and to use code which doesn't hog a kilobyte of data in cache) but specifically the author of this topic stated it didn't matter to them and would rather engage in reinventing a wheel they decided didn't work. This kind of research is great and feel free to post here if you want to share it, but we can't provide support for a non-issue.

@ocornut ocornut closed this as completed Nov 8, 2021
@black-square
Copy link

I think it would be nice to have some numbers for the actual probability of random collision for an ideal 32-bit hash function in dependence on the number of samples:

1%	9,300
25%	50,000
50%	77,000
75%	110,000

from https://en.wikipedia.org/wiki/Birthday_attack

The problem is quite real.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
label/id and id stack implicit identifiers, pushid(), id stack
Projects
None yet
Development

No branches or pull requests

5 participants