Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzz testing version 3 #242

Closed
Numpsy opened this issue Nov 19, 2024 · 6 comments · Fixed by #247
Closed

Fuzz testing version 3 #242

Numpsy opened this issue Nov 19, 2024 · 6 comments · Fixed by #247
Labels
performance issues or enhancements related to the performance characteristics of the project
Milestone

Comments

@Numpsy
Copy link
Contributor

Numpsy commented Nov 19, 2024

I tried running the current version 3 through SharpFuzz using the 'english.presets.doc' test file as a base file, and it generated the attached file which seems to cause an infinite loop of some sort trying to open the file (Just trying to open it in structured storage explorer makes it hang).

For reference, v2.4 has the same issue, but the Windows native functions report that the file is corrupt without any real delay.

timeout.zip

@jeremy-visionaid
Copy link
Collaborator

jeremy-visionaid commented Nov 20, 2024

Thanks for testing it out against SharpFuzz! It actually isn't an infinite loop - it will throw an exception once the maximum chain length is reached. However, we can make that determination a lot sooner by comparing to the sector count implied by the stream length rather than the maximum sector value constant.

In the general case though, if the chain has a big loop then we'd to commit some time and memory if we wanted to detect them sooner. We could do something similar to v2.4 and store some hash set of previously encountered values, but since streams could potentially be terabytes long, we'd also likely want some approach that avoids blowing out time and memory in checking for previously encountered entries. My initial though would be something like binary exponential backoff where only the values at indexes that are powers of two are retained.

@jeremy-visionaid jeremy-visionaid added the performance issues or enhancements related to the performance characteristics of the project label Nov 20, 2024
@jeremy-visionaid jeremy-visionaid added this to the 3.0 milestone Nov 20, 2024
jeremy-visionaid added a commit that referenced this issue Nov 20, 2024
Improve loop detection time, but limit memory usage and HashSet
operations with binary exponential backoff.

Fixes: #242
@jeremy-visionaid
Copy link
Collaborator

jeremy-visionaid commented Nov 20, 2024

It's worth mentioning that the loop detection I was proposing above was more of a guard against DOS, and that validation of the payload itself would need to be handled by the client app. Though this particular issue only affects the directory entry chain anyway, since the streams declare their length in their respective directory entry and the implementation wouldn't try to read beyond the stream length (and an exception is already thrown if the chain is too short).

In v4 the detection time could also be improved since the header declares the directory sector count, so the largest valid FAT value would be the minimum of the base stream sector count and the directory sector count.

@jeremy-visionaid
Copy link
Collaborator

Oh, haha, I probably shouldn't try and fix things like this late in the evening. The loop might occur anywhere, so we'd need to do something like Floyd/Brent... I wonder if the reference implementation actually does anything more sophisticated than the chain length against the sector count....

@jeremy-visionaid
Copy link
Collaborator

I need to add something similar for the mini FAT, but it's late so I'll leave it till tomorrow when I can look at it with fresh eyes!

@Numpsy
Copy link
Contributor Author

Numpsy commented Nov 20, 2024

There was a mention of floyds algorithm in one of the previous 'corrupted document' issues, but it didn't go anywhere.

A tangentially related thought on apis - there was a discussion about async functions that got put off, but is it worth considering adding cancellation token support to none-async functions that could still be used to abort potentially long lived operations? (a caller could have one that cancelled on a timer for example).

I just remembered that we raised something similar against the Open XML SDK , though that's basically a no go because lower level bits like System.IO.Packaging don't support it.

@jeremy-visionaid
Copy link
Collaborator

Yeah, I think Floyd's is the intuitively obvious approach, but it would have been hard to add it to 2.4 since it worked by constructing lists, so performance/memory would be worse than just using a simple HashSet. Since 3.0 works on enumerators/iterators it's fairly straightforward to do cycle detection with just an extra uint for the "slow" path.

As for cancellation tokens, my answer is similar to the Open XML SDK folk - we can't add cancellation tokens without async (it's only long lived if it's blocking on I/O for some reason, and sync streams can't be cancelled). Otherwise the caller can abort whenever they like. i.e. there's just no point in adding cancellation tokens to sync stream functions IMHO - there's nothing that I can add to the API that you can't already do for yourself as a caller (hence it's beyond the scope of what the API should do). Fortunately, we're also not blocked like they are - if we want to add async functions then we can, and cancellation tokens would inherently be part of that. You're welcome to create separate ticket for implementing the async APIs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance issues or enhancements related to the performance characteristics of the project
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants