-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pass fuzz.F to fuzz functions #218
Comments
Dmitry's original proposal is here: https://docs.google.com/document/u/1/d/1zXR-TFL3BfnceEAWytV8bnzB2Tfp6EPFinWVJ5V4QC8/pub Relevant passage:
Looks like we're more or less on the same page here, except that I propose to keep the signature simple and uniform, and use methods to get fuzz data. Unless there are objections, I will plan to implement my proposal and see how it looks in practice. |
The critical part here is this:
I believe this is the future of fuzzing and Go can pioneer here with simplicity and easy of use (as always!). Almost all current fuzzers (notably AFL, LibFuzzer) come from security background, and there you naturally operate on a blob because that's the only thing that can come from external world (network, file, etc). But currently there is a shift towards using fuzzing during normal dev process (testing, correctness, quality, CI, etc). And in this context you have a function that accepts stuff and you want to test it, so you need that stuff to come from the fuzzer. The current solutions are both complex, inflexible and inefficient. |
But overall I agree with idea of adding a context argument as it's much more flexible and extensible. I always feel a bit nervous designing public stable APIs... Skip/Fail/Failf look reasonable. For Interesting maybe we want to pass an integer priority (how much it is interesting). But for most cases it will indeed be binary. Or maybe we want to not give this knob to user and instead rely on fuzzer becoming smarter over time? Re ExitOnCompletion, do you have any examples? Or other name alternatives to choose from? :) |
For context, Ian's proposal of using F.Useful/Discard: Skip looks better than Discard, because Skip is testing.T method. For Interesting/Useful we could also consider Priority name, esp if it accepts an int. |
golang/go#19109 (comment)
I am not sure it's Fuzz function responsibility to know about layout of files on disk. It's probably more of an infra responsibility (i.e. should stay as tool flag). |
This is an interesting one about reusing Fuzz function in unit tests: A first idea is providing a FromT function that creates a stub F from testing.T to use in unit-tests. |
Looking at libfuzzer flags in case there are others like -timeout which are actually test property:
|
I've been thinking about your proposal: func FuzzRegexp(f *testing.F, re string, data []byte, posix bool) {
// use re, data, posix
} vs what I had in mind: func FuzzRegexp(fz fuzz.F) {
re := fz.String("re")
data := fz.Bytes("data")
posix := fz.Bool("posix")
// use re, data, posix
} And I have come around to your suggestion. I want to record some of my thinking here, though, for future reference. One challenge when you have anything other than just a This interacts with the Fuzz function signature: If you want some amount of resilience in your corpus (e.g. no changes required if you merely add a new input), then you need your inputs to be keyed somehow. I had had in mind providing a key when you request them (as in the code snippet below), but as you note, someone might request strings in a loop. This makes it impossible to know statically what the shape of the corpus entries is. The downside to having this in the Fuzz function signature is that it makes the variable names significant. If you rename The ability to statically detect the desired inputs is, to my mind, decisive. And there are ways to work around a change in function signature without having to migrate the corpus. E.g. you could make a wrapper fuzz function that modifies the inputs and calls the old (presumably renamed) fuzz function. So for the moment, I'm going to implement only:
with the plan to later support adding other arguments as well. (Which will requiring figuring out about the corpus.) Incidentally, I am not sure yet whether |
I like not having knobs. And they're easier to add than drop. So I'll start without it, particularly since we aren't sure exactly what the knob should look like. It'd be nice to run some experiments to see how much the hint helps right now, so we know what we're leaving on the table. Perhaps I will do that at some point. |
Example: the Go compiler. Refactoring it to be re-usable is a gigantic task, but it is not too hard to set up a Fuzz function entry point for it that can be called only once. Name alternatives: I dunno. An alternative API is to a way to ask go-fuzz to exit the process without considering it a crash. Maybe
I think so. In the general case go-fuzz should do the right thing. But in the compiler case it is hard to do without the control. (I have been using a modified version of go-fuzz.)
Yes, I think so. Concrete use case for doing this from the Fuzz function: When fuzzing starlark, I start with the starlark interpreter but then call Python to compare results. I'd like an aggressive timeout for the first part, but then a very lax timeout for exec'ing Python, which is slow. :) I could adjust my timeout based on how long the work I'm about to do should actually take. |
I agree, although I do think there should be a reasonable default, and have that be based on the package and fuzz function name. (And also maybe its signature?) Consider the case in which you have 5 fuzz functions in a package. It'd be nice to be able to call |
Interesting, but not germane (I think) to the new fuzz function API. I'd want to think about that a bit more. |
Interesting. None strike me as obvious must-haves for v0. Let's do this incrementally. |
Based on all the conversation above, I feel reasonably confident in taking a basic step: minimal fuzz.F, accept a single byte slice in the signature. I will work on that and we can add in extra little pieces as we go. |
Another restriction: no unexported types. (Can't create those types from a generated |
But FWIW, the use cases that I have in mind can also be handled without integer priority, but by calling |
Agree. There should be a simple default. It should take Fuzz function name into account. |
Overall I think the fuzzer should try to do it's best to upgrade old corpus, but exact mechanism should be an implementation detail. Also the corpus inputs should use some simple intuitive format (json) so that users can easily upgrade corpus if they fell like so.
Yes, no interfaces. Probably no channels (at least until we understand we need them). Not sure about maps, probably we an allow them.
Interesting question. Actually since we are the compiler, potentially we can work around this. E.g. we could generate a thunk function that accepts some common format and transforms it into actual arguments and calls the target function. |
Forgot to mention one thing. |
In #218 (comment), you had said:
I think the conversation has moved past that point, but for reference, what the fzgo prototype currently defaults to is
The corpus ends up by default in I think that might be a small elaboration of what was proposed in @dvyukov's March 2017 proposal document (e.g., I'm not sure if the proposal specifically spelled out the form of |
|
@dvyukov I think in case someone supplies a Of course, that alone does not handle all possible conflicts (e.g., (And of course, the behavior could change, either in that prototype or the "real" version). |
I see. This allows to point -fuzzdir to a common location for all targets/packages. Makes sense. |
Alternatively:
But maybe that is unnecessary complexity. Separately, it sounds like there might be some desire to incorporate the FuzzFunc signature into the on-disk location, though that sounds like it might still be an open question? |
Another option is to just -fuzzdir as is and don't append anything. Then it's implied to be user (script, infrastructure) responsibility to pass proper paths in whatever convention they want. Then we can drop package name from testdata. |
Would that imply a need to also drop There might be some minor tension here between making it easy for OSS-Fuzz vs. making it easy for a human using Finally, one only tangentially related question, which really does not belong here in this #218 issue -- I have a few questions around the proposed behavior of |
I can't confirm neither deny this.
Maybe mail golang-dev@ and then write up conclusions and reference the discussion on the issue. |
This is initial work towards dvyukov#218.
The more I think about this, the less convinced I am that we should go out of our way to implement On a related note, testing.T and testing.B both have a |
Did the approach for |
Maybe I’m missing something. I’m saying I don’t see the point in embedding or implementing testing.TB. The overlap with what fuzzing actually needs is too small. We may as well have our own interface. The problem of concrete type vs interface is separate, but also real. In theory we could do much more complicated codegen to work around it. In practice I think we are better off staying abstract. The key point, though, is that I don’t think fuzz.F (future testing.F?) should be a superset of testing.T!. |
Which TB functionality does not belong to F? |
It seems to me that, of TB: F really needs: Maybe it also should have: TB has all that plus: So of the 15 TB methods, 4 are clearly useful and 2 are maybe useful. 3 of them are nonsensical ( I personally would rather have just the 4-6 useful methods, rather than trying to squeeze it into a TB-shaped box.
My initial implementation returns something from Name, and makes Helper a no-op. It's not unimplementable, it's just serious overkill--an unnecessary attempt at consistency. That's my current thinking, anyway. |
I agree that functions like SkipNow are less useful than say Skipf. But I think the main question is if we need that compatibility between F and TB or not. If we need it, then we need to accept legacy like SkipNow as well (that's how interfaces work).
That's kinda handy for short one-line inputs, but so convenient for longer inputs or if there are too many of them. There is also some boilerplate involved and also it's not too convenient to select a single input to run (to debug a newly added crasher). Currently I comment out all but one, because I am not up to giving them names and writing additional code to select subset of inputs nor wrapping into t.Run. I wonder if it's a good idea to instead allow 2/2+ directories with input corpus? If I would have something like this, I would not need convergence between F and TB. -fuzzdir/-workdir could accept several dirs too. Only one of them would be the master one and will be used to store new inputs. LibFuzzer also has:
I wonder if we could merge into the master dir always and then handle this use case as well. |
FWIW, I very much like the concept of multiple corpus locations, especially from the perspective of the #19109 proposal. Multiple corpus locations hopefully helps with what might otherwise be a few different related problems with the current #19109 proposal:
The March 2017 proposal says:
I have a version of that working in fzgo (currently unpublished / still a bit WIP). One set of datapoints is running it on https://github.com/dvyukov/go-fuzz-corpus. The median execution time isn't too bad:
but the two slowest are around 4 seconds:
4 seconds seems high enough that it might be cause for concern from the core Go proposal review team, especially if one considers that https://github.com/dvyukov/go-fuzz-corpus doesn't necessarily reflect a multi-calendar year effort to build up a corpus, I think, and people can have slower Fuzz functions, and hence there is no reason to think that 4 seconds is any type of upper bound on what people might see in the wild. Having one smaller corpus checked into the main repository of a given Go package would mostly side step that, I think, with the option to have a much bigger corpus elsewhere for when you want to do a weekend run or do it in CI or oss-fuzz, etc.
If you are going to get the benefit of your corpus as unit tests when doing a normal I think Dmitry had wondered aloud in another issue if maybe Having a smaller corpus checked into pkgpath/testdata/fuzz or similar also side-steps that problem as well for the common case of wanting to run unit tests.
This is more of a question maybe, but the March 2017 proposal says:
Maybe if the norm is a smaller corpus checked into pkgpath/testdata/fuzz, then maybe that corpus would only be updated for a crasher, but not be updated for new coverage if you just run Side note: the way it works in fzgo is currently fairly crude. It just dumps a corpus_test.go to a temp directory and Sprintfs a TestCorpus function, the heart of which is:
That means things like
Finally, sorry for the long post, but I guess it is fair to say I am enthusiastic about the concept of > 1 corpus location. |
The main fuzzing corpus should not be checked into stdlib repo, regardless of if we support multiple corpus locations or not. So long execution time should not be a problem for anybody. Buildbot could checkout the fuzzing corpus specifically, but then longer execution time is what it explicitly asks for. |
What I was thinking about is something like git submodules. Namely, we import another repo at a particular location, and that location is exactly where go tool looks for the corpus. |
This sounds reasonable. I would say a requirement.
Nice! |
@dvyukov @josharian I think you are both saying now that there is no real value in using the I wanted to confirm that is correct? FWIW, I agree as well, including the fzgo prototype already side-stepped that (by synthesizing Test functions as mentioned in the "Side note" at the end of the (too-long) comment #218 (comment)). In the golang/go#19109 discussion, I think the first mention of
But I think that was might have been more of a brainstorming comment from @dvyukov in response to someone asking if the new fuzzing functionality could fit into existing APIs such as I think the first real suggestion to start with
Read in context, I think it is clear Russ is saying testing.TB would just be step along the way to testing.F. In any event, if it is not needed for a prototype, and seems awkward, then seems like no real value in using the |
FWIW, I read Russ's comment as saying that testing.F should implement testing.TB, and then proposing a way to achieve that. But I think we should start by implementing it in whatever way makes the most practical sense. We can then have a further discussion with Russ about the value (or lack thereof) of implementing testing.TB. As long as we are thoughtful in our prototype, that shouldn't cause too much pain or wasted work. |
Yes, let's postpone implementing TB. In the end interfaces are retrofit-able in Go so we don't burn any bridges. |
Some related discussion in golang/go#19109 (comment), including:
|
@thepudds thanks a lot for the link here, I was not aware of this discussion. I indeed believe that there is great value in ability to run modern property-based testing library on top of go-fuzz. For that, ability to interactively request chunks of random data is crucial; generating it "upfront" based on fuzz function signature is too limiting, because:
I think it is much better to be able to decouple data generation from go-fuzz, and I'll be happy to go more in-depth about anything related to property-based testing. |
Part of what makes fuzzing effective is a straightforward relationship between the input byte slice and what is done with it. As an example, instrumentation notes when a comparison fails and modified the input slice to try to make it succeed. As another example, one kind of mutation is to detect an ascii-printed number and replace it with another ascii number. If the byte slice is used to generate random numbers, that relationship will be lost, and fuzzing effectiveness will diminish. That is also why the discussion here about extended signatures is focused on precise input types—because it makes the relationship between the fuzz input and what is done with it much tighter. As such, I think your proposal is not a good fit for a fuzzing engine. It might be that you would want to reuse a fuzz function for randomized testing. That makes sense and is worth exploring. But the API you suggest would not work well for the fuzzing use case. |
#218 (comment) sounds like #65 all over again. Could someone please retitle this bug to clarify it's about the |
@josharian while I agree that my proposal can make the job of the sonar harder, on the other hand, it will make the job of the mutator much easier and significantly increase its chance of generating valid input data. Crucial thing to understand is that by building on top of random data stream the right way, it is possible to have a very high hit rate of generating valid data, even when it has complex internal structure and validity requirements. All mutator has to do is either change individual bytes, or add/remove/reorder spans of data, without having to know anything about the kind or constraints of data being generated. Consider high-level mutation like "add customer to the list of customers". With span-based approach, all we need to do is place a (potentially modified) copy of a span right after it; there is no need to know anything about what a valid "customer" or "list" is; in particular, there is no need to update any kind of "list length" data, as it does not exist in the data stream. |
Let's please move the discussion of the complex args to #65. There are too many scattered discussions already. |
Another fuzz.F method pair to consider: The idea is that, when called, we would stop/start gathering coverage information. The implementation would probably be that we would swap in a dummy coverage counter, so that coverage writes would be effectively ignored. The use case is for fuzzing something like a compiler. Imagine building a generation tool that emits valid code based on a random byte slice. The parsing code is full of branches that will saturate our coverage counters, but it's not coverage we care about; it'd be nice to get past that before attending to coverage. Or I suppose for this particular case, we could also add a flag to go-fuzz-build to suppress instrumentation of a set of packages. Hmm. |
Another method to consider is AddDataCoverage(hash uint64) We are experimented with data space-guided coverage and it shows very strong result. Basically if we hash our model state and modify coverage bitmap based on this hash, then fuzzer is able to explore not only code space, but data space as well. Explained here. |
The current Fuzz function signature is
I think we should migrate it to something more like:
(That import path will obviously have to change if go-fuzz moves into the standard toolchain. Or if we migrate to github.com/go-fuzz/go-fuzz, or the like.)
I imagine starting
fuzz.F
(fuzzing.F
?) with:There's plenty more to add, e.g. key-value-based requests for bools/ints/etc instead of having to parse them out of a byte slice. But this would be a good first start.
In order to avoid people having to change their fuzz functions, I'd automatically detect the old style of signature and have go-fuzz-build insert a shim.
Discuss. :)
(P.S. I think you had a similar proposal, Dmitry. I know that I need to go look at it. Apologies.)
The text was updated successfully, but these errors were encountered: