Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pass fuzz.F to fuzz functions #218

Open
josharian opened this issue Mar 4, 2019 · 51 comments
Open

pass fuzz.F to fuzz functions #218

josharian opened this issue Mar 4, 2019 · 51 comments

Comments

@josharian
Copy link
Collaborator

The current Fuzz function signature is

func FuzzSomething(b []byte) int

I think we should migrate it to something more like:

import fuzz "github.com/dvyukov/go-fuzz"

func FuzzSomething(fz fuzz.F)

(That import path will obviously have to change if go-fuzz moves into the standard toolchain. Or if we migrate to github.com/go-fuzz/go-fuzz, or the like.)

I imagine starting fuzz.F (fuzzing.F?) with:

type F interface {
  // Bytes returns a byte slice to be used as input.
  Bytes() []byte

  // Skip tells go-fuzz that this input should not added to the corpus, and stops execution.
  // (Similar to return -1 right now.)
  // msg explains why. It is currently just a form of documentation for the user,
  // but you can imagine later gathering stats about whence all the skips.
  Skip(msg string)

  // Interesting tells go-fuzz that this input is interesting and should be given added priority.
  // Equivalent to return 1 right now, except that it does not stop execution.
  Interesting()

  // Fail reports a failure of an invariant and stops execution.
  // Equivalent to a custom panic right now.
  Fail(msg string)
  Failf(msg string, ...interface{})

  // ExitOnCompletion requests that go-fuzz exit the binary after the Fuzz function completes.
  // This dramatically impacts performance and the effectiveness of go-fuzz;
  // it should be used only when it is infeasible to write a fuzz function that can be safely
  // called multiple times.
  ExitOnCompletion()
}

There's plenty more to add, e.g. key-value-based requests for bools/ints/etc instead of having to parse them out of a byte slice. But this would be a good first start.

In order to avoid people having to change their fuzz functions, I'd automatically detect the old style of signature and have go-fuzz-build insert a shim.

Discuss. :)

(P.S. I think you had a similar proposal, Dmitry. I know that I need to go look at it. Apologies.)

@josharian
Copy link
Collaborator Author

Dmitry's original proposal is here: https://docs.google.com/document/u/1/d/1zXR-TFL3BfnceEAWytV8bnzB2Tfp6EPFinWVJ5V4QC8/pub

Relevant passage:

The support is added to the testing package.
User fuzz functions are added to _test.go files and start with Fuzz akin to tests and benchmarks. Fuzz function accepts two arguments: (*testing.F, data []byte).
The data []byte argument is the random input that the function is supposed to use in some way.
The new type testing.F merely implements testing.TB interface:
Log functions accumulate output for the duration of a single invocation; output is printed for failing invocations.
Error/Fatal/Fail functions denote a found bug.
Skip functions terminate current invocation without failing and mark the input as uninteresting.
testing.F type can later be extended with other functions if necessary.
The fuzz function signature can later be allowed to accept multiple randomly-generated arguments of different types. This is useful for fuzz tests that need multiple inputs, for example:
func FuzzRegexp(f *testing.F, re string, data []byte, posix bool) {

Looks like we're more or less on the same page here, except that I propose to keep the signature simple and uniform, and use methods to get fuzz data.

Unless there are objections, I will plan to implement my proposal and see how it looks in practice.

@dvyukov
Copy link
Owner

dvyukov commented Mar 5, 2019

Looks like we're more or less on the same page here, except that I propose to keep the signature simple and uniform, and use methods to get fuzz data.

The critical part here is this:

The fuzz function signature can later be allowed to accept multiple randomly-generated arguments of different types. This is useful for fuzz tests that need multiple inputs, for example:
func FuzzRegexp(f *testing.F, re string, data []byte, posix bool) {

I believe this is the future of fuzzing and Go can pioneer here with simplicity and easy of use (as always!). Almost all current fuzzers (notably AFL, LibFuzzer) come from security background, and there you naturally operate on a blob because that's the only thing that can come from external world (network, file, etc). But currently there is a shift towards using fuzzing during normal dev process (testing, correctness, quality, CI, etc). And in this context you have a function that accepts stuff and you want to test it, so you need that stuff to come from the fuzzer. The current solutions are both complex, inflexible and inefficient.
So I would like to keep the possibility of adding more arguments open.
And if for single []byte passing it as argument of as F method is more of a matter of taste, I don't see how we can conveniently pass more arguments via F (and capture their types, statically or via reflect).

@dvyukov
Copy link
Owner

dvyukov commented Mar 5, 2019

But overall I agree with idea of adding a context argument as it's much more flexible and extensible.

I always feel a bit nervous designing public stable APIs...

Skip/Fail/Failf look reasonable.

For Interesting maybe we want to pass an integer priority (how much it is interesting). But for most cases it will indeed be binary. Or maybe we want to not give this knob to user and instead rely on fuzzer becoming smarter over time?

Re ExitOnCompletion, do you have any examples? Or other name alternatives to choose from? :)
Should it accept number of times the process can be reused?... 1 will give this behavior, but can also be set to, say, 10. Perhaps this is over-designing.
On a related note, if a test says if it can be reused or not, should it also say what timeout it needs? We currently have it as a flag to go-fuzz, but it's kind of a test property.

@dvyukov
Copy link
Owner

dvyukov commented Mar 5, 2019

For context, Ian's proposal of using F.Useful/Discard:
golang/go#19109 (comment)

Skip looks better than Discard, because Skip is testing.T method.

For Interesting/Useful we could also consider Priority name, esp if it accepts an int.

@dvyukov
Copy link
Owner

dvyukov commented Mar 5, 2019

golang/go#19109 (comment)
also mentions:

Could there be some default convention say a _fuzz/xxx directory (where xxx corresponds with FuzzXxx) and a method on the *testing.F object to load a different corpus from the _fuzz/ directory if necessary? It seems like it should just know where the corpus is.

I am not sure it's Fuzz function responsibility to know about layout of files on disk. It's probably more of an infra responsibility (i.e. should stay as tool flag).

@dvyukov
Copy link
Owner

dvyukov commented Mar 5, 2019

This is an interesting one about reusing Fuzz function in unit tests:
golang/go#19109 (comment)

A first idea is providing a FromT function that creates a stub F from testing.T to use in unit-tests.

@dvyukov
Copy link
Owner

dvyukov commented Mar 5, 2019

Looking at libfuzzer flags in case there are others like -timeout which are actually test property:
https://llvm.org/docs/LibFuzzer.html#options
Potentially relevant may be:

-max_len
Maximum length of a test input. If 0 (the default), libFuzzer tries to guess a good value based on the corpus (and reports it).
-timeout
Timeout in seconds, default 1200. If an input takes longer than this timeout, the process is treated as a failure case.
-rss_limit_mb
Memory usage limit in Mb, default 2048. Use 0 to disable the limit. If an input requires more than this amount of RSS memory to execute, the process is treated as a failure case. The limit is checked in a separate thread every second. If running w/o ASAN/MSAN, you may use ‘ulimit -v’ instead.
-malloc_limit_mb
If non-zero, the fuzzer will exit if the target tries to allocate this number of Mb with one malloc call. If zero (default) same limit as rss_limit_mb is applied.
-only_ascii
If 1, generate only ASCII (isprint``+``isspace) inputs. Defaults to 0.

@josharian
Copy link
Collaborator Author

I've been thinking about your proposal:

func FuzzRegexp(f *testing.F, re string, data []byte, posix bool) {
  // use re, data, posix
}

vs what I had in mind:

func FuzzRegexp(fz fuzz.F) {
  re := fz.String("re")
  data := fz.Bytes("data")
  posix := fz.Bool("posix")
  // use re, data, posix
}

And I have come around to your suggestion. I want to record some of my thinking here, though, for future reference.

One challenge when you have anything other than just a []byte as the input is the corpus. What happens if you change your set of inputs? Does that invalidate your existing corpus? How does go-fuzz know, and what is the behavior? Is there a migration script of some kind? If you want to manually seed the corpus, how do you generate the entries?

This interacts with the Fuzz function signature: If you want some amount of resilience in your corpus (e.g. no changes required if you merely add a new input), then you need your inputs to be keyed somehow. I had had in mind providing a key when you request them (as in the code snippet below), but as you note, someone might request strings in a loop. This makes it impossible to know statically what the shape of the corpus entries is.

The downside to having this in the Fuzz function signature is that it makes the variable names significant. If you rename re to r, what does this mean for the existing corpus? I'm guessing that go-fuzz would die, saying something along the lines of "the corpus contains (re, data, posix), but you want (r, data, posix), you need to migrate (or use a different/new corpus)". This is a bit unexpected; we normally think of variable names as unimportant, whereas arguments to a function that take a key are more obviously important. It also means that there are restrictions on the Fuzz function signature (no interfaces, I'm guessing), which again is a bit weird.

The ability to statically detect the desired inputs is, to my mind, decisive.

And there are ways to work around a change in function signature without having to migrate the corpus. E.g. you could make a wrapper fuzz function that modifies the inputs and calls the old (presumably renamed) fuzz function.

So for the moment, I'm going to implement only:

func FuzzSomething(fz fuzz.F, data []byte)

with the plan to later support adding other arguments as well. (Which will requiring figuring out about the corpus.)

Incidentally, I am not sure yet whether fuzz.F should be an interface or a concrete type. I had convinced myself a while ago it had to be an interface, but I no longer remember why. When I implement it I will discover (or not, and use a struct!).

@josharian
Copy link
Collaborator Author

For Interesting maybe we want to pass an integer priority (how much it is interesting). But for most cases it will indeed be binary. Or maybe we want to not give this knob to user and instead rely on fuzzer becoming smarter over time?

I like not having knobs. And they're easier to add than drop. So I'll start without it, particularly since we aren't sure exactly what the knob should look like.

It'd be nice to run some experiments to see how much the hint helps right now, so we know what we're leaving on the table. Perhaps I will do that at some point.

@josharian
Copy link
Collaborator Author

Re ExitOnCompletion, do you have any examples? Or other name alternatives to choose from? :)

Example: the Go compiler. Refactoring it to be re-usable is a gigantic task, but it is not too hard to set up a Fuzz function entry point for it that can be called only once.

Name alternatives: I dunno. An alternative API is to a way to ask go-fuzz to exit the process without considering it a crash. Maybe f.Exit() (like os.Exit but without a rc)? Or f.ExitProcess()? Then in the compiler fuzz function you could defer f.Exit().

Should it accept number of times the process can be reused?... 1 will give this behavior, but can also be set to, say, 10. Perhaps this is over-designing.

I think so. In the general case go-fuzz should do the right thing. But in the compiler case it is hard to do without the control. (I have been using a modified version of go-fuzz.)

On a related note, if a test says if it can be reused or not, should it also say what timeout it needs? We currently have it as a flag to go-fuzz, but it's kind of a test property.

Yes, I think so. SetDeadline, perhaps? It could even be called multiple times.

Concrete use case for doing this from the Fuzz function: When fuzzing starlark, I start with the starlark interpreter but then call Python to compare results. I'd like an aggressive timeout for the first part, but then a very lax timeout for exec'ing Python, which is slow. :) I could adjust my timeout based on how long the work I'm about to do should actually take.

@josharian
Copy link
Collaborator Author

I am not sure it's Fuzz function responsibility to know about layout of files on disk.

I agree, although I do think there should be a reasonable default, and have that be based on the package and fuzz function name. (And also maybe its signature?)

Consider the case in which you have 5 fuzz functions in a package. It'd be nice to be able to call go-fuzz and have it work on all 5 without having to tell it where the corpus is for each of the 5. But we at least agree that the Fuzz function shouldn't know, so I can at least proceed on the topic at hand here. :)

@josharian
Copy link
Collaborator Author

A first idea is providing a FromT function that creates a stub F from testing.T to use in unit-tests.

Interesting, but not germane (I think) to the new fuzz function API. I'd want to think about that a bit more.

@josharian
Copy link
Collaborator Author

libfuzzer flags

Interesting. None strike me as obvious must-haves for v0. Let's do this incrementally.

@josharian
Copy link
Collaborator Author

Based on all the conversation above, I feel reasonably confident in taking a basic step: minimal fuzz.F, accept a single byte slice in the signature. I will work on that and we can add in extra little pieces as we go.

@josharian
Copy link
Collaborator Author

It also means that there are restrictions on the Fuzz function signature (no interfaces, I'm guessing)

Another restriction: no unexported types. (Can't create those types from a generated main function.)

@dvyukov
Copy link
Owner

dvyukov commented Mar 12, 2019

For Interesting maybe we want to pass an integer priority (how much it is interesting). But for most cases it will indeed be binary. Or maybe we want to not give this knob to user and instead rely on fuzzer becoming smarter over time?

I like not having knobs. And they're easier to add than drop. So I'll start without it, particularly since we aren't sure exactly what the knob should look like.

It'd be nice to run some experiments to see how much the hint helps right now, so we know what we're leaving on the table. Perhaps I will do that at some point.

But Interesting is one big knob in itself. So why we are adding it? ;)
I added it because it made my life easier. There was never a sound proof that it's useful. I still think it's useful, but it's still a knob. It may be replaced with a smarter trace analysis (e.g. most use cases that I have it mind boil down to progressing further in the Fuzz function).

FWIW, the use cases that I have in mind can also be handled without integer priority, but by calling Interesting multiple times. I.e. calling Interesting twice makes the input twice as interesting as the input that calls Interesting once.
EDIT: And to make it clear, I don't mean writing a helper function that calls Interesting in a loop :) I mean that Interesting can be naturally placed, say, somewhere in the middle of the Fuzz function and closer to the end of the function.

@dvyukov dvyukov closed this as completed Mar 12, 2019
@dvyukov dvyukov reopened this Mar 12, 2019
@dvyukov
Copy link
Owner

dvyukov commented Mar 12, 2019

I am not sure it's Fuzz function responsibility to know about layout of files on disk.

I agree, although I do think there should be a reasonable default, and have that be based on the package and fuzz function name. (And also maybe its signature?)

Consider the case in which you have 5 fuzz functions in a package. It'd be nice to be able to call go-fuzz and have it work on all 5 without having to tell it where the corpus is for each of the 5. But we at least agree that the Fuzz function shouldn't know, so I can at least proceed on the topic at hand here. :)

Agree. There should be a simple default. It should take Fuzz function name into account.

@dvyukov
Copy link
Owner

dvyukov commented Mar 12, 2019

This interacts with the Fuzz function signature: If you want some amount of resilience in your corpus (e.g. no changes required if you merely add a new input), then you need your inputs to be keyed somehow.

Overall I think the fuzzer should try to do it's best to upgrade old corpus, but exact mechanism should be an implementation detail. Also the corpus inputs should use some simple intuitive format (json) so that users can easily upgrade corpus if they fell like so.
Speaking of the implementation details. Looking at variable names looks unnecessary. I think the most common cases that it should handle it adding arguments/struct fields at the end and limited set of type changes (e.g. between int/float/bool, or string/[]byte), also discarding excessive fields at the end.

It also means that there are restrictions on the Fuzz function signature (no interfaces, I'm guessing), which again is a bit weird.

Yes, no interfaces. Probably no channels (at least until we understand we need them). Not sure about maps, probably we an allow them.
The rules should be enforced during fuzzer build (so no runtime failures).

Another restriction: no unexported types. (Can't create those types from a generated main function.)

Interesting question. Actually since we are the compiler, potentially we can work around this. E.g. we could generate a thunk function that accepts some common format and transforms it into actual arguments and calls the target function.

@dvyukov
Copy link
Owner

dvyukov commented Mar 12, 2019

func FuzzRegexp(f *testing.F, re string, data []byte, posix bool)

Forgot to mention one thing.
One very interesting direction is using struct field tags to provide more hints/semantic information to the fuzzer.
Both syzkaller and libprotobuf-mutator that work on structured input level allow specifying some additional constraints for fields. Say, expected range for an int variable, possible values for a string variable (so called dictionary), etc. This may also help with interfaces that represent "unions", e.g. set of some heterogeneous "actions" on the tested type.
And tags is something that we do want to have statically in the fuzzer.

@thepudds
Copy link
Collaborator

thepudds commented Mar 12, 2019

In #218 (comment), you had said:

I am not sure it's Fuzz function responsibility to know about layout of files on disk.

I agree, although I do think there should be a reasonable default, and have that be based on the package and fuzz function name. (And also maybe its signature?)

I think the conversation has moved past that point, but for reference, what the fzgo prototype currently defaults to is pkgpath/testdata/fuzz/pkg.FuzzFunc/corpus. For example, if you have a fuzz.go such as:

package sample

func FuzzSemver(data []byte) int {
   ...
}

The corpus ends up by default in .../testdata/fuzz/sample.FuzzSemver/corpus, and a -fuzzdir flag allows the corpus to be stored elsewhere (e.g., a separate corpus repo, or just another directory in the same repo, etc.).

I think that might be a small elaboration of what was proposed in @dvyukov's March 2017 proposal document (e.g., I'm not sure if the proposal specifically spelled out the form of pkg.FuzzFunc in the file path).

@dvyukov
Copy link
Owner

dvyukov commented Mar 12, 2019

testdata already located in the package dir, why do we need package name second time in pkg.FuzzFunc? Could it be just pkgpath/testdata/fuzz/FuzzFunc/corpus?

@thepudds
Copy link
Collaborator

thepudds commented Mar 12, 2019

@dvyukov I think in case someone supplies a -fuzzdir argument. In other words, -fuzzdir defaults to pkgpath/testdata/fuzz, but someone could set it to -fuzzdir=/tmp/fuzz, which would result in the corpus being in /tmp/fuzz/sample.FuzzSemver/corpus (continuing prior example).

Of course, that alone does not handle all possible conflicts (e.g., math/rand vs. crypto/rand).

(And of course, the behavior could change, either in that prototype or the "real" version).

@dvyukov
Copy link
Owner

dvyukov commented Mar 12, 2019

I see. This allows to point -fuzzdir to a common location for all targets/packages. Makes sense.
A more principled approach would be to include full package path then. It would avoid all conflicts (at the cost of longer paths, of course).

@thepudds
Copy link
Collaborator

thepudds commented Mar 12, 2019

Alternatively:

  • an empty -fuzzdir could imply <pkgpath>/testdata/fuzz/FuzzFunc/corpus (so by default, the package's testdata directory is used, and there is no redundancy on the package name), vs.
  • a non-empty -fuzzdir such as -fuzzdir=/tmp/fuzz could mean it creates directories corresponding to the package path (e.g., /tmp/fuzz/example.com/full/package/path/FuzzFunc/corpus).

But maybe that is unnecessary complexity. Separately, it sounds like there might be some desire to incorporate the FuzzFunc signature into the on-disk location, though that sounds like it might still be an open question?

@dvyukov
Copy link
Owner

dvyukov commented Mar 12, 2019

Another option is to just -fuzzdir as is and don't append anything. Then it's implied to be user (script, infrastructure) responsibility to pass proper paths in whatever convention they want. Then we can drop package name from testdata.
I suspect that example.com/full/package/path/FuzzFunc/corpus may be too complex for systems like OSS-Fuzz. They want to simply give a directory with corpus as is and expect updated corpus to be there after a fuzzing session for copy to a persistent location. They may not understand what are these 6 additional levels of dirs and there is the corpus in the end.
For automation cases the -fuzzdir location on local disk is most likely temporal. Somebody just provides the corpus in that location at fuzzer start and expect it updated corpus there. That's it. There are even most likely no other fuzzers in that temp VM, so no potential path conflicts to resolve.

@thepudds
Copy link
Collaborator

Another option is to just -fuzzdir as is and don't append anything. Then it's implied to be user (script, infrastructure) responsibility to pass proper paths in whatever convention they want. Then we can drop package name from testdata.

Would that imply a need to also drop FuzzFunc from the path? Or would OSS-Fuzz be OK with that?

There might be some minor tension here between making it easy for OSS-Fuzz vs. making it easy for a human using go test or go test -fuzz . or similar. For example, a human might want to do something like export GOFLAGS='-fuzzdir=/workspace/cloned-corpus-repo' and then navigate around different packages or run something like go test ./... and have the corpus directory be found for each different package / FuzzFunc by the go tool based off of that one setting. Perhaps the answer is two different settings or otherwise slightly refactoring some of the proposed settings.

Finally, one only tangentially related question, which really does not belong here in this #218 issue -- I have a few questions around the proposed behavior of go test when no fuzzing flags are set, including I have the start of a very basic form of running the corpus in fzgo as unit tests for go test when no fuzzing flags are set. However, golang/go#19109 is already very long. Any thoughts on where / how best to ask some follow-up questions on the 2017 March proposal? I was tempted to open an issue on the fzgo repo and CC some potentially interested people (given my most immediate set of questions are around possible next steps for fzgo), but not sure that is the best approach.

@dvyukov
Copy link
Owner

dvyukov commented Mar 13, 2019

Would that imply a need to also drop FuzzFunc from the path? Or would OSS-Fuzz be OK with that?

I can't confirm neither deny this.

Any thoughts on where / how best to ask some follow-up questions on the 2017 March proposal?

Maybe mail golang-dev@ and then write up conclusions and reference the discussion on the issue.

josharian added a commit to josharian/go-fuzz that referenced this issue Mar 14, 2019
This is initial work towards dvyukov#218.
@josharian
Copy link
Collaborator Author

The more I think about this, the less convinced I am that we should go out of our way to implement testing.TB. There's a ton of duplication, because testing and benchmarking have distinctions and needs that fuzzing does not.

On a related note, testing.T and testing.B both have a Run method. Implementing that for fuzzing would be difficult. It would require careful bookkeeping around corpuses. It would also require resetting the cover tab every time you called Run. Yet another divergence between fuzzing and testing/benchmarking.

@thepudds
Copy link
Collaborator

thepudds commented Mar 29, 2019

Did the approach for testing.TB suggested in
golang/go#19109 (comment) not work out because of the issue cited a couple of comments back in #218 (comment) about the challenges of where to put a concrete implementation?

@josharian
Copy link
Collaborator Author

Maybe I’m missing something. I’m saying I don’t see the point in embedding or implementing testing.TB. The overlap with what fuzzing actually needs is too small. We may as well have our own interface.

The problem of concrete type vs interface is separate, but also real. In theory we could do much more complicated codegen to work around it. In practice I think we are better off staying abstract. The key point, though, is that I don’t think fuzz.F (future testing.F?) should be a superset of testing.T!.

@dvyukov
Copy link
Owner

dvyukov commented Mar 29, 2019

Which TB functionality does not belong to F?
I see logging, erroring and skipping there. Skipping we need. Erroring we need in some form, we can panic, but calling methods on F looks fine too. Is it logging? It does not look fatal.
And we can return something from Name and make Helper no-op?

@josharian
Copy link
Collaborator Author

Which TB functionality does not belong to F?

It seems to me that, of TB:

F really needs: Fatal, Fatalf, Skip, Skipf.

Maybe it also should have: Log, Logf.

TB has all that plus: Name, Error, Errorf, FailNow, SkipNow, Fail, Failed, Skipped, Helper.

So of the 15 TB methods, 4 are clearly useful and 2 are maybe useful. 3 of them are nonsensical (Failed, Skipped, Helper). And 3 will basically never be used (FailNow, SkipNow, Fail), just as they are basically never used by people writing tests or benchmarks; they are historical dregs.

I personally would rather have just the 4-6 useful methods, rather than trying to squeeze it into a TB-shaped box.

It does not look fatal. And we can return something from Name and make Helper no-op?

My initial implementation returns something from Name, and makes Helper a no-op. It's not unimplementable, it's just serious overkill--an unnecessary attempt at consistency.

That's my current thinking, anyway.

@dvyukov
Copy link
Owner

dvyukov commented Apr 1, 2019

I agree that functions like SkipNow are less useful than say Skipf. But I think the main question is if we need that compatibility between F and TB or not. If we need it, then we need to accept legacy like SkipNow as well (that's how interfaces work).
So what was the original motivation between converging F and TB? Is it only writing regression tests that invoke the fuzz function? That's an important one even if just to debug a new crash. But if that's the only one, I am wondering if we actually can do something else.
Currently I tend to hardcode the inputs in the test, from recent examples:

func TestFuzz(t *testing.T) {
	for i, data := range []string{
		`test$length10(&200000000000009`,
		`test$str0(&(0x7f0000000000)='\xz+')`,
		`syz_compare(&AUTO=""/81546506777")`,
		`syz_compare(&AUTO=""/190734863281259)`,
		`syz_compare(&AUTO=""/500000)`,
		`test$vma0(&(0x7f0000000000)=0)`,
		`test$vma0(&(0x7f0000000000)=')`,
		`test$length10(&(0x7f0000009000),AUTO)`,
		`syz_compare(&AUTO=""/2712404)
mutate4()
mutate7()
mutate8()
`,
	} {
		t.Logf("test #%v: %q", i, string(data))
		Deserialize([]byte(data))
		ParseLog([]byte(data))
	}
}

That's kinda handy for short one-line inputs, but so convenient for longer inputs or if there are too many of them. There is also some boilerplate involved and also it's not too convenient to select a single input to run (to debug a newly added crasher). Currently I comment out all but one, because I am not up to giving them names and writing additional code to select subset of inputs nor wrapping into t.Run.

I wonder if it's a good idea to instead allow 2/2+ directories with input corpus?
For example, if we read inputs from testdata/something/something, but also from -fuzzdir/-workdir if provided. Then testdata/ could contain hand-written inputs and regression tests and is checked-in with the code (that's small number of higher-quality inputs with low churn, so no different from unit-tests and makes sense to check-in). The second dir can contain the random inputs, there are more of them and high churn. So that is preferably checked-in somewhere else (stored in an archive or something else).
Then workflow would be to simply copy the crashing input from the second dir into testdata/ and run go test -run=file_name (if the auto-generated regression test uses t.Run then this will work auto-magically).

If I would have something like this, I would not need convergence between F and TB.

-fuzzdir/-workdir could accept several dirs too. Only one of them would be the master one and will be used to store new inputs. LibFuzzer also has:

-merge
If set to 1, any corpus inputs from the 2nd, 3rd etc. corpus directories that trigger new code coverage will be merged into the first corpus directory. Defaults to 0. This flag can be used to minimize a corpus.

I wonder if we could merge into the master dir always and then handle this use case as well.

@thepudds
Copy link
Collaborator

thepudds commented Apr 2, 2019

I wonder if it's a good idea to instead allow 2/2+ directories with input corpus?
For example, if we read inputs from testdata/something/something, but also from -fuzzdir/-workdir if provided.

FWIW, I very much like the concept of multiple corpus locations, especially from the perspective of the #19109 proposal.

Multiple corpus locations hopefully helps with what might otherwise be a few different related problems with the current #19109 proposal:

  1. Execution time

The March 2017 proposal says:

go test runs fuzz functions as unit tests. Fuzz functions are selected with -run flag on par with tests (i.e. all by default). Fuzz functions are executed on all inputs from the corpus ...

I have a version of that working in fzgo (currently unpublished / still a bit WIP). One set of datapoints is running it on https://github.com/dvyukov/go-fuzz-corpus. The median execution time isn't too bad:

       strings:   42.959 ms to execute Fuzz function on corpus size:  456

but the two slowest are around 4 seconds:

  htmltemplate: 4605.361 ms to execute Fuzz function on corpus size: 5724
           png: 3818.827 ms to execute Fuzz function on corpus size:  267

4 seconds seems high enough that it might be cause for concern from the core Go proposal review team, especially if one considers that https://github.com/dvyukov/go-fuzz-corpus doesn't necessarily reflect a multi-calendar year effort to build up a corpus, I think, and people can have slower Fuzz functions, and hence there is no reason to think that 4 seconds is any type of upper bound on what people might see in the wild.

Having one smaller corpus checked into the main repository of a given Go package would mostly side step that, I think, with the option to have a much bigger corpus elsewhere for when you want to do a weekend run or do it in CI or oss-fuzz, etc.

  1. Discoverability, and/or how to find a separate corpus repo

If you are going to get the benefit of your corpus as unit tests when doing a normal go test ., under the March 2017 proposal, it would be fairly awkward to know where that other corpus repo is, or to remember to set and unset -fuzzdir=/some/other/repo or GOFLAGS=-fuzzdir=/some/other/repo, especially as you move between different packages.

I think Dmitry had wondered aloud in another issue if maybe go.mod could help with that problem, but I've observed a large amount of reluctance to put more information into go.mod. (Modules themselves suffer a fair amount from how two modules can't automatically find each other if you are working on more than one module at the same time, and there is no current plan or timeframe for how to resolve that, e.g., skim golang/go#27542 or golang/go#26640 for 30 sec to get a sense of the issue, if interested. In other words, one module "finding" another module is a fairly pressing issue, but even there there is reluctance to add more to go.mod).

Having a smaller corpus checked into pkgpath/testdata/fuzz or similar also side-steps that problem as well for the common case of wanting to run unit tests.

  1. Running go test . probably shouldn't dirty your VCS status.

This is more of a question maybe, but the March 2017 proposal says:

go test runs fuzz functions as unit tests. ... Fuzz functions are executed on all inputs from the corpus and on some amount of newly generated inputs (for 1s in normal mode and for 0.1s in short mode).

Maybe if the norm is a smaller corpus checked into pkgpath/testdata/fuzz, then maybe that corpus would only be updated for a crasher, but not be updated for new coverage if you just run go test . (without specifying -fuzzdir to a larger location)?


Side note: the way it works in fzgo is currently fairly crude. It just dumps a corpus_test.go to a temp directory and Sprintfs a TestCorpus function, the heart of which is:

	for _, file := range files {
		if file.IsDir() {
			continue
		}
		t.Run(file.Name(), func(t *testing.T) {
			dat, err := ioutil.ReadFile(filepath.Join(corpusPath, file.Name()))
			if err != nil {
				t.Error(err)
			}
			fuzzer.%s(dat)   // ends up with fuzz.Fuzz(dat) here, or similar
		})

	}

That means things like -run with part of a filename works:

fzgo test -v -run=/f79c40 github.com/dvyukov/go-fuzz-corpus/png 
=== RUN   TestCorpus
=== RUN   TestCorpus/f79c40bef24b6e10e8f10b8bcab3223c26dc3110-9
--- PASS: TestCorpus (0.00s)
    --- PASS: TestCorpus/f79c40bef24b6e10e8f10b8bcab3223c26dc3110-9 (0.00s)

Finally, sorry for the long post, but I guess it is fair to say I am enthusiastic about the concept of > 1 corpus location.

@dvyukov
Copy link
Owner

dvyukov commented Apr 3, 2019

Execution time

The main fuzzing corpus should not be checked into stdlib repo, regardless of if we support multiple corpus locations or not. So long execution time should not be a problem for anybody. Buildbot could checkout the fuzzing corpus specifically, but then longer execution time is what it explicitly asks for.
So this looks somewhat orthogonal. Execution time depends on the fact if go test can reach the main fuzzing corpus or not. And that may or may not happen regardless of support for multiple corpus dirs. Or am I missing something?
We also have the -short flag as a potential lever for something. But I think running inputs until we exhaust 1 second real time threshold would be unfortunately wrong (non-deterministic). But maybe we could use it in some other smart way?
We also have a potential lever in the form of F object methods. E.g. "my inputs are slow, run only 1/10-th of the corpus as unit test". But we need to be careful to not pollute it with clumsy things. Something automatic is definitely more preferable.
Also, were these tests run in parallel? We still kinda need to accommodate for people with few cores, but running in parallel may tolerate the problem to some degree. E.g. short net/http tests run for 12 seconds without parallelism and 2.5s with parallelism.

@dvyukov
Copy link
Owner

dvyukov commented Apr 3, 2019

I think Dmitry had wondered aloud in another issue if maybe go.mod could help with that problem, but I've observed a large amount of reluctance to put more information into go.mod.

What I was thinking about is something like git submodules. Namely, we import another repo at a particular location, and that location is exactly where go tool looks for the corpus.
But modules can't do this, right? They can't import something into, say, testdata subdir. They work on package level if I understand correctly.
Is it feasible to have some convention re import path? E.g. golang.org/x/crypto is fuzzed repo, then corpus is in golang.org/x/crypto.fuzz or something.

@dvyukov
Copy link
Owner

dvyukov commented Apr 3, 2019

Running go test . probably shouldn't dirty your VCS status.

This sounds reasonable. I would say a requirement.

That means things like -run with part of a filename works:

Nice!

@thepudds
Copy link
Collaborator

thepudds commented May 8, 2019

@dvyukov @josharian I think you are both saying now that there is no real value in using the testing.TB interface.

I wanted to confirm that is correct?

FWIW, I agree as well, including the fzgo prototype already side-stepped that (by synthesizing Test functions as mentioned in the "Side note" at the end of the (too-long) comment #218 (comment)).

In the golang/go#19109 discussion, I think the first mention of testing.TB might have been by @dvyukov in golang/go#19109 (comment):

Yes, this is fittable into testing/quick. I can't make my mind as to whether we need to fit it into testing/quick or not yet.
...
But on the other hand, testing.TB seems to provide everything we need (with t.Skip as "discard this input"). So it would be nice to make them just normal tests with no new APIs. Any thoughts?

But I think that was might have been more of a brainstorming comment from @dvyukov in response to someone asking if the new fuzzing functionality could fit into existing APIs such as testing/quick.

I think the first real suggestion to start with testing.TB was really more about how to prototype, when Russ mentioned in golang/go#19109 (comment):

If you want to allow the fuzzed function to take a *testing.F for error reporting in the long term, you could start with using testing.TB instead. Note that you can implement a testing.TB even though it has unexported methods, like this ...

Read in context, I think it is clear Russ is saying testing.TB would just be step along the way to testing.F.

In any event, if it is not needed for a prototype, and seems awkward, then seems like no real value in using the testing.TB interface.

@josharian
Copy link
Collaborator Author

FWIW, I read Russ's comment as saying that testing.F should implement testing.TB, and then proposing a way to achieve that.

But I think we should start by implementing it in whatever way makes the most practical sense. We can then have a further discussion with Russ about the value (or lack thereof) of implementing testing.TB. As long as we are thoughtful in our prototype, that shouldn't cause too much pain or wasted work.

@dvyukov
Copy link
Owner

dvyukov commented May 9, 2019

Yes, let's postpone implementing TB. In the end interfaces are retrofit-able in Go so we don't burn any bridges.
The only motivation for implementing TB is sharing code between fuzz function and tests. But if we provide a zero-code way of doing it, then we don't need TB at all.

@thepudds
Copy link
Collaborator

Some related discussion in golang/go#19109 (comment), including:

Thus, I'd like to propose to start with an interface similar to GetRandomData([]byte) from the beginning of the discussion, which later can be extended with means to specify structure:

type T interface {
        GetRandomData([]byte)  // can be called any number of times

        BeginSpan(label uint64) (id int)
        EndSpan(id int, discard bool)
}

@flyingmutant
Copy link

@thepudds thanks a lot for the link here, I was not aware of this discussion.

I indeed believe that there is great value in ability to run modern property-based testing library on top of go-fuzz. For that, ability to interactively request chunks of random data is crucial; generating it "upfront" based on fuzz function signature is too limiting, because:

  • it does not allow to express very important "stateful" property tests cleany
  • there are a lot of ways even simplest data types like []byte can be generated (e.g. "no ASCII control and not longer than 8 bytes")

I think it is much better to be able to decouple data generation from go-fuzz, and GetRandomData() serves this purpose nicely, while IMO not making the original binary fuzzing use case too painful.

I'll be happy to go more in-depth about anything related to property-based testing.

@josharian
Copy link
Collaborator Author

Part of what makes fuzzing effective is a straightforward relationship between the input byte slice and what is done with it. As an example, instrumentation notes when a comparison fails and modified the input slice to try to make it succeed. As another example, one kind of mutation is to detect an ascii-printed number and replace it with another ascii number.

If the byte slice is used to generate random numbers, that relationship will be lost, and fuzzing effectiveness will diminish. That is also why the discussion here about extended signatures is focused on precise input types—because it makes the relationship between the fuzz input and what is done with it much tighter.

As such, I think your proposal is not a good fit for a fuzzing engine.

It might be that you would want to reuse a fuzz function for randomized testing. That makes sense and is worth exploring. But the API you suggest would not work well for the fuzzing use case.

@tv42
Copy link
Contributor

tv42 commented May 17, 2019

#218 (comment) sounds like #65 all over again.

Could someone please retitle this bug to clarify it's about the fuzz.F part, and not "objects" as in complex inputs and not []byte? "Object" is a very vague word. "pass fuzz.F to fuzz functions", or such.

@dvyukov dvyukov changed the title passing an object to fuzz functions pass fuzz.F to fuzz functions May 17, 2019
@dvyukov
Copy link
Owner

dvyukov commented May 17, 2019

@tv42 done

FTR, the pending PR is #223

@flyingmutant
Copy link

@josharian while I agree that my proposal can make the job of the sonar harder, on the other hand, it will make the job of the mutator much easier and significantly increase its chance of generating valid input data. Crucial thing to understand is that by building on top of random data stream the right way, it is possible to have a very high hit rate of generating valid data, even when it has complex internal structure and validity requirements. All mutator has to do is either change individual bytes, or add/remove/reorder spans of data, without having to know anything about the kind or constraints of data being generated.

Consider high-level mutation like "add customer to the list of customers". With span-based approach, all we need to do is place a (potentially modified) copy of a span right after it; there is no need to know anything about what a valid "customer" or "list" is; in particular, there is no need to update any kind of "list length" data, as it does not exist in the data stream.

@dvyukov
Copy link
Owner

dvyukov commented May 17, 2019

Let's please move the discussion of the complex args to #65. There are too many scattered discussions already.

@josharian
Copy link
Collaborator Author

Another fuzz.F method pair to consider: StopCoverage() and StartCoverage().

The idea is that, when called, we would stop/start gathering coverage information. The implementation would probably be that we would swap in a dummy coverage counter, so that coverage writes would be effectively ignored.

The use case is for fuzzing something like a compiler. Imagine building a generation tool that emits valid code based on a random byte slice. The parsing code is full of branches that will saturate our coverage counters, but it's not coverage we care about; it'd be nice to get past that before attending to coverage.

Or I suppose for this particular case, we could also add a flag to go-fuzz-build to suppress instrumentation of a set of packages. Hmm.

@hrissan
Copy link

hrissan commented Apr 11, 2024

Another method to consider is AddDataCoverage(hash uint64)

We are experimented with data space-guided coverage and it shows very strong result. Basically if we hash our model state and modify coverage bitmap based on this hash, then fuzzer is able to explore not only code space, but data space as well.

Explained here.

https://ieeexplore.ieee.org/document/9152719

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants