-
Notifications
You must be signed in to change notification settings - Fork 635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce passive crawling #781
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- JSONL update to include passive information, we can use omitempty for
response
andheaders
.
{
"timestamp": "2024-03-02T02:12:06.977128+05:30",
"request": {
"method": "GET",
"endpoint": "https://nuclei.projectdiscovery.io/tinyseg.min.js"
},
"response": {
"headers": {}
},
"passive": {
"source": "alienvault",
"reference": "https://otx.alienvault.com/api/v1/indicators/domain/projectdiscovery.io/url_list?page=20"
}
}
- URL dedup, as of now duplicate lines are being returned in CLI and JSONL output.
- CLI option validation, i.e
-headless
and-passive
can't be used together. - Misc CLI updates
[INF] Enumerating passive endpoints for projectdiscovery.io
....
....
....
[INF] Found 2335 endpoints for projectdiscovery.io in 10 seconds (alienvault: 433, waybackurls: 1500, commoncrawl: 434)
$ go run . -u hackerone.com -ps
__ __
/ /_____ _/ /____ ____ ___ _
/ '_/ _ / __/ _ / _ \/ _ /
/_/\_\\_,_/\__/\_,_/_//_/\_,_/
projectdiscovery.io
[INF] Current katana version v1.0.5 (latest)
[INF] Enumerating passive endpoints for https://hackerone.com
https://hackerone.com/reports/118582.json
http://hackerone.com/sh1yo
https://www.hackerone.com/cookies
https://hackerone.com/fetlife/hacktivity
https://hackerone.com/assets/static/main_js-vEWx0Z9c.js
https://hackerone.com/reports/508459
https://hackerone.com/reports/361438
...
[INF] Found 171036 endpoints for https://hackerone.com in 1m0.417674709s (alienvault: 1198, commoncrawl: 16274, waybackarchive: 153564) json output example: {
"timestamp": "2024-03-03T18:51:52.700414+03:00",
"request": {
"method": "GET",
"endpoint": "https://hackerone.com/reports/118582.json"
},
"passive": {
"source": "alienvault",
"reference": "https://otx.alienvault.com/api/v1/indicators/domain/hackerone.com/url_list?page=1"
}
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit dubious about passive crawling capabilities within an active crawler. Anyway,
I would suggest the following changes:
- Replace
RegexUrlExtractor
with a package definedregexp
and global method:
package extractor
var re = regexp.MustCompile(...)
func Extract(...) ... { ... }
- With a passive result create a Response object with 200 status code, and populate it with potential additional data (headers, body, etc) returned by the various provider (in case there is anything else than just the bare URL) - Probably it would be enough to parse the URL and assign to response. In order to benefit from potential future changes in the output formatting logic.
What do you think?
I made the first change but need clarification on the second one. If I make the change, it'll look like this: {
"timestamp": "2024-03-11T11:54:05.711507+03:00",
"request": {
"method": "GET",
"endpoint": "https://hackerone.com/reports/508459"
},
"response": {
"status_code": 200,
"headers": {},
"body": "https://hackerone.com/reports/508459"
},
"passive": {
"source": "alienvault",
"reference": "https://otx.alienvault.com/api/v1/indicators/domain/hackerone.com/url_list?page=1"
}
} What do you say? @Mzack9999 @ehsandeep |
As the crawler is organized in a way that a request has a matching response, the existence of an URL within a passive source, probably means that there was a type Response struct {
Resp *http.Response `json:"-"`// <------- http.Response{StatusCode: 200, URL: url.Parse(passiveURL)}
...
StatusCode int `json:"status_code,omitempty"` // <----- 200
RootHostname string `json:"-"` // <---- Domain Name
...
} I believe that this might be useful in existing uses as SDK as well as future integrations with other providers that provide additional data than the mere URL (for example headers, body, etc...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm - Hopefully mixing active/passive detection will not generate too much ambiguity
Closes #139