Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce passive crawling #781

Merged
merged 18 commits into from
Mar 20, 2024
Merged

introduce passive crawling #781

merged 18 commits into from
Mar 20, 2024

Conversation

dogancanbakir
Copy link
Member

Closes #139

go run . -u hackerone.com -ps

   __        __                
  / /_____ _/ /____ ____  ___ _
 /  '_/ _  / __/ _  / _ \/ _  /
/_/\_\\_,_/\__/\_,_/_//_/\_,_/                                                   

                projectdiscovery.io

[INF] Current katana version v1.0.5 (latest)

https://hackerone.com/figdann
https://hackerone.com/0xf0e
https://hackerone.com/alerigord
https://hackerone.com/assets/static/main_js--0ks61pg.js
https://www.hackerone.com/company-news/introducing-program-levels-hacker-friendly-practices-improve-program-results
https://hackerone.com/davebb
https://www.hackerone.com/for-hackers/gtm.js
https://www.hackerone.com/application-security/how-industrys-first-hacker-powered-api-helps-hackers-automate-workflows
https://www.hackerone.com/core/themes/stable/css/system/components/progress.module.css?qdmjt3
https://hackerone.com/vampz
https://hackerone.com/bazookaa
https://hackerone.com/deep_aman_knp
https://hackerone.com/gosterweil
https://hackerone.com/navreet2514
https://hackerone.com/wutchumeanman
https://hackerone.com/yuri_almeida
https://hackerone.com/assets/static/js/main.ef92e845.js
https://hackerone.com/dingbat
https://hackerone.com/fotrosthefallenangel1087654321
https://hackerone.com/geetagirij01928374655647382910
https://hackerone.com/hack-us-h1c/hacktivity
https://hackerone.com/comehere123
https://hackerone.com/mastro_titta
https://hackerone.com/torproject?view_policy=true
https://hackerone.com/assets/constants-2a67188703cc082c42bcdb271bbbce028c0236e2dadb2792f73fb06898029882.js
https://hackerone.com/assets/frontend.71af6e628f7de0c794afe06389599db3.css
https://hackerone.com/assets/static/display_invitation_expiry-phs2ip__.js
https://hackerone.com/assets/static/js/main.3c6b6de1.js
https://hackerone.com/devinbileck
...

^C[INF] - Ctrl+C pressed in Terminal
[INF] Creating resume file: /Users/dogancanbakir/.config/katana/resume-cneqptna2ua0iajq0t3g.cfg

@dogancanbakir dogancanbakir self-assigned this Feb 27, 2024
pkg/engine/passive/httpclient/httpclient.go Dismissed Show dismissed Hide dismissed
@dogancanbakir dogancanbakir linked an issue Feb 27, 2024 that may be closed by this pull request
5 tasks
@dogancanbakir dogancanbakir marked this pull request as ready for review February 29, 2024 06:33
Copy link
Member

@ehsandeep ehsandeep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • JSONL update to include passive information, we can use omitempty for response and headers.
{
  "timestamp": "2024-03-02T02:12:06.977128+05:30",
  "request": {
    "method": "GET",
    "endpoint": "https://nuclei.projectdiscovery.io/tinyseg.min.js"
  },
  "response": {
    "headers": {}
  },
  "passive": {
    "source": "alienvault",
    "reference": "https://otx.alienvault.com/api/v1/indicators/domain/projectdiscovery.io/url_list?page=20"
  }
}
  • URL dedup, as of now duplicate lines are being returned in CLI and JSONL output.
  • CLI option validation, i.e -headless and -passive can't be used together.
  • Misc CLI updates
[INF] Enumerating passive endpoints for projectdiscovery.io
....
....
....
[INF] Found 2335 endpoints for projectdiscovery.io in 10 seconds (alienvault: 433, waybackurls: 1500, commoncrawl: 434)

@dogancanbakir
Copy link
Member Author

dogancanbakir commented Mar 3, 2024

$ go run . -u hackerone.com -ps                

   __        __                
  / /_____ _/ /____ ____  ___ _
 /  '_/ _  / __/ _  / _ \/ _  /
/_/\_\\_,_/\__/\_,_/_//_/\_,_/                                                   

                projectdiscovery.io

[INF] Current katana version v1.0.5 (latest)
[INF] Enumerating passive endpoints for https://hackerone.com
https://hackerone.com/reports/118582.json
http://hackerone.com/sh1yo
https://www.hackerone.com/cookies
https://hackerone.com/fetlife/hacktivity
https://hackerone.com/assets/static/main_js-vEWx0Z9c.js
https://hackerone.com/reports/508459
https://hackerone.com/reports/361438
...
[INF] Found 171036 endpoints for https://hackerone.com in 1m0.417674709s (alienvault: 1198, commoncrawl: 16274, waybackarchive: 153564)

json output example:

{
    "timestamp": "2024-03-03T18:51:52.700414+03:00",
    "request": {
        "method": "GET",
        "endpoint": "https://hackerone.com/reports/118582.json"
    },
    "passive": {
        "source": "alienvault",
        "reference": "https://otx.alienvault.com/api/v1/indicators/domain/hackerone.com/url_list?page=1"
    }
}

Copy link
Member

@Mzack9999 Mzack9999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit dubious about passive crawling capabilities within an active crawler. Anyway,
I would suggest the following changes:

  • Replace RegexUrlExtractor with a package defined regexp and global method:
package extractor

var re = regexp.MustCompile(...)
func Extract(...) ... { ... }
  • With a passive result create a Response object with 200 status code, and populate it with potential additional data (headers, body, etc) returned by the various provider (in case there is anything else than just the bare URL) - Probably it would be enough to parse the URL and assign to response. In order to benefit from potential future changes in the output formatting logic.

What do you think?

@dogancanbakir
Copy link
Member Author

dogancanbakir commented Mar 11, 2024

I made the first change but need clarification on the second one. If I make the change, it'll look like this:

{
    "timestamp": "2024-03-11T11:54:05.711507+03:00",
    "request": {
        "method": "GET",
        "endpoint": "https://hackerone.com/reports/508459"
    },
    "response": {
        "status_code": 200,
        "headers": {},
        "body": "https://hackerone.com/reports/508459"
    },
    "passive": {
        "source": "alienvault",
        "reference": "https://otx.alienvault.com/api/v1/indicators/domain/hackerone.com/url_list?page=1"
    }
}

What do you say? @Mzack9999 @ehsandeep

@Mzack9999
Copy link
Member

As the crawler is organized in a way that a request has a matching response, the existence of an URL within a passive source, probably means that there was a GET or HEAD request that received a 200 response (the body should remain empty, if not available from the third party source). Based on this assumtpion, I think that we should populate the following fields of navigation response:

type Response struct {
	Resp               *http.Response    `json:"-"`// <------- http.Response{StatusCode: 200, URL: url.Parse(passiveURL)}
	...
	StatusCode         int               `json:"status_code,omitempty"` // <----- 200
	RootHostname       string            `json:"-"` // <---- Domain Name
        ...
}

I believe that this might be useful in existing uses as SDK as well as future integrations with other providers that provide additional data than the mere URL (for example headers, body, etc...)

Copy link
Member

@Mzack9999 Mzack9999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - Hopefully mixing active/passive detection will not generate too much ambiguity

@ehsandeep ehsandeep merged commit 50865cf into dev Mar 20, 2024
13 checks passed
@ehsandeep ehsandeep deleted the introduce_passive_crawling branch March 20, 2024 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Passive crawling from external sources support
3 participants