Option to store http responses to file #185

edoardottt · 2022-11-16T11:35:21Z

This PR adds changes in order to add these two options (Context: #177):

   -sr, -store-response              store http response to output directory
   -srd, -store-response-dir string  store http response to custom directory

For now the PR creates the output directory (katana_responses by default, but can be changed with -srd option), then it creates correctly the specific subfolders for each domain (e.g. katana_responses/www.edoardoottavianelli.it) and it also creates the file named as the hash of the URL.
E.g.:
For the URL https://www.edoardoottavianelli.it/post/post6/post6.html, this is the file created: katana_responses/www.edoardoottavianelli.it/3270ccbc882c5239fcaa7c801503df606e8979be (sha1 hash of URL)

The problem is that the output is built upon the struct Result:

// Result is a result structure for the crawler
type Result struct {
	// Timestamp is the current timestamp
	Timestamp time.Time `json:"timestamp,omitempty"`
	// Method is the method for the result
	Method string `json:"method,omitempty"`
	// Body contains the body for the request
	Body string `json:"body,omitempty"`
	// URL is the URL of the result
	URL string `json:"endpoint,omitempty"`
	// Source is the source for the result
	Source string `json:"source,omitempty"`
	// Tag is the tag for the result
	Tag string `json:"tag,omitempty"`
	// Attribute is the attribute for the result
	Attribute string `json:"attribute,omitempty"`
}

and this struct is not suitable for this type of output. I would like to write something similar to meg output:

▶ head -n 20 ./out/example.com/45ed6f717d44385c5e9c539b0ad8dc71771780e0
http://example.com/robots.txt

> GET /robots.txt HTTP/1.1
> Host: example.com

< HTTP/1.1 404 Not Found
< Expires: Sat, 06 Jan 2018 01:05:38 GMT
< Server: ECS (lga/13A2)
< Accept-Ranges: bytes
< Cache-Control: max-age=604800
< Content-Type: text/*
< Content-Length: 1270
< Date: Sat, 30 Dec 2017 01:05:38 GMT
< Last-Modified: Sun, 24 Dec 2017 06:53:36 GMT
< X-Cache: 404-HIT

<!doctype html>
<html>
<head>

Do you have any suggestions? @ehsandeep @Mzack9999

This PR closes #177.

v0.0.2

ehsandeep

Hi @edoardottt,

Thank you @edoardottt for working on this feature, we can keep this uniform like other PD tools (httpx/proxify/nuclei).

Here is an example format from httpx that we can replicate for katana.

echo example.com | httpx -sr

cat output/example.com.txt


HTTP/1.1 200 OK
Connection: close
Accept-Ranges: bytes
Age: 545788
Cache-Control: max-age=604800
Content-Type: text/html; charset=UTF-8
Date: Sat, 19 Nov 2022 10:36:59 GMT
Etag: "3147526947"
Expires: Sat, 26 Nov 2022 10:36:59 GMT
Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
Server: ECS (dcb/7EEF)
Vary: Accept-Encoding
X-Cache: HIT

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

ehsandeep · 2022-11-19T10:50:49Z

@edoardottt the above example was for response format, for the filenames, we are adopting httpx to follow meg format in projectdiscovery/httpx#848 and something to adopt here as well.

edoardottt · 2022-11-20T10:14:22Z

Hi @ehsandeep, just a lil bit of context.
I'm trying to add the feature with as few changes as possible, however there are some constraints.

In Headless mode for now it's not possible to store the responses because the responses provided in headless mode lack some important information (such as the status, the protocol used) and I don't know how to store them. It might be possible to do this, but of course the format of the responses won't be the same. Let me know how you want to handle that.
katana prints the results on the command line, but it doesn't perform an HTTP request for all of them. Using a proxy I see 12 results on the CLI and 6 requests made by katana. Because of this, the results in the responses folder and the results printed on the cli don't match. This works in this way because if the depth level is 2 (default), katana will see some URLs (depth 3, print them because in scope) but it won't go deeper crawling them (and so not performing HTTP requests).
For this I've used https://www.edoardoottavianelli.it/ as test case.

Test command

echo "https://www.edoardoottavianelli.it/" | ./katana -sr -proxy http://127.0.0.1:8888

Katana CLI output:

...
[WRN] Developers assume no liability and are not responsible for any misuse or damage.
https://www.edoardoottavianelli.it/blog.html
https://www.edoardoottavianelli.it/cve.html
https://www.edoardoottavianelli.it/aboutme.html
https://www.edoardoottavianelli.it/cv.html
https://www.edoardoottavianelli.it/
https://www.edoardoottavianelli.it/index.html
https://www.edoardoottavianelli.it/CVE-2022-44019/index.html
https://www.edoardoottavianelli.it/CVE-2022-41392/index.html
https://www.edoardoottavianelli.it/post/post7/post7.html
https://www.edoardoottavianelli.it/post/post6/post6.html
https://www.edoardoottavianelli.it/post/post5/post5.html
https://www.edoardoottavianelli.it/post/post1/post1.html

Proxify logs

/tmp/logs> head -n 1 www.edoardoottavianelli.it:443-*
==> www.edoardoottavianelli.it:443-cdsvqfe2g8ihqu0g3pdg.txt <==
GET / HTTP/1.1

==> www.edoardoottavianelli.it:443-cdsvqfm2g8ihqu0g3pe0.txt <==
GET /cve.html HTTP/1.1

==> www.edoardoottavianelli.it:443-cdsvqfu2g8ihqu0g3peg.txt <==
GET /blog.html HTTP/1.1

==> www.edoardoottavianelli.it:443-cdsvqg62g8ihqu0g3pf0.txt <==
GET /aboutme.html HTTP/1.1

==> www.edoardoottavianelli.it:443-cdsvqge2g8ihqu0g3pfg.txt <==
GET /cv.html HTTP/1.1

==> www.edoardoottavianelli.it:443-cdsvqgm2g8ihqu0g3pg0.txt <==
GET / HTTP/1.1

Demo

$> echo "https://projectdiscovery.io/" | ./katana -sr

$> cat katana_responses/index.txt

katana_responses/projectdiscovery.io/60da1e66fe7802e77cc27eb06d24509a936d2b25.txt https://projectdiscovery.io/ (200 OK)
katana_responses/projectdiscovery.io/ca419688a1b91baf51417038bcf5d170b73220ee.txt https://projectdiscovery.io/app.bundle.css (200 OK)
katana_responses/projectdiscovery.io/73cc72a0568d8943b4cc46ee8e258f538dd4f25d.txt https://projectdiscovery.io/app.js (200 OK)

$> head -n 35 katana_responses/projectdiscovery.io/60da1e66fe7802e77cc27eb06d24509a936d2b25.txt

https://projectdiscovery.io/


GET / HTTP/1.1
Host: projectdiscovery.io
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36


HTTP/1.1 200 OK
X-Timer: S1668940326.392399,VS0,VE114
Content-Type: text/html; charset=utf-8
Cache-Control: max-age=600
Strict-Transport-Security: max-age=0; preload
X-Content-Type-Options: nosniff
Date: Sun, 20 Nov 2022 10:32:06 GMT
Access-Control-Allow-Origin: *
Via: 1.1 varnish
Connection: keep-alive
Age: 0
X-Cache: HIT
X-Cache-Hits: 1
Cf-Ray: 76d0850fdec35a1f-MXP
Nel: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
Last-Modified: Mon, 07 Nov 2022 14:59:55 GMT
X-Served-By: cache-mxp6929-MXP
Vary: Accept-Encoding
Report-To: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=dZHfumZ2a17gP3V%2BnmXiDRAACRkQEBN6goKGQfmqvhUBR6J0cnBqJyyg80pJn7bFK6%2BgIEw2jw2t7MrJ4zpGRwJtlvOl0YduCrfFRVDMT9%2BdRTcVmBPhr21HQ2UTNJz0rTDgRg4%3D"}],"group":"cf-nel","max_age":604800}
Expires: Sun, 20 Nov 2022 09:16:04 GMT
X-Proxy-Cache: MISS
X-Fastly-Request-Id: 854ff2bf4c9674ae0a90974e6794e694baf0aeca
X-Github-Request-Id: C01A:122BC:245AB82:2582642:6379EDFC
Cf-Cache-Status: DYNAMIC
Server: cloudflare

<!doctype html><html lang="en"><head><meta charset="utf-8"/><meta content="width=device-width" name="viewport"/><title>Projectdiscovery.io</title><link rel="preconnect" href="https://fonts.gstatic.com"><link href="https://fonts.googleapis.com/css2?family=Nunito+Sans&family=Poppins&family=Montserrat&family=Open+Sans&display=swap" rel="stylesheet"><script async src="https://www.googletagmanager.com/gtag/js?id=UA-165996103-1"></script><script>var _iub = _iub || [];

chore(deps): bump github.com/projectdiscovery/retryablehttp-go (#186)

Dev

ehsandeep

@edoardottt we can handle a headless response in another ticket as it requires further investigation.

and the CLI behavior you mentioned is expected, as the purpose of the project to get all the endpoints, and depending on the depth, it will always sprint the response URLs in output and not visit them, this behavior can be controlled when we will introduce -validate option in future.

Minor bug identified by @wdahlenburg at #177 (comment) that we can fix in this PR to include POST body while writing a response on the disk.

ehsandeep

@edoardottt I noticed, the crawl time got improved and reduced by a lot, Is anything specific you to point out?

Thanks again for working on this.

ShubhamRasal

lgtm - suggesting small change

pkg/output/output.go

ehsandeep and others added 3 commits November 9, 2022 17:05

Added SECURITY.md

22fa3fe

Merge pull request #163 from projectdiscovery/dev

4301a61

v0.0.2

Option to store http responses to file (#177)

210bc81

edoardottt marked this pull request as draft November 16, 2022 11:35

Remove debug output

c756c97

ehsandeep reviewed Nov 19, 2022

View reviewed changes

edoardottt added 2 commits November 20, 2022 10:54

Add store response option

af206b0

Merge branch 'projectdiscovery:main' into store-resp

359c1dc

edoardottt marked this pull request as ready for review November 20, 2022 09:58

edoardottt requested a review from ehsandeep November 20, 2022 09:58

tarunKoyalwar linked an issue Nov 27, 2022 that may be closed by this pull request

Option to store http responses to file #177

Closed

edoardottt added 5 commits November 29, 2022 17:36

Merge pull request #1 from edoardottt/dev

f14c83b

chore(deps): bump github.com/projectdiscovery/retryablehttp-go (#186)

update

8ca3631

Merge branch 'store-resp' into dev

c7492f3

Merge pull request #2 from edoardottt/dev

88d5b60

Dev

update sum

0322183

ehsandeep requested changes Dec 6, 2022

View reviewed changes

edoardottt added 4 commits December 6, 2022 15:26

update

b23c510

Merge branch 'dev' into store-resp

9cb173a

update

75d6526

fix test

dc04116

edoardottt requested a review from ehsandeep December 6, 2022 14:38

ehsandeep approved these changes Dec 7, 2022

View reviewed changes

ehsandeep requested a review from ShubhamRasal December 7, 2022 13:58

ShubhamRasal requested changes Dec 8, 2022

View reviewed changes

pkg/output/output.go Outdated Show resolved Hide resolved

update output

aa739f4

edoardottt requested a review from ShubhamRasal December 8, 2022 08:06

ehsandeep approved these changes Dec 8, 2022

View reviewed changes

ehsandeep merged commit 96210d8 into projectdiscovery:dev Dec 8, 2022

edoardottt deleted the store-resp branch December 8, 2022 08:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to store http responses to file #185

Option to store http responses to file #185

edoardottt commented Nov 16, 2022 •

edited

Loading

ehsandeep left a comment

ehsandeep commented Nov 19, 2022

edoardottt commented Nov 20, 2022 •

edited

Loading

ehsandeep left a comment

ehsandeep left a comment

ShubhamRasal left a comment

Option to store http responses to file #185

Option to store http responses to file #185

Conversation

edoardottt commented Nov 16, 2022 • edited Loading

ehsandeep left a comment

Choose a reason for hiding this comment

ehsandeep commented Nov 19, 2022

edoardottt commented Nov 20, 2022 • edited Loading

Demo

ehsandeep left a comment

Choose a reason for hiding this comment

ehsandeep left a comment

Choose a reason for hiding this comment

ShubhamRasal left a comment

Choose a reason for hiding this comment

edoardottt commented Nov 16, 2022 •

edited

Loading

edoardottt commented Nov 20, 2022 •

edited

Loading