-
Notifications
You must be signed in to change notification settings - Fork 635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update regex.go #249
Update regex.go #249
Conversation
update Regex #239 |
+1 for this PR to get accepted, relatives URLs aren't being captured properly as of now. |
@yuzhe-Mortal , this seems like a good regex but since this directly affects number of results returned by katana , we have to add more unit tests to validate that we aren't missing out any potential endpoints . |
@tarunKoyalwar I need specific URL examples to modify the regex |
Hi, I've hosted here a fragment of a JS file that has relative URLs in it. LinkFinder finds the relative URLs, where katana does not.
|
@gabriel-schneider-vtex @tarunKoyalwar var (
BodyA0 = `(?:`
BodyB0 = `(`
BodyC0 = `(?:[\.]{1,2}/[A-Za-z0-9-_/\\?&@\.?=%]+)`
BodyC1 = `|(https?://[A-Za-z0-9_\-\.]+([\.]{0,2})?\/[A-Za-z0-9-_/\\?&@\.?=%]+)`
BodyC2 = `|(/[A-Za-z0-9-_/\\?&@\.%]+\.(aspx?|action|cfm|cgi|do|pl|css|x?html?|js(p|on)?|pdf|php5?|py|rss))`
BodyB1 = `)`
BodyA1 = `)`
// pageBodyRegex extracts endpoints from page body
pageBodyRegex = regexp.MustCompile(BodyA0 + BodyB0 + BodyC0 + BodyC1 + BodyC2 + BodyB1 + BodyA1)
JsA0 = `(?:"|'|\s)`
JsB0 = `(`
JsC0 = `((https?://[A-Za-z0-9_\-\.]+(:\d{1,5})?)+([\.]{1,2})?/[A-Za-z0-9/\-_\.\\%]+([\?|#][^"']+)?)`
JsC1 = `|((\.{1,2}/)?[a-zA-Z0-9\-_/\\%]+\.(aspx?|js(on|p)?|html|php5?|html|action|do)([\?|#][^"']+)?)`
JsC2 = `|((\.{0,2}/)[a-zA-Z0-9\-_/\\%]+(/|\\)[a-zA-Z0-9\-_]{3,}([\?|#][^"|']+)?)`
JsC3 = `|((\.{0,2})[a-zA-Z0-9\-_/\\%]{3,}/)`
JsB1 = `)`
JsA1 = `(?:"|'|\s)`
// relativeEndpointsRegex is the regex to find endpoints in js files.
relativeEndpointsRegex = regexp.MustCompile(JsA0 + JsB0 + JsC0 + JsC1 + JsC2 + JsC3 + JsB1 + JsA1)
) katana.exe -u https://gbrls.space/katana.js -o D:\Desktop\1.json -v -d 10
__ __
/ /_____ _/ /____ ____ ___ _
/ '_/ _ / __/ _ / _ \/ _ /
/_/\_\\_,_/\__/\_,_/_//_/\_,_/ v0.0.2
projectdiscovery.io
[WRN] Use with caution. You are responsible for your actions.
[WRN] Developers assume no liability and are not responsible for any misuse or damage.
start Crawl,input: https://gbrls.space/katana.js
[js] [404 Not Found] https://gbrls.space/management/commerce/orders/ [bodylen:555] [Source:https://gbrls.space/katana.js]
[js] [404 Not Found] https://gbrls.space/generic/ [bodylen:555] [Source:https://gbrls.space/katana.js]
[js] [404 Not Found] https://gbrls.space/management/commerce/orders/paymentmethods/ [bodylen:555] [Source:https://gbrls.space/katana.js]
[js] [404 Not Found] https://gbrls.space/installments/ [bodylen:555] [Source:https://gbrls.space/katana.js]
[js] [404 Not Found] https://gbrls.space/payment/ [bodylen:555] [Source:https://gbrls.space/katana.js] |
I've also made the code more readable, please let me know if there are other cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm !
|
Increase crawling content by optimizing regular expressions
katana-main/katana-main/pkg/utils/regex.go Connections in web pages are extracted by regular expressions
It's implemented by pageBodyRegex or relativeEndpointsRegex
But you will miss some endpoints
for example
http://www.google.com:8080
https://www.google.com/images/%E4%BA%A7%E5%93%81%E4%B8%AD%E5%BF%83/
/1.php
Then the following regular expressions will help you
reference linking: https://github.com/yuzhe-Mortal/tool/blob/main/Reptile.py