Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase crawling content by optimizing regular expressions #238

Closed
yuzhe-Mortal opened this issue Dec 14, 2022 · 1 comment · Fixed by #249
Closed

Increase crawling content by optimizing regular expressions #238

yuzhe-Mortal opened this issue Dec 14, 2022 · 1 comment · Fixed by #249
Labels
Status: Completed Nothing further to be done with this issue. Awaiting to be closed. Type: Enhancement Most issues will probably ask for additions or changes.
Milestone

Comments

@yuzhe-Mortal
Copy link
Contributor

yuzhe-Mortal commented Dec 14, 2022

Increase crawling content by optimizing regular expressions

katana-main/katana-main/pkg/utils/regex.go Connections in web pages are extracted by regular expressions
It's implemented by pageBodyRegex or relativeEndpointsRegex
But you will miss some endpoints

for example

http://www.google.com:8080
https://www.google.com/images/%E4%BA%A7%E5%93%81%E4%B8%AD%E5%BF%83/
/1.php

Then the following regular expressions will help you

    ((?:"|'|\s)                              # Start newline delimiter
  (
    ((?:[a-zA-Z]{1,10}://|//)           # Match a scheme [a-Z]*1-10 or //
    [^"'/]{1,}\.                        # Match a domainname (any character + dot)
    [a-zA-Z]{2,}[^"']{0,})              # The domainextension and/or path
    |
    ((?:/|\.\./|\./)                    # Start with /,../,./
    [^"'><,;| *()(%%$^/\\\[\]]          # Next character can't be...
    [^"'><,;|()]{1,})                   # Rest of the characters can't be
    |
    ([a-zA-Z0-9_\-/]{1,}/               # Relative endpoint with /
    [a-zA-Z0-9_\-/\.]{1,}                 # Resource name
    \.(?:[a-zA-Z]{1,4}|action)          # Rest + extension (length 1-4 or action)
    (?:[\?|#][^"|']{0,}|))              # ? mark with parameters
    |
    ([a-zA-Z0-9_\-/]{1,}/
    [a-zA-Z0-9_\-/]{3,}
    (?:[\?|#][^"|']{0,}|))
    |
    ([a-zA-Z0-9_\-\.]{1,}                 # filename
    \.(?:php|asp|aspx|jsp|json|
         action|html|js|txt|xml|do)             # . + extension
    (?:[\?|#][^"|']{0,}|))                  # ? mark with parameters
  )
  (?:"|'|\s)

reference linking: https://github.com/yuzhe-Mortal/tool/blob/main/Reptile.py

@yuzhe-Mortal yuzhe-Mortal added the Type: Enhancement Most issues will probably ask for additions or changes. label Dec 14, 2022
@yuzhe-Mortal
Copy link
Contributor Author

Why can't we combine pageBodyRegex and relativeEndpointsRegex
In the absence of Single quote,double quote, and space, endpoints are extracted using html tags

@ehsandeep ehsandeep reopened this Dec 16, 2022
@ehsandeep ehsandeep linked a pull request Dec 16, 2022 that will close this issue
@ehsandeep ehsandeep linked a pull request Mar 4, 2023 that will close this issue
@ehsandeep ehsandeep added the Status: Completed Nothing further to be done with this issue. Awaiting to be closed. label Mar 4, 2023
@ehsandeep ehsandeep added this to the katana v.0.0.4 milestone Mar 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Completed Nothing further to be done with this issue. Awaiting to be closed. Type: Enhancement Most issues will probably ask for additions or changes.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants