Increase crawling content by optimizing regular expressions #238

yuzhe-Mortal · 2022-12-14T02:50:28Z

Increase crawling content by optimizing regular expressions

katana-main/katana-main/pkg/utils/regex.go Connections in web pages are extracted by regular expressions
It's implemented by pageBodyRegex or relativeEndpointsRegex
But you will miss some endpoints

for example

http://www.google.com:8080
https://www.google.com/images/%E4%BA%A7%E5%93%81%E4%B8%AD%E5%BF%83/
/1.php

Then the following regular expressions will help you

    ((?:"|'|\s)                              # Start newline delimiter
  (
    ((?:[a-zA-Z]{1,10}://|//)           # Match a scheme [a-Z]*1-10 or //
    [^"'/]{1,}\.                        # Match a domainname (any character + dot)
    [a-zA-Z]{2,}[^"']{0,})              # The domainextension and/or path
    |
    ((?:/|\.\./|\./)                    # Start with /,../,./
    [^"'><,;| *()(%%$^/\\\[\]]          # Next character can't be...
    [^"'><,;|()]{1,})                   # Rest of the characters can't be
    |
    ([a-zA-Z0-9_\-/]{1,}/               # Relative endpoint with /
    [a-zA-Z0-9_\-/\.]{1,}                 # Resource name
    \.(?:[a-zA-Z]{1,4}|action)          # Rest + extension (length 1-4 or action)
    (?:[\?|#][^"|']{0,}|))              # ? mark with parameters
    |
    ([a-zA-Z0-9_\-/]{1,}/
    [a-zA-Z0-9_\-/]{3,}
    (?:[\?|#][^"|']{0,}|))
    |
    ([a-zA-Z0-9_\-\.]{1,}                 # filename
    \.(?:php|asp|aspx|jsp|json|
         action|html|js|txt|xml|do)             # . + extension
    (?:[\?|#][^"|']{0,}|))                  # ? mark with parameters
  )
  (?:"|'|\s)

reference linking: https://github.com/yuzhe-Mortal/tool/blob/main/Reptile.py

The text was updated successfully, but these errors were encountered:

yuzhe-Mortal · 2022-12-14T03:08:16Z

Why can't we combine pageBodyRegex and relativeEndpointsRegex
In the absence of Single quote,double quote, and space, endpoints are extracted using html tags

yuzhe-Mortal added the Type: Enhancement Most issues will probably ask for additions or changes. label Dec 14, 2022

yuzhe-Mortal closed this as completed Dec 15, 2022

ehsandeep reopened this Dec 16, 2022

ehsandeep linked a pull request Dec 16, 2022 that will close this issue

update Regex #239

Closed

ehsandeep linked a pull request Mar 4, 2023 that will close this issue

Update regex.go #249

Merged

ehsandeep added the Status: Completed Nothing further to be done with this issue. Awaiting to be closed. label Mar 4, 2023

ehsandeep added this to the katana v.0.0.4 milestone Mar 4, 2023

ehsandeep closed this as completed Mar 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase crawling content by optimizing regular expressions #238

Increase crawling content by optimizing regular expressions #238

yuzhe-Mortal commented Dec 14, 2022 •

edited

Loading

yuzhe-Mortal commented Dec 14, 2022

Increase crawling content by optimizing regular expressions #238

Increase crawling content by optimizing regular expressions #238

Comments

yuzhe-Mortal commented Dec 14, 2022 • edited Loading