Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple concurrent crawler with split output. Asking if there is interest in completing my Fork. #82

Open
maxime4000 opened this issue Nov 28, 2023 · 0 comments

Comments

@maxime4000
Copy link

@steve8708 Questioning interest, I have made a big refactoring of the codebase for integrating thoses features :

  • excludeSelectors : Remove elements that you don't want in the output data
  • Cleaner output : Remove some
  • Refactoring of the full code
  • Concurrency
  • Multiple config
  • Config parsing now set default if not defined.
  • ProgressBar logging
  • Sub Routing namings
  • output now generated in it's own folder
  • change output.json to output/data.json
  • Fix .gitignore
  • added Prettier in the project. (Wouldn't mind to revert that if not wanted)

Things that would be required to fully "complete" the PR:

  • CLI full support
  • Terminal logs fixed. (Mostly INFO and ERROR logs from PlaywrightCrawler)

My needs:

I wanted to create a knowledge base for godot, but wanted to separate each section into their own files. I manage to do it with multiple config. But that being done and I have the output I needed, I am not interested in fixing the logging part. Useful when I saw some error from a bad error, but not that helpful imo.

Current state

So the current changes are big and 90% finish. Nonetheless, I think they are an improvement, just not a "fully stable" and completed improvement... Everythings that was added is very functionnal, but I still have issues with the output of the terminal. If the lines get wrapped, the output get ugly. Nx has a similar issue with their run-many CLI, so I don't know if it's vscode, the terminal or the lib... I'm just not interested in completing the feature.

> @builder.io/[email protected] build
> tsc

Crawling started.
████████████████████████████████████████ | about | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | getting_started | 10/33 (L: 50, F: 33) | ETA: 101s | /getting_started/step_by_step/instancing.html
███████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | tutorials | 9/50 (L: 50, F: 327) | ETA: 268s | /tutorials/best_practices/godot_interfaces.html
████████████████████████████████████████ | community | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | contributing | 9/47 (L: 50, F: 47) | ETA: 248s | /contributing/development/index.html INFO Statistics: null request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":6323,"requestsFinishedPerMinute":9,"requestsFailedPerMinute":0,"requestTotalDurationMillis":56909,"requestsTotal":9,"crawlerRuntimeMillis":60560,"retryHistogram":[9]}
████████████████████████████████████████ | about | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
████████████████████████████████░░░░░░░░ | getting_started | 26/33 (L: 50, F: 33) | ETA: 28s | /getting_started/first_3d_game/03.player_movement_code.html
█████████████████████░░░░░░░░░░░░░░░░░░░ | tutorials | 26/50 (L: 50, F: 327) | ETA: 91s | /tutorials/editor/managing_editor_features.html
████████████████████████████████████████ | community | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
█████████████████████░░░░░░░░░░░░░░░░░░░ | contributing | 26/50 (L: 50, F: 57) | ETA: 92s | /contributing/development/debugging/using_sanitizers.html INFO Statistics: null request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":4464,"requestsFinishedPerMinute":13,"requestsFailedPerMinute":0,"requestTotalDurationMillis":116054,"requestsTotal":26,"crawlerRuntimeMillis":120568,"retryHistogram":[26]}
████████████████████████████████████████ | about | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
████████████████████████████████████████ | getting_started | 33/33 (L: 50, F: 33) | ETA: 0s | Completed
██████████████████████████████████████░░ | tutorials | 47/50 (L: 50, F: 327) | ETA: 8s | /tutorials/3d/procedural_geometry/arraymesh.html
████████████████████████████████████████ | community | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
███████████████████████████████████░░░░░ | contributing | 44/50 (L: 50, F: 73) | ETA: 19s | /contributing/documentation/class_reference_primer.html INFO Sta ████████████████████████████████████████ | about | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
████████████████████████████████████████ | getting_started | 33/33 (L: 50, F: 33) | ETA: 0s | Completed
████████████████████████████████████████ | tutorials | 50/50 (L: 50, F: 327) | ETA: 0s | Completed
████████████████████████████████████████ | community | 7/7 (L: 50, F: 7) | ETA: 0s | Completed
████████████████████████████████████████ | contributing | 50/50 (L: 50, F: 73) | ETA: 0s | Completed

I made this multi progress bar because with concurrent crawling, the log was hard to follow. With this, it's easier to follow, but when logging things happen like error, info and other in the mean times, it's a mess...

The issue :

When this "type" of line appear from PlaywrightCrawler, it break the multi progressbar :

INFO Statistics: null request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":4525,"requestsFinishedPerMinute":12,"requestsFailedPerMinute":0,"requestTotalDurationMillis":113118,"requestsTotal":25,"crawlerRuntimeMillis":120511,"retryHistogram":[25]}

The multi progressbar display get bugged. I do not understand enough terminal and playwright to know exactly what to change to fix this.

Why Asking ?

I have no interest in fixing the terminal as I got what I wanted, but the whole changes is a improvement and I was asking if I could make a PR and let someone else fix the issue in the PR and push it ? I guess the concurrent part could be omitted and that would "make the PR completed".

Other changes that I can omit if not wanted.

I use a "modern" prettier config, my editor will format using my config if none existe in the repo I work on. I have setup prettier as I was already changing formatting when I saving, but I'm ok with reverting this. But I could also push it if thecopied some files that would configure that as I wasn't planning to make big change, but I'm willing to remove that too if not interested.

Here's some visual preview :

image
image

  • Won't push the config changes tho. (Maybe only the typing)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant