Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add spiderqueue configuration option #476

Merged
merged 3 commits into from
Mar 10, 2023
Merged

Conversation

jpmckinney
Copy link
Contributor

@jpmckinney jpmckinney commented Mar 8, 2023

closes #197

See #475 for updating the default spider queue.

@codecov
Copy link

codecov bot commented Mar 8, 2023

Codecov Report

Merging #476 (040cc67) into master (68cb43c) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #476      +/-   ##
==========================================
- Coverage   87.31%   87.30%   -0.01%     
==========================================
  Files          41       41              
  Lines        1876     1875       -1     
==========================================
- Hits         1638     1637       -1     
  Misses        238      238              
Flag Coverage Δ
unittests 87.30% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
scrapyd/jobstorage.py 100.00% <100.00%> (ø)
scrapyd/spiderqueue.py 95.45% <100.00%> (+0.21%) ⬆️
scrapyd/tests/test_spiderqueue.py 100.00% <100.00%> (ø)
scrapyd/utils.py 89.25% <100.00%> (+0.08%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@pspsdev
Copy link

pspsdev commented Mar 9, 2023

Probably not the correct solution, as different types of queues might need more arguments like redis_host, redis_port etc not just the DB path. I'm wondering what would be the quickest and simplest solution to quickly improve scrapyd performance for those who run high polling rates. Maybe add support :memory: as dbpath so sqlite runs in memory? Currently the code makes it always a file based solution but scrapyd/spiderqueue.py defaults to :memory: correctly but that can never happen because upstream code always sets a file.

@jpmckinney
Copy link
Contributor Author

jpmckinney commented Mar 9, 2023

Ah, good call, we need to move the dbs_dir parsing to the queue's initialization.

For Redis, PostgreSQL, etc. we can have dbs_dir still be a single string, e.g. a connection string like redis://user:pass@host:port/database or postgresql://user:pass@host:port/database, etc.

Edit: Or, we can pass the full config object to the queues. As-is, it's not out of the question for alternative queues to just read their configuration from the environment, rather than from the config file.

@jpmckinney
Copy link
Contributor Author

jpmckinney commented Mar 9, 2023

Currently the code makes it always a file based solution but scrapyd/spiderqueue.py defaults to :memory: correctly but that can never happen because upstream code always sets a file.

I would say that its use of :memory: is incorrect, as it will cause all projects to use the same DB, which is not the intent. The connection string would need to be something like file:project1?mode=memory https://www.sqlite.org/inmemorydb.html

Edit: Nevermind, misread some other documentation:

Every :memory: database is distinct from every other. So, opening two database connections each with the filename ":memory:" will create two independent in-memory databases.

…o spider queue implementation. Respect :memory: and URL values for dbs_dir.
@jpmckinney
Copy link
Contributor Author

Okay, custom queues should be more extensible with the latest commit.

@pspsdev
Copy link

pspsdev commented Mar 9, 2023

@jpmckinney this looks good. What else need to be done to get it merged?

@jpmckinney jpmckinney merged commit 538357c into master Mar 10, 2023
@jpmckinney jpmckinney deleted the 197-spiderqueue branch March 10, 2023 16:39
jpmckinney added a commit that referenced this pull request Jul 23, 2024
…an scrapyd.sqlite.initialize function (unnecessary support for dbs_dir URLs added in #476)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Configurable spider queue class
2 participants