-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code refactoring and speed optimizations #24
base: master
Are you sure you want to change the base?
Conversation
…s dep manager, move to .env config
…s, update basic dict types
…atter instructions
…add exponential backoff decorator
Code refactoring and speed optimizations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
general comments:
- move tests to one folder
- add instructions how to run tests
- add test coverage percentage
- run black
- add docstrings
- add pylint to the project - https://docs.pylint.org/index.html
- add commands how to run pylint
Waiting for some feedback and further discussion on these suggestions! @SanGreel |
Also, while debugging, I noticed a special case: when Dialog is of type Group or Channel, then Telethon's Therefore, all messages from groups or channels have a user_id in Semantically, does this make sense? Should this be left as is, or modified so that it isn't null even in case of private chat? @SanGreel |
General overview
This PR introduces a major update from v0.01 to v0.02, which comes with a lot of new features.
Improved general code structure
settings.py
, which allows for easy setup and explanations on some variables. A few of the most important settings include concurrent dialog processing count, which allows for high customizations regarding the speed and proved to be useful when adjusting to Telegram's request blocking (429 Too Many Requests
).Speed optimizations
The initial version of the script, while utilizing the asynchronous nature of the Telethon library, didn't use it to full capacity.
Now, with including asynchronous operation between separate dialogs, the speed has drastically increased.
Note: all the following tests (unless specified) were conducted locally using my local data. It consists of 328 Telegram dialogs, where around 5-10 contained >100k messages. These measurements are not extremely accurate and should be taken with a grain of salt.
0_download_dialogs_list.py
:Initial version: 1m41s
Optimized version: 1m9s
1_download_dialogs_data.py
(on 5 chats with >100k msgs):Initial version: 40k msgs/hour
Improved version: 200k msgs/hour
When testing for another user (kind regards to the tester @velosypedno), the speed (after editing the concurrency parameter) reached 340k msgs/hour.
Speed comes very important as the number of chats starts to increase. In order to show the difference, with the optimized version it took around 10–12 hours, when with the original it would take a whopping 52 hours!
Important updates
Each of the following changes were made out of improvement considerations, are debatable and can be reverted upon request.
In the initial version,
from_id
parameter for the second script was saved as a representation of a Python object (e.g.,PeerUser(user_id=...)
). Now, instead we retrieve the ID from any Telethon object (seeTypePeer
) using utility Telethon function (seetelethon.utils.get_peer_id
), which lead to some changes in ID. The next part is retrieved directly from documentation:Instead of previous dialog types ("Private dialog", "Group", "Channel" and "" for unknown type) more standardized dialog types were introduced, which are listed in an Enum (see
dict_types
): "private", "group", "channel" and "unknown".Migrated to
poetry
(see here) package manager instead of defaultpip
. It proved to be more user-friendly and better at resolving dependencies.Used
ruff
(see here) instead ofblack
formatter. While the code styles are highly debatable, it is a lot faster and more responsive thanblack
.Possible improvements
threading
support to increase throughput. Sadly, couldn't utilize nor test it myself, as received a lot of 429 status codes, but proved to be useful for other users.Summary
All-in-all, each improvement was made with a possible use case in mind. Feel free to leave your comments and suggest improvements!