Code refactoring and speed optimizations #24

Hukyl · 2024-10-26T13:54:05Z

General overview

This PR introduces a major update from v0.01 to v0.02, which comes with a lot of new features.

Improved general code structure

Now each entity/class is separated into different files and folders for better logical representation. This will also allow for extending the functionality, for example, with export types.
Added type hints and type declarations for basic entities, like Message or Dialog.
Refactored some functions.
Exported settings to a separate file settings.py, which allows for easy setup and explanations on some variables. A few of the most important settings include concurrent dialog processing count, which allows for high customizations regarding the speed and proved to be useful when adjusting to Telegram's request blocking (429 Too Many Requests).
Added CONTRIBUTING.md for future contributors.

Speed optimizations

The initial version of the script, while utilizing the asynchronous nature of the Telethon library, didn't use it to full capacity.
Now, with including asynchronous operation between separate dialogs, the speed has drastically increased.

Note: all the following tests (unless specified) were conducted locally using my local data. It consists of 328 Telegram dialogs, where around 5-10 contained >100k messages. These measurements are not extremely accurate and should be taken with a grain of salt.

`0_download_dialogs_list.py`:

Initial version: 1m41s
Optimized version: 1m9s

`1_download_dialogs_data.py` (on 5 chats with >100k msgs):

Initial version: 40k msgs/hour
Improved version: 200k msgs/hour
When testing for another user (kind regards to the tester @velosypedno), the speed (after editing the concurrency parameter) reached 340k msgs/hour.

Speed comes very important as the number of chats starts to increase. In order to show the difference, with the optimized version it took around 10–12 hours, when with the original it would take a whopping 52 hours!

Important updates

Each of the following changes were made out of improvement considerations, are debatable and can be reverted upon request.

In the initial version, from_id parameter for the second script was saved as a representation of a Python object (e.g., PeerUser(user_id=...)). Now, instead we retrieve the ID from any Telethon object (see TypePeer) using utility Telethon function (see telethon.utils.get_peer_id), which lead to some changes in ID. The next part is retrieved directly from documentation:
Convert the given peer into its marked ID by default.
This "mark" comes from the "bot api" format, and with it the peer type
can be identified back. User ID is left unmodified, chat ID is negated,
and channel ID is "prefixed" with -100:
- user_id
- -chat_id
- -100channel_id
The original ID and the peer type class can be returned with
a call to :meth:resolve_id(marked_id).
Instead of previous dialog types ("Private dialog", "Group", "Channel" and "" for unknown type) more standardized dialog types were introduced, which are listed in an Enum (see dict_types): "private", "group", "channel" and "unknown".
Migrated to poetry (see here) package manager instead of default pip. It proved to be more user-friendly and better at resolving dependencies.
Used ruff (see here) instead of black formatter. While the code styles are highly debatable, it is a lot faster and more responsive than black.

Possible improvements

Add threading support to increase throughput. Sadly, couldn't utilize nor test it myself, as received a lot of 429 status codes, but proved to be useful for other users.
Improved code style and responsibility separation.

Summary

All-in-all, each improvement was made with a possible use case in mind. Feel free to leave your comments and suggest improvements!

…s dep manager, move to .env config

…s, update basic dict types

…atter instructions

… levels

…add exponential backoff decorator

Code refactoring and speed optimizations

SanGreel

general comments:

move tests to one folder
add instructions how to run tests
add test coverage percentage
run black
add docstrings
add pylint to the project - https://docs.pylint.org/index.html
add commands how to run pylint

1_download_dialogs_data.py

telegram_data_downloader/processor/message_downloader.py

telegram_data_downloader/settings.py

telegram_data_downloader/utils.py

Hukyl · 2024-11-23T23:19:34Z

Please, consider using ruff instead of pylint and black.

In short, it is a blazingly fast (yeah, it's written in Rust) code linter and formatter. It's 99% compatible with black (doc source), and supports some error codes from pylint (here is a list of all supported). Maybe we could use a combination of ruff for linting and formatting, and run pylint before commiting, and suggest this to other contributors as a contributing guideline.
About docstring, I suggest adding only basic docstrings, as I believe code should be mostly self-documenting. Therefore, I'll try to improve the namings and function return types, so that it would be so, while providing basic docstring about the method function.
Perhaps keep unittests in the same folder as the file it is testing? This is a project structuring technique I've adopted from Golang, which allows for lower cognitive load when running tests and lays more incentive on the test file path (useful for pytest) and reducing the naming requirements. That being said, integration and e2e tests IMO should still be placed in the tests/ dir in the project root.
Testing instruction is already present in CONTRIBUTING.md. Is it better to move to README.md?
It appears as if you cannot fetch all reactions to the message (i.e., set to None), see here. I still exported the setting to config file tho

Waiting for some feedback and further discussion on these suggestions! @SanGreel

Hukyl · 2024-11-25T12:36:12Z

Also, while debugging, I noticed a special case: when Dialog is of type Group or Channel, then Telethon's message.from_id is not empty, but when it's a Private Dialog, it is None, but the user can still be identified by the dialog.id, which is the user's id.

Therefore, all messages from groups or channels have a user_id in from_id field, but all private chats - don't. Again, user can still be identified by dialog_id field

Semantically, does this make sense? Should this be left as is, or modified so that it isn't null even in case of private chat? @SanGreel

Hukyl and others added 14 commits October 22, 2024 23:45

refactor: split project into basic independent entities, use Poetry a…

b942c48

…s dep manager, move to .env config

refactor: update first script and remove unused files

412cd94

fix: remove pdb after testing

04eafd0

refactor: update download dialog data script, add concurrency setting…

e976da7

…s, update basic dict types

formatting: update README.md, add CONTRIBUTING.md, remove unused form…

310e692

…atter instructions

refactor: update default export paths, normalize logging messages and…

4664167

… levels

refactor: include specific dialog ids import note to README.md

1bc3589

refactor: export concurrent dialog number as a setting

d52d28b

refactor: move message iterator initialization to a separate method, …

94bce1c

…add exponential backoff decorator

bugfix: verify poetry installation env variables

f762ae4

Merge pull request #1 from Hukyl/refactor/restructuring-and-optimization

ede2adf

Code refactoring and speed optimizations

bugfix: file export format, add tests

a28784d

bugfix: actually export messages from channels

cad270e

formatting: add testing guideline to CONTRIBUTING.md

a4202c7

Hukyl changed the title ~~Code refactoring and optimizations~~ Code refactoring and speed optimizations Oct 30, 2024

Hukyl added 2 commits November 3, 2024 18:16

enhance: add message takeout to increase throughput

bb56e55

fix: iter_message() return typing and message to_id in private chats

37db23c

SanGreel requested changes Nov 21, 2024

View reviewed changes

Hukyl added 2 commits November 24, 2024 03:18

codestyle: add pylint to dev dependencies, reformat files

f50ede0

formatting: add test coverage, update formatters info

a02040a

codestyle: add broadcast channel reactions log message, add docstrings

ea07d3e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code refactoring and speed optimizations #24

Code refactoring and speed optimizations #24

Hukyl commented Oct 26, 2024 •

edited

Loading

SanGreel left a comment

Hukyl commented Nov 23, 2024 •

edited

Loading

Hukyl commented Nov 25, 2024 •

edited

Loading

Code refactoring and speed optimizations #24

Are you sure you want to change the base?

Code refactoring and speed optimizations #24

Conversation

Hukyl commented Oct 26, 2024 • edited Loading

General overview

Improved general code structure

Speed optimizations

0_download_dialogs_list.py:

1_download_dialogs_data.py (on 5 chats with >100k msgs):

Important updates

Possible improvements

Summary

SanGreel left a comment

Choose a reason for hiding this comment

Hukyl commented Nov 23, 2024 • edited Loading

Hukyl commented Nov 25, 2024 • edited Loading

Hukyl commented Oct 26, 2024 •

edited

Loading

`0_download_dialogs_list.py`:

`1_download_dialogs_data.py` (on 5 chats with >100k msgs):

Hukyl commented Nov 23, 2024 •

edited

Loading

Hukyl commented Nov 25, 2024 •

edited

Loading