-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the lzreposync loop over channels and synchronize them using a Status column #9551
base: master
Are you sure you want to change the base?
Conversation
lzreposync will be a spacewalk-repo-sync replacement written in Python. It uses a src layout and a pyproject.toml. The target Python version is 3.11, compatibility with older Python versions is explicitly not a goal.
Added the remote_path column that will hold the remote path/ url of a given package. This information will help locate the package later-on on the remote repository and download it.
A boolean argument that checks whether we should call the header.hdr.fullFilelist() We added this argument to disable the header.hdr.fullFilelist() function only for the lzreposync service.
The inspect.getargspec() method is deprecated in Python 3 It can be replaced by inspect.getfullargspec()
The import_signatures is a boolean argument that specifies whether we should execute the _import_signatures() method. We added this parameter to disable the _import_signatures() method for the lzreposync service.
Parsing the rpm's Primary.xml packages metadata file using pulldom xml parser as a memory efficient parsing library. Note that some attributes in the returned parsed object are faked, and maybe filled in elsewhere. The faking of some of the data is done because some attributes are required by the importer service.
Parsing the rpm's filelists.xml metadata file using pulldom xml parser as a memory efficient parsing library. The parser parses the given filelists.xml file (normally in gz format), and cache the filelist information of each package in a separate file in the cache directory, using the package's hash as the filename, with no file extension.
Using both primary_parser and filelists_parser, return the full packages' metadata, pacakge by package, using lazing parsing. Note that there some attributes that are faked, because we can't fetch them now, and they're required by the package importer later-on. However, we can fake them more efficiently, using less memory.
Parsed the update-info.xml file and imported the parsed patches/updates to the database. We used pretty much the same code from the old Reposync class.
Import the parsed rpm and debian packages to the database in batche, and associate each pacakge with the corresponding channel
Parsed the debian Packages metadata file in a lazy way and yield the metadata of each package separately.
Parsed the debian's Translation file that contains the full description of packages, grouped by description-md5, and cache the parsed descriptions in a cache directory.
Using both packages_parser and translation_parser, return the full packages' metadata, pacakge by package, using lazing parsing Also set the debian repository's information in a DebRepo class
Given the channel label, fetch important repository's information form the database, and store it in a temporary object RepoDTO
Added the necessary command line arguments. Identify the target repositories, prepare the datastructures, and execute the lazy synchronization of repositories/packages.
Added a new dependency python-gnupg used to verify repo signature.
Ignored two linting complains about rasing exceptions floowing the approach in the old reposync. We can enhance the code instead of doing this though.
This commit completes almost all the logic and use cases of the new lazy reposync. **Note** that this commit will be restructured and possibly divided into smaller and more convenient commits. This commit is for review purposes.
Seemingly this error happened because we reached the maximum number of unclosed db connections. And thought that this might be due to the fact that the close() method in the Database class was not implemented, and the rhnSQL.closeDB() was not closing any connection. However, we're still hesitating about whether this is the root cause of the problem, because the old(current) reposync is was using it without any error.
This is the latest and almost the final version of the lzreposync service. (gpg sig check not complete) It contains pretty much all the necessary tests, including the ones for updates/patches import. Some of the remaining 'todos' are either for code enhancements or some unclear concepts that will be discussed with the team. Of course, this commit will be split into smaller ones later after rebase.
- Removed some todos. - Changed some sql queries with equivalent ones using JOIN...ON. - Some other minor cleanup
Optimized some code by changing classes and methods in some logics with free functions. Consolidated the debian repo parsing.
Completed the gpg signature check for rpm repositories, mainly for the repomd.xml file. This is done by downloading the signature file from the remote rpm repo, and executing 'gpg verify' to verify the repomd.xml file against its signature using the already added gpg keys on the filesystem. So, if you haven't already added the required gpg keyring on your system, you'll not be able to verify the repo. You should ideally run this version directly on the uyuni- server, because the gpg keyring will probably be present there.
makedirs() in uyuni.common.fileutils now accepts relative paths that consist of only a directory name or paths with trailing slashes.
Completed the gpg signature check for debian repositories. If you haven't already added the required gpg keyring on your system, you'll not be able to verify the repo, and you'll normally get a GeneralRepoException. You should ideally run this version directly on the uyuni- server, because the gpg keyring will probably be present there.
Mocked the SPACEWALK_GPG_HOMEDIR value to `~/.gnupg/`, which is the default directory for gpg, in order to execute the gpg tests outside the uyuni-server
Made the lzreposync service continuously loop over the existing channels and synchronize the corresponding repositories. Added a status column in the rhnchannel table to indicate the sync status of a given channel. Also added some helper arguments to the service that allows us to perform test operations, like creating a test channel and associating repositories to it, etc
0fd16e3
to
1b7cd07
Compare
from spacewalk.common.repo import GeneralRepoException | ||
from spacewalk.satellite_tools.repo_plugins.deb_src import DebRepo | ||
|
||
SLEEP_TIME = 2 # Num of seconds between one sync and another |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 seconds seems very short. I have no idea what time we would need here though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree, 2 seconds aren't that much for the lzreposync
to look for updated channels to sync. But I think that we shouldn't also make it too long to not let the users wait too much for the sync. I am also not sure what will be the best time to put yet.
By the way, will the user have the ability to explicitly fire the sync job for his own newly updated channels ? It's more like the On-Demand
download strategy ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
postgresql can also have a notification mechanism. I know we are using it in some other parts of our code, but I don't know how that works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what context we'll be using the notification mechanism ?
parser.add_argument( | ||
"--create-content-source", | ||
help="Adding a new content source to channel by specifying 'channel_label', 'source_url', 'source_label' and 'type'\n" | ||
"Eg: --create-content-source test_channel https://download.opensuse.org/update/leap/15.5/oss/ leap15.5 yum", | ||
dest="source_info", | ||
type=str, | ||
nargs=4, | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why you want to create a channel from the command line? While looping you take basically no other parameter than the DB connection and may be the sleep time and act on the available channels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, it's just temporary for me to able to test the service, we can remove it before merging the PR. And maybe I shouldn't have committed it in the first place.
) | ||
_display_failed_repos(ret) | ||
|
||
except KeyboardInterrupt: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the only interruption we need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I'm still not completely sure about it.
And I think we should also be careful about the final channel's status
, after any interruption in general, and not leaving the channel in a blocking status like in_progress
, because in that case it will be ignored by future syncs.
It still needs some refinements..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be in those cases the status should be set to failed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true...I was just thinking about a rare scenario in which all the repositories of a given channel get completely synced, but before updating the database status to done
the interruption happens. However the probability of this happening is extremely low, so I think setting the status to failed
should be good enough.
Now in terms of implementation, I would like to get your opinion about these two possible ways:
First one
Add another try...catch
block inside the for loop, to be able to get the current channel label being synced, and update its status:
...
for channel in all_channels
ch_label = channel["channel_label"]
try:
sync_channel_workflow(ch_label)
except KeyboardInterrupt:
update_status(ch_label, 'failed')
Second one
Initialize an empty variable above the first try
block and set it to None
: curr_chnl_label=None
. Then it gets updated inside the for
loop with the current channel label being synced:
curr_chnl_label = None
try:
...
for channel in all_channels:
curr_chnl_label = channel["channel_label"]
....
except KeyboardInterrupt:
update_status(curr_chnl_label, 'failed')
However in the latter case, we should add None
check.
Both seem to have the same complexity, but I'm not sure what will be the best practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are not sure about the actual status, "failed" looks appropriate to me. Better to run it twice than have bad data.
For the implementation, I don't think it makes a difference. Pick either of them 😄
@@ -423,6 +423,7 @@ class OracleBackend(Backend): | |||
"checksum_type_id": DBint(), | |||
"channel_access": DBstring(10), | |||
"update_tag": DBstring(128), | |||
"status": DBstring(11), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more optimized to reduce the status to a single character in case there are lots of data. @rjmateus any opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Postgresql also support enums (occupying 4 bytes) https://www.postgresql.org/docs/current/datatype-enum.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use enums, or just 1 char. Both should be fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I add the enum
type creation script inside rhnPackage.sql, or in the schema upgrade script, or in both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it needs to be defined in both places, one for new installations and the other for upgrades (only if it does not exist)
What does this PR change?
Made the lzreposync service continuously loop over the existing
channels and synchronize the corresponding repositories.
Added a
status
column in the rhnchannel table to indicate thesync status of a given channel.
Also added some helper arguments to the service that allows us to
perform test operations, like creating a test channel and
associating repositories to it, etc
GUI diff
No difference.
Documentation
No documentation needed: only internal and user invisible changes
DONE
Test coverage
ℹ️ If a major new functionality is added, it is strongly recommended that tests for the new functionality are added to the Cucumber test suite
No tests: add explanation
No tests: already covered
Unit tests were added
Cucumber tests were added
DONE
Links
Issue(s): #
Port(s): # add downstream PR(s), if any
Changelogs
Make sure the changelogs entries you are adding are compliant with https://github.com/uyuni-project/uyuni/wiki/Contributing#changelogs and https://github.com/uyuni-project/uyuni/wiki/Contributing#uyuni-projectuyuni-repository
If you don't need a changelog check, please mark this checkbox:
If you uncheck the checkbox after the PR is created, you will need to re-run
changelog_test
(see below)Re-run a test
If you need to re-run a test, please mark the related checkbox, it will be unchecked automatically once it has re-run:
Before you merge
Check How to branch and merge properly!