-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add -o option to only use a specific source for an AS-SET or Route Set #70
Conversation
Add -o option to use a specific source for an AS-SET or Route Set. Add missing close() calls for file descriptors Align return/exit codes Add some missing error messages
…pported server data sources Reorder call to bgpq_get_irrd_sources to ensure we can get list of supported server data sources
Quick question, why not make this feature available via |
I didn't want to change existing behavior in case it breaks some existing automation for some users. |
Great work! I have one question: Shouldn't this behavior be the new default as opposed to make this a non-default optional? The output of I think changing the default behavior into something more secure and more inline what we would actually expect (honoring the SOURCE:: format) would a good thing. I'm actually more afraid what the non-zero exit codes and stderr output does for complex automation in a lot of cases (the example with |
I think there are two different goals here:
In my opinion those are two different goals and while related, I would rather not make one big change and put them behind the same optional flag. I'd make goal 1 default, frankly. However I'd certainly not marry goal 2 with goal 1, because it would hamper the adoption of goal 1. If we do this, people will either:
I mean here are a few networks (using peeringdb's IRR as-set/route-set), some bigger, some smaller, even some content that trigger errors. And I did not search long at all:
With IPv6 it gets a lot worse (which is pretty much expected since not every IPv4 network is running an IPv6 network):
So the amount of errors in real life on an typical Internet exchange, even for as-set's of content networks (I'm not even talking the obvious worse case: large Tier 2 networks) would just be mind boggling to the point where there is no way other then to dump it to /dev/null. At least for now. Do we want to actually output errors to stderr to improve RPSL accuracy? Probably, but that's a real long term goal, while honoring So the discussion should probably be must goal 1 be optional or can we just honor small technical nitpicks:
|
@lukastribus Thanks for the in-depth review! Regarding two goals So yeah I had a similar thought to you, there are kind of two goals in this PR which could be separated. The reason I didn't split them is because when specifying a specific data source for an AS-SET or route set, and that data source doesn't exist, then we need to throw and error and return non-zero exist status. What would be the point in being able to specify a data source for an object if we don't throw an error when the data source doesn't exist? Also, I can foresee quite a bit of discussion. Should we all just jump on a Teams call / Zoom meeting / ? (I'm on central European time) Regarding making -o the default behavior I already mentioned that I didn't want to break existing implementations due to the change in exit status codes. But there is another reason I forgot to mention. If you re-read this comment, bgpq4 normally uses a recursive (We are discussing writing/raising this PR for irrd internally, and then one for bgpq4 to use that new query type, but it would be in a few weeks/or couple of months, and we need a fix sooner, hence this bgpq4 PR). |
I think the point is to get data from the correct Unresolved dependencies on the other hand is a different beast and is not new at all. bgpq3/4 always ignored unresolved dependencies #1, because that is how filters are created (use correct data to permit, and reject everything else). Honoring
What is the operational consequence here? What is the operator supposed to do specifically with the 778 errors from generating v6 filters for Akamai Prolexic, 389 errors for Cloudflare, or over 9000 errors for HE?
I think this error output belongs to verbose mode. Unresolved dependencies are very normal in RPSL, we would basically return non-zero exit codes and stderr all the time. Should the operator ignore the filter output on stdout if bgpq4 returns non-zero? Probably, but then filter creation would not even work for FAANG companies... Now we have different exit codes, based on whether we specify
I just don't see how this would actually work in practice, without ignoring stderr and return code. I'd like to talk about how operators actually use return codes and stderr output in real life scenarios on an Internet Exchanges while peering with different networks. What do we do with tens of thousands of errors on every bgpq4 run? What do we do with basically omni present non-zero return codes?
Thanks for clarifying. I'd predict a very gradual load increase though, considering that new codes paths in bgpq4 and especially |
I had a phone call with James, we agreed to make |
I am traveling this weekend so I will refactor the PR then. Just to confirm:
|
Ok thanks.
Just one request: can you fix the double newline after each error in stderr? Those empty lines in stderr are not pretty.
Thank you James! |
@lukastribus |
Resolving PR review feedback regarding coding style
Resolving PR review feedback regarding coding style
To save re-reading the whole thread, this is the current state of the PR:
It would be trivial at a later date to add support for the "::" notation for ASNs too, so that an AS-SET can have members like "RIPE:AS1122334455". First we need to get the syntax of the IRR DB "members" field updated to support the "::" notation (that's next on my to-do list). |
It seems there is a bug for hierarchical as-set names such as
|
@job hmm, this was working for me, I will test again. |
Fixed now... These are all working for me (please check for your self):
|
I think we are almost there, there seems to be a difference, compare:
with
is this expected or desirable? |
This was discussed further up this thread, but this thread is getting a bit long so I'll recap here for you: The SHA256 hash is confirming that the stdout output is the same (the generated prefix list). I generated a bunch of different prefixes with the original bgpq4 and this PR version and The difference in your example output is that currently bgpq4 doesn't produce any error messages to stderr when there are problems, so I added some error message which should have already been there (in my opinion):
OK, but why are there different error messages when using this PR version of bgpq4 against the same AS-SET: When I run the current/published version In both the existing version of bgpq4 and this PR version of bgpq4, when NOT using the When using the These added error messages statements could be...
What do you prefer? |
Main Feature
This PR adds a new CLI option (
-o
) to specify that only specific data sources should be used for AS-SET and Route-Set expansion. Those data sources are specified using the "::" notation e.g., "RADB::AS-GOOGLE". This is to resolve #5, specifically this implements "Part 1" of the idea mentioned in this comment (a separate PR could follow at a later date for "Part 2").A user may already specify one or more data sources using the
-S
option. In the case that-S
is used, the given data sources are saved as the default data source. If-S
isn't used, the irrd server is queried for it's list of sources and these are saved as the default data source.When using the new
-o
option e.g.,bgpq4 -o RIPE::AS-MCKAY
with no-S
option, RIPE is set to the data source in this example, then a query is sent to irrd to perform a non-recursive expansion of AS-MCKAY, and then for each member of the AS-SET, one of two things happens;This continues down the tree, honouring any sources found along the way using the "::" syntax, and using the default data sources when no specific source is given.
Other stuff
There are quite a few things that could be improve but opening a massive PR which changes lots of things isn't helpful. These are the other things this PR fixes "along the way" in order to get the main feature working nicely:
There are missing
close()
calls for file descriptors used, I have added someclose(fd)
calls.In the following places the program stops what it is doing and returns or exits without any error message:
https://github.com/bgp/bgpq4/blob/main/expander.c#L200
https://github.com/bgp/bgpq4/blob/main/expander.c#L203
https://github.com/bgp/bgpq4/blob/main/expander.c#L734
https://github.com/bgp/bgpq4/blob/main/expander.c#L844
https://github.com/bgp/bgpq4/blob/main/expander.c#L850
I have added error messages to these locations so that they aren't silent errors.
err()
when there is an error or issue, in others it callsexit()
, and in other places it callssx_report(SX_FATAL, ...)
and thenexit(1)
which is confusing becausesx_report(SX_FATAL, ...)
callsexit(-1)
so the extra exit with a different return code is confusing to debug:https://github.com/bgp/bgpq4/blob/main/expander.c#L402
https://github.com/bgp/bgpq4/blob/main/expander.c#L745
https://github.com/bgp/bgpq4/blob/main/expander.c#L811
https://github.com/bgp/bgpq4/blob/main/expander.c#L814
I have removed
exit()
calls after a fatal error.Inside expander.c the following functions return 1 on success and 0 on failure/error:
These two functions return 0 on success and failure:
The are the two worker functions in expander.c, the former for pipeline mode and later for non-pipeline mode. I have changed these last two functions to return 1 on success and 0 on failure/error when there is an error inside these function, so that there is consistent success/error status codes across all functions in expander.c.
At no point where any of the above functions are called, is the return status actually checked. Also the status of the callback functions used by
bgpq_read()
andbgpq_expand_irrd()
isn't checked. Also, the "main" function in expander.c calledint bgpq_expand()
is hard coded to return 1. bgpq4 checks the return status of this function inmain()
, but because the function returns 1 (and for this function 1 means "success"), bgpq4 always exits with exit-status 0 even though there are in fact errors occurring (unless the error is right at the start like parsing the CLI args/opts).I have added checks to the calls used by the new
-o
option in expander.c, so that the return status is passed through each function to the final exit status of bgpq4 (from callback function -> tobgpq_read()
/bgpq_expand_irrd()
-> tobgpq_expand()
-> tomain()
.The combination of points 2, 3 and 4 above is that now when I run
bgpq4 -o RADB::AS-GOOGLE
, bgpq4 generates as much of the prefix list as it can, but due to some missing data in irrd an error is printed and I have a non-zero exit status. So I can bring up a peering session with AS-GOOGLE AND I know I need to tell them they have a problem with their IRR data.Without this, which is how bgpq4 currently operates, users are not told that a specific source wasn't valid for example. This is an example: https://github.com/bgp/bgpq4/blob/main/expander.c#L683
AS-MCKAY is in RIPE, it contains an AS-SET in ARIN though, so prefixes are missing from this prefix set, but the exist status is 0. I know this is kind of by design, but I think it's a bit miss-leading for the user, there should at least be an error to say "one of the members of AS-MCKAY was not found in the the RIPE data".
This is from my PR:
This hasn't changed the default behaviour of bgpq4, but when using the new
-o
option an error message is at least shown to the user and the exit status is non-zero so if they are using bgpq4 in an automated way and logging the output, they can check their logs to at least get a clue that something needs a deeper look.Testing
I've tried to test this as thoroughly as I can on the CLI with all different combinations. I'm happy it's working. We are now loading this binary onto our prefix generation server, which generates prefix lists towards all our peers and customers every night, so we will also do some production testing too.
Example: Generate the prefix-list for a specific peer and specify the source for the peers' AS-SET or Route-Set (prefix-list excluded with grep for brevity). Here we can see there are errors, but the prefix list is generated for everything that is possible (just as before), the difference being that AS-GOOGLE was specifically picked from RADB, but downstream members used the irrd servers default data sources (which is all data sources):
Example: Thrown an error when the AS-SET isn't available from the specified data source:
Example: Specify the data sources to use for all member objects, the AS-SET specified on the CLI which will only use RIPE: