-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Base 64 URL #1658
Comments
What we're considering is no longer base64. It should really be a different API named |
@GrabYourPitchforks not exactly - that's merely a canonical name. From an encoding and decoding perspective they are just different alphabets at the lowest level. You literally just exchange 2 characters in the respective encoding and decoding (e.g. 4 numbers). It took me exponentially more time to understand the existing implementation than it did to to make the actual change. Mostly, I wanted to get the discussion going. There are several ways this proposed change could be implemented. If we split out the API into a new type called Base64Url I think that's more than sufficient. That's actually how I implemented it in my solution. After realizing that this is a common scenario, I thought it would be appropriate to formally discuss supporting it for others to leverage. The method signatures between Base64 and Base64Url would be exactly the same or, at least, they certainly can be. If we were to take this approach, then it feels like most of the existing Base64 implementation should be refactored into a some other internal type (ex: Base64Core), which can be shared between the two implementations (since they would both be static classes). The internal implementation would then accept the appropriate alphabet leaving the rest of the encoding and decoding algorithm the same. Thoughts? |
This depends how padding is handled (especially for base64Url). Furthermore the current implementation is vectorized. The same holds for the option with the user specified alphabet. |
@gfoidl the padding need not be handled directly in the encoding/decoding, which makes things simpler; the RFC also says as much. I agree, however, that most people will expect the padding to be handled automatically, which is why my original suggestion was to move that responsibility outside of the encoding/decoding process. If we want to include that in the Base64Url class, that's fine by me, but that is merely encapsulating some very basic behavior after encoding and before decoding. Code is worth a thousand words. I've thrown up the gist of what I'm talking about. I'm well aware that the implementation uses vectors to map the alphabets. Like you, I merely changed the mapping of the alphabet here and here (on this line and this line). I didn't change anything else. I suspect you changed things to make it work with .NET Standard, which is not one of my goals. In pure terms of implementing the RFC specification, this is functionally correct. I acknowledge that consumers will expect padding to be automatically handled, which is why you can see everything come together in my rendition of Base64Url.cs. I'm not saying that my implementation should necessarily be the winning design. Step one is to agree that these two things are very, very similar and that it makes sense for the core implementation to not only live together, but that the two have a shared implementation. Exactly which part of API is publicly exposed and what it looks like is wide open for discussion. The specification implies that there could be alternate alphabets. These are the only two I know of. I'm completely onboard with closing the door on open-ended alphabets. There's no need to over-engineer this. I'm content with saying these are the only two alphabets supported and they are internally mapped. If that somehow ever changes in the future, there seems to be a straight forward path as to how that would be achieved. |
In general I agree with you.
👍
I don't know to which change you refer. In the linked project (in my last comment) most changes are for perf, and the vectorized approach -- especially for decoding -- is different for base64 and base64Url. |
Happy to update the title. What should it say now? "Support Base 64 URL"? I didn't dissect your implementation, but it clearly looked different and I saw compiler directives. Maybe there's something I don't understand about the encoding and decoding process, which would negate my initial assumptions. My understanding of the specification and the vectorized implementation is as follows: EncodingDefine the alphabet as a 64 element vector where each element represents the alphabet character. A bunch of optimized techniques and math are then used to calculate which character to use while encoding. DecodingDefine the decoding map as a 256 (technically only 255 is needed) element vector where each element represents the encoded character which maps back to the corresponding encoding alphabet character. A bunch more optimized techniques and math are used to decode the text. The differences between the encoding and decoding is merely using I'll reiterate that padding need not be handled directly in the encoding/decoding process (refer to my example). Unless I've completely missed something else, that means the only difference between the encoding/decoding process is alphabet-to-vector map. Conceptually, this feels like this can be handled by refactoring to a parameter or something other method. I recognize these are very low-level APIs and performance sensitive. I defer to bigger brains than mine as to what the best approach is, but seems to me that choosing vector A or B or passing it as a ref parameter is quite cheap. I re-reviewed your implementation and it seems you are accounting for the padding directly in the encoding/decoding process. That is not what I'm suggesting, but maybe that's a better way to go. In that case, we end up with two distinct encoding/decoding implementations for each alphabet, where the Base 64 URL implementation directly accounts for padding too. The underlying algorithms are quite complex and advanced. My original thinking was that we don't need different algorithms, just different alphabets where padding can be handled outside of the encoding/decoding process for Base 64 URL (the spec even shows I'm onboard to support two different, optimized algorithms if that's the agreed upon approach, it just seems like it's unnecessary maintenance overhead and complexity. |
Sounds good. We're going down to implementation details...I think it's too early for this. First there should be a consense about the API design / shape. Maybe we mismatch "vectorized". When I mean "vectorized" it's SIMD. You mean "array of data" (here a lookup table (LUT)). For padding: I understood what you suggested 😉 But all users of base64Url (in the .NET landscape) use it without padding (e.g. ASP.NET Core), so this should be handled directly. At leasst with a default param -- so a user could opt-in to keep the padding. Pure scalar code and w/o padding in mind, you're absolutely right that the implementations are equivalent, just with differnt LUTs. Hence it makes sense to share as much code as possible. As said before, it's heavily optimized so I'm afraid it's not that easy without perf-loss or introducing even more complex code by playing JIT-tricks. I'll wait for some info from the API designers here. |
Title updated. Alright ... we are agreed. I guess we enter the holding pattern for stakeholders to decide. I'm personally not married to any code or design, I'd just like to see all of the perf goodness flow from the lowest levels up, which currently isn't happening. 😉 |
Any news on this one? It doesn't seem to be a big deal... A standardized System.Memory.Base64Url will be useful... |
(I opened what's effectively a dupe of this as #66841. Oops!) I'd like to propose an alternative API shape for this: instead of the two options above, just create a new class Just having two separate classes that share code lets people make a simpler decision: do I want to encode in base64 or base64url? And having a separate class might also make it easier to have different defaults for padding between base64 and base64url if options are added to control padding. So, I propose option 3: "make it its own class": namespace System.Buffers.Text
{
public static class Base64Url
{
public static System.Buffers.OperationStatus DecodeFromUtf8(System.ReadOnlySpan<byte> utf8, System.Span<byte> bytes, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true);
public static System.Buffers.OperationStatus DecodeFromUtf8InPlace(System.Span<byte> buffer, out int bytesWritten);
public static System.Buffers.OperationStatus EncodeToUtf8(System.ReadOnlySpan<byte> bytes, System.Span<byte> utf8, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true);
public static System.Buffers.OperationStatus EncodeToUtf8InPlace(System.Span<byte> buffer, int dataLength, out int bytesWritten);
public static int GetMaxDecodedFromUtf8Length(int length);
public static int GetMaxEncodedToUtf8Length(int length);
}
} |
@TravisSpomer the term "alphabet" is the vernacular of the spec RFC 4648. I don't really care what it's called in the code, but some level of symmetry and meaning tends to be self-describing. Where I was really going with in terms of an alphabet is in relation to how the spec calls them out. I'm more than onboard with having a restricted public surface area. My thought process was that there could be the ability to swap out alphabets at a lower level. As the spec calls out, there are more than just 2 possible alphabets. If we're only interested in having these two public approaches, then I don't have any resistance to restricting it that way. There's no sense in over-engineering it. I posted a Gist (above) which is extracted from a working implementation. I didn't post the unit tests for it, but have those handy if we were to move forward with a PR. I believe the only real hold up on this issue is an agreed upon API design. I've contributed my 2¢, but I'm not married to anything. In comparing the implementations between Base64 and Base64Url (as provided thus far), the only real low-level difference is the encoding and decoding maps/table/vectors. Handling padding, be it trim or add, is easily handled at a higher API level. As it relates to this issue, the allocation sizes are exactly the same. Trimming a simple I'm totally onboard with having 2 laser-focused public static classes for I would ❤️ to see this land as a supported API. There's a wide benefit to client and server web stacks, in particular. According to the linked .NET 7 roadmap, this issue is queued on the backlog to potentially land. @jeffhandley, is there something we can do to help drive this? |
Sorry, "What's an 'alphabet' in this context?" and the neighboring questions were supposed to be theoretical questions a person trying to comprehend the API would ask, not a literal question. I know what it means here, because I've read through that spec, but a normal person has not and will not. 🙂 |
Roger that. Apologies, I didn't mean to spec-splain. 😛 |
@brentschmaltz, could you share what that would translate to in terms of your ideal API in the proposal above? |
@stephentoub consider a protocol, such as SHR which is a JWS inside a JWS (the envelope). When we hit that claim, we will remember where it was in the buffer, and pass a Span to the decoder when it is time to process. We don't have to worry about escaping as the value must be base64url encoded or we will fault on the decoding. Something simple like:
|
That would be |
Perfect, love it. |
Could we make it NS 2.0? This is so heavily used in the Azure SDK and we keep duplicating this all over the place. |
This type is needed for .NET Standard 2.0. System.Memory.nupkg is the most obvious destination, but we may have problems with that which require a net new package. namespace System.Buffers.Text;
public static class Base64Url
{
public static int GetMaxDecodedLength(int base64Length);
public static int GetEncodedLength(int bytesLength);
public static OperationStatus EncodeToUtf8(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true);
public static int EncodeToUtf8(ReadOnlySpan<byte> source, Span<byte> destination);
public static bool TryEncodeToUtf8(ReadOnlySpan<byte> source, Span<byte> destination, out int charsWritten);
public static byte[] EncodeToUtf8(ReadOnlySpan<byte> source);
public static OperationStatus EncodeToChars(ReadOnlySpan<byte> source, Span<char> destination, out int bytesConsumed, out int charsWritten, bool isFinalBlock = true);
public static int EncodeToChars(ReadOnlySpan<byte> source, Span<char> destination);
public static bool TryEncodeToChars(ReadOnlySpan<byte> source, Span<char> destination, out int charsWritten);
public static char[] EncodeToChars(ReadOnlySpan<byte> source);
public static string EncodeToString(ReadOnlySpan<byte> source);
public static bool TryEncodeToUtf8InPlace(Span<byte> buffer, int dataLength, out int bytesWritten);
public static OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true);
public static int DecodeFromUtf8(ReadOnlySpan<byte> source, Span<byte> destination);
public static bool TryDecodeFromUtf8(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten);
public static byte[] DecodeFromUtf8(ReadOnlySpan<byte> source);
public static OperationStatus DecodeFromChars(ReadOnlySpan<char> source, Span<byte> destination, out int charsConsumed, out int bytesWritten, bool isFinalBlock = true);
public static int DecodeFromChars(ReadOnlySpan<char> source, Span<byte> destination);
public static bool TryDecodeFromChars(ReadOnlySpan<char> source, Span<byte> destination, out int bytesWritten);
public static byte[] DecodeFromChars(ReadOnlySpan<char> source);
public static int DecodeFromUtf8InPlace(Span<byte> buffer);
public static bool IsValid(ReadOnlySpan<char> base64UrlText);
public static bool IsValid(ReadOnlySpan<char> base64UrlText, out int decodedLength);
public static bool IsValid(ReadOnlySpan<byte> utf8Base64UrlText);
public static bool IsValid(ReadOnlySpan<byte> utf8Base64UrlText, out int decodedLength);
} |
Are we going to follow the established |
Reopening to track the .NET Standard 2.0. port
Sounds like we are |
The new Microsoft.Bcl.Memory package for netstandard 2.0 (and .NET 8) support will release with preview 7. |
Thanks ! |
[Theory]
/* ✅ */ [InlineData("hello!")] // Contains invalid character '!'
/* ✅ */ [InlineData("aGk===")] // Too much padding
/* ✅ */ [InlineData("aGk==x")] // Invalid character 'x' after valid padding
/* ✅ */ [InlineData("jaGVsbG8=")] // Length not a multiple of 4
/* ❌ */ [InlineData("aGk")] // Length not a multiple of 4
public void TestInvalidInputs(string base64Url)
{
ReadOnlySpan<byte> base64UrlBytes = Encoding.UTF8.GetBytes(base64Url);
Assert.False(Base64Url.IsValid(base64UrlBytes));
} The last one should return false from |
Base64Url allows for padding to be omitted (see the "allow missing padding" option on that site) |
The netstandard support is now released with https://www.nuget.org/packages/Microsoft.Bcl.Memory package. @KrzysztofCwalina @am11 please give it try, leave feedback if have you any. |
Updated by @MihaZupan on 2024-02-27
Proposed API
Original issue
The Base64 implementation in System.Memory provides excellent low-level optimizations for RFC 4648, but it currently uses a fixed alphabet. Minor refactoring would add additional use cases using the existing implementation.
Rationale and Usage
The most obvious use case for this change is to support the encoding variant known as Base 64 URL also described in RFC 4648 §4.
This excerpt from the RFC describes the difference between the standard Base 64 alphabet and the Base 64 URL alphabet.
I have already been able to verify the encoding and decoding in a copy of the current implementation by merely changing the 2 relevant characters in the alphabet mappings at:
Today, this logic is suboptimally implemented in at least:
Furthermore, the encoding is generic and has use cases outside of ASP.NET (e.g. you shouldn't have to reference ASP.NET to use it).
Proposed API Change
The existing
Base64.EncodeToUtf8
andBase64.DecodeFromUtf8
should each add a new method overload that allows one of the following:A new enumeration of allowed alphabets (ex:
Base64Alphabet
) which internally maps to well-known alphabet spansAllow any custom alphabet to be supplied as
ReadOnlySpan<sbyte>
that must have an exact length of 64 for encoding and 256 for decodinga. One or more new types could be provided with static properties for the well-known alphabet spans
Details
Option 1 requires less validation, but has less flexibility. Option 2 has more flexibility, including scenarios not described here, but requires more validation before using the alphabet.
Although the standard Base 64 and Base 64 URL are technically the same encoding with different alphabets, Base 64 URL typically does not include padding. RFC 4648 §3.2 indicates this is allowed (as it's explicitly stated), but there is no need to make that concession in this API. Including the
=
character for padding in Base 64 URL encoding is still correct.Both approaches would have a similar looking API:
Figure 1: Supply an alphabet enumeration
Figure 2: Supply a custom alphabet
The support for trimming off and re-adding padding should be implemented separately. This could be in a separate
Base64Url
class or, perhaps more appropriately, added as a new type in System.Text.Encodings.Web.Padding is generally very cheap to deal with compared to other implementation methods. Trimming involves walking the tail end of the span while there are padding characters and then slicing it off. Re-padding fills the end of the span buffer before decoding. Neither operation requires additional allocations.
Open Questions
The text was updated successfully, but these errors were encountered: