-
Notifications
You must be signed in to change notification settings - Fork 6
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A convention for complex numbers? #369
Comments
Re question 1: I think CF is a good place for this, however... the inertia you will encounter for getting any of these added is probably going to be large. CF is very slowly adopting netcdf4 features, e.g. strings were added just today. When looking though the mailing list, proposals for standard names take the form "imaginary_part_of_fourier_transform_of_air_pressure_wrt_time" (and corresponding "real_part" names). None of these have made it into the actual standard name list. These would be using option 2b above. The netcdf4-python docs use option 2a above as an example (compound types): https://unidata.github.io/netcdf4-python/netCDF4/index.html#section10 This method seems to be the way that the library authors think you should store data like this. Option 2b seemed to be the one favored in a mailing list 2010 discussion, but there was no conclusion. This was about 2 years after netcdf4 was introduced and option 2a was "unavailable" or at least too new. http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2010/003517.html Option 2c is what the CFRadial folks have proposed with an attribute which indicates that the last dim is two reals representing a complex. https://cf-trac.llnl.gov/trac/ticket/169 |
It's nice to see that somebody has tried each of these approaches before! With a compound dtype, I can use netCDF4-Python to create complex values that can be read by h5py: import netCDF4
import numpy
import h5py
size = 3
datac = numpy.exp(1j*(1.+numpy.linspace(0, numpy.pi, size)))
with netCDF4.Dataset("complex.nc","w") as f:
complex128 = numpy.dtype([("r",numpy.float64), ("i",numpy.float64)])
complex128_t = f.createCompoundType(complex128, "complex128")
x_dim = f.createDimension("x_dim", None)
v = f.createVariable("var", complex128_t, "x_dim")
data = numpy.empty(size,complex128)
data["r"] = datac.real
data["i"] = datac.imag
v[:] = data
with h5py.File('complex.nc') as f:
print(f['var'][...]) But unfortunately it does not seem to be possible to do it the other way around -- netCDF-C can't read the sort of custom dtypes that h5py creates. This could presumably be fixed either in netCDF-C or in h5py. I think my favored solution would be option 2c -- the extra dimensions for real/imag parts feels like the most efficient/compact solution. It's backwards compatible with netCDF3 (still useful for many purposes!) and also with newer format like zarr (though zarr also supports complex values directly). From the perspective of a tool like xarray it avoids the issue of needing to combine attributes from two different variables into a single variable. |
I agree that option 2c is likely the best to maximize compatibility across different version of netCDF, etc. Splitting an imaginary number into two variables is (to me) counterintuitive and counterproductive. An attribute, such as the CF-Radial |
What are the use cases for storing complex numbers in polar form? I guess it allows you to load half the data from disk, if you only need one of the terms? From a simplicity perspective, splitting along the last dimension into real and imaginary components feels the cleanest to me -- these map directly into numeric types. This avoids the issue of specifying multiple units in a single string, which feels a little strange to me. If you need to keep track of different units for the magnitude and phase, then I think multiple variables would be appropriate. |
With the caveat that none of my work involves complex numbers: I ask this because if this is an "encoding" then perhaps the units wouldn't really apply until after you do the "decode" step. The representation of the values "at rest" aren't as connected as the original CF Radial proposal would imply. Similar to how |
This issue is certainly of interest in the CF community and has been discussed before, most recently by CF-Radial, as mentioned. I favour option 2b, because
Whether we should add complex numbers to CF in some way also depends on their being a use-case which is strong enough for CF to adopt, since we don't add things to CF solely because we can imagine the possible need for them. |
As a maintainer of domain agnostic software that interprets a subset of CF conventions, I would like to have a standard way to store arrays of complex values in a single netCDF file, that doesn't require adopting all of CF to be useful. In xarray's data model, we can already represent multi-dimensional labeled arrays of complex values as a single variable, and these are what users want to store to netCDF. Most of these users don't particularly care about CF conventions (e.g., they are physicists not climate scientists), but I'd like them to be able to store "more standard" data to disk, so it has a better chance of being readable by other software. So if we use multiple variables for complex values, I'd like to standards for:
The advantage of approach 2(c) is that it side-steps all of these issues, because it's impossible to use different metadata for the different components. I'm sure there are valid use cases for using different units for real/imaginary components or storing data in polar form. I agree that for these use cases, it makes sense to store data in separate arrays, and I would encourage creating domain specific conventions for how to store particular sets of variables. But these feel like a different use case: data that could be represented as complex values, not data that always is complex valued. |
Another point in favor of 2c, IMO, is that one can readily read in the data from a netCDF file and create a view of this data using a client language's native support for complex values with no copies. |
Ping @mike-dixon and @piyushrpt, who were involved in the discussion on the CF-Trac website. |
Just wanted to chime in. While waiting for resolution on this, the following features have been added to GDAL primarily with SAR geospatial datasets in mind: NETCDF4 HDF5 These implement approach 2a. It allows us to work well with native numpy types in python as well as with complex values support in C++. We have been focusing more on using CF conventions with HDF5 as it allows us to float16 for our applications. |
I just had my attention drawn to this. By way of trying to move this along, a bit of a summary and restatement. The original suggestion from @shoyer was that there were three options: a) compound data types, b) real and imaginary parts in separate arrays, and c) real and imaginary parts in a multi-dimensional array. This was effectively because he had correctly rejected underlying library support (netCDF doesn't do it, effectively because HDF doesn't do it, yet). A couple of quick comments
Jonathan has said we need a use case, and cited cf-radial. Maybe we need to revisit that? But in any case, this feels like a real chicken and egg issue: we don't have a use case for complex numbers because we don't have support for them ... So, revisiting the options,
My take on the discussion so far is that there is agreeement we should do this, and more support for (c) than the other options. We have a CF meeting in a couple of weeks, can we knock this one off there? |
I'm a research software engineer from the plasma science community where we regularly use complex numbers, and I've seen all three conventions in use in production research software. I've also been trying to push some kind of solution forward in the last few months. I strongly support option a, using a compound datatype. This is supported in the netCDF4 API, released in 2008, so should be widely available! Option b, split variables, and option c, a new dimension, are very unlikely to make it into either netcdf or HDF5 at the library level. The h5netcdf implementation, which uses h5py to write netcdf-compliant HDF5 files directly, also uses a compound datatype However, the downside of picking any convention at the library level is that there will still be plenty of existing files (and software) using the others, as well as variations on the chosen format (mostly different names for the real/imaginary components). This has been bothering me so much I've started work on a (proof-of-concept for now) drop-in extension to netCDF with C, C++, Fortran, and Python APIs. The idea is to handle all possible conventions for reading (and eventually appending) so that users and developers don't have to care about which one a particular file is using. It also writes native complex numbers to compound datatypes, checking if the type already exists in the file and creating it if not. The plan is to support reading as many existing representations as possible in order to make it completely painless for an application to switch convention.
|
A compound data type would be OK for CF metadata if it was real and imaginary parts, but not if it was polar, because the two components have different units, and would need different standard names. |
Yes, I had meant to reference the original trac ticket where this part of the problem was originally discussed. |
I am not sure we can get a nirvana here, the past is the past, and given we had no convention, I don't feel obliged to be backwards compatible, existing software will work with existing files, and new software can surely cope. I'm also not sure that we can hope to predict what HDFx will do, and when it will do it. That said, I don't have a preference for (a) or (c) at the moment, I was just trying to reflect what seemed like a preference of (c) thus far. However, that preference may have been biased by the fact that when this ticket was open (a) would have been much harder than it is now. |
I was wondering what compound types actually look like, and found this CDL example, which I've copied below. In particular, it shows a convention for how different units can be applied to different variable components.
|
Although I see the appeal, I think that |
Would it really break existing software? If we are going to introduce compound types for our variable types, it's not obvious to me that folks would be trying to interpret the compound type units (as opposed to the component types) with udunits? |
This is just a summary of current implementation in GDAL to demonstrate how this can be done. Here is the example HDF5 file used to test complex compound datatypes for Float{16,32,64}. The file can be found here: The relevant code for determining data types can be found here: Example netcdf file: https://github.com/OSGeo/gdal/blob/c095015224753fc1f082b50c2d36cf7f1a74a4bd/autotest/gdrivers/data/netcdf/complex.nc This is the relevant code in the netcdf driver, which interprets CF metadata while recognizing the data type correctly: |
It could be a problem for any software that scanned the variables in the file to find one with particular units, for instance, or which was harvesting metadata and expected units attributes to be scalar strings.
|
My 2 cents having worked with a lot of SAR data in both HDF5 / netcdf formats ... Complex numbers are a fundamental data type nowadays in scientific computing. Support for general compound datatypes is challenging and probably a harder challenge to reach consensus on. But, w.r.t complex numbers:
It might also become important to support certain specific datatypes - Float16, N-bit integers etc going forward considering the volume of SAR data that is going to be generated by new missions and increased desire from this user community to use xarray. |
@piyushrpt I'm not sure whether your point (2) above is arguing that the string has to be a single valued units string or not? I've just dived into a rabbit hole about whether or not a complex number is a scalar, suffice to say people can earnestly argue that it is, and that it isn't. However, it certainly appears that if we are going to use complex numbers as scalars, then we can't have multiple units. So I think part of this discussion comes down to deciding whether for us, a complex number is a scalar. Recognising that in practice, communities we care about, may not treat it as one (i.e. they have different units for the different components). Of course, CF does not have to support their conventions, but we should not get in the way of them being able to build on top of CF. I guess my point is we should see what CFradial2 is doing now. |
I fully agree with what you say @bnlawrence and apologies for not clearly stating it. Yes, I personally believe complex numbers are scalars in their original (real, imag) form - readily usable with spectral operations etc. Any other form (amp/ intensity, phase) is a derived form and can already be represented using scalar data types already, as in done in CFRadial. End users have to transform these other forms of complex numbers into (real, imag) for spectral operations anyway. Including support for fundamental |
I too agree that a variable containing a complex number would be OK for CF if the whole variable could be described with CF metadata just as for any other kind of single variable i.e. a single |
We discussed this today at a breakout session of the CF annual conference. Three options were considered:
By way of an interim summary (before plenary discussion), we think that of the three options listed above, only the first (compound data types) and the third (array with trailing 2 dimensions for real and imaginary parts) should be considered. On balance there is probably a very slight preference for the array option as it might, for some languages, simplify handling the data in memory and it doesn't require implementors in "non-python languages" (netcdf4-python supports them directly) to have to do heavy lifting around the compound data types. There is strong support for considering complex numbers to be scalars and allowing only one unit (not least because one can argue that complex numbers themselves do not have units, only the real part should). This of course could have implications for supporting applications which have a different conventions, which leads us to specific actions: we intend to a) further investigate what CFradial2 is doing, and what an evolution of that convention might do if we did have complex numbers, b) investigate why many radar communities are using HDF transient data types directly and avoiding NetCDF altogether, and c) look at other potential use-cases in the CF community, particularly around spectral transform models, time series analysis, and storing refractive indices. Who might use these things, and how urgent is it to them? When we are clear on the implications of the choices (compound data types versus arrays, and scalars with at most one unit versus vectors with multiple units) we think the actual convention itself will be easy to propose and agree, but we will want to do it in conjunction with at least one actual use-case. |
@piyushrpt It might be that you are able to argue cogently against the assumption that arrays will be easier than compound datatypes for implementation (R was raised as a potential issue)? Also, if you have anything useful to say about why some radar communities have not followed the CFradial2 approach and bypassed hte lack of direct complex number support and have used transient types in HDF, that'd be really helpful. The group of us discussing this were aware of the practice, but did not have a detailed understanding. |
@ZedThree Can you please explain and/or point us to more detail about your netcdf implementation of reading what I presume are HDF transient data types? How would one write them if not using h5netcdf? |
The long and short of it is that it should work completely automatically in the next version of netCDF. It's implemented in the C library, so should then be available in all other language wrappers immediately. It will allow reading files created by h5netcdf containing complex or bool variables, although they won't be converted to the language-native types (i.e. you'll get a structure with The technical details: When creating compound types in general in HDF5, it's optional whether or not to commit the datatype to the file(/group). If it's not committed, then the definition of the datatype is stored inside the variable itself and isn't given a separate name -- this is a "transient type". Committing the type is an extra step that's completely optional when using HDF5, which is why I suspect many libraries and applications skip it. NetCDF, on the other hand, always commits the type and gives it a name. When reading a file with HDF5 directly, the library doesn't actually care where a variable's datatype is stored, so querying the type of a variable always works. The type is either native/built-in, stored directly in the variable, or is a reference to a named type stored elsewhere in the file. NetCDF reads all named types in a file immediately on opening it, stores them in some structure, and then when reading or querying a variable, looks up its type in this structure. So previously, when a variable had a transient type, this look-up failed. My fix was basically to just fallback to reading the type from the variable. Separately, I'm working on a tool to make both reading and writing to/from language-native complex numbers trivial, with APIs for C, C++, Fortran, and Python. I've not looked at R, but if this is a blocker, I can probably find an R expert to help work something out. |
Thanks for the summary of the discussion yesterday, @bnlawrence. I was in another Hackathon discussion session but would have really liked to have joined this one too, so it is useful to see your outline. If someone wouldn't mind summarising this bit that wasn't covered, briefly, why was option (2) (which from the thread - particularly this comment - I believe is '[a] single UDUNITS-compatible string attribute on the compound data type variable') considered not a good option by the discussion group? I also had one question in response to something you mentioned, namely:
I am potentially being a devil's advocate here, but to me, something which has concrete given units (here, the real part) and something unitless (the complex part, in your hypothetical argument) do indeed have different units - like comparing 0 to any non-zero number, by analogy. For example, in our >>> from cfunits import Units
>>> u1 = Units("m") # taking metres as an example for the real part's units
>>> u1
<Units: m>
>>> u2 = Units()
>>> u2 # special unit-less object
<Units: >
>>> u1.equals(u2)
False So if only one unit will be allowed for a complex number representation (and I feel like intuitively that seems like a very wise idea to me so would agree with the group yesterday, without having a specific use case in mind), does there need to be a way to specify whether the complex part has the same units as the real part or else is unitless, which I think should be the only two options possible, though for different applications we might want to allow either case rather than enforcing one, so providing configurability somehow (some sort of Boolean flag, maybe)? |
@sadielbartholomew Wrt your first question, I don't think option [2] from that thread was an option in itself, it was an additional comment on top of a recommendation to use compound data types. So it falls into the "treat complex numbers as scalars, and hence, by definition of a scalar, allow them to have only one unit" ... ... and that's really the bottom line wrt to your second question. I think the argument that only the real value has units is a bit of a red herring, and I probably shouldn't have included it. |
@ZedThree You said:
This is presumably for writing to NetCDF or HDF files using compound data types under the hood? |
Would this tool work in conjunction, somehow, with whatever else we're using to create datasets (e.g. netCDF4-python in my case)? Thanks, |
@bnlawrence Yes, exactly. It should also work with the other backing file types that support the netCDF4 format. I could also make it configurable to use a complex dimension when writing. @davidhassell Yep, the Python API is built on top of netCDF4-python, so you can literally do It's currently proof-of-concept, but it does work and you try it out here. Packages coming soon! |
Have you thought about a pull request directly into netCDF4-python? In any case, for other readers coming to this thread, you should look at this link from @ZedThree above; it's a great summary of complex number issues. I think we'd be pretty silly to deviate from the analysis (and "blessed solution") he provides. |
Hi All
FYI - I want to let you know about our decisions in CfRadial.
We are storing spectra, which can be represented as:
power - i.e. a single scalar
power and phase - i.e. 2 scalars
a complex pair, also 2 scalars
For the complex case, we chose to store these as 2 separate variables.
We are now working on a CF convention for radar time series data - the
so-called In-phase and Quadrature (I/Q) pairs.
Complex numbers apply here as well. We are adopting the same approach - 2
separate variables.
We will probably use the ancillary_variables attribute to
indicate the connection between the 2 arrays.
Thanks
Mike Dixon
…On Fri, Oct 6, 2023 at 2:32 AM Bryan Lawrence ***@***.***> wrote:
@davidhassell <https://github.com/davidhassell> Yep, the Python API is
built on top of netCDF4-python, so you can literally do import nc_complex
as netCDF4 and not touch the rest of your code. For other languages, it
will be similar, though sometimes function names may have to be changed.
Have you thought about a pull request directly into netCDF4-python?
In any case, for other readers coming to this thread, you should look at
this link <https://github.com/PlasmaFAIR/nc-complex> from @ZedThree
<https://github.com/ZedThree> above; it's a great summary of complex
number issues. I think we'd be pretty silly to deviate from the analysis
(and "blessed solution") he provides.
—
Reply to this email directly, view it on GitHub
<#369>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABTOD74UJ6UFZD5ZEPVN3BTX5662NAVCNFSM4IVKCMG2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZVGAYTSNJTGUYA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi Mike,
My initial thought on this is it would be better to use a different attribute, rather than to overload an existing one with a use for which it wasn't intended ... but it'd be good first to make sure that I understand what you have in mind. Could you possibly post a CDL snippet giving an example? Many thanks, |
Hi David
I would welcome the addition of attributes specifically for this purpose.
Mike
…On Thu, Oct 12, 2023 at 7:42 AM David Hassell ***@***.***> wrote:
Hi Mike,
We will probably use the ancillary_variables attribute to indicate the
connection between the 2 arrays.
My initial thought on this is it would be better to use a different
attribute, rather than to overload an existing one with a use for which it
wasn't intended ... but it'd be good first to make sure that I understand
what you have in mind. Could you possibly post a CDL snippet giving an
example?
Many thanks,
David
—
Reply to this email directly, view it on GitHub
<#369>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABTOD73YOIZGSQVUTUCUBYTX67XTFANCNFSM4IVKCMGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Dear @mike-dixon On 6th October, you wrote
I think that choice will work well in CF. @davidhassell replied on 12th
meaning a new attribute, other than Best wishes Jonathan |
to be clear for me, when you say: """ Is that two complex numbers, or one complex number (which requires 2 values to represent. I think you mean the latter, but want to make sure. Anyway, conceptually, a complex number IS a scalar -- so to the degree possible, it would be really great to treat it as such in CF. If that is not possible to do properly (a complex type). and we need to use two variables, I think they should be defined specifically as a complex number, rather than by overloading an existing concept. Which ai think is what @davidhassell is suggesting. |
I think the latter, too, i.e. a complex number is conceptually a scalar, but in netCDF needs two real numbers to represent it, one of which is implicitly multiplied by
Yes. Although pending @mike-dixon's CDL CF-radial example, I'm not yet sure what the new structure would look like. |
I'd just like to plug my nc-complex library again. The C API is now production ready, and support for it has been integrated into netcdf4-python directly and will be available in their next release. I'm currently working on the C++ and Fortran APIs, and they will be available soon. The library supports using a dimension or a compound type to represent complex numbers (and multiple different conventions), and is completely interoperable with the standard netCDF library. It's also trivial to use: switch the I haven't yet added support for complex numbers as two separate variables, as there are some difficult questions like how to represent the name of the variable, should attributes be merged, and so on. A lot more functions would also need to be wrapped in order to handle accessing two variables as a single one. @mike-dixon Is there some discussion available somewhere as to why CF-Radial has chosen to use two variables? I'd like to understand the benefits of using them over a separate dimension or a compound datatype. |
FYI... HDF Group has just published an implementation RFC for float16 and complex number datatypes in libhdf5. You can provide comments or suggestions at HDFGroup/hdf5#3339. |
I think this is great news! Given that we haven't yet established a CF convention for complex yet, it seems the obvious thing to do is adopt the HDF complex -- assuming it gets into netcdf soon. :-) As for CF-Radial -- I'm still confused about whether they are storing complex numbers, or 2D-values-that-have-a-magnitude-and-direction (now that I wrote that -- 2D vectors), which can be mathematically represented as complex. Which seems to me a different thing. For example, in my line of work, we often need to represent wind speed and direction -- but I've never seen anyone call that a complex number. And if CF, we don't use the speed-direction form, standardizing on the eastward-northward form instead. |
A brief update, as a couple of things have happened recently: netcdf4-python can now read and write complex numbers. There's a new The RFC to add native complex numbers to HDF5 is progressing, and is likely to be in version 1.15. It will also have support for no-op conversions to/from compound datatype representations. This hopefully gives us some more pressure to add native complex numbers in netCDF itself. My nc-complex library is an extension that adds support for complex numbers for C and C++. I need to find some time to add Fortran support too. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
This may not come up often for Climate and Forecast use-cases, but many physical scientists are interested in storing arrays of complex numbers. This has come up with some regularity for xarray users, e.g., see pydata/xarray#3297 for the most recent issue.
There does not appear to be a standard for how to store complex numbers in netCDF files, so scientists are currently presented with two poor options:
invalid_netcdf=True
with h5netcdf: https://github.com/shoyer/h5netcdf#invalid-netcdf-filesI would love to resolve this with a standard way to store complex values in netCDF file, so xarray doesn't have to invent its own standard or encourage writing HDF5 files that aren't valid in netCDF.
Some questions:
a. Use compound data types in a single array, like h5py and HDF5.jl (see complex number support JuliaIO/HDF5.jl#558). Advantages: this is an existing standard already in use. Disadvantages: requires the HDF5 data model; netCDF-C lacks support for reading some types compound data types (NetCDF unable to read some HDF5 enums Unidata/netcdf-c#267), including the convention used by h5py.
b. Store real and imaginary parts in separate arrays, with some sort of metadata convention for indicating that they are the same logical array. Advantages: easy to interpret merely by inspection; easy to access real or imaginary parts separately. Disadvantages: a new convention; reading data into complex dtypes in memory will likely require an additional copy to combine real/imaginary parts.
c. Store real and imaginary parts in a single array with an extra dimension of length 2 at the end for real and imaginary parts. Advantages: still backwards compatible with old netCDF formats; can be memory mapped without a copy. Disadvantages: slightly less self-explanatory (user needs to understand the dimension mapping); adds an extra dangling dimension of length two in the dataset.
The text was updated successfully, but these errors were encountered: