-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Parkes/Tempo2 parsing confusion #1320
base: master
Are you sure you want to change the base?
Conversation
Addresses #1319 |
This fix is pretty minimal, but I don't know enough about the TOA formats to know if it will interfere with something else. If it doesn't sound problematic I can add appropriate tests. |
Codecov Report
@@ Coverage Diff @@
## master #1320 +/- ##
=======================================
Coverage 61.64% 61.64%
=======================================
Files 89 89
Lines 20174 20174
Branches 3614 3614
=======================================
Hits 12436 12436
Misses 6961 6961
Partials 777 777
Continue to review full report at Codecov.
|
In the discussion on #1231 @scottransom was pretty adamant that just because you saw a FORMAT 1 you couldn't assume that the TOAs in the rest of the file were all TEMPO2. I think that should be exactly what that line means, but I think we need the opposing opinion. |
Yes, I understand that. But if this file is parsed properly by tempo2 then there should be a way to get PINT to do it. Even if you have to specify the format manually in |
@aarchiba No, that wasn't what my point was. The |
That last task is what I was trying to do here, but it's complicated. There is a mix in terminology between format (tempo2, Parkes, Princeton) and the Perhaps we should discuss this at an upcoming call. |
Is it possible we could obtain some genuine examples of mixed-format TOA files to use as test cases? I haven't seen an example where FORMAT 1 is followed by non-tempo2-format TOAs, but I know my experience with pulsar timing is a little specialized. If we had a clear specification of what files were supposed to look like, we could write some hypothesis code to really hammer it with potentially ambiguous cases, but if our only specification is "whatever the current parser accepts" then it is impossible for it to have a bug and testing is pointless. It certainly seems like a bug to me if some valid tempo2-format files that start with FORMAT 1 and are followed by valid tempo2-format TOAs cannot be correctly read by PINT, but that appears to be the current state of affairs. |
We can easily make test files with multiple TOA types (and we could do it by simply combining some of our current .tim files!). And I agree that this is a bug and should be fixed, if possible. The mixed files that I've seen are archival non-tempo2 format files (using Parkes or Princeton formats) where people have simply cut and pasted tempo2-style TOAs on at the end, because those more recent TOAs are what they were sent from other folks. So there is a "FORMAT 1" (possibly, but maybe not always) in the middle of a file. However, I might have seen the opposite behavior as well, where someone has a "FORMAT 1" file with tempo2 TOAs and then copies and pastes Princeton/Parkes format TOAs into them. According to the intentions of the tempo2 developers, I think that both of these cases would "officially' be user error. But most of the time it is easy to make simple guesses as what types of TOAs are actually there and parse them correctly. So I do think we should try and do that if not too difficult. |
This is why I specified "genuine": what combinations are actually occurring in the wild? It is certain that there will be valid tempo2 TOAs that also parse as other formats, with varying degrees of sensibleness. The more we can narrow the plausible situations where we permit weird combinations, the more reliably we can parse tempo2 TOAs. Which, as far as I can tell, are the Right Answer going forward. As David and I both independently discovered, perfectly reasonable tempo2 files can fail to be parsed correctly because "but sometimes..." There is a danger to flexible file parsing: the two clock file formats we parse are very poorly specified, and as a result you can read a file in as the wrong format and receive no error messages, getting clock corrections that are simply absent, vastly too large, or vastly too small. The first two cases we have a way to detect, but clock corrections that are inadvertently zero are very easy to fail to notice. Likewise, I would much rather be informed that my .tim files are malformed than have TOAs with typos simply dropped or misinterpreted. Here are a few possible situations to improve the user experience:
All of these are sort of hypothetical benefits unless we know what sorts of files our users want to read. My own experience has been generating my own TOAs in tempo2 format, or using such, usually with lots of flags so other formats are hopeless. |
My proposed fix was very minor and only handled a small range of cases. But I agree in principle with @aarchiba . In particular, being able to say that "I know this is the format, follow it strictly" would be good but I think it would require some rewrites of how the parsing is done (nomenclature for format vs. other syntax), and the rest of it but need even more. Where are the best references to the different formats? are there e.g., particular types of commands allowed in one but not others? It might be best for me to close this PR (again, it only fixes a particular case) and move this discussion back to the issue? |
To my knowledge, this is the best reference on TOA formats: http://tempo.sourceforge.net/ref_man_sections/toa.txt I have a ping out to Ingrid and David N about old-school flags used with Parkes/Princeton format TOA files. I think people used "-i" flags as info flags sometimes. What I don't know is if they used other things. I personally don't think we need to get too complicated with this. I'd recommend giving a Warning if we see Tempo2 TOAs added into a file with other format TOAs. And we can check to see, for instance, if the decimal point requirements for Princeton/Parkes/ITOA apply to a string or a number. And maybe any TOA line that has any flags (except "-i"?) gets automatically treated as T2 if it parses correctly. I don't think I like the idea of super strict parsing by default, at least. There are files in the wild that would break that, and I think we can do a really good job with only a little bit of extra effort (and checking of values). |
That reference actually gives a clear and simple way to distinguish between the four formats:
We certainly don't implement these rules. They don't leave room for comments, for one thing, or JUMPs and the like. But it is an understandable starting point, and as Scott says it is the best reference available. Looking at the TEMPO2 source code, it looks like it treats files as containing either "FORMAT" and tempo2-format TOAs or a mix of the other three types: https://bitbucket.org/psrsoft/tempo2/src/9f4f29abe564a3f907f8b97cef79385011391a23/readTimfile.C#lines-343 Looking at the TEMPO source code, it seems to assume everything after a FORMAT 1 is tempo2 format: https://sourceforge.net/p/tempo/tempo/ci/master/tree/src/arrtim.f#l217 (also it doesn't follow the rules above from its own documentation because they mention exceptions) |
If TOA lines are supposed to be Tempo2 format but start with a space and have a "." in spot 41, then they may accidentally be parsed as Parkes format.
This tries to fix that.