Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Parkes/Tempo2 parsing confusion #1320

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dlakaplan
Copy link
Contributor

If TOA lines are supposed to be Tempo2 format but start with a space and have a "." in spot 41, then they may accidentally be parsed as Parkes format.

This tries to fix that.

@dlakaplan
Copy link
Contributor Author

Addresses #1319

@dlakaplan
Copy link
Contributor Author

This fix is pretty minimal, but I don't know enough about the TOA formats to know if it will interfere with something else. If it doesn't sound problematic I can add appropriate tests.

@codecov
Copy link

codecov bot commented Jun 30, 2022

Codecov Report

Merging #1320 (c225f3d) into master (f21a68a) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #1320   +/-   ##
=======================================
  Coverage   61.64%   61.64%           
=======================================
  Files          89       89           
  Lines       20174    20174           
  Branches     3614     3614           
=======================================
  Hits        12436    12436           
  Misses       6961     6961           
  Partials      777      777           
Impacted Files Coverage Δ
src/pint/toa.py 80.77% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f21a68a...c225f3d. Read the comment docs.

@aarchiba
Copy link
Contributor

aarchiba commented Jul 1, 2022

In the discussion on #1231 @scottransom was pretty adamant that just because you saw a FORMAT 1 you couldn't assume that the TOAs in the rest of the file were all TEMPO2. I think that should be exactly what that line means, but I think we need the opposing opinion.

@dlakaplan
Copy link
Contributor Author

Yes, I understand that. But if this file is parsed properly by tempo2 then there should be a way to get PINT to do it. Even if you have to specify the format manually in get_TOAs() or similar.

@scottransom
Copy link
Member

@aarchiba No, that wasn't what my point was. The FORMAT 1 at the beginning of a file is supposed to mean that that file contains only tempo2-style TOAs. That is completely correct. The problem is that there are many .tim files floating around that have mixed format TOAs in them! And FORMAT 1 lines scattered in and out of the files. So people have started abusing this "standard". My point was that we shouldn't assume that just because FORMAT 1 is set, that all TOAs will be tempo2, unless we want PINT to abort while reading (and I think that both TEMPO and Tempo2 do read those files correctly, although I should double-check that). If this is all the case, then any tweaks that make our identification of TOA formats on a line-by-line basis better, should be a good thing. (IMO)

@dlakaplan
Copy link
Contributor Author

That last task is what I was trying to do here, but it's complicated. There is a mix in terminology between format (tempo2, Parkes, Princeton) and the fmt variable (which also includes values like comment and command). One solution would be to impose a TOA format on the whole file but the machinery to do that isn't there. Another would be to more intelligently fall back between different formats that are ambiguous, but that also needs a bit of an architecture change (since the format identification & toa parsing are handled in the same place). I could solve either of those, but both would be some interface changed.

Perhaps we should discuss this at an upcoming call.

@aarchiba
Copy link
Contributor

aarchiba commented Jul 4, 2022

Is it possible we could obtain some genuine examples of mixed-format TOA files to use as test cases? I haven't seen an example where FORMAT 1 is followed by non-tempo2-format TOAs, but I know my experience with pulsar timing is a little specialized.

If we had a clear specification of what files were supposed to look like, we could write some hypothesis code to really hammer it with potentially ambiguous cases, but if our only specification is "whatever the current parser accepts" then it is impossible for it to have a bug and testing is pointless.

It certainly seems like a bug to me if some valid tempo2-format files that start with FORMAT 1 and are followed by valid tempo2-format TOAs cannot be correctly read by PINT, but that appears to be the current state of affairs.

@scottransom
Copy link
Member

We can easily make test files with multiple TOA types (and we could do it by simply combining some of our current .tim files!). And I agree that this is a bug and should be fixed, if possible.

The mixed files that I've seen are archival non-tempo2 format files (using Parkes or Princeton formats) where people have simply cut and pasted tempo2-style TOAs on at the end, because those more recent TOAs are what they were sent from other folks. So there is a "FORMAT 1" (possibly, but maybe not always) in the middle of a file. However, I might have seen the opposite behavior as well, where someone has a "FORMAT 1" file with tempo2 TOAs and then copies and pastes Princeton/Parkes format TOAs into them.

According to the intentions of the tempo2 developers, I think that both of these cases would "officially' be user error. But most of the time it is easy to make simple guesses as what types of TOAs are actually there and parse them correctly. So I do think we should try and do that if not too difficult.

@aarchiba
Copy link
Contributor

aarchiba commented Jul 5, 2022

We can easily make test files with multiple TOA types (and we could do it by simply combining some of our current .tim files!). And I agree that this is a bug and should be fixed, if possible.

The mixed files that I've seen are archival non-tempo2 format files (using Parkes or Princeton formats) where people have simply cut and pasted tempo2-style TOAs on at the end, because those more recent TOAs are what they were sent from other folks. So there is a "FORMAT 1" (possibly, but maybe not always) in the middle of a file. However, I might have seen the opposite behavior as well, where someone has a "FORMAT 1" file with tempo2 TOAs and then copies and pastes Princeton/Parkes format TOAs into them.

According to the intentions of the tempo2 developers, I think that both of these cases would "officially' be user error. But most of the time it is easy to make simple guesses as what types of TOAs are actually there and parse them correctly. So I do think we should try and do that if not too difficult.

This is why I specified "genuine": what combinations are actually occurring in the wild? It is certain that there will be valid tempo2 TOAs that also parse as other formats, with varying degrees of sensibleness. The more we can narrow the plausible situations where we permit weird combinations, the more reliably we can parse tempo2 TOAs. Which, as far as I can tell, are the Right Answer going forward. As David and I both independently discovered, perfectly reasonable tempo2 files can fail to be parsed correctly because "but sometimes..."

There is a danger to flexible file parsing: the two clock file formats we parse are very poorly specified, and as a result you can read a file in as the wrong format and receive no error messages, getting clock corrections that are simply absent, vastly too large, or vastly too small. The first two cases we have a way to detect, but clock corrections that are inadvertently zero are very easy to fail to notice. Likewise, I would much rather be informed that my .tim files are malformed than have TOAs with typos simply dropped or misinterpreted.

Here are a few possible situations to improve the user experience:

  • TOA-reading routines could allow users to specify the format, in which case the parsing could be strict, or leave it at "take a guess" in which case it would do its best to cope with random mishmashes. (PINT allows easy conversion, so if you can read a random mishmash and verify that it worked you can write it out as tempo2.)
  • TOA-reading routines could assume FORMAT 1 means "everything after this is tempo2"; one could even allow "FORMAT 0" or something to indicate the end of tempo2 format.
  • Mishmashes could have their values sanity-checked, for example a frequency of 57123.456 MHz is probably an MJD in the wrong field so you probably have the wrong format. We could also have a clearly defined order of preference, so that a valid tempo2 TOA wasn't allowed to look like a Princeton TOA. We'd then have to make sure our output routines never produced ambiguous TOAs.
  • Comments could be restricted to starting with "C" (we'd have to make sure this didn't clash with tempo2 format "name" fields starting with C) or "#" and then unrecognized lines could produce exceptions rather than being treated as comments.

All of these are sort of hypothetical benefits unless we know what sorts of files our users want to read. My own experience has been generating my own TOAs in tempo2 format, or using such, usually with lots of flags so other formats are hopeless.

@dlakaplan
Copy link
Contributor Author

My proposed fix was very minor and only handled a small range of cases. But I agree in principle with @aarchiba . In particular, being able to say that "I know this is the format, follow it strictly" would be good but I think it would require some rewrites of how the parsing is done (nomenclature for format vs. other syntax), and the rest of it but need even more. Where are the best references to the different formats? are there e.g., particular types of commands allowed in one but not others?

It might be best for me to close this PR (again, it only fixes a particular case) and move this discussion back to the issue?

@scottransom
Copy link
Member

To my knowledge, this is the best reference on TOA formats: http://tempo.sourceforge.net/ref_man_sections/toa.txt

I have a ping out to Ingrid and David N about old-school flags used with Parkes/Princeton format TOA files. I think people used "-i" flags as info flags sometimes. What I don't know is if they used other things.

I personally don't think we need to get too complicated with this. I'd recommend giving a Warning if we see Tempo2 TOAs added into a file with other format TOAs. And we can check to see, for instance, if the decimal point requirements for Princeton/Parkes/ITOA apply to a string or a number. And maybe any TOA line that has any flags (except "-i"?) gets automatically treated as T2 if it parses correctly.

I don't think I like the idea of super strict parsing by default, at least. There are files in the wild that would break that, and I think we can do a really good job with only a little bit of extra effort (and checking of values).

@aarchiba
Copy link
Contributor

aarchiba commented Jul 6, 2022

That reference actually gives a clear and simple way to distinguish between the four formats:

  • tempo2-format TOA files must start with FORMAT 1 (presumably this implies everything else in the file is tempo2-format, though it doesn't quite say that)
  • Otherwise, if columns 1 and 2 are not blank, it is ITOA format
  • Otherwise, if column 1 is blank, it's Parkes format
  • Otherwise, if column 2 is blank, it's Princeton format

We certainly don't implement these rules. They don't leave room for comments, for one thing, or JUMPs and the like. But it is an understandable starting point, and as Scott says it is the best reference available.

Looking at the TEMPO2 source code, it looks like it treats files as containing either "FORMAT" and tempo2-format TOAs or a mix of the other three types: https://bitbucket.org/psrsoft/tempo2/src/9f4f29abe564a3f907f8b97cef79385011391a23/readTimfile.C#lines-343

Looking at the TEMPO source code, it seems to assume everything after a FORMAT 1 is tempo2 format: https://sourceforge.net/p/tempo/tempo/ci/master/tree/src/arrtim.f#l217 (also it doesn't follow the rules above from its own documentation because they mention exceptions)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants