Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the source of the dt=0 error when using adaptive time step #2103

Open
wants to merge 1 commit into
base: release-v4.6.1
Choose a base branch
from

Conversation

JeroBnd
Copy link
Contributor

@JeroBnd JeroBnd commented Sep 11, 2024

Fixed the source of the error dt=0 in BC and out time steps adjust.

TYPE: bug fix

KEYWORDS: time, step, adaptative

SOURCE: Jeronimo Bande (IDING SAS, Córdoba, Argentina)

DESCRIPTION OF CHANGES:
Problem:
When adjusting the time step toward BC and output times, occasionally time step is erroneously set to 0 causing model to stop.

Solution:
The algorithm to determine the time step near BC and output times is modified to correctly set the time step.

ISSUE: Fixes #1560

LIST OF MODIFIED FILES: /dyn_em/adapt_timestep_em.F

TESTS CONDUCTED:

  1. Do mods fix problem? Yes.
  2. The Jenkins tests are all passing.

RELEASE NOTE: Corrected adaptive time step being occasionally set to 0 (which causes the model to stop) on BC and output times.

…pted to mitigate this effect.

Fixed the source of the error dt=0 in BC and out time steps adjust.
The division by 2 of  tmpTimeInterval caused the time to be out of precision for odd values of time_to_bc and time_to_output.
@weiwangncar
Copy link
Collaborator

The regression test results:

Test Type              | Expected  | Received |  Failed
= = = = = = = = = = = = = = = = = = = = = = = =  = = = =
Number of Tests        : 23           24
Number of Builds       : 60           57
Number of Simulations  : 158           150        0
Number of Comparisons  : 95           86        0

Failed Simulations are: 
None
Which comparisons are not bit-for-bit: 
None

@dudhia
Copy link
Collaborator

dudhia commented Sep 11, 2024

To help us review this, can you add more explanation of the fix in the description section?

@weiwangncar weiwangncar changed the base branch from master to release-v4.6.1 September 11, 2024 18:57
@JeroBnd
Copy link
Contributor Author

JeroBnd commented Sep 11, 2024

Hello.

The adaptive time step module operates with a precision of 1/100 seconds, resulting in simulation times with the same precision.

The process that determines the dtInterval involves several steps, checking various conditions. One of these conditions is the precision of 1/100 seconds.

When adjusting the time step to boundary conditions (BC) and output times, the algorithm divides a temporary time interval (with 1/100 sec precision) by two and assigns it to dtInterval. When this temporary time interval has an odd value, the precision changes to 1/200 sec, and the next simulation time also has a precision of 1/200 sec.

In the next step, adjacent to the BC or output time, the algorithm truncates the 1/200 sec precision, setting the simulation time to 1/200 sec before the BC or output time (without the mitigation done in #1398).

The following dtInterval, which is 1/200 sec, is then truncated to a precision of 1/100 sec, resulting in a dtInterval of 0.

This effect is mitigated in #1398 but does not address the underlying source of the problem.

@weiwangncar
Copy link
Collaborator

@JeroBnd Can you expand IDING SAS? Are you working with Kugler who posted the issue?

@JeroBnd
Copy link
Contributor Author

JeroBnd commented Sep 11, 2024

I am not working with Kugler. I am from Córdoba, Argentina. IDING SAS is a startup. We provide services to APRHI (Provincial Administration of Water Resources) for reservoir management. In this context, we are operationally running a high-resolution weather forecast ensemble with WRF.

I found this bug while trying to debug an error caused by myself in the namelist.input that did not throw a warning.

There are several things to do in the adapt_timestep module...

@weiwangncar
Copy link
Collaborator

@JeroBnd Thanks for the info. I tested one of the cases Kugler had problem with, the em_b_wave case, your change didn't help. Did you test that case, using his namelist.input file?

@JeroBnd
Copy link
Contributor Author

JeroBnd commented Sep 12, 2024

@weiwangncar, the adapt_timestep module is not prepared to handle this kind of idealized case. The algorithm uses the remaining time until the boundary condition is applied, with a counter that resets when the boundary condition is used.

In this idealized case, the counter starts with a value of 10800 (equivalent to 3 hours), but when the supposed boundary condition should occur, it doesn’t, and the counter continues decreasing into negative values, causing the simulation to terminate prematurely.

I wrote a small patch to fix this, but the module should be reconsidered. It uses several variables from the namelist.input without properly checking them.

@JeroBnd JeroBnd changed the title Fixed the source of the dt=0 error when using adaptive time step Fix the source of the dt=0 error when using adaptive time step Sep 12, 2024
@weiwangncar
Copy link
Collaborator

@JeroBnd Did you encounter a problem associated with dt=0? If so, what version of the code is it?

@JeroBnd
Copy link
Contributor Author

JeroBnd commented Sep 13, 2024

@weiwangncar I am doing one-way nesting using Ndown and adaptive time step. The boundary conditions (BC) are taken from domain 1 to domain 2 every 30 minutes (1800 seconds).

However, when running WRF for domain 2 with a mistake in the namelist.input file, specifically with the variable interval_seconds set to 3600 seconds, one random run out of the 24-member ensemble usually fails with a CFL error and a segmentation fault or stops at NOAH MP.

For this to happen, two conditions must occur simultaneously:

1 - The time interval between the current simulation time and the next BC time is an odd value (resulting a running time precision of 1/200 sec).

2 - The BC time does not match the BC time derived from interval_seconds, which in my case should occur at times ending in 30 minutes, such as 1:30.

When both of these conditions occur, the mitigation algorithm for dt=0 (#1398) sets the dtInterval to match the BC time derived from interval_seconds, generating a time interval of 30 minutes, which crashes the run.

@weiwangncar
Copy link
Collaborator

@brianreen We wonder if you could help review this PR? Thanks.

@brianreen
Copy link
Contributor

Yes, I will help review this PR.

@brianreen
Copy link
Contributor

@JeroBnd

Does this issue only occur when interval_seconds is set incorrectly?

Since I did not think #154 changed code that sets dtInterval, could you clarify what lines of code you are referring to when you say "the mitigation algorithm for dt=0 (#154)"?

@weiwangncar
Copy link
Collaborator

@JeroBnd Can you address Brian's comment?

@JeroBnd
Copy link
Contributor Author

JeroBnd commented Oct 8, 2024

@brianreen, @weiwangncar
Yes, when you use an incorrect interval_seconds, the round-off error derived from dtInterval = tmpTimeInterval / 2 becomes apparent.
The code from lines 340 to 354 uses a tricks to handle this problem, using the interval_seconds variable.
The proposed code fixes the round-off issue, and the additional ticks are no longer needed.

The reference #154 is incorrect, you are right. It should say #1398, but I actually want to refer to lines 341-354.

@brianreen
Copy link
Contributor

@JeroBnd @weiwangncar

The proposed change appears to make the following changes from the standard version of the code:

A) If between one and two (preliminary) timesteps from the target time (BC or output) then set the timestep to:
Standard: The time to the target time rounded to hundredths of a second and then divided by 2
Proposed: The time to the target time divided by 2 and then rounded to hundredths of a second

B) If less than one (preliminary) timestep from the target time (BC or output) then set the timestep to:
Standard: The time to the target time
Proposed: The time to the target time rounded to hundredths of a second

However, while the adaptive timestep code does sometimes round timesteps to hundredths of a second it does not always do so. For example, if two domains are running and adaptation_domain=1, the inner domain timestep will be adjusted so that the next outer domain timestep aligns with an inner domain timestep. In this process there is no guarantee that the inner domain timestep will be representable in hundredths of a second. One can also get timesteps that are not representable in hundredths of a second if the user specifies either the minimum or maximum allowed timesteps using fractions that are not representable in hundredths of a second. Because timesteps do not have to be representable as hundredths of a second, the model time also does not have to be representable as hundredths of a second. This means that proposed change B is problematic because it can result in the timestep not reaching the target time (BC or output).

@JeroBnd
Copy link
Contributor Author

JeroBnd commented Oct 14, 2024

@brianreen @weiwangncar
You are right; you found new sources of numeric errors.
If you look at lines 166-181, you’ll see that every time step starts rounding the previous time step to the hundredths of a second. Therefore, the last time step must also be in hundredths of a second to avoid losing time during the rounding.
In conclusion, every time step should have this precision.

@brianreen
Copy link
Contributor

@JeroBnd @weiwangncar

It does not seem to me that these are sources of numeric error.

Lines 166-172 enforce the precision but only for max_increase_factor. In the case where the tentative current timestep is greater than the product of the max_increase_factor (adjusted to precision of hundredths) times the previous timestep, this code updates the tentative current timestep to this product. If the previous timestep is not representable by that precision, there is no guarantee that the updated current timestep will be representable by that precision. This code does not assume that the previous timestep is representable by hundredths of a second.

Lines 174-181 do enforce the precision on the tentative current timestep. However, the minimum and maximum timestep criteria are applied after this, as is the adjustment to ensure that outer domain timesteps align with an inner domain timesteps, and thus the final current timestep will not necessarily match this precision.

Based on this, I am not yet seeing how time will be "lost" as a result of rounding in lines 166-181. Thus far, it seems that the standard code functions correctly as long as interval_seconds is set correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

adaptive time step leads to early termination and dt=0
4 participants