Fix the source of the dt=0 error when using adaptive time step #2103

JeroBnd · 2024-09-11T03:19:48Z

Fixed the source of the error dt=0 in BC and out time steps adjust.

TYPE: bug fix

KEYWORDS: time, step, adaptative

SOURCE: Jeronimo Bande (IDING SAS, Córdoba, Argentina)

DESCRIPTION OF CHANGES:
Problem:
When adjusting the time step toward BC and output times, occasionally time step is erroneously set to 0 causing model to stop.

Solution:
The algorithm to determine the time step near BC and output times is modified to correctly set the time step.

ISSUE: Fixes #1560

LIST OF MODIFIED FILES: /dyn_em/adapt_timestep_em.F

TESTS CONDUCTED:

Do mods fix problem? Yes.
The Jenkins tests are all passing.

RELEASE NOTE: Corrected adaptive time step being occasionally set to 0 (which causes the model to stop) on BC and output times.

…pted to mitigate this effect. Fixed the source of the error dt=0 in BC and out time steps adjust. The division by 2 of tmpTimeInterval caused the time to be out of precision for odd values of time_to_bc and time_to_output.

weiwangncar · 2024-09-11T16:59:01Z

The regression test results:

Test Type              | Expected  | Received |  Failed
= = = = = = = = = = = = = = = = = = = = = = = =  = = = =
Number of Tests        : 23           24
Number of Builds       : 60           57
Number of Simulations  : 158           150        0
Number of Comparisons  : 95           86        0

Failed Simulations are: 
None
Which comparisons are not bit-for-bit: 
None

dudhia · 2024-09-11T16:59:44Z

To help us review this, can you add more explanation of the fix in the description section?

JeroBnd · 2024-09-11T20:22:37Z

Hello.

The adaptive time step module operates with a precision of 1/100 seconds, resulting in simulation times with the same precision.

The process that determines the dtInterval involves several steps, checking various conditions. One of these conditions is the precision of 1/100 seconds.

When adjusting the time step to boundary conditions (BC) and output times, the algorithm divides a temporary time interval (with 1/100 sec precision) by two and assigns it to dtInterval. When this temporary time interval has an odd value, the precision changes to 1/200 sec, and the next simulation time also has a precision of 1/200 sec.

In the next step, adjacent to the BC or output time, the algorithm truncates the 1/200 sec precision, setting the simulation time to 1/200 sec before the BC or output time (without the mitigation done in #1398).

The following dtInterval, which is 1/200 sec, is then truncated to a precision of 1/100 sec, resulting in a dtInterval of 0.

This effect is mitigated in #1398 but does not address the underlying source of the problem.

weiwangncar · 2024-09-11T21:16:55Z

@JeroBnd Can you expand IDING SAS? Are you working with Kugler who posted the issue?

JeroBnd · 2024-09-11T21:39:38Z

I am not working with Kugler. I am from Córdoba, Argentina. IDING SAS is a startup. We provide services to APRHI (Provincial Administration of Water Resources) for reservoir management. In this context, we are operationally running a high-resolution weather forecast ensemble with WRF.

I found this bug while trying to debug an error caused by myself in the namelist.input that did not throw a warning.

There are several things to do in the adapt_timestep module...

weiwangncar · 2024-09-12T00:08:31Z

@JeroBnd Thanks for the info. I tested one of the cases Kugler had problem with, the em_b_wave case, your change didn't help. Did you test that case, using his namelist.input file?

JeroBnd · 2024-09-12T12:44:24Z

@weiwangncar, the adapt_timestep module is not prepared to handle this kind of idealized case. The algorithm uses the remaining time until the boundary condition is applied, with a counter that resets when the boundary condition is used.

In this idealized case, the counter starts with a value of 10800 (equivalent to 3 hours), but when the supposed boundary condition should occur, it doesn’t, and the counter continues decreasing into negative values, causing the simulation to terminate prematurely.

I wrote a small patch to fix this, but the module should be reconsidered. It uses several variables from the namelist.input without properly checking them.

weiwangncar · 2024-09-12T19:45:44Z

@JeroBnd Did you encounter a problem associated with dt=0? If so, what version of the code is it?

JeroBnd · 2024-09-13T12:19:37Z

@weiwangncar I am doing one-way nesting using Ndown and adaptive time step. The boundary conditions (BC) are taken from domain 1 to domain 2 every 30 minutes (1800 seconds).

However, when running WRF for domain 2 with a mistake in the namelist.input file, specifically with the variable interval_seconds set to 3600 seconds, one random run out of the 24-member ensemble usually fails with a CFL error and a segmentation fault or stops at NOAH MP.

For this to happen, two conditions must occur simultaneously:

1 - The time interval between the current simulation time and the next BC time is an odd value (resulting a running time precision of 1/200 sec).

2 - The BC time does not match the BC time derived from interval_seconds, which in my case should occur at times ending in 30 minutes, such as 1:30.

When both of these conditions occur, the mitigation algorithm for dt=0 (#1398) sets the dtInterval to match the BC time derived from interval_seconds, generating a time interval of 30 minutes, which crashes the run.

weiwangncar · 2024-09-13T15:27:13Z

@brianreen We wonder if you could help review this PR? Thanks.

brianreen · 2024-09-13T20:17:54Z

Yes, I will help review this PR.

brianreen · 2024-09-19T11:07:32Z

@JeroBnd

Does this issue only occur when interval_seconds is set incorrectly?

Since I did not think #154 changed code that sets dtInterval, could you clarify what lines of code you are referring to when you say "the mitigation algorithm for dt=0 (#154)"?

weiwangncar · 2024-10-08T20:54:48Z

@JeroBnd Can you address Brian's comment?

JeroBnd · 2024-10-08T21:51:37Z

@brianreen, @weiwangncar
Yes, when you use an incorrect interval_seconds, the round-off error derived from dtInterval = tmpTimeInterval / 2 becomes apparent.
The code from lines 340 to 354 uses a tricks to handle this problem, using the interval_seconds variable.
The proposed code fixes the round-off issue, and the additional ticks are no longer needed.

The reference #154 is incorrect, you are right. It should say #1398, but I actually want to refer to lines 341-354.

brianreen · 2024-10-10T19:27:24Z

@JeroBnd @weiwangncar

The proposed change appears to make the following changes from the standard version of the code:

A) If between one and two (preliminary) timesteps from the target time (BC or output) then set the timestep to:
Standard: The time to the target time rounded to hundredths of a second and then divided by 2
Proposed: The time to the target time divided by 2 and then rounded to hundredths of a second

B) If less than one (preliminary) timestep from the target time (BC or output) then set the timestep to:
Standard: The time to the target time
Proposed: The time to the target time rounded to hundredths of a second

However, while the adaptive timestep code does sometimes round timesteps to hundredths of a second it does not always do so. For example, if two domains are running and adaptation_domain=1, the inner domain timestep will be adjusted so that the next outer domain timestep aligns with an inner domain timestep. In this process there is no guarantee that the inner domain timestep will be representable in hundredths of a second. One can also get timesteps that are not representable in hundredths of a second if the user specifies either the minimum or maximum allowed timesteps using fractions that are not representable in hundredths of a second. Because timesteps do not have to be representable as hundredths of a second, the model time also does not have to be representable as hundredths of a second. This means that proposed change B is problematic because it can result in the timestep not reaching the target time (BC or output).

JeroBnd · 2024-10-14T15:48:28Z

@brianreen @weiwangncar
You are right; you found new sources of numeric errors.
If you look at lines 166-181, you’ll see that every time step starts rounding the previous time step to the hundredths of a second. Therefore, the last time step must also be in hundredths of a second to avoid losing time during the rounding.
In conclusion, every time step should have this precision.

brianreen · 2024-10-25T15:05:51Z

@JeroBnd @weiwangncar

It does not seem to me that these are sources of numeric error.

Lines 166-172 enforce the precision but only for max_increase_factor. In the case where the tentative current timestep is greater than the product of the max_increase_factor (adjusted to precision of hundredths) times the previous timestep, this code updates the tentative current timestep to this product. If the previous timestep is not representable by that precision, there is no guarantee that the updated current timestep will be representable by that precision. This code does not assume that the previous timestep is representable by hundredths of a second.

Lines 174-181 do enforce the precision on the tentative current timestep. However, the minimum and maximum timestep criteria are applied after this, as is the adjustment to ensure that outer domain timesteps align with an inner domain timesteps, and thus the final current timestep will not necessarily match this precision.

Based on this, I am not yet seeing how time will be "lost" as a result of rounding in lines 166-181. Thus far, it seems that the standard code functions correctly as long as interval_seconds is set correctly.

weiwangncar changed the base branch from master to release-v4.6.1 September 11, 2024 18:57

weiwangncar added the release-v4.6.1 label Sep 11, 2024

This was referenced Sep 12, 2024

Fix the source of the dt=0 error when using adaptive time step in em_b_wave ideal test case #2106

Open

Fix the dtInterval adjunstment on output time using adaptive and JeroBnd/WRF#1

Merged

JeroBnd changed the title ~~Fixed the source of the dt=0 error when using adaptive time step~~ Fix the source of the dt=0 error when using adaptive time step Sep 12, 2024

weiwangncar added Develop Branch and removed release-v4.6.1 labels Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the source of the dt=0 error when using adaptive time step #2103

Fix the source of the dt=0 error when using adaptive time step #2103

JeroBnd commented Sep 11, 2024 •

edited by weiwangncar

Loading

weiwangncar commented Sep 11, 2024

dudhia commented Sep 11, 2024

JeroBnd commented Sep 11, 2024 •

edited by weiwangncar

Loading

weiwangncar commented Sep 11, 2024

JeroBnd commented Sep 11, 2024

weiwangncar commented Sep 12, 2024

JeroBnd commented Sep 12, 2024

weiwangncar commented Sep 12, 2024

JeroBnd commented Sep 13, 2024 •

edited by weiwangncar

Loading

weiwangncar commented Sep 13, 2024

brianreen commented Sep 13, 2024

brianreen commented Sep 19, 2024

weiwangncar commented Oct 8, 2024

JeroBnd commented Oct 8, 2024

brianreen commented Oct 10, 2024

JeroBnd commented Oct 14, 2024

brianreen commented Oct 25, 2024

Fix the source of the dt=0 error when using adaptive time step #2103

Are you sure you want to change the base?

Fix the source of the dt=0 error when using adaptive time step #2103

Conversation

JeroBnd commented Sep 11, 2024 • edited by weiwangncar Loading

weiwangncar commented Sep 11, 2024

dudhia commented Sep 11, 2024

JeroBnd commented Sep 11, 2024 • edited by weiwangncar Loading

weiwangncar commented Sep 11, 2024

JeroBnd commented Sep 11, 2024

weiwangncar commented Sep 12, 2024

JeroBnd commented Sep 12, 2024

weiwangncar commented Sep 12, 2024

JeroBnd commented Sep 13, 2024 • edited by weiwangncar Loading

weiwangncar commented Sep 13, 2024

brianreen commented Sep 13, 2024

brianreen commented Sep 19, 2024

weiwangncar commented Oct 8, 2024

JeroBnd commented Oct 8, 2024

brianreen commented Oct 10, 2024

JeroBnd commented Oct 14, 2024

brianreen commented Oct 25, 2024

JeroBnd commented Sep 11, 2024 •

edited by weiwangncar

Loading

JeroBnd commented Sep 11, 2024 •

edited by weiwangncar

Loading

JeroBnd commented Sep 13, 2024 •

edited by weiwangncar

Loading