Page 1 of 3

Segfault and CFL Violation Recovery Tricks

Posted: Thu Sep 27, 2012 2:56 pm
by pattim
I love how easy it is to run a model with EMS!! The trouble I have is that I like to do high resolution runs and runs with high resolution nests, and any time I go below ~5km grid resolution, I start getting segfaults in my models. There usually isn't a warning of CFL violation unless my grid is really bad.

About half the time I can get rid of segfaults by increasing TIME_STEP_SOUND in run_timestep.conf from it's default of 4. I've gone as high as 16, and that seems to help, although it slows the simulation substantially to go above 8. Sometimes choosing different microphysics or cloud physics helps also, as well as *slightly* increasing the grid spacing. However, I have not yet been able to run a simulation below ~1.5 km resolution without a segfault (I have tried the same simulation on three different computers!).

I think it would be great if others can share their experiences with this problem on this "Sharing Knowhow" forum! One idea might be substituting the executables from WRF 3.4 into EMS. (I don't know if later versions of WRF are more stable or not.)

Tell us your magic!!!!
Patricia :)

Re: Segfault Recovery Tricks

Posted: Thu Sep 27, 2012 3:05 pm
by Antonix
I did not understand what you mean!
you want the binaries of 3.4 available for everyone??
I do not think that the problem is due to segfault compiling, version, pc or other ...
if I (we) the more information we'll talk.
see ... now you do a simulation (25-5-1) as I do :ugeek:
I am sure that you will not have problems, Let's work together and see what happens.

Re: Segfault Recovery Tricks

Posted: Thu Sep 27, 2012 3:20 pm
by pattim
Hi Antonix: I'm doing a 27:9:3 simulation right now - all the defaults - except I turned on one-way nesting and CU_PHYSICS = 5,5,5 (Grell 3D which is supposed to be OK for DX < 10km). It segfaulted about 7 model hours into the simulation. I reran with TIME_STEP_SOUND = 8 and it segfaulted 6 hours into the simulation. I just reset my memory to 1333MHz (it's 1600MHz memory) to try to rule out a memory issue. These are all AMD machines I'm using: Phenom II x6 and Opteron 6180SE

I really don't know if there are any bugs in 3.2 WRF that were fixed in 3.4.1 that were related to segfaulting. The EMS segfault error *does* say it may be related to array dimensioning (or something like that).

I will try setting up a 25:5:1 like you say - just using all the defaults or MYJ? Which Cu?

thank You!!

Re: Segfault Recovery Tricks

Posted: Thu Sep 27, 2012 3:41 pm
by meteoadriatic
Hello, I had real troubles with stability in my last project. I came up with every run failed with segmentation fault or with cfl violation. There were multiple issues... I still don't know what exactly were the causes, but I found some of them for sure. At least, my project is now stable.

1) What I found is that my one RAM module was defective. That caused segmentation faults most of the time. You can check with memtest from live cd, or if you don't want to stop your computer there is prime95 that can help in testing hardware.

2) Boundary issues in nests when 2-way nesting is used. This caused lot of troubles in nests near boundaries in relax zone. Huge noise patterns, especially in top 5 sigma levels. When I switched feedback to 0, those boundary noises and CFL violations gone away.

3) Too steep topography. I applied little bit more smoothing in geogrid and it helped to stabilize CFL conditions over steep terrain.

4) Of course, time step. Now I use adaptive time step, it slows down model when CFS becomes unstable so it runs better than fixed time step of default 6xDX.

5) If you have vertical speed that violated CFL condition you can apply some damping in dynamics... for example those can help:
w_damping = 1
damp_opt = 1 or 3
dampcoef = 0.1 or more
epssm = 0.3 or more

6) Try to avoid putting domain boundaries near steep orography. If you can't avoid, use more smoothing passes in geogrid table before you create domain.

I didn't found v3.4 any more stable than previous versions. No difference here.

Also if you fix all above issues (1-6) and still have segmentation faults or CFL violations, then it is time to try another physics or dynamics options.... and lastly, if you get error immediately after model start that indicate initialization data issue.

That's all I can think of right now, but of course, there can be other issues that I'm not aware, as well.

Re: Segfault Recovery Tricks

Posted: Thu Sep 27, 2012 3:59 pm
by Antonix
So, I had the same problems meteoadriatic, and loi I solved the same way.
follow all the advice meteoadriatic and in particular you should pay close attention to RAM.
is important that they are of good quality and that they are stable. I suggest memtest.
run it for at least 4 cycles and you will be sure that the ram is stable.
If they are stable try other solutions.

since you have tried it on other computers, you should change the domain as shown in meteoadriatic. be very careful where you step edges (mountains).
because the noise that is generated can be much worse prediction.

Re: Segfault Recovery Tricks

Posted: Thu Sep 27, 2012 9:07 pm
by pattim
OK, I'm getting ready to try this, but I thought I would let everyone know that I just tried to run a 2km global run on my 48-cpu Opteron machine - it has registered server memory and is known very stable (I run ModelE on it all the time). This 2km run segfaulted. So there are known segfault issues.
now you do a simulation (25-5-1) as I do :ugeek:
I am sure that you will not have problems, Let's work together and see what happens.
Antonix, I am going to try this on my opteron 48-CPU box right now and see if it segfaults. I assume I am supposed to use all the defaults. (The only orographic boundary in my domain is the shore.)

Re: Segfault Recovery Tricks

Posted: Thu Sep 27, 2012 9:54 pm
by Antonix
in modeling the problem is not the terrain ... but the edge.
the wrf-arw, unlike mm5, treats the edges in a very "noisy".
this could be a problem, and then performs 3 operations.
1) create a domain as we told you
2) Smooth the topography
3) create nested domains with edges far apart (a distance "geostrophic")
4) create the domain in such a way as not to intersect too many topographical variations
5) send me the domain
I try to run it I
6) controls the memory
7) controola executable that installs the wrf, you must use one for each type of CPU you are using (i7, amd istanbul etc etc etc)

I never had these problems except in one case:
I installed wrf-ems, and then I updated the operating system kernel.
I think you have problems, "computer" is not modeled (which is worse)

Re: Segfault Recovery Tricks

Posted: Fri Sep 28, 2012 2:16 am
by pattim
Thank you very much, guys! That has given me much to think about. I tried a 27km-9km-3km nest and it ran with no segfaults, but when I tried a 25-5-1 it segfaults after about 7 or 8 sweeps. Both were default settings (except 1-way nesting) and had about 5000km x 3000km primary domain size.

I use dwiz so I don't know how to smooth the terrain. I may try it again on the middle of the Sahara (no orography). There are also different qualities of grib data in different parts of the world, and different quality of land data - maybe that would affect the results?

Re: Segfault Recovery Tricks

Posted: Fri Sep 28, 2012 7:30 am
by meteoadriatic
pattim wrote:I use dwiz so I don't know how to smooth the terrain.
dwiz iz just graphical interface that runs various routines. When it runs geogrid it is all the same as you ran geogrid by hand. So, open GEOGEID.TBL.ARW (or NMM) and there you have three places where you can change smoothing:

name = HGT_M
name = HGT_U
name = HGT_V

in all three there is:
smooth_option = .....

you can simply add more smooth_passes there (for example try 3, or even 5...) or you can also change method... take a look into WPS manual for details.

When you change that in table file, simply recreate domain.

Re: Segfault Recovery Tricks

Posted: Fri Sep 28, 2012 7:42 am
by Antonix
I remain of the view that it is a problem of binary, or hardware configuration.
problems are too diffused in your simulations.
but uses ARW or NMM??
I advise you to reinstall wrf-ems, downloading the right track for your processor.

1) uses ARW or NMM??
2) which Linux distribution you use??
3) You tried to use the 1-way nesting??