Segfault and CFL Violation Recovery Tricks

You know something useful, tips&tricks, have some unofficial documentation...? Please share your knowledge with us in this section of EMS forum. * * * Please DON'T open topics with questions in this section! * * *
pattim
Posts: 157
Joined: Sun Jun 24, 2012 8:42 pm
Location: Los Angeles, CA, USA

Re: Segfault Recovery Tricks

Post by pattim » Fri Sep 28, 2012 7:07 pm

Thank you very, very for the discussion so far, guyz!

I have been using only ARW, on a two different Phenom II 1090T boxes and on an Opteron 6180SE box. I am using OpenSuSE 12.1 x64. The Opteron box is a server with ECC registered server memory. All boxes have run ModelE global climate models successfully.

I think these two AMD processors should be highly compatible with standard x64 linux builds, but I don't think Robert builds specifically for either Opterons or Phenoms. Maybe I should just build WRF3.4.1 and try that, since it is automatically built for my computers? (Does anyone have an online directory for the builds? That should tell me what versions are currently available.)

Don't you think it's interesting from a code perspective that I only have segfaults when I go below 3km resolution?

EDIT: OK, I just ran a 27-9-3 km run and it ran OK, but when I tried 27-9-3-1 it segfaults. I had CU scheme turned off for all domains and only one-way nesting. So it is looking very much like a solution stability problem. I will try some settings, like increasing TIME_STEP_SOUND - that has helped before.

EDIT2: I set TIME_STEP_SOUND from 4 to 16 and the 27-9-3-1 km simulation has run further. I have to wait and see if it runs to completion.

pattim
Posts: 157
Joined: Sun Jun 24, 2012 8:42 pm
Location: Los Angeles, CA, USA

Re: Segfault Recovery Tricks

Post by pattim » Sat Sep 29, 2012 9:25 pm

Thanks for your suggestions, guys!! Having success like I did on my Opteron machine, I tried my 2 Phenom II machines with Memtest and sure enough, there was one "bad" memory stick in each machine. By "bad" I mean that it would not run reliably at DDR3-1600, although it ran OK with DDR3-1333 timing. So I replaced those memory modules and only had one problem - a kernel panic due to too much overclocking (4GHz). So I clocked it down to 3.9GHz and it is very stable and now my 1km runs work OK with TIME_STEP_SOUND = 8 (I still get a segfault and EMS crash if it is set to the default, which is 4). But they are SLOWWWWWW....

It's too bad that EMS segfaults when it should instead trap the error from having TIME_STEP_SOUND too low. But at least my three computers are now all acting consistently!!!!!!

Thanks again for your suggestions, I am going to be playing around with your suggestions for the next week.

Also, I found the unstable overclocking by booting Windows and running Intel CPU Burn.

Patti :D

Antonix
Posts: 256
Joined: Fri Oct 16, 2009 8:53 am

Re: Segfault Recovery Tricks

Post by Antonix » Sun Sep 30, 2012 8:22 pm

as I said, any change or any slight damage to hardware turns into errors.
can not search for an error in the code if you're not certain hardware.
If you delete the overclocking you will not segfault.
completely eliminates the oc. port the machine to its original level and do a test.
before using the wrf on a PC overclocking, you must be sure that the CPU and RAM are rocksolid for at least 24 hours of stress test.
ps: overclocking a machine takes the best performance, but not daneggia only the cpu, which perhaps suffers less, but much damage the ram and other devices.
unless you do not have a motherboard made ​​for overclocking mail you should not use it.
let us know how it goes ;) ;) ;)

pattim
Posts: 157
Joined: Sun Jun 24, 2012 8:42 pm
Location: Los Angeles, CA, USA

Re: Segfault Recovery Tricks

Post by pattim » Sun Sep 30, 2012 9:46 pm

I used Memtest86 to determine which memory modules would not work at 1600DDR3. I changed those, and I ran Memtest86 on my memory at 1600DDR3 for 24 hours and verified no errors. So I know the memory is good.

I installed the sensors linux package so I could watch my CPU temperatures during EMS runs. I found the temperatures were too high - so I lowered the overclock frequency. Now the memory and CPU temperatures are OK.

My segfaults are gone with 1km resolution! But I must set TIME_STEP_SOUND=8. If I set TIME_STEP_SOUND=4, then my 1km runs have segfaults.

I will try 2km runs and see if a lower TIME_STEP_SOUND is possible.

Antonix
Posts: 256
Joined: Fri Oct 16, 2009 8:53 am

Re: Segfault Recovery Tricks

Post by Antonix » Sun Sep 30, 2012 10:24 pm

sorry but do not agree on ITER. (but maybe I did not understand how are you working on right now)
I find it more functional from a standard situation and then proceed with the OC.
Although I think that overclocking is not a good solution, both numerical and "physical" level.
Put everything in default, do the test, and try to run.

meteoadriatic
Posts: 1512
Joined: Wed Aug 19, 2009 10:05 am

Re: Segfault Recovery Tricks

Post by meteoadriatic » Mon Oct 01, 2012 11:02 am

If you want to overclock you need to pay close attention to cooling. CPU, RAM, NorthBridge, those are particularly sensitive areas that will have troubles with temperature when overclocked. Also you probably need higher input voltages to keep system stable, but then when you turn various voltages up (or motherboard do that automatically if set that way), heat also produces much more.

I personally avoid any overclocking above component rated figures. Then I can even lower voltages bit down to keep system less warm. And even then I like to use big and quality cooling devices.

pattim
Posts: 157
Joined: Sun Jun 24, 2012 8:42 pm
Location: Los Angeles, CA, USA

Re: Segfault Recovery Tricks

Post by pattim » Wed Oct 03, 2012 10:14 pm

Hi Meteo: I did verify that the segfaults do occur even when not overclocked, and I am very careful with cooling. So I think you are right about this being associated with grid/orography and even grid/nest interactions. I usually only get it with 2km or smaller grid sizes. I am not exactly sure how to trouble shoot. I'm trying some different runs to see if I can find a good example to talk about on this Forum.

Patricia

meteoadriatic
Posts: 1512
Joined: Wed Aug 19, 2009 10:05 am

Re: Segfault Recovery Tricks

Post by meteoadriatic » Thu Oct 04, 2012 7:42 am

Do you have in your rsl_error log files anything why model stopped? I believe there are cfl violations... you need to check all rsl_error files, not just first one to find cfl violations.

pattim
Posts: 157
Joined: Sun Jun 24, 2012 8:42 pm
Location: Los Angeles, CA, USA

Re: Segfault Recovery Tricks

Post by pattim » Wed Oct 31, 2012 2:46 pm

meteoadriatic wrote:Do you have in your rsl_error log files anything why model stopped? I believe there are cfl violations... you need to check all rsl_error files, not just first one to find cfl violations.
Yes, I have to look more often for CFL violations. I am currently doing a simulation of the 2011 April tornadoes and got segfaults when using the Grell (5) cumulus scheme (CU_PHYSICS = 5,5), but when I switched to CU_PHYSICS = 2,0 - the segfault did not occur.

Code: Select all

                
WRF EMS MODEL RUN SUMMARY FOR US-Tornados-001
                                     WRF REAL program         WRF ARW core            
    ------------------------------------------------------------------------------
     System Name                                Scheduled Processors
       OS121-1.site              :          06                     06                    
     Total Processors            :          06                     06                    
     Domain Decomposition        :        1 x 6                    1 x 6                   
     Active Domains                  Domain 01                Domain 02
    ------------------------------------------------------------------------------

     Domain & Run Information           
       Domain Type               :  Limited Area             Limited Area            
       Primary Time Step         :  75 Seconds               25 Seconds              
       Grid dimensions (NX x NY) :  200 x 172                301 x 268               
       Vertical Layers (NZ)      :  45                       45                      
       Grid Spacing              :  15.00km                  5.00 km                 
       Top of Model Atmosphere   :  50mb                     50mb                    
       Parent Domain             :  NA                       Domain 01               
       Nesting Feedback          :  Feedback Off             Feedback Off            

     Timing Information                                 
       Start Date                :  2011 Apr 25 06:00 UTC    2011 Apr 25 06:00 UTC   
       End Date                  :  2011 Apr 28 06:00 UTC    2011 Apr 28 06:00 UTC   
       Simulation Length         :  72 Hours                 72 Hours                
       Boundry Update Freq       :  06 Hours                

     File Output Information           
       File Output Freq          :  30 Minutes               15 Minutes              
       Output File Format        :  netCDF                   netCDF                  
       Adjust Output Times       :  Yes                      Yes                     

     Model Physics                  
       Cumulus Scheme            :  Betts-Miller-Janjic      None                    
       Microphysics Scheme       :  Lin et al.               Lin et al.              
       PBL Scheme                :  Mellor-Yamada-Janjic     Mellor-Yamada-Janjic    
       Land Surface Scheme       :  Noah 4-Layer LSM         Noah 4-Layer LSM        
       Number Soil Layers        :  4                        4                       
       Surface Layer Physics     :  Monin-Obukhov (Janjic)   Monin-Obukhov (Janjic)  
       Long Wave Radiation       :  RRTM                     RRTM                    
       Short Wave Radiation      :  Dudhia Scheme            Dudhia Scheme           
       Cloud Effects             :  Cloud Effects On         Cloud Effects On        
       Topographic Shading       :  Shading Effects Off      Shading Effects Off     

     ARW Core Model Dynamics     
       Dynamics                  :  Non-Hydrostatic          Non-Hydrostatic         
       Gravity Wave Drag         :  Off                      Off                     
       Time-Integration Scheme   :  Runge-Kutta 3rd Order    Runge-Kutta 3rd Order   
       Diffusion Sheme           :  Simple Diffusion         Simple Diffusion        
       6th-order Diffusion       :  6th-Order W/O Up-Gradient  6th-Order W/O Up-Gradient 
       6th-order Diffusion Rate  :  0.12                     0.12                    
       Eddy Coefficient Scheme   :  2D 1st Order Closure     2D 1st Order Closure    
       Damping Option            :  No Damping               No Damping              
       W Damping                 :  W Damping On             W Damping On            
       Horiz Momentum Advection  :  5th Order                5th Order               
       Horiz Scalar Advection    :  5th Order                5th Order               
       Vert Momentum Advection   :  3rd Order                3rd Order               
       Vert Scalar Advection     :  3rd Order                3rd Order               
       Sound Time Step Ratio     :  8 to 1 (Large TS)        8 to 1 (Large TS)       
       Moisture Advection Option :  Positive-Definite        Positive-Definite       
       Scalar Advection Option   :  Positive-Definite        Positive-Definite       
       TKE Advection Option      :  Positive-Definite        Positive-Definite   

meteoadriatic
Posts: 1512
Joined: Wed Aug 19, 2009 10:05 am

Re: Segfault Recovery Tricks

Post by meteoadriatic » Tue Nov 06, 2012 3:30 pm

Yesterday case, I still had big issue with violating CFL criteria and model crashing, but have solved the problem successfully. Here is full report.

I had reproducible crashes of my ARW model on same geographical point and on the exactly same time. I had this configuration:

Code: Select all

&time_control
 start_year                 = 2012, 2012
 start_month                = 09, 09
 start_day                  = 16, 16
 start_hour                 = 00, 00
 start_minute               = 00, 00
 start_second               = 00, 00
 end_year                   = 2012, 2012
 end_month                  = 09, 09
 end_day                    = 20, 20
 end_hour                   = 00, 00
 end_minute                 = 00, 00
 end_second                 = 00, 00
 interval_seconds           = 10800
 input_from_file            = T, T
 history_interval           = 1440, 60
 history_outname            = "wrfout_d<domain>_<date>"
 frames_per_outfile         = 1, 1
 io_form_history            = 2
 io_form_input              = 2
 io_form_restart            = 2
 io_form_boundary           = 2
 io_form_auxinput2          = 2
 restart                    = F
 restart_interval           = 11520
 auxhist1_outname           = "sfcout_d<domain>_<date>"
 auxhist1_interval          = 0, 0
 frames_per_auxhist1        = 1, 1
 io_form_auxhist1           = 2
 auxhist2_outname           = "auxhist2_d<domain>_<date>"
 auxhist2_interval          = 0, 0
 frames_per_auxhist2        = 1, 1
 io_form_auxhist2           = 5
 fine_input_stream          = 0, 2
 adjust_output_times        = T
 debug_level                = 0
/

&domains
 time_step                  = 30
 time_step_fract_num        = 0
 time_step_fract_den        = 10
 max_dom                    = 2
 s_we                       = 1, 1
 e_we                       = 200, 206
 s_sn                       = 1, 1
 e_sn                       = 190, 196
 s_vert                     = 1, 1
 e_vert                     = 55, 55
 dx                         = 20000.0000, 4000.0000
 dy                         = 20000.0000, 4000.0000
 grid_id                    = 1, 2
 parent_id                  = 1, 1
 i_parent_start             = 1, 91
 j_parent_start             = 1, 69
 parent_grid_ratio          = 1, 5
 parent_time_step_ratio     = 1, 5
 feedback                   = 0
 smooth_option              = 1
 grid_allowed               = T, T
 max_dz                     = 1000.
 numtiles                   = 1
 nproc_x                    = -1
 nproc_y                    = -1
 num_metgrid_soil_levels    = 4
 num_metgrid_levels         = 27
 interp_type                = 2
 extrap_type                = 2
 t_extrap_type              = 2
 use_levels_below_ground    = T
 use_surface                = T
 lagrange_order             = 1
 zap_close_levels           = 500
 lowest_lev_from_sfc        = F
 force_sfc_in_vinterp       = 1
 sfcp_to_sfcp               = F
 smooth_cg_topo             = F
 use_tavg_for_tsk           = F
 p_top_requested            = 5000
 use_adaptive_time_step     = T
 step_to_output_time        = T
 target_cfl                 = 1.2, 1.2
 max_step_increase_pct      = 5, 51
 starting_time_step         = 60, 12
 max_time_step              = 200, 40
 min_time_step              = 60, 12
/

&physics
 mp_physics                 = 95, 8
 cu_physics                 = 1, 0
 kfeta_trigger              = 2
 cudt                       = 5, 0
 sf_sfclay_physics          = 11, 11
 sf_surface_physics         = 2, 2
 num_soil_layers            = 4
 num_land_cat               = 20
 num_soil_cat               = 16
 bl_pbl_physics             = 1, 1
 topo_wind                  = 1, 1, 1
 bldt                       = 0, 0
 ra_lw_physics              = 1, 1
 ra_sw_physics              = 2, 2
 radt                       = 20, 20
 sst_skin                   = 1
/

&dynamics                
 rk_ord                     = 3
 w_damping                  = 1
 diff_opt                   = 1
 km_opt                     = 4
 diff_6th_opt               = 2
 diff_6th_factor            = 0.10
 base_temp                  = 290.
 damp_opt                   = 0
 zdamp                      = 5000.
 dampcoef                   = 0.01
 khdif                      = 0
 kvdif                      = 0
 smdiv                      = 0.1
 emdiv                      = 0.01
 epssm                      = 0.1
 non_hydrostatic            = T
 gwd_opt                    = 1
 h_mom_adv_order            = 5
 h_sca_adv_order            = 5
 v_mom_adv_order            = 3
 v_sca_adv_order            = 3
 moist_adv_opt              = 1
 scalar_adv_opt             = 1
 chem_adv_opt               = 0
 tke_adv_opt                = 0
/

&bdy_control             
 spec_bdy_width             = 5
 spec_zone                  = 1
 relax_zone                 = 4
 specified                  = T, F
 nested                     = F, T
/

&namelist_quilt          
 nio_tasks_per_group        = 0
 nio_groups                 = 1
/

Those was my errors from log files, always same location and level (i,j,k: 77,93,54). Level 54 is near the top of model; it has 55 levels, and location i=77, j=94 is pretty much near centre of domain.

Code: Select all

Timing for main (dt= 60.00): time 2012-11-05_12:37:47 on domain   1:    6.45040 elapsed seconds
Timing for main (dt= 30.00): time 2012-11-05_12:38:17 on domain   2:    2.05627 elapsed seconds
Timing for main (dt= 30.00): time 2012-11-05_12:38:47 on domain   2:    2.07021 elapsed seconds
Timing for main (dt= 60.00): time 2012-11-05_12:38:47 on domain   1:    6.35270 elapsed seconds
 d01 2012-11-05_12:38:47+34/**          157 points exceeded cfl=2 in domain d01 
 at time 2012-11-05_12:38:47+34/** hours
 d01 2012-11-05_12:38:47+34/**  MAX AT i,j,k:           77          93          
 54 vert_cfl,w,d(eta)=   732.0633      -324237.7      5.2468888E-03
Timing for main (dt= 30.00): time 2012-11-05_12:39:17 on domain   2:    2.06971 elapsed seconds
Timing for main (dt= 30.00): time 2012-11-05_12:39:47 on domain   2:    2.06886 elapsed seconds
Timing for main (dt= 60.00): time 2012-11-05_12:39:47 on domain   1:    6.34816 elapsed seconds
 d01 2012-11-05_12:39:47+34/**          314 points exceeded cfl=2 in domain d01 
 at time 2012-11-05_12:39:47+34/** hours
 d01 2012-11-05_12:39:47+34/**  MAX AT i,j,k:           17          34          
 54 vert_cfl,w,d(eta)=   651.3358      -391281.1      5.2468888E-03
Timing for main (dt= 12.00): time 2012-11-05_12:39:59 on domain   2:    1.89663 elapsed seconds
Timing for main (dt= 12.00): time 2012-11-05_12:40:11 on domain   2:    1.82588 elapsed seconds
So when you have CFL errors near top of model, it is probably issue with model dynamics, not physics. After investigation and trial/error approach, I found that change in upper damping option from layer of increased diffusion (damp_opt =1) to implicit gravity-wave damping layer (damp_opt =3) fixes an issue and model could go past that point of 2012-11-05_12:38, smoothly.

With changing damp_opt choice, I had to change dampcoef, because there are different way the model uses this value depending on damp_opt choice. Here is extract from ARW manual:

Code: Select all

damp_opt 	upper level damping flag
0 without damping
1 with diffusive damping; maybe used for real-data cases (dampcoef nondimensional ~ 0.01 - 0.1)
2 with Rayleigh damping (dampcoef inverse time scale [1/s], e.g. 0.003)
3 with w-Rayleigh damping (dampcoef inverse time scale [1/s] e.g. 0.2; for real-data cases)
So I selected dampcoef = 0.2. I also changed zdamp from 5000 to 10000, but I think it is not needed.

This is resulting namelist configuration:

Code: Select all

&time_control
 start_year                 = 2012, 2012
 start_month                = 11, 11
 start_day                  = 06, 06
 start_hour                 = 00, 00
 start_minute               = 00, 00
 start_second               = 00, 00
 end_year                   = 2012, 2012
 end_month                  = 11, 11
 end_day                    = 10, 10
 end_hour                   = 00, 00
 end_minute                 = 00, 00
 end_second                 = 00, 00
 interval_seconds           = 10800
 input_from_file            = T, T
 history_interval           = 1440, 60
 history_outname            = "wrfout_d<domain>_<date>"
 frames_per_outfile         = 1, 1
 io_form_history            = 2
 io_form_input              = 2
 io_form_restart            = 2
 io_form_boundary           = 2
 io_form_auxinput2          = 2
 restart                    = F
 restart_interval           = 11520
 auxhist1_outname           = "sfcout_d<domain>_<date>"
 auxhist1_interval          = 0, 0
 frames_per_auxhist1        = 1, 1
 io_form_auxhist1           = 2
 auxhist2_outname           = "auxhist2_d<domain>_<date>"
 auxhist2_interval          = 0, 0
 frames_per_auxhist2        = 1, 1
 io_form_auxhist2           = 5
 fine_input_stream          = 0, 2
 adjust_output_times        = T
 debug_level                = 0
/

&domains
 time_step                  = 30
 time_step_fract_num        = 0
 time_step_fract_den        = 10
 max_dom                    = 2
 s_we                       = 1, 1
 e_we                       = 200, 206
 s_sn                       = 1, 1
 e_sn                       = 190, 196
 s_vert                     = 1, 1
 e_vert                     = 55, 55
 dx                         = 20000.0000, 4000.0000
 dy                         = 20000.0000, 4000.0000
 grid_id                    = 1, 2
 parent_id                  = 1, 1
 i_parent_start             = 1, 91
 j_parent_start             = 1, 69
 parent_grid_ratio          = 1, 5
 parent_time_step_ratio     = 1, 5
 feedback                   = 0
 smooth_option              = 1
 grid_allowed               = T, T
 max_dz                     = 1000.
 numtiles                   = 1
 nproc_x                    = -1
 nproc_y                    = -1
 num_metgrid_soil_levels    = 4
 num_metgrid_levels         = 27
 interp_type                = 2
 extrap_type                = 2
 t_extrap_type              = 2
 use_levels_below_ground    = T
 use_surface                = T
 lagrange_order             = 1
 zap_close_levels           = 500
 lowest_lev_from_sfc        = F
 force_sfc_in_vinterp       = 1
 sfcp_to_sfcp               = F
 smooth_cg_topo             = F
 use_tavg_for_tsk           = F
 p_top_requested            = 5000
 use_adaptive_time_step     = T
 step_to_output_time        = T
 target_cfl                 = 1.2, 1.2
 max_step_increase_pct      = 5, 51
 starting_time_step         = 60, 12
 max_time_step              = 200, 40
 min_time_step              = 60, 12
/

&physics
 mp_physics                 = 95, 8
 cu_physics                 = 1, 0
 kfeta_trigger              = 2
 cudt                       = 5, 0
 sf_sfclay_physics          = 11, 11
 sf_surface_physics         = 2, 2
 num_soil_layers            = 4
 num_land_cat               = 20
 num_soil_cat               = 16
 bl_pbl_physics             = 1, 1
 topo_wind                  = 1, 1
 bldt                       = 0, 0
 ra_lw_physics              = 1, 1
 ra_sw_physics              = 2, 2
 radt                       = 20, 20
 sst_skin                   = 1
/

&dynamics                
 rk_ord                     = 3
 w_damping                  = 1
 diff_opt                   = 1
 km_opt                     = 4
 diff_6th_opt               = 2
 diff_6th_factor            = 0.12
 base_temp                  = 290.
 damp_opt                   = 3
 zdamp                      = 10000.
 dampcoef                   = 0.2
 khdif                      = 0
 kvdif                      = 0
 smdiv                      = 0.1
 emdiv                      = 0.01
 epssm                      = 0.1
 non_hydrostatic            = T
 gwd_opt                    = 1
 h_mom_adv_order            = 5
 h_sca_adv_order            = 5
 v_mom_adv_order            = 3
 v_sca_adv_order            = 3
 moist_adv_opt              = 1
 scalar_adv_opt             = 1
 chem_adv_opt               = 0
 tke_adv_opt                = 0
/

&bdy_control             
 spec_bdy_width             = 5
 spec_zone                  = 1
 relax_zone                 = 4
 specified                  = T, F
 nested                     = F, T
/

&namelist_quilt          
 nio_tasks_per_group        = 0
 nio_groups                 = 1
/
Hope it helps.

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest