Segmentation Fault

All issues/questions about EMS v3.4 package, please ask here.
Post Reply
robncyns
Posts: 6
Joined: Sat Apr 12, 2014 12:37 am

Segmentation Fault

Post by robncyns » Thu Apr 17, 2014 3:22 pm

I'm running a nested grid thru South Texas and keep getting Segmentation Faults a few hours into the forecast. I'm running a 4km with 1 nest using the NAM with a 30 min and 15 min output interval.

I get this in the dump that I archive from the run...

! Possible problem as system return code was 255

! Your WRF simulation (PID 7624) returned a exit status of 255, which is never good.

System Signal Code (SN) : 127 (Unknown Signal) with Core Dump

! While perusing the log/run_wrfm.log file I determined the following:

It appears that your run failed due to a Segmentation Fault on your
system. This failure is typically caused when the EMS attempt to access a
region of memory that has not been allocated. Most often, segmentation
faults are due to an array bounds error ior accessing memory though a NULL
pointer. Either way, this is an issue that needs to be corrected by the
developer.

So, if you want this problem fixed send your log files along with the
namelist.wrfm and namelist.wps files to Robert.Rozumalski@noaa.gov, just
because he cares.


! Here are the last few lines from the run_wrfm.log file:

= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0@n001.bw01.calpine.com] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:0@n001.bw01.calpine.com] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@n001.bw01.calpine.com] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:5@n006.bw01.calpine.com] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:5@n006.bw01.calpine.com] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:5@n006.bw01.calpine.com] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:2@n003.bw01.calpine.com] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:2@n003.bw01.calpine.com] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:2@n003.bw01.calpine.com] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:3@n004.bw01.calpine.com] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:3@n004.bw01.calpine.com] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:3@n004.bw01.calpine.com] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@calpine-wx.bw01.calpine.com] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@calpine-wx.bw01.calpine.com] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@calpine-wx.bw01.calpine.com] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion
[mpiexec@calpine-wx.bw01.calpine.com] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion



The Namelist file is this and I'm using the Windfarm Option. I'm stumped as i can run this at home on a PC but on a 250+ core cluster it blows up? Any thoughts?

&time_control
start_year = 2014, 2014
start_month = 04, 04
start_day = 17, 17
start_hour = 00, 00
start_minute = 00, 00
start_second = 00, 00
end_year = 2014, 2014
end_month = 04, 04
end_day = 18, 18
end_hour = 12, 12
end_minute = 00, 00
end_second = 00, 00
interval_seconds = 10800
input_from_file = T, T
history_interval = 30, 15
history_outname = "wrfout_d<domain>_<date>"
frames_per_outfile = 1, 1
io_form_history = 2
io_form_input = 2
io_form_restart = 2
io_form_boundary = 2
io_form_auxinput2 = 2
restart = F
restart_interval = 4320
auxhist1_outname = "auxhist1_d<domain>_<date>"
auxhist1_interval = 0, 0
frames_per_auxhist1 = 1, 1
io_form_auxhist1 = 2
auxhist2_outname = "auxhist2_d<domain>_<date>"
auxhist2_interval = 0, 0
output_diagnostics = 0
auxhist3_outname = "wrfxtrm_d<domain>_<date>"
auxhist3_interval = 0, 0
frames_per_auxhist2 = 1, 1
io_form_auxhist2 = 2
auxinput4_inname = "wrflowinp_d<domain>"
auxinput4_interval = 360, 360
io_form_auxinput4 = 2
fine_input_stream = 0, 2
adjust_output_times = T
reset_simulation_start = F
cycling = F
iofields_filename = "my_iofields_list.txt"
ignore_iofields_warning = T
diag_print = 0
debug_level = 0
/

&domains
time_step = 20
time_step_fract_num = 0
time_step_fract_den = 10
time_step_dfi = 60
max_dom = 2
s_we = 1, 1
e_we = 150, 136
s_sn = 1, 1
e_sn = 150, 136
s_vert = 1, 1
e_vert = 40, 40
dx = 4000.0000, 1333.3334
dy = 4000.0000, 1333.3334
grid_id = 1, 2
parent_id = 1, 1
i_parent_start = 1, 53
j_parent_start = 1, 53
parent_grid_ratio = 1, 3
parent_time_step_ratio = 1, 3
feedback = 0
smooth_option = 0
grid_allowed = T, T
max_dz = 1000.
numtiles = 1
nproc_x = -1
nproc_y = -1
num_metgrid_soil_levels = 4
num_metgrid_levels = 40
interp_type = 2
extrap_type = 2
t_extrap_type = 2
use_levels_below_ground = T
use_surface = T
lagrange_order = 1
zap_close_levels = 500
lowest_lev_from_sfc = F
force_sfc_in_vinterp = 1
sfcp_to_sfcp = F
smooth_cg_topo = F
use_tavg_for_tsk = F
aggregate_lu = F
rh2qv_wrt_liquid = T
rh2qv_method = 1
p_top_requested = 5000
vert_refine_fact = 1
use_adaptive_time_step = F
/

&dfi_control
dfi_opt = 0
/

&physics
cu_physics = 0, 0
cudt = 0, 5
mp_physics = 2, 2
mp_zero_out = 0
mp_zero_out_thresh = 1.e-8
mp_tend_lim = 10.
no_mp_heating = 0
do_radar_ref = 1
shcu_physics = 0, 0
bl_pbl_physics = 5, 5
bldt = 0, 0
grav_settling = 0, 0
topo_wind = 0, 0
mfshconv = 0
sf_sfclay_physics = 5, 5
sf_surface_physics = 2, 2
num_land_cat = 24
num_soil_cat = 16
num_soil_layers = 4
surface_input_source = 1
rdmaxalb = T
rdlai2d = F
tmn_update = 0
sf_urban_physics = 0, 0
ra_lw_physics = 1, 1
ra_sw_physics = 1, 1
radt = 4, 4
ra_call_offset = 0
swrad_scat = 1
slope_rad = 0, 0
topo_shading = 0, 0
icloud = 1
co2tf = 1
sst_skin = 1
sst_update = 0
seaice_threshold = 271
fractional_seaice = 0
prec_acc_dt = 0, 0.
bucket_mm = -1
bucket_j = -1
windturbines_spec = "windfarm.input"
/

&noah_mp
/

&dynamics
non_hydrostatic = T, T
gwd_opt = 0
rk_ord = 3
h_mom_adv_order = 5, 5
h_sca_adv_order = 5, 5
v_mom_adv_order = 3, 3
v_sca_adv_order = 3, 3
moist_adv_opt = 1, 1
moist_adv_dfi_opt = 0
scalar_adv_opt = 1, 1
momentum_adv_opt = 1, 1
chem_adv_opt = 1, 1
tke_adv_opt = 1, 1
diff_opt = 1
km_opt = 4
km_opt_dfi = 1
w_damping = 1
diff_6th_opt = 2, 2
diff_6th_factor = 0.12, 0.12
damp_opt = 0
zdamp = 5000., 5000.
dampcoef = 0.2, 0.2
khdif = 0, 0
kvdif = 0, 0
time_step_sound = 0, 0
do_avgflx_em = 0, 0
do_avgflx_cugd = 0, 0
smdiv = 0.1, 0.1
emdiv = 0.01, 0.01
epssm = 0.1, 0.1
top_lid = F, F
mix_isotropic = 0, 0
mix_upper_bound = 0.1, 0.1
rotated_pole = F
tke_upper_bound = 1000., 1000.
sfs_opt = 0, 0
m_opt = 0, 0
iso_temp = 0.
tracer_opt = 0, 0
tracer_adv_opt = 0, 0
/

&scm
scm_force = 0
scm_force_dx = 4000.
num_force_layers = 8
scm_lu_index = 2
scm_isltyp = 4
scm_vegfra = 0.5
scm_canwat = 0.0
scm_lat = 37.600
scm_lon = -96.700
scm_th_adv = .true.
scm_wind_adv = .true.
scm_qv_adv = .true.
scm_vert_adv = .true.
/

&fdda
grid_fdda = 0
/

&tc
insert_bogus_storm = F
remove_storm = F
num_storm = 1
latc_loc = -999.
lonc_loc = -999.
vmax_meters_per_second = -999.
rmax = -999.
vmax_ratio = -999.
/

&fire
/

&bdy_control
spec_bdy_width = 5
spec_zone = 1
relax_zone = 4
spec_exp = 0
specified = T, F
nested = F, T
/

&grib2
/

&namelist_quilt
nio_tasks_per_group = 0
nio_groups = 1
/

smartie
Posts: 97
Joined: Sat May 21, 2011 7:34 am

Re: Segmentation Fault

Post by smartie » Sat Apr 19, 2014 9:14 am

I'm not familiar with the wind turbine options or the exact nature of the errors you're seeing (related to mpich). However, I see you're using the MYNN PBL and surface physics (opt 5). I've found this to be a superior PBL scheme and in tests to outperform eg MYJ against surface obs. It does seem to require more memory though and in some setups seems to be prone to failure. I 'd change this first-see if a MYJ run goes to completion. IF MYNN is the problem then I'd change some of the other physics. Maybe try a combination like the operational HRRR: Thompson MP, MYNN PBL etc.

I also see your inner nest is pretty small, which can lead to problems- but no doubt you're aware of that. I'd probably have a Cu scheme on the 4km mother domain as well- HRRR uses Grell IIRC.

Post Reply

Who is online

Users browsing this forum: Google [Bot] and 2 guests