How to run UEMS/WRF v19 (19.7.2) in a cluster with slurm Step by step

You know something useful, tips&tricks, have some unofficial documentation...? Please share your knowledge with us in this section of EMS forum. * * * Please DON'T open topics with questions in this section! * * *
Post Reply
ramousta
Posts: 8
Joined: Fri Jun 28, 2019 4:29 pm

How to run UEMS/WRF v19 (19.7.2) in a cluster with slurm Step by step

Post by ramousta » Wed Jul 31, 2019 4:10 am

First of all I will explain the big picture:
1. Geogrid will be run on one sbatch script by ems_domain.
2. Ungrib, avgtsfc and metgrid will be rum sequentially by ems_prep on the same sbatch script since they depend on each other.
3. Real and wrf will be run sequentially by ems_run on the same sbatch script since they depend on each other.
4. When we execute ems_domain, ems_prep and ems_run namelist files and scripts containing links to required data will be created and saved automatically on sbatch directory (uems/runs/domain_name/sbatch). In addition sbatch script will be generated and sent to the queue.
5. Sbatch script will use the files and scripts generated in the 4 step.
6. The files that need to be modified are:
  • UEMSwrkstn located in uems/etc/modulefiles/UEMSwrkstn
    template_uemsBsub.parallel located in uems/data/tables/lsf/
    template_uemsBsub.serial located in uems/data/tables/lsf/
    Elsf.pm located in uems/strc/Uutils/
    Dprocess.pm located in uems/strc/Udomain/
    Pungrib.pm located in uems/strc/Uprep/
    Pinterp.pm located in uems/strc/Uprep/
    Rexe.pm located in uems/strc/Urun/
Let’s begin:
I. ********************* UEMSwrkstn located in uems/etc/modulefiles/UEMSwrkstn **********************
Edit the file and change:
1.
  • setenv LSF_SYS 1
    # LSF_MANAGED_RESOURCES
    setenv LSF_MANAGED_RESOURCES 1
    #MAXIMUM number of nodes
    setenv LSF_NODEMAX_DOMAIN 1
    setenv LSF_NODEMAX_PREP 1
    setenv LSF_NODEMAX_RUN 8
    setenv LSF_NODEMAX_POST 2
2. the name of the partition used, for me:
  • setenv LSF_QUEUE_SHARED defq
    setenv LSF_QUEUE_EXCLUSIVE defq
3. the command to execute sbatch script:
  • setenv LSF_BSUB_COMMAND "sbatch"
4. create a directory ~/.modulerc and copy the file inside it (~/ means home directory: /home/rachid/):
  • mkdir ~/.modulerc
    cp uems/etc/modulefiles/UEMSwrkstn ~/.modulerc
    chmod 755 ~/.modulerc/ UEMSwrkstn
5. create a script to run UEMSwrkstn file
gedit modules.csh
Copy and paste these lines inside the file:

Code: Select all

#-------------------------------------------------------#
# system-wide csh.modules                                              #
# Initialize modules for all csh-derivative shells                     #
#-------------------------------------------------------#
  if ($?tcsh) then
	  source /cm/local/apps/environment-modules/4.0.0/Modules/default/init/tcsh
  else
	  source /cm/local/apps/environment-modules/4.0.0/Modules/default/init/csh
  endif
#-------------------------------------------------------#
6. create ~/.sh/ directory and move the script inside it:
  • chmod 755 modules.csh
    mv modules.csh ~/.sh/
7. Add these lines to .cshrc file to load the module automatically:
  • #UEMSwrkstn Module
    source ~/.sh/modules.csh
    module load ~/.modulerc/UEMSwrkstn

II. ******************** template_uemsBsub.parallel located in uems/data/tables/lsf/******************

1. Edit the file and replace all the contents by:

Code: Select all

#!/bin/bash
#  Note: Lines beginning with '#' (comments)  are ignored when writing the
#        final batchjob file. Additionally, this template file was designed
#        for use on the NWS WCOSS and may need to be changed for your system.
#
#  COMMENTS_START - Start of Comments section (not required)
#  COMMENTS_STOP  - End of Comments section (not required)
#  
#  BSUBS_START    - Start of the BSUB section (required)
#  BSUBS_STOP     - End of the BSUB section (required)
#
#  UEMSCWD        - Working directory - just in case
#  UEMSEXEC       - The location for the command (string) to be executed (required)
#
#  Everything else in this file is left intact in the same relative location
#
COMMENTS_START
COMMENTS_STOP

BSUBS_START
BSUBS_STOP
module load cuda91/toolkit/9.1.85
source /cm/shared/apps/intel/ips_2017/bin/compilervars.sh -arch intel64
source /cm/shared/apps/gromacs/intel/64/2018.3/bin/GMXRC

export LANG=en_US

UEMSCWD
sleep 10 
ADD_1
ADD_2
SECSS=`date +%s`
mpirun -np SLURM_NTASKS  REALEXE  
sleep 11
ADD_3
mpirun -np SLURM_NTASKS  UEMSEXEC 
sleep 10
ADD_4
ADD_5
SECSE=`date +%s`
TSECS=$((SECSE-SECSS))
echo "UEMS parallel job completed in $TSECS seconds ($?)"
2. Save and exit.

III. ******************** template_uemsBsub.serial located in uems/data/tables/lsf/******************
1. Edit the file and replace all the contents by:

Code: Select all

#!/bin/bash
#  Note: Lines beginning with '#' (comments)  are ignored when writing the
#        final batchjob file. Additionally, this template file was designed
#        for use on the NWS WCOSS and may need to be changed for your system.
#
#  COMMENTS_START - Start of Comments section (not required)
#  COMMENTS_STOP  - End of Comments section (not required)
#  
#  BSUBS_START    - Start of the BSUB section (required)
#  BSUBS_STOP     - End of the BSUB section (required)
#
#  UEMSCWD        - Working directory - just in case
#  UEMSEXEC       - The location for the command (string) to be executed (required)
#
#  Everything else in this file is left intact in the same relative location
#
COMMENTS_START
COMMENTS_STOP

BSUBS_START
BSUBS_STOP


export LANG=en_US

UEMSCWD
sleep 2 
ADD1
ADD2

SECSS=`date +%s`
UEMSEXEC
sleep 2 
ADD_1
ADD_2

SECSE=`date +%s`
TSECS=$((SECSE-SECSS))

echo "UEMS serial job completed in $TSECS seconds ($?)"
2. Save and exit.

IV. ******************** Elsf.pm located in uems/strc/Uutils/******************
1. We don’t want to create sbatch file for: avgtsfc, metgrid, and real_arw.exe. so replace (line 222) :

Code: Select all

   return %lsf if $lsf{error} = &WriteBatchjobFile(\%lsf);
    
    $lsf{uemsexec}  = "$ENV{LSF_BSUB_COMMAND} < $lsf{batchfile}";
By:

Code: Select all

#cra
   if ($lsf{procexe} ne "$ENV{EMS_BIN}/avgtsfc" and $lsf{procexe} ne "$ENV{EMS_BIN}/metgrid" and $lsf{procexe} ne "$ENV{EMS_BIN}/real_arw.exe > log/run_real1.log"){
    return %lsf if $lsf{error} = &WriteBatchjobFile(\%lsf);
    
    $lsf{uemsexec}  = "$ENV{LSF_BSUB_COMMAND} < $lsf{batchfile}";
}# No batch file for  avgtsfc/metgrid and real
2. Replace all sub WriteBatchjobFile by this one (from line 235 to the end):

Code: Select all

sub WriteBatchjobFile {
#  ===================================================================================
#    Write the bachjob file to be submitted to the scheduler on an LSF system. A
#    template file for either a serial and mpi batch job is read and and them written
#    to $lsf{batchfile} with the appropriate information. The input hash containing 
#    the necessary variables is created in &ConfigureProcessLSF.
#  ===================================================================================
#
    my $date    = `date -u`; chomp $date;
    my @lines   = ('#!/bin/bash');

    my $lref = shift;  my %lsf = %$lref;  #  Created in &ConfigureProcessLSF


    #----------------------------------------------------------------------------------
    #  Specify the template file to use
    #----------------------------------------------------------------------------------
    #
    my $template = $lsf{serial} ? 'template_uemsBsub.serial' : 'template_uemsBsub.parallel';
       $template = $lsf{template} if $lsf{template};  #  Allow for user to specify template

    my $temppath = "$ENV{EMS_DATA}/tables/lsf/$template";
    my $bjtype   = $lsf{serial} ? 'serial' : 'mpi';

    #----------------------------------------------------------------------------------
    #  Create the comments block
    #----------------------------------------------------------------------------------
    #
    my @comnts = ();
    push @comnts, '----------------------------------------------------------------------------------';
    push @comnts, "  This $bjtype UEMS batch job file is used to run the $lsf{process} routine.";
    push @comnts, "  All routines are compiled as static executables.";
    push @comnts, "";
    push @comnts, "  UEMS:     $lsf{uemsver}";
    push @comnts, "  CREATED:  $date";
    push @comnts, "  TEMPLATE: $template";
    push @comnts, "  AUTHOR:   Robert.Rozumalski\@noaa.gov";
    push @comnts, '----------------------------------------------------------------------------------';
    push @comnts, "";
    $_ = "## $_" foreach @comnts;

    #----------------------------------------------------------------------------------
    #  Create the BSUB block. The order is not important but some entries are 
    #  restricted to either serial or parallel jobs.
    #----------------------------------------------------------------------------------
    #

    my @bsubs = ();

    push @bsubs, "#SBATCH --comment= $lsf{prjcode}" if $lsf{prjcode};
    push @bsubs, "#SBATCH --partition=$lsf{queue}"   if $lsf{queue};
    push @bsubs, "#SBATCH -J $lsf{jobname}" if $lsf{jobname};

    if ($lsf{serial}) {
### -- specify that we need 2GB of memory per core/slot -- 
        push @bsubs, "#SBATCH --mem-per-cpu=$lsf{reqmem}";
### -- specify affinity: number of "cores" for each "process" --
        push @bsubs, "#SBATCH --ntasks=1 --ntasks-per-node=$lsf{cpnode}";      
    } else {
##Not required in sbatch    
##        push @bsubs, "#BSUB -a intelmpi";
       push @bsubs, "#SBATCH --ntasks=$lsf{ncores}";     # Number of MPI ranks
       push @bsubs, "#SBATCH --ntasks-per-core=1";  # How many tasks on each node
#request exclusive node allocation:only makes sure that there will be no other jobs running on your nodes.          
        push @bsubs, "#SBATCH --exclusive";
    }
#wall-clock time limit    
    push @bsubs, "#SBATCH --time=23:50:00"                if $lsf{wtime};
    push @bsubs, "#SBATCH -o $lsf{logstd}"                 if $lsf{logstd};
    push @bsubs, "#SBATCH -e $lsf{logerr}"                 if $lsf{logerr};


    #----------------------------------------------------------------------------------
    #  Create the batchjob file with the information and comments
    #----------------------------------------------------------------------------------
    #
    open (my $ifh, '<', $temppath) || return "Error: Can't open template file for reading - $temppath";

    while (<$ifh>) {
        chomp; s/^\s*//g; s/\s+$//g;
        next if /^#/;
        next if /_STOP/i;
        if (/COMMENTS_START/i) {@lines = (@lines,@comnts); next;}
        if (/BSUBS_START/i)    {@lines = (@lines,@bsubs);  next;}
        s/UEMSCWD/cd $lsf{workdir}/g;
        
    if ($lsf{procexe} eq "$ENV{EMS_BIN}/geogrid") {
        s{ADD_1}{./sbatch/link_geogrid.csh};
        s{ADD_2}{rm -fr sbatch/slurm-geogrid.out};
        s{mpirun -np SLURM_NTASKS  REALEXE}{};
        s{sleep 11}{};
        s{ADD_3}{}; 
        s{ADD_4}{mv slurm* sbatch/slurm-geogrid.out}; 
        s{ADD_5}{mkdir -p log/geo/;rm -f log/geo/* static/*.TBL namelist.wps;mv geogrid.log* log/geo/}; 
        s{SLURM_NTASKS}{$lsf{ncores}}; 
        
        s{UEMSEXEC}{$lsf{procexe} > log/run_geogrid1.log}g;
        
    } elsif ($lsf{procexe} eq "$ENV{EMS_BIN}/ungrib") {
        s{ADD1}{./sbatch/link_ungrib.csh};
        s{ADD2}{rm -fr sbatch/slurm-ungrib.out};
        s{ADD_1}{mv ungrib.log log/run_ungrib2.log;./sbatch/link_metgrid1.csh};
        s{ADD_2}{./sbatch/link_metgrid2.csh}; 
        
        s{UEMSEXEC}{$lsf{procexe} > log/run_ungrib1.log}g;
        
    } elsif ($lsf{procexe} eq "$ENV{EMS_BIN}/wrfm_arw.exe > log/run_wrfm1.log") {
        s{ADD_1}{./sbatch/link_real.csh;};
        s{ADD_2}{rm -f sbatch/slurm-wrf.out};
        s{REALEXE}{$ENV{EMS_BIN}/real_arw.exe > log/run_real1.log};
        s{ADD_3}{./sbatch/link_wrf.csh; mkdir -p log/real/;rm -f log/real/*;mv rsl.* log/real/}; 
        s{ADD_4}{mkdir -p log/wrf/;rm -f log/wrf/*;mv rsl.* log/wrf/;mv slurm* sbatch/slurm-wrf.out}; 
        s{ADD_5}{./sbatch/link_rm.csh}; 
        s{SLURM_NTASKS}{$lsf{ncores}}; 
        
        s/UEMSEXEC/$lsf{procexe}/g;
        
}

        push @lines, $_;
    }  close $ifh;

    #----------------------------------------------------------------------------------
    #  Write the batchjob file out to $lsf{batchfile}
    #----------------------------------------------------------------------------------
    #
#cra    
    
    open (my $ofh, '>', $lsf{batchfile}) || return "Error: Can't open batchjob file for writing - $lsf{batchfile}";
    print $ofh "$_\n" foreach @lines;
    close $ofh;
#cra
    my $stat;
    $stat = `cp -fr $lsf{batchfile} $lsf{workdir}/sbatch/`;
    $stat = `chmod 755 $lsf{workdir}/sbatch/*`;

    return "Error: Problem while writing to batchjob file - $lsf{batchfile}" unless -s $lsf{batchfile};

return '';
}
3. Save and exit.
V. ******************** Dprocess.pm pm located in uems/strc/Udomain/******************

1. In line 1882 it create a script file that creates symbolic links for geogrid table and nameliste:
After these lines

Code: Select all

    symlink $geotbl => 'static/GEOGRID.TBL';
    symlink $wpsnl  => 'namelist.wps';
Add this

Code: Select all

       system "mkdir -p sbatch"; #create sbatch directory if doesn't exist
       open(my $FILE,">sbatch/link_geogrid.csh") || die "Cannot create sbatch/link_geogrid.csh for writing: $!";
       print $FILE "#!/bin/csh -f \n";
       my $lcom;
       $lcom="ln -sf $geotbl static/GEOGRID.TBL" ; 
       print $FILE "$lcom  \n";   
       $lcom="ln -sf $wpsnl namelist.wps" ; 
       print $FILE "$lcom  \n";   
       print $FILE "echo '**** run geogrid ****' \n";
       close($FILE);
       system "chmod 755 sbatch/link_geogrid.csh";
2. Save and exit.
VI. ******************** Pungrib.pm located in uems/strc/Uprep/ ******************
1. it creates a script file that creates symbolic links (for grib files and vtable)
2. it will copy namelist to sbatch/namelist.wps.ungrib and updates script file
Replace ( from line 238 to 278)

Code: Select all

#-------------------------------------------------------------------------------------
        #  Save the bc data frequency information for metgrid
        #-------------------------------------------------------------------------------------
        #
        $masternl{SHARE}{interval_seconds}[0] = $intrs if $dstruct->useid == 1 or $dstruct->useid == 3;

        
        if (@{$dstruct->gribs}) { #  Process GRIB files unless WRF intermediate files are being used

            #-------------------------------------------------------------------------------------
            #  Begin the pre "running of the ungrib" festivities by creating a link "GRIBFILE.???"
            #  to each of the GRIB files, followed by the ceremonial "writing of the namelist 
            #  file."  A good time will be had by all.
            #-------------------------------------------------------------------------------------
            #
            my $ID = 'AAA';
            foreach my $grib (sort @{$dstruct->gribs}) {$grib = File::Spec->abs2rel($grib); symlink $grib, "$emsrun{dompath}/GRIBFILE.$ID"; $ID++;}
    
            symlink $dstruct->useid == 4 ? File::Spec->abs2rel($dstruct->lvtable) : File::Spec->abs2rel($dstruct->vtable) => 'Vtable';
    
    
            #-------------------------------------------------------------------------------------
            #  The namelist should be written to the top level of the domain directory
            #  and not /static; otherwise you will loose important information.
            #-------------------------------------------------------------------------------------
            #
            open (my $lfh, '>', $wpsnl);
            print $lfh  "\&share\n",
                         "  start_date = \'$start\'\n",
                         "  end_date   = \'$stop\'\n",
                         "  interval_seconds = $intrs\n",
                         "  debug_level = 0\n",
                         "\/\n\n",

                         "\&ungrib\n",
                         "  out_format = \'WPS\'\n",
                         "  prefix     = \'./$wpsprd/$ucdset\'\n",
                         "\/\n\n"; 
            close $lfh;
By

Code: Select all

#-------------------------------------------------------------------------------------
        #  Save the bc data frequency information for metgrid
        #-------------------------------------------------------------------------------------
        #
        $masternl{SHARE}{interval_seconds}[0] = $intrs if $dstruct->useid == 1 or $dstruct->useid == 3;

#Cra line 245 create a script file that creates symbolic links
       system "mkdir -p sbatch"; #create sbatch directory if doesn't exist
       open(my $FILE,">sbatch/link_ungrib.csh") || die "Cannot create sbatch/link_ungrib.csh for writing: $!";
       print $FILE "#!/bin/csh -f \n";
       my $lcom;
       my $vtab;
        
        if (@{$dstruct->gribs}) { #  Process GRIB files unless WRF intermediate files are being used

            #-------------------------------------------------------------------------------------
            #  Begin the pre "running of the ungrib" festivities by creating a link "GRIBFILE.???"
            #  to each of the GRIB files, followed by the ceremonial "writing of the namelist 
            #  file."  A good time will be had by all.
            #-------------------------------------------------------------------------------------
            #
#            my $ID = 'AAA';
#            foreach my $grib (sort @{$dstruct->gribs}) {$grib = File::Spec->abs2rel($grib); symlink $grib, "$emsrun{dompath}/GRIBFILE.$ID"; $ID++;}
    
#            symlink $dstruct->useid == 4 ? File::Spec->abs2rel($dstruct->lvtable) : File::Spec->abs2rel($dstruct->vtable) => 'Vtable';
            my $ID = 'AAA';
            foreach my $grib (sort @{$dstruct->gribs}) {$grib = File::Spec->abs2rel($grib); symlink $grib, "$emsrun{dompath}/GRIBFILE.$ID"; $lcom="ln -sf $grib $emsrun{dompath}/GRIBFILE.$ID" ; print $FILE "$lcom  \n"; $ID++;}
    
            symlink $dstruct->useid == 4 ? File::Spec->abs2rel($dstruct->lvtable) : File::Spec->abs2rel($dstruct->vtable) => 'Vtable';
#cra
          if($dstruct->useid == 4){
           $vtab=File::Spec->abs2rel($dstruct->lvtable);
           }else{
           $vtab=File::Spec->abs2rel($dstruct->vtable);
           }
            $lcom="ln -sf $vtab Vtable" ; print $FILE "$lcom  \n";   
             
            #-------------------------------------------------------------------------------------
            #  The namelist should be written to the top level of the domain directory
            #  and not /static; otherwise you will loose important information.
            #-------------------------------------------------------------------------------------
            #
            open (my $lfh, '>', $wpsnl);
            print $lfh  "\&share\n",
                         "  start_date = \'$start\'\n",
                         "  end_date   = \'$stop\'\n",
                         "  interval_seconds = $intrs\n",
                         "  debug_level = 0\n",
                         "\/\n\n",

                         "\&ungrib\n",
                         "  out_format = \'WPS\'\n",
                         "  prefix     = \'./$wpsprd/$ucdset\'\n",
                         "\/\n\n"; 
            close $lfh;

#cra a line 279 Pungrib.pm
    my $stat;
    $stat = `cp -fr namelist.wps $emsrun{dompath}/sbatch/namelist.wps.ungrib`;
#Cra line 282 update script file 
     print $FILE "cp sbatch/namelist.wps.ungrib  namelist.wps \n";
     print $FILE "echo '**** run ungrib ****' \n";
     close($FILE);
     system "chmod 755 sbatch/link_ungrib.csh";

3. Save and exit.
******************** Pinterp.pm located in uems/strc/Uprep/******************
1. It will copy namelist to sbatch/namelist.wps.metgrib
2. It creates two script files that calls namelist file, creates symbolic links for METGRID table, move TAVGSFC to static/ and creates command to execute avgtsfc if we call it before.
3. Add these lines in 157 line where namelist is updated

Code: Select all

#cra 
    my $stat;
    $stat = `cp -fr namelist.wps $emsrun{dompath}/sbatch/namelist.wps.metgrib`;
#Cra line 156 create links to METGRID table and  TAVGSFC 
     open(my $FILE1,">sbatch/link_metgrid1.csh") || die "Cannot create sbatch/link_metgrid1.csh for writing: $!";
     print $FILE1 "#!/bin/csh -f \n";
     print $FILE1 "sleep 2 \n";
     print $FILE1 "cp sbatch/namelist.wps.metgrib  namelist.wps \n";
#Cra
     open(my $FILE,">sbatch/link_metgrid2.csh") || die "Cannot create sbatch/link_metgrid2.csh for writing: $!";
     print $FILE "#!/bin/csh -f \n";
4. In line 217, before

Code: Select all

#----------------------------------------------------------------------------------
        #  Start the process, or not
        #----------------------------------------------------------------------------------
        #
        &Ecomm::PrintMessage(1,11+$Uprep{arf},144,1,0,"Calculating mean surface temperatures for missing water values - ");
add these lines: we need to update static/namelist.wps for sbatch otherwise we get error " there is no initialization file that matches the requested simulation start date " when we run ems_run

Code: Select all

    if ($ENV{LSF_SYS}==1) {
    #----------------------------------------------------------------------------------
    #  While the information is fresh, update the static/namelist.wps file.
    #---------------------------------------------------------------------------------- 
    #
    if (&Others::Hash2Namelist($emsrun{wpsnl},"$ENV{EMS_DATA}/tables/wps/namelist.wps",%masternl) ) {
        my $file = "$ENV{EMS_DATA}/tables/wps/namelist.wps";
        $ENV{PMESG} = &Ecomm::TextFormat(0,0,84,0,0,'The Hash2Namelist routine',"BUMMER: Problem writing $file");
        return 1;
    }
} #if ($ENV{LSF_SYS}==1)
5. In line 248 and inside unless ($conf{noaltsst}), we need to add “if ($ENV{LSF_SYS} eq 0){” before “ if (my $status = &Ecore::SysExecute($Proc{uemsexec}, $tlog2)) {” , otherwise we get error message and the program will stop. We don’t want uems to run the requested process and check for errors, because the sbatch file is in the queue and takes some time to run.
Replace:

Code: Select all

  if (my $status = &Ecore::SysExecute($Proc{uemsexec}, $tlog2)) {
By

Code: Select all

if ($ENV{LSF_SYS} eq 0){
        if (my $status = &Ecore::SysExecute($Proc{uemsexec}, $tlog2)) {
and don’t forget to close in line 334 just before “&Ecomm::PrintMessage(0,0,96,0,1,"Success");”

Code: Select all

}#if ($ENV{LSF_SYS} eq 0)
        &Ecomm::PrintMessage(0,0,96,0,1,"Success");     
6. Around line 337 and after

Code: Select all

        &Ecomm::PrintMessage(0,0,96,0,1,"Success");
        system "mv $emsrun{dompath}/TAVGSFC $emsrun{wpsprd}";
Add these lines:

Code: Select all

#Cra ($conf{noaltsst})
        print $FILE1 "echo '**** run avgtsfc ****' \n";
        print $FILE1 "$Puems{procexe} > log/run_avgtsfc1.log \n";
        print $FILE1 "mv $emsrun{dompath}/TAVGSFC $emsrun{wpsprd} \n";
        print $FILE1 "mv logfile.log log/run_avgtsfc2.log  \n";
7. In line 348 before

Code: Select all

#==================================================================================
    #  Time to run the metgrid routine, which is the first time we have to account
    #  for the number of CPUs allocated for use with the UEMS. The primary purpose
    #  for this bit of caution is to ensure that the domain is not over-decomposed,
    #  which may lead to a segmentation fault floating point error. There is another
    #  potential issue related to IO.
    #==================================================================================
Add this:

Code: Select all

#Cra close sbatch/link_metgrid1.csh
     close($FILE1);
     system "chmod 755 sbatch/link_metgrid1.csh";
#==================================================================================
    #  Time to run the metgrid routine, which is the first time we have to account
    #  for the number of CPUs allocated for use with the UEMS. The primary purpose
    #  for this bit of caution is to ensure that the domain is not over-decomposed,
    #  which may lead to a segmentation fault floating point error. There is another
    #  potential issue related to IO.
    #==================================================================================
8. In line 364 after

Code: Select all

#==================================================================================
    #

    #  Create a link from the default metgrid tables to the local directory
    #
    &Others::rm('METGRID.TBL'); symlink $metgtbl => 'METGRID.TBL';
Add this:

Code: Select all

#Cra Create a link from the default metgrid tables to the local directory
     print $FILE "sleep 2 \n";
     print $FILE "ln -sf $metgtbl METGRID.TBL \n";
9. In line 474 replace

Code: Select all

if (my $status = &Ecore::SysExecute($Proc{uemsexec}, $mlog2)) {
By

Code: Select all

#Cra 
        print $FILE "echo '**** run metgrid ****' \n";
        print $FILE "$Puems{procexe} > log/run_metgrid1.log \n";
        print $FILE "rm -f GRIBFILE* Vtable  logfile.log namelist.wps wpsprd/GFS* *.TBL wpsprd/TAVGSFC \n";
        print $FILE "mv slurm* sbatch/slurm-ungrib.out \n";
        print $FILE "mv metgrid.log log/run_metgrid2.log \n";

     close($FILE);
     system "chmod 755 sbatch/link_metgrid2.csh";

#Cra Don't do it for $ENV{LSF_SYS}=1, otherwise we get error message
    if ($ENV{LSF_SYS} eq 0){
    if (my $status = &Ecore::SysExecute($Proc{uemsexec}, $mlog2)) {
to update link_metgrid2.csh. Also We don’t want uems to run the requested process and check for errors, because the sbatch file is in the queue and takes some time to run.
Don’t forget to close the if statement around the line 588 before

Code: Select all

} #if ($ENV{LSF_SYS} eq 0)

    #----------------------------------------------------------------------------------
    #  We're not safe yet - A child process (metgrid) can be terminated and not
    #  report an error. Check the files just in case.
    #----------------------------------------------------------------------------------
10. Save and exit.
********************* Rexe.pm located in uems/strc/Urun/*****************
1. at line 114 before “if (my $err = &Ecore::SysExecute($Proc{uemsexec},$Puems{rlog2})) {” We don’t want uems to run the requested process for $ENV{LSF_SYS}=1 and check for errors, because the sbatch file is in the queue and takes some time to run.
Replace

Code: Select all

    if (my $err = &Ecore::SysExecute($Proc{uemsexec},$Puems{rlog2})) {
with

Code: Select all

#Cra
    if ($ENV{LSF_SYS} eq 0){
    if (my $err = &Ecore::SysExecute($Proc{uemsexec},$Puems{rlog2})) {
don’t forget to close the if statement in line before “# Only rename rsllog file if successful ” (line 241)

Code: Select all

} #if $ENV{LSF_SYS}

    #----------------------------------------------------------------------------------
    #  Only rename rsllog file if successful
    #----------------------------------------------------------------------------------
    #
2. The simulation does not start yet, so don’t Check4Sueccess, otherwise we get errors.
Add these at line 462 and before “ if ($err = $ENV{EMSERR} || $we || $rc || &Others::Ret10to01(&Elove::Check4Success($Puems{rsllog}))) {”

Code: Select all

#Cra 
    if ($ENV{LSF_SYS} eq 0){

    if ($err = $ENV{EMSERR} || $we || $rc || &Others::Ret10to01(&Elove::Check4Success($Puems{rsllog}))) {
3. Before (line 762)

Code: Select all

   if ($proc eq 'wrfm') {
Add this:

Code: Select all

##Cra
    if ($ENV{LSF_SYS} eq 0){

    if ($proc eq 'wrfm') {
close the if before “@{$Puems{phytbls}} = @{$Urun{emsrun}{tables}{physics}};”

Code: Select all

} #if $ENV{LSF_SYS} line 781

4. After these lines (1341)

Code: Select all

#----------------------------------------------------------------------------------
    #  Step 2. If running WRF REAL, create links to the WPS files in the wpsprd/
    #          directory.  We only want to create links to the domains included 
    #          in the simulation.
    #----------------------------------------------------------------------------------
    #
    if (%{$Process{wpsfls}}) {
        foreach my $d (sort {$a <=> $b} keys %{$Process{domains}}) {
            foreach my $f (@{$Process{wpsfls}{$d}}) {symlink "wpsprd/$f" => $f; push @delete => $f;}
        }
    }
Add these lines:

Code: Select all

#Cra create scripts that contain namlistes, links to the WPS files and links to the tables required by the physics schemes 
    if (%{$Process{wpsfls}}) {
#Cra create links to the WPS files 
     open(my $FILERM,">sbatch/link_rm.csh") || die "Cannot create sbatch/link_rm.csh for writing: $!";

     open(my $FILE,">sbatch/link_real.csh") || die "Cannot create sbatch/link_real.csh for writing: $!";
     my $lcom;
     print $FILE "#!/bin/csh -f \n";
     print $FILERM "#!/bin/csh -f \n";
        foreach my $d (sort {$a <=> $b} keys %{$Process{domains}}) {
            foreach my $f (@{$Process{wpsfls}{$d}}) {$lcom="ln -sf wpsprd/$f $f " ;print $FILE "$lcom  \n"; print $FILERM "rm -f $f  \n"; push @delete => $f;}
        }
#Cra Create links to the tables required by the physics schemes 
    foreach my $pt (@{$Process{phytbls}}) {
        my $l = &Others::popit($pt);
        $lcom="ln -sf  $pt $l " ;
        print $FILE "$lcom  \n";
        print $FILERM "rm -f $l  \n";
        push @delete => $l;
    }
# requesting that the WRF model do an internal decomposition of your domain across the number of processors being used.
     print $FILE "sed -i 's/ nproc_x                    = 4/ nproc_x                    = -1/g'  static/namelist.real  \n";
     print $FILE "sed -i 's/ nproc_y                    = 16/ nproc_y                    = -1/g'  static/namelist.real  \n";
     print $FILE "ln -sf static/namelist.real namelist.input  \n";
     print $FILE "echo '**** Run Real ****' \n";
     close($FILE);

     open(my $FILE,">sbatch/link_wrf.csh") || die "Cannot create sbatch/link_wrf.csh for writing: $!";
     print $FILE "#!/bin/csh -f  \n";
     print $FILE "sed -i 's/ nproc_x                    = 4/ nproc_x                    = -1/g'  static/namelist.wrfm  \n";
     print $FILE "sed -i 's/ nproc_y                    = 16/ nproc_y                    = -1/g'  static/namelist.wrfm  \n";
     print $FILE "ln -sf static/namelist.wrfm namelist.input \n";
     print $FILE "echo '**** Run Wrf ****' \n";
     close($FILE);
     
     print $FILERM "rm -f namelist.input \n";
     close($FILERM);
     system "chmod 755 sbatch/link_wrf.csh  sbatch/link_rm.csh sbatch/link_real.csh";
        }
5. Save and exit.
Last thing don't update because you will lose everything!
I hope that these instructions are clear. Next time I will show you examples of created scripts.
A big thanks to Pr. Robert Rozumalski for making all our lives easier with this precious tool.
Rachid
Last edited by ramousta on Wed Jul 31, 2019 4:24 pm, edited 1 time in total.

ramousta
Posts: 8
Joined: Fri Jun 28, 2019 4:29 pm

Re: How to run UEMS/WRF v19 in a cluster with slurm Step by step

Post by ramousta » Wed Jul 31, 2019 3:54 pm

log files will be located in log/:

Code: Select all

$ ls log/
geo/                  real/             run_avgtsfc2.log  run_metgrid2.log  run_ungrib1.log  run_wrfm1.log  uems_system.info
prep_ungrib2-gfs.log  run_avgtsfc1.log  run_metgrid1.log  run_real1.log     run_ungrib2.log  run_wrfm2.log  wrf/
sbatch scripts and they related scripts will be created inside the sbatch directory

Code: Select all

$ ls sbatch/
link_geogrid.csh*   link_metgrid2.csh*  link_rm.csh*      link_wrf.csh*          namelist.wps.ungrib*  slurm-ungrib.out*  uemsBsub.ungrib*
link_metgrid1.csh*  link_real.csh*      link_ungrib.csh*  namelist.wps.metgrib*  slurm-geogrid.out*    uemsBsub.geogrid*  uemsBsub.wrfm*
note that all files will be genereted automatically for you. The sbatch will be sent to the queue at the same time.
sbatch files are created in the purpose to give as an idea how the work is done. We will take a close look in each file:
1. uemsBsub.geogrid to to launch geogrid. This script calls sbatch/link_geogrid.csh

Code: Select all

$ m sbatch/uemsBsub.geogrid
#!/bin/bash
## ----------------------------------------------------------------------------------
##   This mpi UEMS batch job file is used to run the geogrid routine.
##   All routines are compiled as static executables.
## 
##   UEMS:     19.6.1
##   CREATED:  Wed Jul 31 00:31:58 UTC 2019
##   TEMPLATE: template_uemsBsub.parallel
##   AUTHOR:   Robert.Rozumalski@noaa.gov
## ----------------------------------------------------------------------------------
## 

#SBATCH --comment= WFOEMS-T2O
#SBATCH --partition=longq
#SBATCH -J UEMS-geogrid
#SBATCH --ntasks=20
#SBATCH --ntasks-per-core=1
#SBATCH --exclusive
#SBATCH --time=23:50:00
#SBATCH -o log/uems_geogrid.std
#SBATCH -e log/uems_geogrid.err
module load cuda91/toolkit/9.1.85
source /cm/shared/apps/intel/ips_2017/bin/compilervars.sh -arch intel64
source /cm/shared/apps/gromacs/intel/64/2018.3/bin/GMXRC

export LANG=en_US

cd /data/r.moustabchir/uems/runs/northafrica
sleep 10
./sbatch/link_geogrid.csh
rm -fr sbatch/slurm-geogrid.out
SECSS=`date +%s`



mpirun -np 20  /data/r.moustabchir/uems/bin/geogrid > log/run_geogrid1.log
sleep 10
mv slurm* sbatch/slurm-geogrid.out
mkdir -p log/geo/;rm -f log/geo/* static/*.TBL namelist.wps;mv geogrid.log* log/geo/
SECSE=`date +%s`
TSECS=$((SECSE-SECSS))
echo "UEMS parallel job completed in $TSECS seconds ($?)"
sbatch/link_geogrid.csh script:

Code: Select all

$ m ./sbatch/link_geogrid.csh
#!/bin/csh -f 
ln -sf ../../../data/tables/wps/GEOGRID.TBL.ARW static/GEOGRID.TBL  
ln -sf static/namelist.wps namelist.wps  
echo '**** run geogrid ****' 
2. uemsBsub.ungrib to launch ungrib, avgtsfc and metgrid. This script calls sbatch/link_ungrib.csh, sbatch/link_metgrid1.csh and sbatch/link_metgrid2.csh

Code: Select all

 $ m sbatch/uemsBsub.ungrib
#!/bin/bash
## ----------------------------------------------------------------------------------
##   This serial UEMS batch job file is used to run the ungrib routine.
##   All routines are compiled as static executables.
## 
##   UEMS:     19.6.1
##   CREATED:  Wed Jul 31 00:48:34 UTC 2019
##   TEMPLATE: template_uemsBsub.serial
##   AUTHOR:   Robert.Rozumalski@noaa.gov
## ----------------------------------------------------------------------------------
## 

#SBATCH --comment= WFOEMS-T2O
#SBATCH --partition=longq
#SBATCH -J UEMS-ungrib
#SBATCH --mem-per-cpu=2048
#SBATCH --ntasks=1 --ntasks-per-node=20
#SBATCH --time=23:50:00
#SBATCH -o log/uems_ungrib.std
#SBATCH -e log/uems_ungrib.err


export LANG=en_US

cd /data/r.moustabchir/uems/runs/northafrica
sleep 2
./sbatch/link_ungrib.csh
rm -fr sbatch/slurm-ungrib.out

SECSS=`date +%s`
/data/r.moustabchir/uems/bin/ungrib > log/run_ungrib1.log
sleep 2
mv ungrib.log log/run_ungrib2.log;./sbatch/link_metgrid1.csh
./sbatch/link_metgrid2.csh

SECSE=`date +%s`
TSECS=$((SECSE-SECSS))

echo "UEMS serial job completed in $TSECS seconds ($?)"
sbatch/link_ungrib.csh script is

Code: Select all

 $ m sbatch/link_ungrib.csh
#!/bin/csh -f 
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f000 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAA  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f003 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAB  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f006 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAC  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f009 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAD  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f012 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAE  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f015 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAF  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f018 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAG  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f021 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAH  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f024 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAI  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f027 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAJ  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f030 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAK  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f033 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAL  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f036 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAM  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f039 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAN  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f042 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAO  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f045 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAP  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f048 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAQ  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f051 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAR  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f054 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAS  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f057 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAT  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f060 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAU  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f063 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAV  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f066 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAW  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f069 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAX  
ln -sf grib/19073000.gfs.t00z.0p50.pgrb2f072 /data/r.moustabchir/uems/runs/northafrica/GRIBFILE.AAY  
ln -sf ../../data/tables/vtables/Vtable.GFS Vtable  
cp sbatch/namelist.wps.ungrib  namelist.wps 
echo '**** run ungrib ****' 

sbatch/link_metgrid1.csh is

Code: Select all

$ m sbatch/link_metgrid1.csh
#!/bin/csh -f 
sleep 2 
cp sbatch/namelist.wps.metgrib  namelist.wps 
echo '**** run avgtsfc ****' 
/data/r.moustabchir/uems/bin/avgtsfc > log/run_avgtsfc1.log 
mv /data/r.moustabchir/uems/runs/northafrica/TAVGSFC /data/r.moustabchir/uems/runs/northafrica/wpsprd 
mv logfile.log log/run_avgtsfc2.log  
and sbatch/link_metgrid2.csh

Code: Select all

$ m sbatch/link_metgrid2.csh
#!/bin/csh -f 
sleep 2 
ln -sf ../../data/tables/wps/METGRID.TBL.ARW METGRID.TBL 
echo '**** run metgrid ****' 
/data/r.moustabchir/uems/bin/metgrid > log/run_metgrid1.log 
rm -f GRIBFILE* Vtable  logfile.log namelist.wps wpsprd/GFS* *.TBL wpsprd/TAVGSFC 
mv slurm* sbatch/slurm-ungrib.out 
mv metgrid.log log/run_metgrid2.log 
3. uemsBsub.wrfm to to launch real and wrf. This script calls sbatch/link_real.csh, sbatch/link_wrf.csh and ./sbatch/link_rm.csh

Code: Select all

 $ m sbatch/uemsBsub.wrfm
#!/bin/bash
## ----------------------------------------------------------------------------------
##   This mpi UEMS batch job file is used to run the wrfm routine.
##   All routines are compiled as static executables.
## 
##   UEMS:     19.6.1
##   CREATED:  Wed Jul 31 02:07:13 UTC 2019
##   TEMPLATE: template_uemsBsub.parallel
##   AUTHOR:   Robert.Rozumalski@noaa.gov
## ----------------------------------------------------------------------------------
## 

#SBATCH --comment= WFOEMS-T2O
#SBATCH --partition=longq
#SBATCH -J UEMS-wrfm
#SBATCH --ntasks=20
#SBATCH --ntasks-per-core=1
#SBATCH --exclusive
#SBATCH --time=23:50:00
#SBATCH -o log/uems_wrfm.std
#SBATCH -e log/uems_wrfm.err
module load cuda91/toolkit/9.1.85
source /cm/shared/apps/intel/ips_2017/bin/compilervars.sh -arch intel64
source /cm/shared/apps/gromacs/intel/64/2018.3/bin/GMXRC

export LANG=en_US

cd /data/r.moustabchir/uems/runs/northafrica
sleep 10
./sbatch/link_real.csh;
rm -f sbatch/slurm-wrf.out
SECSS=`date +%s`
mpirun -np 20  /data/r.moustabchir/uems/bin/real_arw.exe > log/run_real1.log
sleep 11
./sbatch/link_wrf.csh; mkdir -p log/real/;rm -f log/real/*;mv rsl.* log/real/
mpirun -np 20  /data/r.moustabchir/uems/bin/wrfm_arw.exe > log/run_wrfm1.log
sleep 10
mkdir -p log/wrf/;rm -f log/wrf/*;mv rsl.* log/wrf/;mv slurm* sbatch/slurm-wrf.out
./sbatch/link_rm.csh
SECSE=`date +%s`
TSECS=$((SECSE-SECSS))
echo "UEMS parallel job completed in $TSECS seconds ($?)"
sbatch/link_real.csh is

Code: Select all

 $ m sbatch/link_real.csh
#!/bin/csh -f 
ln -sf wpsprd/met_em.d01.2019-07-30_00:00:00.nc met_em.d01.2019-07-30_00:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-07-30_03:00:00.nc met_em.d01.2019-07-30_03:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-07-30_06:00:00.nc met_em.d01.2019-07-30_06:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-07-30_09:00:00.nc met_em.d01.2019-07-30_09:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-07-30_12:00:00.nc met_em.d01.2019-07-30_12:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-07-30_15:00:00.nc met_em.d01.2019-07-30_15:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-07-30_18:00:00.nc met_em.d01.2019-07-30_18:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-07-30_21:00:00.nc met_em.d01.2019-07-30_21:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-07-31_00:00:00.nc met_em.d01.2019-07-31_00:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-07-31_03:00:00.nc met_em.d01.2019-07-31_03:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-07-31_06:00:00.nc met_em.d01.2019-07-31_06:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-07-31_09:00:00.nc met_em.d01.2019-07-31_09:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-07-31_12:00:00.nc met_em.d01.2019-07-31_12:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-07-31_15:00:00.nc met_em.d01.2019-07-31_15:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-07-31_18:00:00.nc met_em.d01.2019-07-31_18:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-07-31_21:00:00.nc met_em.d01.2019-07-31_21:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-08-01_00:00:00.nc met_em.d01.2019-08-01_00:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-08-01_03:00:00.nc met_em.d01.2019-08-01_03:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-08-01_06:00:00.nc met_em.d01.2019-08-01_06:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-08-01_09:00:00.nc met_em.d01.2019-08-01_09:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-08-01_12:00:00.nc met_em.d01.2019-08-01_12:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-08-01_15:00:00.nc met_em.d01.2019-08-01_15:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-08-01_18:00:00.nc met_em.d01.2019-08-01_18:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-08-01_21:00:00.nc met_em.d01.2019-08-01_21:00:00.nc   
ln -sf wpsprd/met_em.d01.2019-08-02_00:00:00.nc met_em.d01.2019-08-02_00:00:00.nc   
ln -sf wpsprd/met_em.d02.2019-07-30_00:00:00.nc met_em.d02.2019-07-30_00:00:00.nc   
ln -sf  /data/r.moustabchir/uems/data/tables/wrf/physics/radn/CAMtr_volume_mixing_ratio CAMtr_volume_mixing_ratio   
ln -sf  /data/r.moustabchir/uems/data/tables/wrf/physics/radn/RRTM_DATA RRTM_DATA   
ln -sf  /data/r.moustabchir/uems/data/tables/wrf/physics/lsm/GENPARM.TBL GENPARM.TBL   
ln -sf  /data/r.moustabchir/uems/data/tables/wrf/physics/lsm/LANDUSE.TBL LANDUSE.TBL   
ln -sf  /data/r.moustabchir/uems/data/tables/wrf/physics/lsm/MPTABLE.TBL MPTABLE.TBL   
ln -sf  /data/r.moustabchir/uems/data/tables/wrf/physics/lsm/SOILPARM.TBL SOILPARM.TBL   
ln -sf  /data/r.moustabchir/uems/data/tables/wrf/physics/lsm/VEGPARM.TBL VEGPARM.TBL   
ln -sf  /data/r.moustabchir/uems/data/tables/wrf/physics/lsm/URBPARM.TBL URBPARM.TBL   
ln -sf  /data/r.moustabchir/uems/data/tables/wrf/physics/lsm/URBPARM_UZE.TBL URBPARM_UZE.TBL   
sed -i 's/ nproc_x                    = 4/ nproc_x                    = -1/g'  static/namelist.real  
sed -i 's/ nproc_y                    = 16/ nproc_y                    = -1/g'  static/namelist.real  
ln -sf static/namelist.real namelist.input  
echo '**** Run Real ****' 
sbatch/link_wrf.csh file is

Code: Select all

 $m sbatch/link_wrf.csh
#!/bin/csh -f  
sed -i 's/ nproc_x                    = 4/ nproc_x                    = -1/g'  static/namelist.wrfm  
sed -i 's/ nproc_y                    = 16/ nproc_y                    = -1/g'  static/namelist.wrfm  
ln -sf static/namelist.wrfm namelist.input 
echo '**** Run Wrf ****' 
and ./sbatch/link_rm.csh

Code: Select all

 $ m sbatch/link_rm.csh
#!/bin/csh -f 
rm -f met_em.d01.2019-07-30_00:00:00.nc  
rm -f met_em.d01.2019-07-30_03:00:00.nc  
rm -f met_em.d01.2019-07-30_06:00:00.nc  
rm -f met_em.d01.2019-07-30_09:00:00.nc  
rm -f met_em.d01.2019-07-30_12:00:00.nc  
rm -f met_em.d01.2019-07-30_15:00:00.nc  
rm -f met_em.d01.2019-07-30_18:00:00.nc  
rm -f met_em.d01.2019-07-30_21:00:00.nc  
rm -f met_em.d01.2019-07-31_00:00:00.nc  
rm -f met_em.d01.2019-07-31_03:00:00.nc  
rm -f met_em.d01.2019-07-31_06:00:00.nc  
rm -f met_em.d01.2019-07-31_09:00:00.nc  
rm -f met_em.d01.2019-07-31_12:00:00.nc  
rm -f met_em.d01.2019-07-31_15:00:00.nc  
rm -f met_em.d01.2019-07-31_18:00:00.nc  
rm -f met_em.d01.2019-07-31_21:00:00.nc  
rm -f met_em.d01.2019-08-01_00:00:00.nc  
rm -f met_em.d01.2019-08-01_03:00:00.nc  
rm -f met_em.d01.2019-08-01_06:00:00.nc  
rm -f met_em.d01.2019-08-01_09:00:00.nc  
rm -f met_em.d01.2019-08-01_12:00:00.nc  
rm -f met_em.d01.2019-08-01_15:00:00.nc  
rm -f met_em.d01.2019-08-01_18:00:00.nc  
rm -f met_em.d01.2019-08-01_21:00:00.nc  
rm -f met_em.d01.2019-08-02_00:00:00.nc  
rm -f met_em.d02.2019-07-30_00:00:00.nc  
rm -f CAMtr_volume_mixing_ratio  
rm -f RRTM_DATA  
rm -f GENPARM.TBL  
rm -f LANDUSE.TBL  
rm -f MPTABLE.TBL  
rm -f SOILPARM.TBL  
rm -f VEGPARM.TBL  
rm -f URBPARM.TBL  
rm -f URBPARM_UZE.TBL  
rm -f namelist.input 
That's it. Good luck! If you have any questions, let me know!!!!
Rachid

ramousta
Posts: 8
Joined: Fri Jun 28, 2019 4:29 pm

Re: How to run UEMS/WRF v19 (19.7.2) in a cluster with slurm Step by step

Post by ramousta » Thu Aug 01, 2019 11:03 pm

You can download all modified files in this link:
https://drive.google.com/file/d/1LG3ebn ... s=5d436cb7
The tar contains the 7 files that I modified. steps to follow:
  • 1. download the tar file
    2. go to the top-level directory of uems
    3. untar the file
I made some minor modifications in the files. You can locate easily all modified parts on each file by searching #Cra

Code: Select all

$ cd ~/uems
$ tar xzvf uem_modifiedFol.tar.gz 
strc/
strc/Udomain/
strc/Udomain/Dprocess.pm
strc/Uutils/
strc/Uutils/Elsf.pm
strc/Uprep/
strc/Uprep/Pungrib.pm
strc/Uprep/Pinterp.pm
strc/Urun/
strc/Urun/Rexe.pm
data/tables/lsf/
data/tables/lsf/template_uemsBsub.parallel
data/tables/lsf/template_uemsBsub.serial

ramousta
Posts: 8
Joined: Fri Jun 28, 2019 4:29 pm

Re: How to run UEMS/WRF v19 (19.7.2) in a cluster with slurm Step by step

Post by ramousta » Mon Aug 05, 2019 1:21 am

Hi everybody!
Today I will shear with you the problems I got when I run wrfems and how I resolve them. First of all the big issues are:
1. Model was running slowly on the cluster.
2. I never got the partition I ask for when creating the sbatch script.
3. I never got the number of nodes I requested when creating the sbatch script. I get only 1 node.


1. Model was running slowly on the cluster.
I learned on the internet when I googled my problem that wrf runs faster when using a compiled version with Intel.
When listing files included in the bin folder we can see that Pr. Rozumalski already included intel version of wrf:

Code: Select all

$ ls ~/uems/bin/
avgtsfc*   emsupp*     geogrid.intel*  metgrid.intel*  
bufrstns*  geogrid*    metgrid*        modlevs*        real_arw.exe.intel*  ungrib.intel*  wrfm_arw.exe.intel*
emsbufr*      real_arw.exe*   ungrib*              wrfm_arw.exe*
I first save a copy of non intel version

Code: Select all

$ cp ~/uems/bin/geogrid ~/uems/bin/geogrid.1
$ cp ~/uems/bin/metgrid ~/uems/bin/metgrid.1
$ cp ~/uems/bin/ungrib ~/uems/bin/ungrib.1
$ cp ~/uems/bin/real_arw.exe ~/uems/bin/real_arw.exe.1
$cp ~/uems/bin/wrfm_arw.exe ~/uems/bin/wrfm_arw.exe.1
I make the intel version as the defaut one to execute:

Code: Select all

$ cp ~/uems/bin/geogrid.intel ~/uems/bin/geogrid
$ cp ~/uems/bin/metgrid.intel ~/uems/bin/metgrid
$ cp ~/uems/bin/ungrib.intel ~/uems/bin/ungrib
$ cp ~/uems/bin/real_arw.exe.intel ~/uems/bin/real_arw.exe
$ cp ~/uems/bin/wrfm_arw.exe.intel ~/uems/bin/wrfm_arw.exe
Now I can run these exe files without any problem using mpiexec.gf.
Now the new problem I have is that mpiexec.gf does not take into account the number of nodes. I need to use mpirun.
When I run real_arw.exe with mpirun I got erreur leading to real.exe stop due to inconsistent land use categories:

Code: Select all

----------------- ERROR -------------------
namelist : NUM_LAND_CAT = 0
input files : NUM_LAND_CAT = 21 (from geogrid selections).
-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE: <stdin> LINE: 464
Mismatch between namelist and wrf input files for dimension NUM_LAND_CAT
I could not find any answer about this problem, but I find a working solution which is running real with mpiexec.gf and wrf with mpirun.
In this case I fixed ntasks for mpiexec.gf to 12 , because if I use a big value (something like 70) I got an error for real.exe:

Code: Select all

taskid: 0 hostname: node13
 module_io_quilt_old.F        2931 T
 Ntasks in X            1 , ntasks in Y           37
   For domain            1 , the domain size is too small for this many processors, or the decomposition aspect ratio is poor.
   Minimum decomposed computational patch size, either x-dir or y-dir, is 10 grid cells.
  e_we =   240, nproc_x =    1, with cell width in x-direction =  240
  e_sn =   210, nproc_y =   37, with cell width in y-direction =    5
  --- ERROR: Reduce the MPI rank count, or redistribute the tasks.
-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE:  <stdin>  LINE:    1983
NOTE:       1 namelist settings are wrong. Please check and reset these options
-------------------------------------------
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
forrtl: error (69): process interrupted (SIGINT)
so in line 401 of Elsf.pm I use

Code: Select all

#Cra:--- if I use $lsf{cpnode} I got error for real: ERROR: Reduce the MPI rank count, or redistribute the tasks.
        s{SLURM_NTASKS}{12}; 
2. I never got the partition I ask for and the total number of nodes I got after running the script is always 1.
For that problem, I found that for OpenMPI we need to use the following lines:

Code: Select all

module load mpi/openmpi/<placeholder_for_version>
mpirun --bind-to core --map-by core -report-bindings my_par_program

in my case I add these lines to the parallel template:

Code: Select all

source /cm/shared/apps/intel/ips_2017/bin/compilervars.sh -arch intel64 -platform linux
module load openmpi/intel/3.1.2

and:

Code: Select all

mpirun --bind-to core:overload-allowed --map-by core UEMSEXEC 
You can see that I use --bind-to core:overload-allowed rather then --bind-to core because I got an error like:

Code: Select all

-------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE
   Node:        node8
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
The right way to include a partition name and nodes number is by using this command:

Code: Select all

 sbatch -p partition_name -N nodes_number -n ntasks --time=hh:mm:ss job_sbatch.sh
So in my case I use:

Code: Select all

#Cra : 
   $lsf{uemsexec}  = "sbatch -p $lsf{queue} -N  $lsf{totalnodes}  -n  $lsf{ncores} --time=$lsf{wtime}  < $lsf{batchfile}";
in Elsf.pm instead of

Code: Select all

  $lsf{uemsexec}  = "$ENV{LSF_BSUB_COMMAND} < $lsf{batchfile}";
3. Last thing:
You can modify total number of cores, total nodes, partition name and wall-clock time. Please refer to lines 179 and 219 In Elsf.pm file for more details:

Code: Select all

#cra line 179: If you want to define total number of cores and total nodes, you can do it her. uncomment these two lines and add the values you want. In this case you need also to let wrf to choose an internal decomposition of your domain across the number of processors being used. Go to Rexe.pm  and set nproc_x=nproc_x= -1 in the real and wrf namelists to avoid errors (uncomment lines 1364,1365,1374 and 1365).
#    $Puems{totalcores}=64;
#    $Puems{totalnodes}=8;
and

Code: Select all

#Cra line 219
# If you want to change the the partition name and wall-clock time limit. $lsf{queue} for the partition name and  $lsf{wtime} by wall-clock time.
#    $lsf{queue}="shortq";
#    $lsf{wtime}="06:00:00";

I updated the tar file in the previous post to take into account the change I made.

If you want to use MPI parallel programs, mpiexec.hydra please refer to the link below, you will get all the information you need:
https://wiki.scc.kit.edu/hpc/index.php/ ... Batch_Jobs

ramousta
Posts: 8
Joined: Fri Jun 28, 2019 4:29 pm

Re: How to run UEMS/WRF v19 (19.7.2) in a cluster with slurm Step by step

Post by ramousta » Thu Aug 08, 2019 6:13 am

Today I will shear with you the steps I use to run ems_post to generate grib files and grads in slurm. Two steps are needed:

1. First step: run ems_post --domain 02 to create grib files for domain 02.
2. Second step: run ems_post --noupp --grads --domain 02 to create grads files.

Note that when you pass the flag --grads on the first step, the grads files will not be created because the sbatch file is in the queue and there is no grib files to use. so we need to run ems_post two times: the first time to create grib files, and the second time with the --noupp --grads flags when the job finished.

**************************************** Parts to be modified in Grib2.pm located under uems/strc/Upost/ ****************************************

The file I need to modify is Grib2.pm. I also make some modifications for Elsf.pm to take into account ems_post.

Two scripts will be created automatically to generate links for the needed files or delete these files at the end of the process.
Also I created a Perl file to to rename the GRIB files at the end.
1. For the perl file postGrib.pl located in uems/scr .
I make a copy inside the directory uems/scr/

Code: Select all

$ m postGrib.pl
#!/usr/bin/perl
require 5.008;
use strict;
use warnings;
use English;

use Cwd 'abs_path';
use FindBin qw($RealBin);
use lib (abs_path("../../../../strc/Uutils"), abs_path("../../../../strc/Upost"));

use Ecore;
use Eenv;
use Others;
use Outils;

    my $wkdir="./";
    my $gribfname="xxx";
    my $domin="yyy";
    my $postcore="zzz";
    my $filetype="uuu";
    my $postarf="www";
    my $mesag;
    my @Grib;
    my @Delet;
    my @Deloop;

   foreach my $ngrib (sort &Others::FileMatch($wkdir,'EMSPRS',1,1)) {

       if (-z $ngrib) {
           &Ecomm::PrintMessage(0,14+$postarf,144,1,1,sprintf('Problem  : %-34s (%.2f MBs)',"$ngrib is zero size   ",&Others::Bytes2MB(&Others::FileSize($ngrib))));
           next;
}

       my @pairs = &Outils::Grib2DateStrings("$wkdir/$ngrib");
          @pairs = (@pairs,"WD:$domin","CORE:$postcore","KEY:$filetype");

       my $grib2 = &Others::PlaceholderFill($gribfname,@pairs);

          $mesag = (system "mv $wkdir/$ngrib $wkdir/$grib2") ? "Failed (mv code $?)" : $grib2; my $err = $?;
          &Outils::ProcessInterrupt('emsupp',2,$wkdir,\@Delet,\@Deloop,[$ngrib]) if $err == 2;

          push @Grib => "$wkdir/$grib2" unless $err;

          &Ecomm::PrintMessage(0,14+$postarf,144,1,0,sprintf('Completed: %-34s (%.2f MBs)',$mesag,&Others::Bytes2MB(&Others::FileSize($grib2))));
    }

&Ecore::SysExit(0,$0);
2. For file Grib2.pm.

In line 217 I make a copy of UPP control file (fort.14) and I created link_grib_rm.csh and link_grib.csh

Code: Select all

#Cra make a copy o fort.14  in sbatch directory
    system "mkdir -p $Post{grib}{dpost}/../../sbatch"; #create sbatch directory if doesn't exist

     open(my $FILESC,">$Post{grib}{dpost}/../../sbatch/link_grib.csh") || die "Cannot create sbatch/link_grib.csh for writing: $!";
     open(my $FILESCRM,">$Post{grib}{dpost}/../../sbatch/link_grib_rm.csh") || die "Cannot create sbatch/link_grib_rm.csh for writing: $!";
     print $FILESC "#!/bin/csh -f \n";
     print $FILESCRM "#!/bin/csh -f \n";

   system "cp -fr $Post{grib}{dpost}/fort.14 $Post{grib}{dpost}/../../sbatch/";
   print $FILESC "cp -fr $Post{grib}{dpost}/../../sbatch/fort.14  $Post{grib}{dpost}/ \n"; print $FILESCRM "rm -f $Post{grib}{dpost}/fort.* \n";
from line 365 to 448 I updated the two script files:

Code: Select all

#Cra create scripts that contain links to the required files

#    push @Deletes => $Proc{procsfile} if $Proc{procsfile};
#    my $local = &Others::popit($NetCDFs[0]); &Others::rm($local); symlink $NetCDFs[0] => $local;
#    my ($init, @verfs) = &Others::FileFcstTimes($local,'netcdf'); &Others::rm($local);
#    my $indx = @NetCDFa - @NetCDFs;
     my $lcom;
  
    push @Deletes => $Proc{procsfile} if $Proc{procsfile};
    my $local = &Others::popit($NetCDFs[0]); &Others::rm($local); symlink $NetCDFs[0] => $local;
    $lcom="ln -sf $NetCDFs[0] $local " ;print $FILESC "$lcom  \n"; print $FILESCRM "rm -f $local  \n";
    my ($init, @verfs) = &Others::FileFcstTimes($local,'netcdf'); &Others::rm($local);
    my $indx = @NetCDFa - @NetCDFs;
     
    

    #----------------------------------------------------------------------------------
    #  If $indx is non-zero, it means the autopost is running, so a link to the 
    #  previous netcdf file, $NetCDFa[$indx-1], must be created for calculating 
    #  bucket amounts.
    #----------------------------------------------------------------------------------
    #
#Cra
#    if ($indx) {$local = &Others::popit($NetCDFa[$indx-1]); &Others::rm($local); symlink $NetCDFa[$indx-1] => $local;}
    if ($indx) {$local = &Others::popit($NetCDFa[$indx-1]); &Others::rm($local); symlink $NetCDFa[$indx-1] => $local;
    $lcom="ln -sf  $NetCDFa[$indx-1] $local " ;print $FILESC "$lcom  \n"; print $FILESCRM "rm -f $local  \n";}
    
        my $mono = (@verfs > 1) ? 1 : 0; my $ntimes = $mono ? @verfs : @NetCDFs; my $ntimespi = $ntimes + $indx;
    my $type = $mono ? 'mono' : 'single'; 
       $type = 'auxhist'  if  $ftype eq 'auxhist'; 
       $type = 'hailcast' if  $ftype eq 'hailcast';
    my $core = $Post{core} eq 'arw' ? 'NCAR' : 'NCEP';
    my $pacc = 60.0*$Post{grib}{pacc};

    my $epocit = &Others::CalculateEpochSeconds($init);  #  Number of epoc seconds for $init


    #=================================================================================
    #  Begin the processing of netCDF into GRIB2 files. Begin by creating links to
    #  any necessary tables. Note that both the $init & @verfs date/time string are 
    #  of the format YYYYMMDDHHMNSS (20170506030000).
    #=================================================================================
    #
#Cra
#    foreach (@Links) {my $l = &Others::popit($_); symlink $_ => $l; push @Deletes => "$Post{grib}{dpost}/$l";}
    foreach (@Links) {my $l = &Others::popit($_); symlink $_ => $l;
    $lcom="ln -sf  $_ $l " ;print $FILESC "$lcom  \n"; print $FILESCRM "rm -f $l  \n"; 
    push @Deletes => "$Post{grib}{dpost}/$l";}

    #  Open the UEMS UPP input file for writing
    #
    push @Deloops, "$Post{grib}{dpost}/$infile";
    open (my $ifh, '>', "$Post{grib}{dpost}/$infile");

    for my $ntime (1 .. $ntimes) {

        $|    = 1;
        my $i = $ntime-1;

        #-----------------------------------------------------------------------------
        #  Retrieve the name of the netCDF file to process. $ptfile hold the filename
        #  to process
        #-----------------------------------------------------------------------------
        #
        my $ptfile = File::Spec->abs2rel($mono ? $NetCDFs[0] : $NetCDFs[$i]);
        my $ptloc  = &Others::popit($ptfile);
        my $ptdate = &Others::DateStringWRF2DateString($ptloc);

        
        #-----------------------------------------------------------------------------
        #  Create the link from the emsprd/grib directory to the netcdf file in wrfprd.
        #-----------------------------------------------------------------------------
        #
#Cra
        $lcom="ln -sf  $ptfile $ptloc " ;print $FILESC "$lcom  \n"; print $FILESCRM "rm -f $ptloc  \n"; 

        unless (symlink $ptfile => $ptloc) {
            &Ecomm::PrintMessage(0,1,64,0,1,'Problem');
            &Ecomm::PrintMessage(0,1,144,1,2,sprintf("Link creation to wrfprd/$ptloc Failed (%s) - Return",$!));
            return ();
        }
        push @Deloops => "$Post{grib}{dpost}/$ptloc";

in line 487 I make a copy of the file emsupp.in in the sbatch and I update and close the two script files.

Code: Select all

#cra 
    system "cp -fr $Post{grib}{dpost}/emsupp.in $Post{grib}{dpost}/../../sbatch/";
#Cra update script file 
     print $FILESC "cp $Post{grib}{dpost}/../../sbatch/emsupp.in  $Post{grib}{dpost}/. \n";
     print $FILESCRM "rm -f   $Post{grib}{dpost}/emsupp.in \n";
     close($FILESC);
     close($FILESCRM);
     system "chmod 755 $Post{grib}{dpost}/../../sbatch/link_*";
In line 613 I updated the perl file:

Code: Select all

#Cra update the template perl file to  rename the new GRIB files.
    my $filna="$Post{grib}{dpost}/../../sbatch/postGrib.pl";
    system "cp -fr $Post{grib}{dpost}/../../../../scr/postGrib.pl $filna";

      system "sed -i 's/xxx/$Post{grib}{fname}/g'  ${filna}";
      system "sed -i 's/yyy/$dom/g'  ${filna}";
      system "sed -i 's/zzz/$Post{core}/g'  ${filna}";
      system "sed -i 's/uuu/$ftype/g'  ${filna}";
      system "sed -i 's/www/$Post{arf}/g'  ${filna}";
**************************************** Created files ****************************************

in sbatch/ you will get the necessary scripts for uemsBsub.emsupp to run:

Code: Select all

 link_grib.csh          link_grib_rm.csh       postGrib.pl          uemsBsub.emsupp
uemsBsub.emsupp will be run automatically as you type: ems_post --domain 02 for example.
uemsBsub.emsupp looks like:

Code: Select all

$ m uemsBsub.emsupp
#!/bin/bash
## ----------------------------------------------------------------------------------
##   This mpi UEMS batch job file is used to run the emsupp routine.
##   All routines are compiled as static executables.
## 
##   UEMS:     19.6.1
##   CREATED:  Thu Aug  8 05:55:44 UTC 2019
##   TEMPLATE: template_uemsBsub.parallel
##   AUTHOR:   Robert.Rozumalski@noaa.gov
## ----------------------------------------------------------------------------------
## 

#SBATCH --comment= WFOEMS-T2O
#SBATCH --partition=shortq
#SBATCH -J UEMS-emsupp
#SBATCH -N 1
#SBATCH -n 32
#SBATCH --ntasks-per-node=32
#SBATCH --ntasks-per-core=1
#SBATCH --exclusive
#SBATCH --time=04:30:00
#SBATCH -o ../../log/uems_emsupp.std
#SBATCH -e ../../log/uems_emsupp.err

export LANG=en_US
source /cm/shared/apps/intel/ips_2017/bin/compilervars.sh -arch intel64 -platform linux
module load openmpi/intel/3.1.2
module load mpich/ge/gcc/64/3.2.1
ulimit -s unlimited

cd /data/r.moustabchir/uems/runs/nortafr/emsprd/grib
sleep 11
./../../sbatch/link_grib.csh

SECSS=`date +%s`
mpiexec.gf  -np 18  /data/r.moustabchir/uems/bin/emsupp > ../../log/post_emsupp1_wrfoutd02.log

./../../sbatch/postGrib.pl;./../../sbatch/link_grib_rm.csh; mv slurm* ../../sbatch/slurm-post.out




SECSE=`date +%s`
TSECS=$((SECSE-SECSS))
echo "UEMS parallel job completed in $TSECS seconds ($?)"
link_grib.csh is

Code: Select all

$ m link_grib.csh         
#!/bin/csh -f 
cp -fr /data/r.moustabchir/uems/runs/nortafr/emsprd/grib/../../sbatch/fort.14  /data/r.moustabchir/uems/runs/nortafr/emsprd/grib/ 
ln -sf /data/r.moustabchir/uems/runs/nortafr/wrfprd/wrfout_d02_2019-08-06_00:00:00 wrfout_d02_2019-08-06_00:00:00   
ln -sf  /data/r.moustabchir/uems/data/tables/post/grib2/params_grib2_tbl_new params_grib2_tbl_new   
ln -sf  /data/r.moustabchir/uems/data/tables/post/grib2/post_avblflds.xml post_avblflds.xml   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_00:00:00 wrfout_d02_2019-08-06_00:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_01:00:00 wrfout_d02_2019-08-06_01:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_02:00:00 wrfout_d02_2019-08-06_02:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_03:00:00 wrfout_d02_2019-08-06_03:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_04:00:00 wrfout_d02_2019-08-06_04:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_05:00:00 wrfout_d02_2019-08-06_05:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_06:00:00 wrfout_d02_2019-08-06_06:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_07:00:00 wrfout_d02_2019-08-06_07:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_08:00:00 wrfout_d02_2019-08-06_08:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_09:00:00 wrfout_d02_2019-08-06_09:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_10:00:00 wrfout_d02_2019-08-06_10:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_11:00:00 wrfout_d02_2019-08-06_11:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_12:00:00 wrfout_d02_2019-08-06_12:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_13:00:00 wrfout_d02_2019-08-06_13:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_14:00:00 wrfout_d02_2019-08-06_14:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_15:00:00 wrfout_d02_2019-08-06_15:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_16:00:00 wrfout_d02_2019-08-06_16:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_17:00:00 wrfout_d02_2019-08-06_17:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_18:00:00 wrfout_d02_2019-08-06_18:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_19:00:00 wrfout_d02_2019-08-06_19:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_20:00:00 wrfout_d02_2019-08-06_20:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_21:00:00 wrfout_d02_2019-08-06_21:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_22:00:00 wrfout_d02_2019-08-06_22:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-06_23:00:00 wrfout_d02_2019-08-06_23:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_00:00:00 wrfout_d02_2019-08-07_00:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_01:00:00 wrfout_d02_2019-08-07_01:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_02:00:00 wrfout_d02_2019-08-07_02:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_03:00:00 wrfout_d02_2019-08-07_03:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_04:00:00 wrfout_d02_2019-08-07_04:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_05:00:00 wrfout_d02_2019-08-07_05:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_06:00:00 wrfout_d02_2019-08-07_06:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_07:00:00 wrfout_d02_2019-08-07_07:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_08:00:00 wrfout_d02_2019-08-07_08:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_09:00:00 wrfout_d02_2019-08-07_09:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_10:00:00 wrfout_d02_2019-08-07_10:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_11:00:00 wrfout_d02_2019-08-07_11:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_12:00:00 wrfout_d02_2019-08-07_12:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_13:00:00 wrfout_d02_2019-08-07_13:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_14:00:00 wrfout_d02_2019-08-07_14:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_15:00:00 wrfout_d02_2019-08-07_15:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_16:00:00 wrfout_d02_2019-08-07_16:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_17:00:00 wrfout_d02_2019-08-07_17:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_18:00:00 wrfout_d02_2019-08-07_18:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_19:00:00 wrfout_d02_2019-08-07_19:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_20:00:00 wrfout_d02_2019-08-07_20:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_21:00:00 wrfout_d02_2019-08-07_21:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_22:00:00 wrfout_d02_2019-08-07_22:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-07_23:00:00 wrfout_d02_2019-08-07_23:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_00:00:00 wrfout_d02_2019-08-08_00:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_01:00:00 wrfout_d02_2019-08-08_01:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_02:00:00 wrfout_d02_2019-08-08_02:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_03:00:00 wrfout_d02_2019-08-08_03:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_04:00:00 wrfout_d02_2019-08-08_04:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_05:00:00 wrfout_d02_2019-08-08_05:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_06:00:00 wrfout_d02_2019-08-08_06:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_07:00:00 wrfout_d02_2019-08-08_07:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_08:00:00 wrfout_d02_2019-08-08_08:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_09:00:00 wrfout_d02_2019-08-08_09:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_10:00:00 wrfout_d02_2019-08-08_10:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_11:00:00 wrfout_d02_2019-08-08_11:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_12:00:00 wrfout_d02_2019-08-08_12:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_13:00:00 wrfout_d02_2019-08-08_13:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_14:00:00 wrfout_d02_2019-08-08_14:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_15:00:00 wrfout_d02_2019-08-08_15:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_16:00:00 wrfout_d02_2019-08-08_16:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_17:00:00 wrfout_d02_2019-08-08_17:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_18:00:00 wrfout_d02_2019-08-08_18:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_19:00:00 wrfout_d02_2019-08-08_19:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_20:00:00 wrfout_d02_2019-08-08_20:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_21:00:00 wrfout_d02_2019-08-08_21:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_22:00:00 wrfout_d02_2019-08-08_22:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-08_23:00:00 wrfout_d02_2019-08-08_23:00:00   
ln -sf  ../../wrfprd/wrfout_d02_2019-08-09_00:00:00 wrfout_d02_2019-08-09_00:00:00   
cp /data/r.moustabchir/uems/runs/nortafr/emsprd/grib/../../sbatch/emsupp.in  /data/r.moustabchir/uems/runs/nortafr/emsprd/grib/. 
link_grib_rm.csh is:

Code: Select all

$ m link_grib_rm.csh   
#!/bin/csh -f 
rm -f /data/r.moustabchir/uems/runs/nortafr/emsprd/grib/fort.* 
rm -f wrfout_d02_2019-08-06_00:00:00  
rm -f params_grib2_tbl_new  
rm -f post_avblflds.xml  
rm -f wrfout_d02_2019-08-06_00:00:00  
rm -f wrfout_d02_2019-08-06_01:00:00  
rm -f wrfout_d02_2019-08-06_02:00:00  
rm -f wrfout_d02_2019-08-06_03:00:00  
rm -f wrfout_d02_2019-08-06_04:00:00  
rm -f wrfout_d02_2019-08-06_05:00:00  
rm -f wrfout_d02_2019-08-06_06:00:00  
rm -f wrfout_d02_2019-08-06_07:00:00  
rm -f wrfout_d02_2019-08-06_08:00:00  
rm -f wrfout_d02_2019-08-06_09:00:00  
rm -f wrfout_d02_2019-08-06_10:00:00  
rm -f wrfout_d02_2019-08-06_11:00:00  
rm -f wrfout_d02_2019-08-06_12:00:00  
rm -f wrfout_d02_2019-08-06_13:00:00  
rm -f wrfout_d02_2019-08-06_14:00:00  
rm -f wrfout_d02_2019-08-06_15:00:00  
rm -f wrfout_d02_2019-08-06_16:00:00  
rm -f wrfout_d02_2019-08-06_17:00:00  
rm -f wrfout_d02_2019-08-06_18:00:00  
rm -f wrfout_d02_2019-08-06_19:00:00  
rm -f wrfout_d02_2019-08-06_20:00:00  
rm -f wrfout_d02_2019-08-06_21:00:00  
rm -f wrfout_d02_2019-08-06_22:00:00  
rm -f wrfout_d02_2019-08-06_23:00:00  
rm -f wrfout_d02_2019-08-07_00:00:00  
rm -f wrfout_d02_2019-08-07_01:00:00  
rm -f wrfout_d02_2019-08-07_02:00:00  
rm -f wrfout_d02_2019-08-07_03:00:00  
rm -f wrfout_d02_2019-08-07_04:00:00  
rm -f wrfout_d02_2019-08-07_05:00:00  
rm -f wrfout_d02_2019-08-07_06:00:00  
rm -f wrfout_d02_2019-08-07_07:00:00  
rm -f wrfout_d02_2019-08-07_08:00:00  
rm -f wrfout_d02_2019-08-07_09:00:00  
rm -f wrfout_d02_2019-08-07_10:00:00  
rm -f wrfout_d02_2019-08-07_11:00:00  
rm -f wrfout_d02_2019-08-07_12:00:00  
rm -f wrfout_d02_2019-08-07_13:00:00  
rm -f wrfout_d02_2019-08-07_14:00:00  
rm -f wrfout_d02_2019-08-07_15:00:00  
rm -f wrfout_d02_2019-08-07_16:00:00  
rm -f wrfout_d02_2019-08-07_17:00:00  
rm -f wrfout_d02_2019-08-07_18:00:00  
rm -f wrfout_d02_2019-08-07_19:00:00  
rm -f wrfout_d02_2019-08-07_20:00:00  
rm -f wrfout_d02_2019-08-07_21:00:00  
rm -f wrfout_d02_2019-08-07_22:00:00  
rm -f wrfout_d02_2019-08-07_23:00:00  
rm -f wrfout_d02_2019-08-08_00:00:00  
rm -f wrfout_d02_2019-08-08_01:00:00  
rm -f wrfout_d02_2019-08-08_02:00:00  
rm -f wrfout_d02_2019-08-08_03:00:00  
rm -f wrfout_d02_2019-08-08_04:00:00  
rm -f wrfout_d02_2019-08-08_05:00:00  
rm -f wrfout_d02_2019-08-08_06:00:00  
rm -f wrfout_d02_2019-08-08_07:00:00  
rm -f wrfout_d02_2019-08-08_08:00:00  
rm -f wrfout_d02_2019-08-08_09:00:00  
rm -f wrfout_d02_2019-08-08_10:00:00  
rm -f wrfout_d02_2019-08-08_11:00:00  
rm -f wrfout_d02_2019-08-08_12:00:00  
rm -f wrfout_d02_2019-08-08_13:00:00  
rm -f wrfout_d02_2019-08-08_14:00:00  
rm -f wrfout_d02_2019-08-08_15:00:00  
rm -f wrfout_d02_2019-08-08_16:00:00  
rm -f wrfout_d02_2019-08-08_17:00:00  
rm -f wrfout_d02_2019-08-08_18:00:00  
rm -f wrfout_d02_2019-08-08_19:00:00  
rm -f wrfout_d02_2019-08-08_20:00:00  
rm -f wrfout_d02_2019-08-08_21:00:00  
rm -f wrfout_d02_2019-08-08_22:00:00  
rm -f wrfout_d02_2019-08-08_23:00:00  
rm -f wrfout_d02_2019-08-09_00:00:00  
rm -fr   /data/r.moustabchir/uems/runs/nortafr/emsprd/grib/emsupp.in 
postGrib.pl is

Code: Select all

$ m postGrib.pl  
#!/usr/bin/perl
require 5.008;
use strict;
use warnings;
use English;

use Cwd 'abs_path';
use FindBin qw($RealBin);
use lib (abs_path("../../../../strc/Uutils"), abs_path("../../../../strc/Upost"));

use Ecore;
use Eenv;
use Others;
use Outils;

    my $wkdir="./";
    my $gribfname="YYMMDDHHMN_wrfout_arw_d02.grb2fFXFMFS";
    my $domin="02";
    my $postcore="arw";
    my $filetype="wrfout";
    my $postarf="0";
    my $mesag;
    my @Grib;
    my @Delet;
    my @Deloop;

   foreach my $ngrib (sort &Others::FileMatch($wkdir,'EMSPRS',1,1)) {

       if (-z $ngrib) {
           &Ecomm::PrintMessage(0,14+$postarf,144,1,1,sprintf('Problem  : %-34s (%.2f MBs)',"$ngrib is zero size   ",&Others::Bytes2MB(&Others::Fi
leSize($ngrib))));
           next;
}

       my @pairs = &Outils::Grib2DateStrings("$wkdir/$ngrib");
          @pairs = (@pairs,"WD:$domin","CORE:$postcore","KEY:$filetype");

       my $grib2 = &Others::PlaceholderFill($gribfname,@pairs);

          $mesag = (system "mv $wkdir/$ngrib $wkdir/$grib2") ? "Failed (mv code $?)" : $grib2; my $err = $?;
          &Outils::ProcessInterrupt('emsupp',2,$wkdir,\@Delet,\@Deloop,[$ngrib]) if $err == 2;

          push @Grib => "$wkdir/$grib2" unless $err;

          &Ecomm::PrintMessage(0,14+$postarf,144,1,0,sprintf('Completed: %-34s (%.2f MBs)',$mesag,&Others::Bytes2MB(&Others::FileSize($grib2))));
    }

&Ecore::SysExit(0,$0);
**************************************** Create grads files ****************************************

Since grib files are now available in emsprd directory, you can run another time ems_post with the flag --noupp --grads to create only Grads files.
For example type to process domain 02:
ems_post --noupp --grads --domain 02

I updated the tar file I provided before to take into account all the changes I made.
I replaced mpirun by mpiexec because mpiexec is working without any errors and makes the job run very fast than mpirun.

lcana
Posts: 75
Joined: Wed Nov 30, 2011 4:34 pm

Re: How to run UEMS/WRF v19 (19.7.2) in a cluster with slurm Step by step

Post by lcana » Wed Aug 21, 2019 2:07 pm

Hi,

thanks for share your work. Very nice of you!

All the best,

Luis

ramousta
Posts: 8
Joined: Fri Jun 28, 2019 4:29 pm

Re: How to run UEMS/WRF v19 (19.7.2) in a cluster with slurm Step by step

Post by ramousta » Thu Aug 22, 2019 3:09 am

Thank you Luis.
comparison between mpirun srun and mpiexec.hydra

For mpirun
Sometimes I got error:
node05.192508PSM2 can't open hfi unit: -1 (err=23)
I found from some websites that this strange error is related to low locked memory!!!!
Others said that:
" This may occur on TMI when libEnsemble Python processes have been launched to a node and these, in turn, launch jobs on the node; creating too many processes for the available contexts. Note that while processes can share contexts, the system is confused by the fact that there are two phases, first libEnsemble processes and then sub-processes to run user jobs. "
Using export I_MPI_FABRICS=shm:dapl solve the problem but I think makes the run very slow.
in sbatch file:

Code: Select all

......
export I_MPI_FABRICS=shm:dapl 
......
mpirun -n 32   /data/r.moustabchir/uems/bin/geogrid > log/run_geogrid1.log
mpirun doesn't work for real. I use

Code: Select all

mpiexec.gf  -np 18  /data/r.moustabchir/uems/bin/real_arw.exe > log/run_real1.log
For mpiexec.hydra
You also need to use export I_MPI_FABRICS=shm:dapl otherwise you get error:
node05.192508PSM2 can't open hfi unit: -1 (err=23)
in sbatch file:

Code: Select all

export I_MPI_FABRICS=shm:dapl 
......
mpiexec.hydra -bootstrap slurm -np 32  /data/r.moustabchir/uems/bin/geogrid > log/run_geogrid1.log
mpiexec.hydra -bootstrap doesn't work for real. I use

Code: Select all

mpiexec.gf  -np 18  /data/r.moustabchir/uems/bin/real_arw.exe > log/run_real1.log
For srun
srun --mpi=pmi2 does not need I_MPI_FABRICS=shm:dapl. Srun work very fast than mpirun and mpiexec without errors. And works with real.
in sbatch file:

Code: Select all

srun  --mpi=pmi2    /data/r.moustabchir/uems/bin/real_arw.exe > log/run_real1.log
NOTE:
  • I use only intel version of geogrid ,ungrib, metgrid, real and wrf exe.
    Srun is the best solution.
    When submitting a job, the desired version can then be selected using any of the available from --mpi=list. The default for pmix will be the highest version of the library:

Code: Select all

$  srun --mpi=list
srun: MPI types are...
srun: none
srun: openmpi
srun: pmi2
That's it.
Next time I will shear with you a new post on how to automatize and autorun wrfems in slurm: i.e. doing all the steps on the same sbatch file and run it automatically everyday at fixed hour even though users are not allowed to shedule and execute periodic jobs on a cluster using the cron utility.
Rachid

Post Reply