cluster - UEMS install required on each node?

Posted: Thu Oct 04, 2018 4:26 am
by pattim
Dear UEMS users: I'm having an odd issue on a cluster. I set up and tested passwordless ssh on my mini-cluster (2 12-cpu boxes). It works well, except but UEMS crashes from an error I don't understand - does UEMS have to be installed on all nodes in a cluster? I didn't see that suggested anywhere in the docs.
If so,this could cause issues on heterogeneous clusters. But I think it may be an issue with my calling syntax, which isn't clear from the docs or the .conf file text. Is this a known error and/or is my calling syntax (NODECPUS) correct?

The error originally was:
MPICH executables not found on linux-68t1 (/run/media/patti/00_uems/uems/util/mpich2/bin)!

So I copied that directory tree to node1 (linux-68t1) and received this error:
UEMS executables not found on linux-68t1 (/run/media/patti/dc648309-128b-489f-a9c7-387482966210/Users/00_uems/uems/bin)!
(UEMS is installed on a removable drive on linux-68t0)

My head-node is linux-68t0 - and that's where UEMS is installed.

In run-ncpus.conf
REAL_NODECPUS = local:10
WRFM_NODECPUS = linux-68t0:10,linux-68t1:10

also tried...
ems_run --nodes linux-68t0:10,linux-68t1:10 --length 24 --cycle 6 --domain 4


           *  Simulation start and end times:

              Domain         Start                   End              Parent
                 1     2018-10-03_06:00:00     2018-10-04_06:00:00       
                 2     2018-10-03_06:00:00     2018-10-04_06:00:00      1
                 3     2018-10-03_06:00:00     2018-10-04_06:00:00      2
                 4     2018-10-03_06:00:00     2018-10-04_06:00:00      3

              Primary domain simulation length will be 24 hours.

           *  Gathering system information for running WRF REAL

           *  Gathering system information for running WRF ARW

           ☠  UEMS executables not found on linux-68t1 (/run/media/patti/dc648309-128b-489f-a9c7-387482966210/Users/00_uems/uems/bin)!

         !  Oh Poop! There is a problem with one or more hosts requested - Exit

Posted: Sun Oct 07, 2018 1:02 am
by pattim
Nope - not required. Turns out the UEMS directory must be shared ("exported") to all compute nodes. Problem solved. Did I miss that in the docs somewhere or is that just something a cluster-builder knows intrinsically? :roll:

Posted: Wed Oct 10, 2018 4:31 pm
by lcana
Hi Pattim,

Thanks for sharing this with us. It’s interesting how you build up UEMS in a cluster. BTW, which distro did you used for building it? Rocks cluster perhaps?



Posted: Thu Nov 15, 2018 11:38 pm
by pattim
Hi - I just used my regular go-to linux distro (in my case, OpenSuSE Leap 15.0). There is a picture on the FB page I created...

I have done a little more. It was pretty easy after exporting the shared UEMS directory (which exists on the head node) - Opensuse has the YaST system admin tool which makes this pretty easy and minimizes mistakes. I used a standard 1Gb router. It was easy after getting ssh open (and passwordless... ) from the head node to the compute nodes. Then you have to use this syntax in run_ncpus.conf:
REAL_NODECPUS = local:10

But in trying to get better performance, I found you can get cheap infiniband cards on ebay - so I decided to try that. You do get better and more stable performance as the communication goes from 1Gg on a router to 10Gb on inifinband. You don't need a "switch" or a "router" for infiniband to work - you just cable the machines together. But you still have to use a similar ncpis - I changed the IP address a little bit.
REAL_NODECPUS = local:10

I think this works out to "TCP/IP over infiniband" - it's what I could figure out easily. I think you can use a faster protocol, but when I tested mine, it's running at about 9.5Gb/sec.

Posted: Fri Nov 16, 2018 9:14 am
by lcana
Thanks pattim!

Really helpful.