27 March 2012

115. Very Simple Python Queue Manager

I suppose we can call it the VSPQM, which sounds a bit like a Roman initialism, akin to SPQR.

I've spent the past few days trying to get to grips with the Sun Gridengine (SGE) but have given up for now. While it seems capable, it's just overkill for my purposes, especially taking into account the difficulties in simply configuring it. It's a bit similar to my experience with OpenDX, a very capable plotting program, but which I couldn't make work to satisfaction in spite of being one of the lucky few in possession of the "Open DX -- Paths to Visualisation" book.

Long story short -- I wrote a small script in python. It
- reads a file, list, with the name of shell scripts
- the shell scripts, job1.sh..jobn.sh, are executed sequentially - when the execution of one script is finished, the next one is executed
- jobs can be added and removed from list during execution

It's a 'dumb' script -- it does not try to balance jobs across nodes or look for idle cpus/cores. It just executes one job after the other, and mark jobs as done after execution.

To test it:
create a file called list and put the following lines in it:
pi40.sh
pi400.sh
pi2000.sh
The scripts are the following:

pi40.sh
echo "pi to 40 decimals"
echo "scale=40; 4*a(1)" | bc -l -q
echo "done"
pi400.sh
echo "scale=400; 4*a(1)" | bc -l -q
pi200.sh
 echo "scale=2000; 4*a(1)" | bc -l -q
The python code for vspqm.py is below

I've aliased my vspqm (edit ~/.bashrc):
alias vspqm='/home/me/work/vspqm/vspqm.py'
Then sourced ~/.bashrc

Launch in the directory you keep your list file using
me@beryllium:~/work/vspqm/jobs$ vspqm list > log &
[1] 23925
me@beryllium:~/work/vspqm/jobs$ cat log
pi to 40 decimals
3.1415926535897932384626433832795028841968
done
3.141592653589793238462643383279502884197169399375105820974944592307\
[..]
3.141592653589793238462643383279502884197169399375105820974944592307\
81640628620899862803482534211706798214808651328230664709384460955058\
[..]

An nwchem example would be
list:
ac.sh
bn.sh
ac.sh:
cd acetone/
mpirun -n 4 nwchem ac.nw>ac.out
cd ../
bn.sh:
cd benzene/
mpirun -n 4 nwchem bn.nw>bn.out
cd ../


Our python queue manager (which we'll call vspqm.py and chmod +x to make executable) is below. Don't forget to change #!/usr/bin/python2.4 if necessary -- I use 2.4 on ROCKS and 2.7 on Debian testing/wheezy

#!/usr/bin/python2.4
# rudimentary queue manager. Handles a single node,
# submitting a series of jobs in sequence. use python v2.4-2.7
import os
import time
import sys
infile=sys.argv[1]
print "pyqm v 0.0.3"
def launchjob(job):
        i=0
        print "######"
        job=job.rstrip('\n')
     
        i=os.system("sh "+job)
        if i==0:
                print "Job successful"
        else:
                print "Job failed"
        print "######"
        return i
def remake_list(infile):
        qfile=open(infile,"w")
        bakfile=open(infile+".bak",'r')
        for i in bakfile:
                qfile.write(i)
        return 0
def rewind(infile):
        qfile=open(infile,"w")
        bakfile=open(infile+".bak",'r')
        for i in bakfile:
                qfile.write(i[1:])
        return 0
def get_next_job(infile):
        qfile=open(infile,"r")
        bakfile=open(infile+".bak",'w')
        lines=""
        job=""
        for line in qfile:
                if line[0]=="*":
                        print "Marked as done: ",line[1:]
                if line[0]!="*" and job=="":
                        print "Launching: ", line
                        job=line
                        line="*"+line
                lines+=line
        bakfile.write(lines)
        qfile.close
        bakfile.close
        return job
def main(infile):
        jobs=1
        while (jobs==1):
                newjob=get_next_job(infile)
                remake_list(infile)
                if newjob!="":
                        jobs=1
                        echojob=launchjob(newjob)
                else:
                        print "No more jobs found at "+str(time.asctime())    
                        jobs=0
        return 0

if __name__ == "__main__":
        main(infile)
        rewind(infile)

20 March 2012

114. Nwchem 6.0 with openmpi support on debian testing

I still haven't managed to compile a working versin of Nwchem 6.1 on Debian 64 bit regardless of whether I'm using mpich or openmpi. The number of posts relating to compiling nwchem is steadily growing, but I'd rather have post which are almost, but not quite, identical if it makes it's unambiguous for the average user how to build and use nwchem.

Anyway, since I'm using openmpi on my rocks cluster(s), I figure I might as well start using openmpi on debian too. In addition, the only way you can get nwchem 6.0 to work with mpich2 on debian seems to be by using the old v1.2 package which causes problems of its own (see apt-pinning).

Note: See here for information about python support: http://verahill.blogspot.com.au/2012/04/adding-python-support-to-nwchem-under.html

Long story short -- nwchem with openmpi:
mkdir ~/tmp
sudo apt-get install openmpi-bin libopenmpi-dev
wget http://www.nwchem-sw.org/images/Nwchem-6.0.tar.gz
tar -xvf Nwchem-6.0.tar.gz
cd nwchem-6.0/

export LARGE_FILES=TRUE
export TCGRSH=/usr/bin/ssh
export NWCHEM_TOP=/home/me/tmp/nwchem-6.0
export NWCHEM_TARGET=LINUX64
export NWCHEM_MODULES=all
export USE_MPI=y
export USE_MPIF=y
export MPI_LOC=/usr/lib/openmpi/lib
export MPI_INCLUDE=/usr/lib/openmpi/include
export LIBRARY_PATH=$LIBRARY_PATH:/usr/lib/openmpi/lib
export LIBMPI="-lmpi -lopen-rte -lopen-pal -ldl -lmpi_f77 -lpthread"
cd $NWCHEM_TOP/src
make clean
make nwchem_config
make FC=gfortran

This will take a good 20-30 minutes.


Your binary will be in nwchem-6.0/bin/LINUX64/

Finally, see whether openmpi is already in your LD_LIBRARY_PATH

echo $LD_LIBRARY_PATH
/lib/openmm:/usr/lib/nvidia-cuda-toolkit:/usr/lib/nvidia
If not, edit ~/.bashrc and add
export LD_LIBRARY_PATH=/usr/lib/openmpi/lib:$LD_LIBRARY_PATH
export PATH=$PATH:/home/me/tmp/nwchem-6.0/bin/LINUX64


113. Using ECCE to run nwchem jobs

EDIT: This post is getting messier as I'm hammering things out...but I've gotten everything to work in the end, so please persist.  The workflow described below is not the ideal one, but it'll get you started. I'll link here when I put up a newer, more reasonable tutorial.

EDIT2: I'm really warming to ECCE as I'm learning more about it. I still think it'd be nice if it was open source, and I can't understand why it has to be reliant on csh (which is pretty much broken on ROCKS, and uncomfortable at the best of times), but it's pretty neat once you've got all the details ironed out. Error feedback/report could be better though.

EDIT 3: ECCE is going open source the (northern) summer of 2012! As users we no longer have any excuses to complain.

Here's a quick introduction to getting started with using ECCE as the interface to nwchem, similar to how gaussview can be used to set up gaussian jobs.

This presumes that you've set up ECCE and preferably compiled your own version of nwchem:
http://verahill.blogspot.com.au/2012/03/ecce-on-debian-but-not-on-rockscentos.html
http://verahill.blogspot.com.au/2012/03/nwchem-61-with-openmpi-on-rocks.html
http://verahill.blogspot.com.au/2012/01/debian-testing-64-wheezy-nwhchem.html


##Important##
Once I had figured all of this out I rebuilt nwchem and re-installed ecce in the proper locations. You might want to do the same.

A. If you're going to use several nodes you should put nwchem in the same position in the file system hierarchy on all nodes e.g.
/opt/nwchem/nwchem-6.0/bin/LINUX64/nwchem

Also, make sure you share a folder (see how to use NFS) between the nodes which you can use for run time files e.g. /work

EDIT 4: This (probably) isn't necessary. In fact, using NFS in the wrong way will slow things down.

Set the permissions right (chown your user and set to 777 -- 755 is enough for nfs sharing between debian nodes, but between ROCKS and Debian you seem to need 777), and open your firewall on all ports for communication between the nodes.

B. Make sure that ECCE_HOME has been set in ~/.bashrc e.g.
export ECCE_HOME=/opt/ecce/apps

and in ~/.cshrc
setenv ECCE_HOME=/opt/ecce/apps

C.
edit /opt/ecce/apps/siteconfig/submit.site (location depends on where you install ecce)
Change lines 65+ from
#NWChemCommand {
#  $nwchem $infile > $outfile
#}
to (for multiple nodes)
NWChemCommand {
mpirun -hostfile /work/hosts.list -n $totalprocs --preload-binary /opt/nwchem/nwchem-6.0/bin/LINUX64/nwchem $infile > $outfile
}
to use mpirun for parallel job submissions and assuming you have a hosts file in /work. For running on a single node you can use


NWChemCommand {
mpirun  -n $totalprocs $nwchem  $infile > $outfile
}

user either --preload-binary /opt/nwchem/nwchem-6.0/bin/LINUX64/nwchem or $nwchem -- see what works for you. You probably can't do preload if you're running different linux distros (e.g. debian and centos)

My hosts.list looks like this:

tantalum slots=4 max_slots=4
beryllium slots=4 max_slots=5

Make sure that you don't accidentally put 2 jobs on node 0, then 2 jobs on node 1, then another 2 jobs on node 0, since they won't be consecutively numbered and will crash armci. You can avoid this by setting slots and max_slots to the same number.


D.
You may have to edit /etc/openmpi/openmpi-mca-params.conf if you have several (real or virtual) interfaces and add e.g.


btl=tcp,sm,self
btl_tcp_if_include=eth1,eth2
btl_tcp_if_exclude=eth0,virtbr0


Start ECCE:
First start the server
csh /home/me/tmp/ecce/ecce-v6.2/server/ecce-utils/start_ecce_server
then launch ecce

ecce

This will launch what the ecce people call the 'gateway':
The Gateway

0. Make sure you've got your machine set up
Click on Machine browser
Make sure that you can connect to the node e.g. by clicking on disk usage

Set the application paths. Don't fiddle with nodes -- just change number of processors to the total for all nodes.



1. Draw SiCl4 
Click on the Builder in the Gateway, which gives you the following:
The builder window

Click on More to get the periodic table which gives you access to Si

Select Geometry -- here, Tetrahedral

Si -- with four 'nubs' (yup, that's what the ecce ppl call them)

Time to attach Cl atoms to the nubs. Select Cl and pick Terminal geometry.

Click on a 'nub' to replace it with a Cl

And do it until you've replaced all 'nubs'. Hold down right mouse button to rotate

Click on the broom next to the bond menu on the right to pre-optimize  the structure using MM

And save. You will probably be limited to saving your jobs in folders below the ecce  folder.


2. Set up your job
Click on the Organizer icon in the 'gateway', which takes you here:

Click on the first icon, Editor

Focus on selecting Theory and Run type. Here's we'll do a geometry optimisation.

Click on Details for Theory

Click on Details for Run type

Constraints are optional

In the organizer, click on the third icon to set the basis set. Defined atoms for a particular basis set are indicated by a n orange right lower corner

You can get Details about the basis set

If you don't have a Navy Triangle you can't run. Click on Editor and see what might be wrong.

Ready to run. Click on Launch.
4. Running
I'm still working on enabling more than a single core...
Once you've clicked on launch you'll get

 If you click on viewer you can monitor the job

Optimization in progress
5. Re-launch a job at higher theory
In the Organizer, select your last job and then click on Edit, Duplicate Setup with Last Geometry
You then get a copy to edit

Change the basis set, save, then click on Final Edit

This is the nwchem input file in a vim instance

Add a line to the end, saying task scf freq to calculate the vibrations (there's another job option called geovib which does optim+freq , but here we do it by hand)

Launch

Running...

You can now look at the vibrations

And you can visualise MOs -- here's the HOMO which looks like all isolated p orbitals on the chlorine

You can also calculate 'properties'

These include GIAO shielding

Performance:
Here's phenol (scf/6-31g*) across three gigabit-linked nodes. The dotted line denotes node boundaries.


Here's a number of alkanes (scf/6-31g) on 4 cores on a single node: