CTBP Header



General Info
CTBP System News and Updates
Obtaining an Account

Resources For Users
Cluster status
wiki
FAQ
HOW-TOs
Hardware
Software
Policies

UCSD
 

Cluster HOW-TO


Cluster information
Account setup
Compiling and running jobs
Tips for Optimizing Application Performance on the Cluster

Cluster information

The CTBP cluster (ctbp1.ucsd.edu) consists of 120 Dell PowerEdge 2650 nodes each with two 2.8GHz Xeon processors and 1GB or 2 GB RAM. The cluster uses gigabit interconnects. The CTBP cluster is running the NPACI Rocks clustering software.

The Sun Grid Engine queuing system has been installed and configured on the cluster. All non-interactive jobs must be submitted through the SGE. See SGE How-To for more information. Running non-interactive jobs outside of the the queuing system is violation of CTBP acceptable use policy.


Account setup

If you see this message when running your jobs:
"/usr/X11R6/bin/xauth: error in locking authority file /home/username/.Xauthority"
Put these lines to your .ssh/config file:
Host compute*
ForwardX11 no
Host c?-*
ForwardX11 no

Compiling and running jobs

In addition to the GNU compilers, high performance Intel C/C++/F90/F95 compilers have been installed on the cluster front-end (ctbp1.ucsd.edu) This also includes the Intel Linux Debugger (LDB).

To use the compilers modify your Makefiles to use icc or ifort as the C/C++ and F77/F90 compilers, respectively. You don't have to set up PATH or any other environment variables, icc/ifort should work immediately.

Documentation for the Intel compilers, debugger and libraries can be found in /soft/linux/share/intel/compiler80/docs. Please read license information in this directory before using the compilers.

Note: If your code uses standard Unix services (etime, call exit(), etc.) don't forget to link the code with -Vaxlib options (which is not the default!).

There are currently several versions of MPI libraries installed on the cluster: intel's icc/ifc compiled MPICH (/opt/mpich/intel), GNU's gcc/g77 compiled MPICH (/opt/mpich/gnu) and gcc/g77 compiled MPICH-MPD (/opt/mpich-mpd). You have to link your application with the MPICH library compiled with the same compiler. Also you have to use appropriate mpirun.

For example, if you want to use the Intel compilers (which is recommended) you need to use mpif90 (or mpif77 or mpicc) from /opt/mpich/intel/bin to compile your code. To actually launch it on the cluster you have to use /opt/mpich/intel/bin/mpirun in your SGE script.


Tips for Optimizing Application Performance on the Cluster

General tips

  • Try to use local disk on the nodes for scratch files. Local disk are much faster (5-8 times) than data transfers over NFS to your home directory. Each node has /scratch filesystem which can be used for this purpose. This /scratch filesystem is a persistent storage, all files saved there will be preserved after node's reboot/crash. Using /scratch will most likely significantly speedup your job especially if it's I/O bound.

    General strategy for file staging, i.e., using /scratch for your job:

    • Create temporary directory in /scratch on the node.
    • Copy all input files from your home dir to the temporary directory on the node.
    • Start your application from the temporary directory.
    • After your application is done, copy all files back to your home dir.
    • Delete the temporary directory.

      This all can be accomplished by copying and modifying the following lines to your SGE script:

      # create e temporary directory
      mkdir /scratch/your_username
      cp /home/your_username/path/to/your/files /scratch/your_username
      cd /scratch/your_username
      
      # now execute your application
      /some/path/application
      
      cp /scratch/your_username/* /home/your_username/path/to/your/files
      rm -rf /scratch/your_username
      
  • Use walltime SGE resource carefully. Try to estimate the upper limit of your job execution time and set up walltime to this value (plus some). Grossly overestimating this value can hold your job in the idle queue and a shorter wall-time job might be run ahead of your job.

Running parallel applications

  • Don't run your parallel jobs on more than 8 CPUs. Most application do not scale well on ethernet interconnected beowulf clusters and there is a big penalty for intra-process communication. The sweet spot for most applications appears to be around 4-6 CPUs. Increasing number of CPUs above this number doesn't decrease the wall clock time and can actually increase it (child processes spending too much time communicating with each other and the master process).


Please direct any questions or comments related to this web page to ctbp-help @ ctbp.ucsd.edu
Last modified: September 19 2008 10:50:00 am.