Contributed by Riccardo Di Meo (EUIndiaGRID Project - ICTP).

The original page, with other material, can be found here

A gsiftp version of this exercise can be found here: Matmul - gsiftp version

Reserve SMP nodes requires Python 2.5 on the user interface

Overview

This example consists in the execution of a simple set of matrix multiplications on the grid, using the Goto Blas library in order to automatically optimize the computation on smp processors and reserve SMP nodes to submit the job on a glite WMS-based grid.

The important concepts pointed out here are:

  1. using threads in scientific computing is often very easy
  2. through the use of the Goto Blas library, very good performances and scalability can be obtained automatically
  3. though the grid doesn't provide the facilities to to submit threaded code, using the reserve SMP nodes it is actually possible to execute such code easily and in a safe way.
  4. the two submission scripts, the simple way in BASH and a real-time feedback using Python and XMLRPC show that porting an application in such a way that it benefits from reserve_smp_nodes can be performed with various investment of time and with different results, and that both the advanced and the average user can tackle the task.

All steps and the meaning of each operation are explained in detail, thus making this a good introduction to the next exercises (which are more syntetic).

The application

We developed a simple test program matmul_bench, the source of which can be found here:

 http://www.ictp.it/~dimeo/matmul_example/matmul_bench.f90

it's only action is to repeatedly perform matrix multiplications between different matrices of increasing size using the BLAS library and print, for each matrix size, the throughput (number of floating point operations per seconds) and the time required to perform the repetitions.

What's more important, is that, though the program has been decently written, no special optimization was implemented and no parallelization was included in the code at all.

This latter point is important, since we are going to rely completely on the cleverness of the GotoBLAS library for both optimization (as it's natural) and parallelization: the Goto implementation of the BLAS library can, in fact, use multiple threads on a SMP and/or multi-core machine to speed up the it's own routines.

This is a somewhat strange approach, since programs are usually parallelized manually by the user (e.g. through the use of threads or the insertion of MPI calls) where with this library, code using the BLAS routines is automatically split between different processes which will run on different CPUs, if available (this usually cannot be done, due to performances and tuning issues, between processes running on different nodes).

To download the GotoBLAS library or know more about it, check:

 http://www.tacc.utexas.edu/resources/software/#blas

Using the reserve_smp_nodes, we will run this program (which is thread enabled and automagically optimized) to the grid and try to get an idea of the performance gain obtained by running the code on a single node with multiple processors: to do that, we will try to approaches, the first very simple, using a plain .sh file (much like the examples pointed in the reserve_smp_nodes page) and a second, written in python and based on a client-server architecture with interactivity and real time feedback.

Keep in mind, that this is, as far as i know, the only way in which multi thread jobs can be run on the grid.

Files required to execute the examples

Precompiled binaries, statically linked, required to run this example can be found in a bzip2 compressed package at the address:

 http://www.ictp.it/~dimeo/matmul_example/matmul_binaries.tar.bz2

This archive contains two binaries: matmul and matmul_short, both performing the same operations, though the first takes ~ 2 hours on a P4 2.4Ghz where the second only 5-10 minutes on the same processor.

The source of the aforementioned binaries can be found at the address:

 http://www.ictp.it/~dimeo/matmul_example/matmul_bench.f90

though recompilation of the code should not be necessary, and the scripts used to run the code with the reserve_smp_nodes can be found here:

 http://www.ictp.it/~dimeo/matmul_example/matmul_plain_lcg.sh

and here (for the advanced XMLRPC version of the scripts):

 http://www.ictp.it/~dimeo/matmul_example/matmul_xmlrpc_server
 http://www.ictp.it/~dimeo/matmul_example/matmul_xmlrpc_client

The reserve SMP nodes version which will be used for this example can be found here:

 http://www.ictp.it/~dimeo/reserve_smp_nodes-1.5.tar.bz2

Simple straightforward version ( matmul_plain_lcg.sh).

This very simple script can be used to execute matmul with reserve_smp_nodes and provides a bare bones example about porting applications for it.

The script performs the following operations:

  1. Gets the matmul binary from the grid (both binary and location are specified on the command line)
  2. Sets the permissions for the executable and the OMP_NUM_THREADS variable appropriately
  3. Run the matmul code and saves the standard output of matmul and other logging information to 2 local files
  4. Sends the files with the output and logs in the grid, marking it with an identifier passed as argument

The arguments that should be passed to it are, as specified in the comments inside the script:

  1. A name used to identify the output of the run once it will be saved in a SE
  2. a gsiftp location pointing to a directory for both read and write, where the matmul binaries will be downloaded and where the script will dump the output files.
  3. The name of the matmul binary which will be used for the execution (in our case only matmul and matmul_short, though other programs could be in principle be used, as long as they are executed in similar ways)

Executing matmul with the matmul_plain_lcg.sh script

Enter the reserve_smp_nodes-1.5 directory and download there the matmul_plain_lcg.sh script.

Dump the matmul and matmul_short binaries to a SE (for such a simple test, the /tmp directory will do, don't use it for other purposes though!):

$ wget -c -t0 http://www.ictp.it/~dimeo/matmul_example/matmul_plain_lcg.sh
$ wget -c -t0 http://www.ictp.it/~dimeo/matmul_example/matmul_binaries.tar.bz2
           '''(...)'''
$ tar xvjf matmul_binaries.tar.bz2
$ lcg-cr -d se-3.grid.seed  -l lfn:/grid/gridseed/someuser/matmul_short file:`pwd`/matmul_short
$ lcg-cr -d se-3.grid.seed  -l lfn:/grid/gridseed/someuser/matmul file:`pwd`/matmul

The former steps will be required only once, and are performed in order to put all requirements in place.

At this point we are almost ready to run the reserve_smp_nodes: we just need a file with will specify the tasks to be submitted to the grid.

Since we would like to run a scalability test, we would like to use a task list like this one (which we will call tasks.txt):

! ,
1 ./matmul_plain_lcg.sh 1proc,lfn:/grid/gridseed/someuser/,matmul_short
2 ./matmul_plain_lcg.sh 2proc,lfn:/grid/gridseed/someuser/,matmul_short
3 ./matmul_plain_lcg.sh 3proc,lfn:/grid/gridseed/someuser/,matmul_short
4 ./matmul_plain_lcg.sh 4proc,lfn:/grid/gridseed/someuser/,matmul_short

which would give us a complete description about how our code scales (if we will be able to reserve the required CPUs), however, since the virtual cluster doesn't have virtual nodes with more than 2 CPU, we will use one like this one:

! ,
1 ./matmul_plain_lcg.sh 1proc,lfn:/grid/gridseed/someuser/,matmul_short
2 ./matmul_plain_lcg.sh 2proc,lfn:/grid/gridseed/someuser/,matmul_short

(the former one was used for a benchmark on the EUIndiaGRID infrastructure. More on that later).

Now we only need to find a cluster which we know has multiple CPUs on a single node (we will use ce-1.grid.seed:2119/jobmanager-lcgpbs-gridseed in our example, which has 2 processors per node), start the reserve_smp_nodes and cross our fingers ;-) ):

 $ ./reserve_smp_nodes -T 1000 -J tasks.txt -r ce-1.grid.seed:2119/jobmanager-lcgpbs-gridseed -j 10

With the latter command we are submitting 10 jobs (though only 3 are required, in total, by our tasks, to account for the job loss associated with WNs partially owned by other users) to the specified queue and waiting for 1000 seconds (after the first contact with a WN) at most for them to start running.

The output should be similar to this one:


 $ ./reserve_smp_nodes -T 1000 -J tasks.txt -r ce-1.grid.seed:2119/jobmanager-lcgpbs-gridseed -j 10
 Checking port 24433...
 Starting to receive...
 All jobs correctly submitted!
 ** New connection established from 10.10.1.2:52078
    + Hostname received: ce-1wn2.grid.seed
 ** New connection established from 10.10.1.1:48900
    + Hostname received: ce-1wn1.grid.seed
 ** New connection established from 10.10.1.1:48903
    + Hostname received: ce-1wn1.grid.seed
 Fitting a 2 job for my 2CPUs node
 After '__best_matches' i have 1 tasks and 0 cpu free on the node

 Got a task for 2 proc to fit in the node
 Script './matmul_plain_lcg.sh' sent.
 Closing socket ce-1wn1.grid.seed (10.10.1.1:48903)
 Closing socket ce-1wn1.grid.seed (10.10.1.1:48900)
 ** New connection established from 10.10.1.2:52083
    + Hostname received: ce-1wn2.grid.seed
 Fitting a 1 job for my 2CPUs node
 After '__best_matches' i have 1 tasks and 1 cpu free on the node

 Got a task for 1 proc to fit in the node
 Script './matmul_plain_lcg.sh' sent.
 Closing socket ce-1wn2.grid.seed (10.10.1.2:52078)
 All tasks have been assigned!
 Out of the receiving cycle
 Closing the remaining resources
 2 tasks submitted for execution

In this case, the reservation worked smoothly, assignign each task before the limit.

However since the former is not a representative examle for a prouction scenario, to give a more complete idea of what the reserve SMP nodes can do, we post here the output of an analogous submission, on the production grid:


 $ ./reserve_smp_nodes -T 1000 -J tasks.txt -r ce-1.grid.seed:2119/jobmanager-lcgpbs-gridseed -j 10
 Checking port 23594...
 Starting to receive...
 All jobs correctly submitted!
 ** New connection established from 131.111.66.212:32863
    + Hostname received: farm012
 ** New connection established from 131.111.66.237:32779
    + Hostname received: farm037
 ** New connection established from 131.111.66.212:32875
           (...)
    + Hostname received: farm018
 ** New connection established from 131.111.66.244:32878
    + Hostname received: farm044
 Timeout hit: about to fit the tasks into the available resources
 Out of the receiving cycle
 Resources available:
 Node       131.111.66.217 1/4 owned.
 Node       131.111.66.212 3/4 owned.
           (...)
 Node       131.111.66.215 2/4 owned.
 - sending a 3-task to a 3-processors node
 Script './matmul_plain_lcg.sh' sent.
 Closing socket farm012 (131.111.66.212:32887)
 Closing socket farm012 (131.111.66.212:32875)
 Closing socket farm012 (131.111.66.212:32863)
   task sent
 - sending a 2-task to a 2-processors node
 Script './matmul_plain_lcg.sh' sent.
 Closing socket farm015 (131.111.66.215:33118)
 Closing socket farm015 (131.111.66.215:33106)
   task sent
 - sending a 1-task to a 1-processors node
 Script './matmul_plain_lcg.sh' sent.
 Closing socket farm020 (131.111.66.220:33073)
   task sent
 Fit of the remaining resources terminated
 6 more cpu executing 3 more tasks
 '''Some taks have not been assigned:
         * 4 CPUs:  - script: ./matmul_plain_lcg.sh args: 4proc se-01.grid.sissa.it/tmp/ matmul_short'''
 Closing the remaining resources
 3 tasks submitted for execution

with more tasks, and with some of them assigned at the timeout.

As you can see, digging among the output (in this version some debugging messages have been left in the code, they are likely to disappear in the next versions), the reserve_smp_nodes was able to gather enough CPUs to fit 3 tasks, the ones of 1, 2 and 3 CPUs, where we weren't able go send the 4 CPU task (which may be due to a number of reasons: we didn't submitted enough jobs, the cluster had all nodes already occupied by at least one job or we didn't waited enough).

At this point, the directory at the location lfn:/grid/gridseed/someuser/ should be periodically inspected for the files:

 1proc_matmul_short.txt
 1proc_output.txt
 2proc_matmul_short.txt
 2proc_output.txt

to appear. The files ending with _output.txt, containing only log info can be ignored, where the files ending like _matmul_short.txt contain the output of the command:

 $ /usr/bin/time matmul_short

Here is a chunk of 2proc_matmul_short.txt:


   256  6331.981     0.265
   286  6535.527     0.358
   316  7339.364     0.430
           (...)
   946  6959.145    12.165
   976  7142.297    13.017
  1006  7091.956    14.356
 264.94user 1.16system 2:21.41elapsed 188%CPU (0avgtext+0avgdata 0maxresident)k
 0inputs+0outputs (0major+68918minor)pagefaults 0swaps

The output can be divided in 2 parts: the lines with the 3 column output are printed by matmul_short, each line is the profiling of a number of matrix multiplication (the first column is the size of the matrices involved, the second is the throughput, in mflops, and the third is the number of seconds required to perform the multiplications).

Confronting the output for different processors, should show that the scalability is very good (where the perfect situation is when, roughly doubling the number of CPU involved in the computation halves the time required to perform it).

Here we present the results of the test on the production grid (where more than 2 CPUs per node can be collected):

Graph of the performances of matmul_short for 1, 2 and 3 cpus

The last 2 lines are instead the output of the /usr/bin/time command, which shows how much time was taken to execute the code, how much CPU time was consumed and the average percentage of the CPU involved (which can be far more than 100, if more than 1 CPU is involved, as in our example).

A quick and enlightening feedback about the performances of the GotoBLAS library comes also from the last 2 lines: the average usage of the CPU should be, for an N processors simulation, near N*100% and the wall time required to run the binary on it should be near T/N seconds, where T is the time required to run the same binary on a single processor of the same architecture.


 ==> 1proc_matmul_short.txt <==
 247.92user 0.71system 4:18.13elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
 0inputs+0outputs (0major+68896minor)pagefaults 0swaps

 ==> 2proc_matmul_short.txt <==
 264.94user 1.16system 2:21.41elapsed 188%CPU (0avgtext+0avgdata 0maxresident)k
 0inputs+0outputs (0major+68918minor)pagefaults 0swaps

Keep however in mind that the previous test is far too small and inaccurate to give a correct picture about the scalability of the code: more data should have been gathered and the matmul binary should have been used instead of it's short alternative.

A more complex example (with real-time feedback) using XMLRPC

Though the previous example is probably enough in most situations (and may be easily adapted to other codes), we will demonstrate here another way in which the porting of the matmul program has been done, which due to it's complexity will probably be more fit for the advanced users.

While for some tasks the previous solution may be the better one (since it allows for a kind of "fire and forget" submission model, very suited for many jobs), in other situations users would like to have a direct feedback about how a simulation is going, and even be able to interact with it in real time: in such scenarios, the former example is clearly too simple.

While the feedback problem could have been solved easily using the logging tool (also on this site), a two way interaction with the user is another story...

To give some clues about how such situations can be handled, in this example we use the SimpleXMLRPCServer module present in any standard python installation to set up a server (which can be run on any machine with inbound connectivity, though we will assume it will be running on the UI) which will receive the connections from another script (the client side), this time running on the WNs.

While, as in the previous exercise, the purpose of the script on the client is to prepare the environment for the matmul binary to start (keep in mind that the required cpus are already reserved as soon as the script starts, since reserve_smp_nodes handles this), it's very different the way in which such operation is performed, namely all the data is directly downloaded/upoaded from/to the UI, and the logging messages are printed directly on the user's screen.

One of the consequences of this setup, is that no SE is used at all: this is a mixed blessing, since while it is easier to create a simulation and the output is retrieved in real time, the time and bandwidth required to send the input data (in this case the executable only) to the WNs may make this workflow un-practical (though nothing forbids to mix this approach with the previous one).

Starting the interactive simulation: server side

Running the server without arguments returns us:

 $ ./matmul_xmlrpc_server
 Wrong number of arguments:
         ./matmul_xmlrpc_server <password> <listening port>

To start the simulation, first we need a suitable port for your server to listen on. To get it, the correct approach would be to use the command netstat, and pick a port among the available ones.

A more pragmatic approach to this problem, though, is to a) pick random number between 22000 and 25000 b) start the server and if it complains about the port being busy, going back to a)... let's say we will run the server on the port 22017.

To protect the server against requests from possibly malicious attackers or even against the risk of running more than one simulation on it at one time a password is also required: we will use foo, this time.

Though it's not specified, we also need the matmul binaries to be somewhere on the server filesystem: we will assume that they are in the same directory where we will start the server itself.

 $ ./matmul_xmlrpc_server foo 22017

and we can, at this point, leave the server where it is: it will do nothing until an incoming connection will be received from one of our jobs: submitting them will be our next task.

All this has to be performed only once.

Starting the interactive simulation: client side

Switch to another terminal (leave the server pending); starting the client returns:

 $ ./matmul_xmlrpc_client
 Wrong number of arguments:
         ./matmul_xmlrpc_client <password> <host> <port>

The password and port will be the same used for the server, since they must match, the host is the fully resolved hostname of the machine where the server is running (the UI in our case).

Therefore, a task file for this job (tasks2.txt) could be like this:

 2:./matmul_xmlrpc_client:foo;ui-1.grid.seed;22017

Note that multiple lines in this file should point to different servers and employ different ports and passwords.

At this point we can launch, as we already mentioned in the previous example, the reserve_smp_nodes:

 $ ./reserve_smp_nodes -T 1000 -J tasks2.txt -r ce-1.grid.seed:2119/jobmanager-lcgpbs-gridseed -j 10
           (...)

At this point we just have to wait that our reservation terminates successfully and then switch to the terminal with the server: if we did everything right, almost immediately the server will come to life and ask us on behalf of the client the name of the binary to run (matmul or matmul_short):


 $ ./matmul_xmlrpc_server foo 22017
 Starting...
 File to run execute?

which will be sent to the client and then executed while we will receive the results in real time:


 $ ./matmul_xmlrpc_server foo 22017
 Starting...
 File to run execute? ./matmul_short
 Getting the executable...
 The executable is 1155398 bytes long
 Setting the permissions
 Setting 'OMP_NUM_THREADS' to 2
 About to run the code.
   256  7952.493     0.211
   286  7380.820     0.317
   316  9094.878     0.347
   346  7021.704     0.590
   376  7467.056     0.712
   406  7693.523     0.870
             (...)
   826  7397.896     7.618
   856  7269.874     8.628
   886  7423.823     9.369
   916  7263.463    10.581
   946  7183.501    11.785
   976  7315.936    12.708
  1006  7275.386    13.994
 248.65user 1.16system 2:06.97elapsed 196%CPU (0avgtext+0avgdata 0maxresident)k
 0inputs+0outputs (0major+68918minor)pagefaults 0swaps
 Program terminated: shut down the server
 or run another simulation

The submission can then be repeated (you don't need to start the server every time) in order to execute different computations or the server shut down and the session terminated.

Note that, with very little effort, the client could be modified in order to do as many computation as we want, thus relieving us from re-submitting the jobs and trying the reservation again: this is a winning strategy for short interactive jobs which take less than 24 hours to execute (like most interactive computations) since it allows us to use better the resources and rely less on the reservation strategy!!!

We invite the advanced users to inspect the scripts and modifying them for their needs: however they have been made in such a generic way that any program can be launched with them (even /usr/bin/date :-)), and that any GotoBLAS enabled one can benefit from the multi-threaded environment (as long as the output is written to the standard output).

Since the wrapper used doesn't affects the performances, we will not investigate the results and the comments at the end of the first example still apply.