IR with GNU Parallel

It's possible to use GNU Parallel to run jobs on multiple nodes of a cluster, or even across disconnected machines over the internet, so long as each node is accessible from the "master" node (where you run the parallel command) via SSH.

The following gives a sense of the syntax of a possible cross-machine parallel command:

parallel --sshlogin 3/:,10/gc2,10/gc3 --basefile stoplist --basefile topics.101-200.title --basefile AP_88-89 --return mu:{1},fbD:{2},fbT:{3},lam:{4} --cleanup "IndriRunQuery stoplist topics.101-200.title -trecFormat=true -index=AP_88-89 -rule=method:d,mu:{1} -fbDocs={2} -fbTerms={3} -fbOrigWeight={4} > mu:{1},fbD:{2},fbT:{3},lam:{4}" ::: 50 250 500 1000 2500 5000 10000 ::: 5 10 20 40 50 75 100 ::: 5 10 20 40 50 75 100 ::: $(seq 0 0.1 1)

The key points:

```
--sshlogin #parallelJobs/hostname,...
```

--basefile file_needed_on_remote_machine

```
--return files_from_remote_machine
```

--cleanup (remove --basefile and --return files from remote machines)

Now in painstaking detail:

--sshlogin indicates that this job should be spread across the specified machines: localhost (denoted with colon), gc2, and gc3.
- 10/gc2 means run 10 jobs on the machine named gc2. gc2 can be any host name accessible via SSH.
- If you want to run a job on your master node, the best option is to use the colon, :. You can also specify localhost, but that will cause issues with file transfers (described below) and is probably less efficient.
- If the remote machine requires a password, parallel will prompt for it.
- Different remote machines can specify different numbers of jobs, so that machines of varying numbers of CPUs can work at different rates. If no number is specified, parallel will attempt to detect the number of CPUs (but in my experience, will fail and default to 1 job).
The example above uses --basefile to send each required file (stoplist, topics.101-200.title, and the AP_88-89 index) to each remote machine via rsync before the execution of IndriRunQuery begins. Because IndriRunQuery requires a bunch of environmental setup, I have assumed above that it is already available in the $PATH of each machine. However, if IndriRunQuery were a regular script that could be easily distributed, it could also be included as a --basefile.
- Note that --cleanup will remove each of the --basefile files as well as each of the --return files (more on --return below).
- This is where the colon vs. localhost comes into play. If you specify localhost instead of colon, the --cleanup command will delete your basefiles off of your master node as well as the remote machines, i.e. it will permanently delete them from every machine. If you use a colon instead of "localhost," parallel knows not to delete those files from the master node.
Notice that stdout redirection is within the quotes in the command above. This is because each file name relies on the parameters passed by parallel to each instance of the IndriRunQuery command. However, because stdout is written within the parallelized command, it means that the output files will be scattered across localhost, gc2, and gc3. You can get all these files back onto your master node using the --return option.
- --return specifies the names of the files produced on the remote machines to be rsync'ed back to your master node. In the above example, every file produced by the parallelized IndriRunQuery command will be copied back to the master node when the job is complete.
- --return only copies the files back; to also remove the files from the remote machines, you can specify --cleanup, which will delete them after they have been copied to the master node.
- If the output of your parallelized command should ultimately be placed into a single file, this is much simpler: just redirect stdout after the parameters. The code will be run on each machine, but the output will be written to a single file on your master node. For example:
  
  parallel --sshlogin 3/:,10/gc2 echo {} ::: $(seq 0 200) > numbers.txt
  
  will write a single file, numbers.txt, to your master node containing the numbers 0 through 200 (possibly out of order). If you do not redirect stdout, it will of course be printed to your terminal.

I also like to use the --eta command when running parallel to get job progress (even across multiple machines) and an estimated time to completion. I've never known it to be remotely accurate, but it gives me some feedback that the job is progressing. Certainly not required, but a worthwhile thing to do in my opinion.

Space shortcuts

Page tree