Getting started

The examples used throughout the workshop have been designed to run quickly. Phylogenomic investigations in practice are often much larger and may necessitate the use of computing clusters. If this is your first time using a cluster or command line software, some activities are provided below to help you get the most out of the computational exercises. There are four goals this morning:

Log on to the UK Crop Diversity HPC
Perform basic UNIX operations
Run jobs interactively and with submission scripts
Understand responsible use of the HPC

Some or all of this may be very familiar. These short instructions are about getting everybody on the same page for advancing later in the workshop. Please take time to help each other throughout the workshop too.

A few necessary downloads for your local computer are covered on the pre-workshop page, but the cluster has all of the necessary software for the analyses ready for you.

Log on to and navigate the UK Crop Diversity HPC

Now you will log into the cluster. Then you will copy materials from a shared folder to your user directory (on the cluster). All analyses will happen in your scratch space. The below code block assume you have followed the key authentication instructions for the HPC.

Note: YOUR_USER_NAME is your user name for the HPC

ssh gruffalo
cd /mnt/shared/scratch/YOUR_USER_NAME
pwd
ls
mkdir network_workshop
cd network_workshop
pwd
ls
wget https://github.com/gtiley/RBG-Networks/raw/main/exercises/introduction.tgz
tar -xzf introduction.tgz
cd introduction
pwd
ls

Several things happened here. We change directory with cd and check the location of the present working directory with pwd. We then list files with ls. These are UNIX commands that will help us navigate the cluster. We then make directory named network_workshop with mkdir. It is important to avoid special characters and spaces when naming directories and files on the cluster. Some files can be downloaded from elsewhere via the internet with wget.

Notice that a file exists in the introduction directory. Let’s have a look and edit it with nano hello.txt. You can only use your arrow keys to move around but you can edit the text directly on the cluster without any graphical software. This is a handy skill to have. These text editors, one of which is nano, have various capabilities, but the important things to remember are control + o to save and control + x to exit. Try to open, edit, and save changes to hello.txt before closing. You can admire your changes to the file with the command less. Try less hello.txt to view the file contents but you will exit less with control + z.

Try making a new file called my_dreams.txt. Do do this, go nano my_dreams.txt. You could type something like “I am an expert on phylogenetic networks!” and save and exit.

Let’s combine two files with the concatenate command cat. Go cat hello.txt my_dreams.txt > the_truth.txt. Now look at the new file you create!

A couple of other important commands to know are copy cp and move mv. Try the following

cp the_truth.txt copied_file.txt
ls
mv the_truth.txt moved_file.txt
ls

cp works like the name says it will. It copied the file and kept the original. mv renamed the file in this case. You can copy and move files to other directories too. It is not necessary to change the file name.

Now you can access the cluster, move files around, and create or edit them.

Moving files between your computer and the cluster

At some point, you will want to move files from your user directory on the cluster to your computer or the other way around. There are some graphical programs that might be helpful, but we will work directly from the command line using the secure copy protocol function scp. First, log off of the cluster and go back to your own computer with exit.

Let’s start by downloading materials for today. Open your command-line prompt (using git-bash or similar for Windows users). I suggest making a folder to organize all of the class materials in one place.

Note: These steps are happening on your computer and not the cluster

cd ~
ls
mkdir network_workshop
cd network_workshop
pwd
scp -r gruffalo:/mnt/shared/scratch/YOUR_USER_NAME/network_workshop/introduction .
ls

The entire introduction directory should now be on your computer. The entire directory was download because we used the -r flag which means recursively. Feel free to create new files or change existing ones with your graphical text editor (BBedit, NotePad++, or otherwise) and save them. You can put files from your computer to the cluster with scp too.

cd introduction
scp *.txt gruffalo:/mnt/shared/scratch/YOUR_USER_NAME/network_workshop/introduction

If using scp and gruffalo did not work, try providing the full address by

scp -r YOUR_USER_NAME@gruffalo.cropdiversity.ac.uk:/mnt/shared/scratch/YOUR_USER_NAME/network_workshop/introduction .
scp *.txt YOUR_USER_NAME@gruffalo.cropdiversity.ac.uk:/mnt/shared/scratch/YOUR_USER_NAME/network_workshop/introduction

Running submission scripts

On a cluster, there are login nodes and compute nodes. You never run an actual analysis that will require notable memory and disk space on a login node (where we have been this whole time). Instead, a scheduler is used to manage requests from users, and queue and execute them in an orderly manner.

Log back on and go to your introduction directory and go ls. We want to look at test_submission.sh, you can do this with nano or less or even cat. I will not define all of the SLURM directives here but we will discuss them in real time.

#!/bin/bash
#SBATCH --job-name=test_submission
#SBATCH --output=test_submission.log
#SBATCH --mail-user=g.tiley@kew.org
#SBATCH --mail-type=FAIL,END
#SBATCH --time=6:00:00
#SBATCH --mem-per-cpu=2G
#SBATCH --cpus-per-task=4
#SBATCH --partition=debug
[[ -d $SLURM_SUBMIT_DIR ]] && cd $SLURM_SUBMIT_DIR

echo "Date              = $(date)"
echo "Hostname          = $(hostname -s)"
echo "Working Directory = $(pwd)"
echo ""
echo "Number of Nodes Allocated      = $SLURM_JOB_NUM_NODES"
echo "Number of Tasks Allocated      = $SLURM_NTASKS"
echo "Number of Cores/Task Allocated = $SLURM_CPUS_PER_TASK"


for i in {1..10}
do
  echo "begin step $i"
done

Do not forget to change your email! If you did, I will find out. After having a look and editing, submit the script with sbatch test_submission.sh.

The job will complete quickly, but if you are fast enough you might find it in the queue with squeue -u YOUR_USER_NAME. Check the log file to see that it ran correctly. There should be some information about resource allocation and a list of numbers 1 to 10.

Most of our activities today will happen interactively. This is largely for educational purposes, but it can be helpful for checking analyses are working as intended before submitting a job for days or weeks. Try allocating some resources on an interactive node like this:

srun --cpus-per-task=4 --mem-per-cpu=2G --partition=debug --pty bash

You are no longer on the login node and you should now be on a compute node. Now, you can safely execute the simple counting bash script by ./count.sh

If the script did not run for you, you might need to change the permissions to make it executable. Check the permissions of the files in the directory, make count.sh executable with chmod, then check the permissions again and execute:

ls -l
chmod u+x count.sh
ls -l
./count.sh

You should now see the numbers 1 to 10 printed to your screen on the cluster. Make sure that you are always running analyses on a compute node by allocating resources with srun or submitting a job to the scheduler with sbatch. You should now be able to log on to the cluster, find your way through directories and edit files, and submit jobs responsibly. A topic that will come up throughout the workshop is allocating the number of cpus correctly so that you get the benefits of working on a compute cluster.

Making software for the workshop available

Instructions for making all necessary software available on the cluster are here.

Some additional notes about scripts

The details will not come up to much over the workshop, but at various points, you might be asked to run scripts. This is often a catch-all term for human-readable programming languages that are interpreted at run time, often with decent functionality for handling text data. The most popular scripting language is Python, but Perl still glues together many bioinformatic applications, and bash scripting becomes a necessity working on clusters. The below notes are not necessary but may be interesting or provide context to some.

Interpreted Code - Scripting

Most genomics applications happen, at least in part, with scripting. Scripting languages are great for performing operations on strings or text, like ACGT. They are also not bad for a human to understand and hide the messiness of turning human-readable code to machine-readable code (compiling). That is the work of the interpreter, which happens in real-time. Scripting languages include Python, Perl, and Ruby. You can also accomplish a bit with bash and awk, but my heart-felt recommendation would be to dedicate some time into learning Python if you are new to this.

Let’s see some simple scripts in action and hint at how they might be useful. In data/, you will find three files *.params with some results from a model.

In all omics data, we are often iterating over many things (e.g. loci, individuals, populations, bootstrap replicates) to do something repetitive. Let’s execute three different scripts that will allow us to loop over the *.params files.

loopFiles.sh - a simple bash script

The first line is a shebang. This is letting your computer know which interpreter program to use. Our first example is using bash, which will be available on any UNIX system. First, all of the *.params files are collected into a single array or list. We then iterate over the number of elements in that array, print the element to the screen, and quit the script.

#!/bin/bash                                                                                                                   

fileList="../data/*.params"
for i in $fileList
do
  echo "$i"
done

Try executing loopFiles.sh on the cluster with you acquired skills. We can execute programs that are not in our path by specifying the location. If we are in the same folder that we want to execute a program from we would go ./loopFiles.sh such that . means here. The file names should print to your screen.

loopFiles.pl - give me Perl

Perl is a popular scripting language that is the glue of the internet and played a large role in early genomics applications. It still is, but has waned in popularity as various R packages and Python. It will give you more flexibility than bash in the long-term and can be quick to learn. Here, we use the glob function to get an array of the file names. We then loop over array elements from their starting position (0) to the end (2) by getting the number of elements in the array with scalar and subtracting 1.

#!/usr/bin/perl -w                                                                                                            

@fileList = glob("../data/*.params");
for $i (0..(scalar(@fileList)-1))
{
    print "$fileList[$i]\n";
}
exit;

We could make this script executable as we did with the bash script, or we could go

perl loopFiles.pl

loopFiles.py - Python and its libraries

Python is a relatively recent language but it the bedrock of most new bioinformatic applications. There has been a lot of development on improving abstraction and this is supported by many libraries (or modules). These are groups of functions that you let python know you want to use with the import <module> syntax. Here we load two very basic modules sys and os, but a third one we actually use, glob! We can use functions from modules by going <module>.<function>(), so we see glob.glob() here.

#!/usr/bin/env python3                                                                                                        
import os
import sys
import glob

fileList = glob.glob("../data/*.params")
for i in range(0,len(fileList)):
    print(fileList[i])
exit;

Let’s run it

python loopFiles.py

If that did not work, python on your system likely points to Python 2 and this is written for Python 3. Systems may differentiate the two by requiring Python 3 be specified as

python3 loopFiles.pl

Using scripts to retrieve information from files

Scripting is a helpful way to get results from our inevitable thousands of output files. Here are a couple of examples in Perl and Python that build upon looping over the file list. Now, each file is opened and we process them line-by-line to extract the relevant information. Our goal is to make one table with the parameter values for each params file.

getResults.pl - a regex approach

Scripting languages can use regular expressions (regex) to find patterns in strings. Good text editors can find and replace with text editors too. You can use them to save pieces of the string you care about and work with those further.

#!/usr/bin/perl -w                                                                                                          

%data = ();

@fileList = glob("../data/*.params");
for $i (0..(scalar(@fileList)-1))
{
#    print "$fileList[$i]\n";                                                                                               
    open FH1,'<',"$fileList[$i]";
    while(<FH1>)
    {
        if (/(\S+)\s+(\S+)/)
        {
            $parameter = $1;
            $value = $2;
            if ($parameter ne "Parameter")
            {
                push @{$data{$parameter}}, $value;
            }
        }
    }
    close FH1;
}

print "File";
foreach $parameter (sort(keys(%data)))
{
    print "\t$parameter";
}
print "\n";

for $i (0..(scalar(@fileList)-1))
{
    print "$fileList[$i]";
    foreach $parameter (sort(keys(%data)))
    {
        print "\t$data{$parameter}[$i]";
    }
    print "\n";
}
exit;

Scripting languages give you access to helpful data structures. Here, I make a hash called data, which is denoted by the %. Hashes have two parts, the key and the value. This is different from an array where you only need to know the element number to access the value - it can be a string too. And here, I actually make a hash of arrays, where each key (a,b,c) gives us the values from the three different files. I then loop back over the data structure to print a matrix that we might work with in R.

perl getReults.pl

getResults.py - splitting lines and tuples for keys

Python can use regex too, but here I simply apply some prior knowledge about the params files to extract what I want. Python also uses hashes, but here the data structure is called a dictionary or dict. Here, we actually have a two-dimensional dictionary where each key is a tuple.

#!/usr/bin/env python3                                                                                                      
import os
import sys
import glob

data = {}

fileList = glob.glob("../data/*.params")
for i in range(0,len(fileList)):
    InFile = open (fileList[i], 'r')
    for Line in InFile:
        Line = Line.strip('\n')
        ElementList = Line.split('\t')
        if (len(ElementList) > 1) and (ElementList[0] != "Parameter"):
            data[(i,ElementList[0])] = ElementList[1]

print('File\tParameter\tValue',end='\n')
for j,k in sorted(data.keys()):
    print(fileList[j],'\t',k,'\t',data[(j,k)],end='\n')
exit;

You might notice that when printing here, we access the keys a bit more efficiently than the Perl case and print out a file that would be more appropriate for tidyverse R packages, so you can start to see how things fit together here.

python3 getReults.py

Compiled Code

Examples of compiled languages are C, C++, and Rust. They implement low-level functions compared to interpreted/scripting languages (e.g. it would take some creativity to re-implement the glob function), but they are more efficient with memory allocation and potentially faster. C and C++ underly many of the workhorses of the genomics field, such as BWA, BLAST, RAxML, and BPP.

There is a silly example available for us to practice with. We will do this on the cluster since getting a good C compiler set up for Windows users can be time-consuming and frustrating. This should work for the mac and linux users though on your own systems.

So, go into the compiledExample folder. Since you might still be in scripts, all you need to do is go

cd ../compiledExample

To generate an executable program from the C code, we will run the gcc compiler as follows:

gcc -Wall -o betaSolver betaSolver.c solveBeta.c

-Wall is telling gcc to print warning flags which you will often see when building popular software applications. Our source code is in two different files betaSolver.c and solveBeta.c, and we use -o to indicate the output program betaSolver You should now have the compiled C program betaSolver. Give it a try! Pull up the help menu and then see if you can get the program to do what it should.

./betaSolver -h

You almost never compile programs by invoking gcc yourself though. This would leave a lot of wiggle room for user errors. Thus, programs sometimes come with a Makefile. There is one to compile the betaSolver program for you to save some typing. Let’s see it in action. We will first delete the application, then run the makefile to re-compile it.

rm betaSolver
make

There will often be frustrating moments when compiling a program that you want to use, because you will run make and it will stop compiling with errors and give you very cryptic messages. As programs become very complex, there can be many external libraries that a program depends on. A step that happens with compiling is linking all of the bits of code scattered across a computer into a single computer-readable program. Let’s break our program.

make clean
mv betaSolver.h wrong.h
make

Makefiles sometimes come with a clean option, so that all of the compiled code is removed. We then change the name of our helper file to wrong.h so the compiler no longer fins the correct one. You will now get a error that stops the compiler. This is an easy issue to diagnose from the error message, but sometimes it is not. Often it is a missing header file or library when using a large shared cluster with piecemeal installs.

Wrap up

You have now logged onto the cluster edited some files and moved them around. You have transferred files between your computer and the cluster. You have successfully run some some scripts and compiled a program.

Pro Tip

For easy cluster access, be sure you have made the suggested edits to your config file on your local computer. Instructions are provided by the Crop Diversity Cluster here.