We want to continue building computational skills from yesterday while exploring how ASTRAL can be used for species tree estimation. Today’s goals:

  1. Write a bash script to analyze a few hundred gene trees with IQTREE quickly
  2. Use those gene trees to estimate an MSC tree with ASTRAL

Continuing with IQTREE

We will continue working with our baobab data. Two new folders have appeared for you:~/PhylogenomicsWorkshop/workshopData/Fasta and ~/PhylogenomicsWorkshop/analysis/MSC. The new data folder has all of the alignments from Karimi et al. (2020)1.

We are going to use the ASTRAL method to estimate a species tree using the MSC. Remember, this works by estimating each gene tree with ML first, and then giving those gene trees to astral. There are over 300 alignments though! We can automate this process with a simple bash script. First you will have to create a file with nano

cd ~/PhylogenomicsWorkshop/analysis/MSC
nano loopFiles.sh

loopFiles.sh - a simple bash script

#!/bin/bash                                                                                                                   

fileList="../../workshopData/Fasta/*.fasta"
for i in $fileList
do
  echo "$i"
done

Now save the file!

CTRL + O
ENTER
CTRL + Z

The first line is called a shebang. This is letting your computer know which program language to use. Our first example is using bash, which will be available on any UNIX system. First, all of the *.fasta files are collected into a single array or list. We then iterate over the number of elements in that array, print the element to the screen, and quit the script.

We now need to make our script executable with the chmod command, then we can use it.

chmod u+x loopFiles.sh
./loopFiles.sh

You should see all of the Fasta file names printed to the screen.

We can modify this file to make it run IQTREE for us on all of the genes without having to type very much. First copy the file to have a new name, and then open it with nano to edit it.

cp loopFiles.sh loopIQTREE.sh
nano loopIQTREE.sh

The new bash script should look like

#!/bin/bash

fileList="../../workshopData/Fasta/*.fasta"
for i in $fileList
do
  eval "~/PhylogenomicsWorkshop/workshopPrograms/iqtree-2.1.3-Linux/bin/iqtree2 -s $i -m MFP -B 1000"
done

Now the script is ready and our work is done. We will run the script. We do not have to use chmod again since we copied a file that was already executable.

./loopIQTREE.sh

IQTREE will now run for every fasta file and create the output in ../../data/Fasta

We now need to put all of the tree files into one file that ASTRAL can analyze

cat ../../workshopData/Fasta/*.treefile > genetrees.txt

Have a look at the new genetrees.txt file with less and see that you now have all of the genetrees together, with one on each line of the file.

We can now analyze run astral

java -jar ~/PhylogenomicsWorkshop/workshopPrograms/ASTRAL-5.7.1/Astral/astral.5.7.1.jar -i genetrees.txt -o astral.tre 2>astral.log

What happened is that we gave astral the gene trees with the option “-i”, specified the output file name with “-o” and redirected what it would normally print to the screen to a file with 2> astral.log. The file names could be anything.

Remember that the MSC can treat multiple individuals of a species as a population for species tree estimation. That is accomplished by giving astral a mapping file with the “-a” option. I have pre-formatted one for you.

java -jar ~/PhylogenomicsWorkshop/workshopPrograms/ASTRAL-5.7.1/Astral/astral.5.7.1.jar  -i genetrees.txt -o astral.grouped.tre -a namemap.astral.txt 2>astral.grouped.log

Tomorrow we can have a look at the trees and compare results across analyses.

  1. Karimi N, et al. 2020. Reticulate evolution helps explain apparent homoplasy in floral biology and pollination in baobabs (Adansonia; Bombacoideae; Malvaceae). Systemaic Biology. 69:462-478.