MSC with ASTRAL
We want to continue building computational skills from yesterday while exploring how ASTRAL can be used for species tree estimation. Today’s goals:
- Write a bash script to analyze a few hundred gene trees with IQTREE quickly
- Use those gene trees to estimate an MSC tree with ASTRAL
Continuing with IQTREE
We will continue working with our baobab data. Two new folders have appeared for you:~/PhylogenomicsWorkshop/workshopData/Fasta
and ~/PhylogenomicsWorkshop/analysis/MSC
. The new data folder has all of the alignments from Karimi et al. (2020)1.
We are going to use the ASTRAL method to estimate a species tree using the MSC. Remember, this works by estimating each gene tree with ML first, and then giving those gene trees to astral. There are over 300 alignments though! We can automate this process with a simple bash script. First you will have to create a file with nano
cd ~/PhylogenomicsWorkshop/analysis/MSC
nano loopFiles.sh
loopFiles.sh - a simple bash script
#!/bin/bash
fileList="../../workshopData/Fasta/*.fasta"
for i in $fileList
do
echo "$i"
done
Now save the file!
CTRL + O
ENTER
CTRL + Z
The first line is called a shebang. This is letting your computer know which program language to use. Our first example is using bash, which will be available on any UNIX system. First, all of the *.fasta
files are collected into a single array or list. We then iterate over the number of elements in that array, print the element to the screen, and quit the script.
We now need to make our script executable with the chmod command, then we can use it.
chmod u+x loopFiles.sh
./loopFiles.sh
You should see all of the Fasta file names printed to the screen.
We can modify this file to make it run IQTREE for us on all of the genes without having to type very much. First copy the file to have a new name, and then open it with nano to edit it.
cp loopFiles.sh loopIQTREE.sh
nano loopIQTREE.sh
The new bash script should look like
#!/bin/bash
fileList="../../workshopData/Fasta/*.fasta"
for i in $fileList
do
eval "~/PhylogenomicsWorkshop/workshopPrograms/iqtree-2.1.3-Linux/bin/iqtree2 -s $i -m MFP -B 1000"
done
Now the script is ready and our work is done. We will run the script. We do not have to use chmod again since we copied a file that was already executable.
./loopIQTREE.sh
IQTREE will now run for every fasta file and create the output in ../../data/Fasta
We now need to put all of the tree files into one file that ASTRAL can analyze
cat ../../workshopData/Fasta/*.treefile > genetrees.txt
Have a look at the new genetrees.txt
file with less and see that you now have all of the genetrees together, with one on each line of the file.
We can now analyze run astral
java -jar ~/PhylogenomicsWorkshop/workshopPrograms/ASTRAL-5.7.1/Astral/astral.5.7.1.jar -i genetrees.txt -o astral.tre 2>astral.log
What happened is that we gave astral the gene trees with the option “-i”, specified the output file name with “-o” and redirected what it would normally print to the screen to a file with 2> astral.log. The file names could be anything.
Remember that the MSC can treat multiple individuals of a species as a population for species tree estimation. That is accomplished by giving astral a mapping file with the “-a” option. I have pre-formatted one for you.
java -jar ~/PhylogenomicsWorkshop/workshopPrograms/ASTRAL-5.7.1/Astral/astral.5.7.1.jar -i genetrees.txt -o astral.grouped.tre -a namemap.astral.txt 2>astral.grouped.log
Tomorrow we can have a look at the trees and compare results across analyses.
-
Karimi N, et al. 2020. Reticulate evolution helps explain apparent homoplasy in floral biology and pollination in baobabs (Adansonia; Bombacoideae; Malvaceae). Systemaic Biology. 69:462-478. ↩