PLK1

Multiple homologs, getting better timing by adding some species.

We have seen that genomes contain many homologs / big gene families. To find out how these gene families evolved it is useful to make gene trees. This should give a timing of when these proteins duplicated. However for that we also need to add species. For this question we are specifically interested in the timing of duplication of PLK1, PLK2, PLK3, and PLK4.

Get the polo kinase 1 (PLK1) sequence from http://www.uniprot.org/uniprot/P53350.fasta We will use phmmer because it is normally faster than blast. And because it is also a useful website to have seen. We will use ensembl human because (a) especially for human it is a consolidated genome set and (b) it also allows us to easily distinguish splice variants from paralogs. i.e. each hit is denoted by its ensembl gene identifier and different genes have different identifiers. If EBI/HMMER is down, you can do this exercise at NCBI blast. But then use refseq as a consolidated genome set (that does still contain splice variants) and change the word size to a higher number for some speed be prepared to wait quite long. Do a phmmer search at https://www.ebi.ac.uk/Tools/hmmer/search/phmmer and change the database to ensembl human.
We want to collect the kinase family members that are most closely related to plk1. One small but annoying disadvantage of the current implemententaion of the hmmer3 package is that a protein with multiple hits, even if they are much less insignificant by themselves, will rise to the top of the hit table. To visualize where our query hits press on the customize button in the right upper side of the output table and there check the checkbox for Hit Positions. You should observe that e.g. "ribosomal protein S6 kinase A3" is amongst the high scoring hits because it has two kinase domains rather than one. It is thus not amongst the top scoring hits because of its similarity to PLK1.
Now collect the best first 6 (includign plk1 itself) hits considering the above (i.e. exlcuding proteins that score high because of multiple kinase domains and select different genes not just different transcipts) Collecting these sequences is some work because the identifiers of the hits lead to ensembl gene entries. There you have to show the transcript table, from which either you can try and find the ensembl protein entry or click on the uniprot link and in uniprot change the format to fasta. Do not forget to include polo kinase 1 itself! (which you can directly do from the uniprot link above saving you the trouble of clickign around on ensembl).
Put these sequences in a fasta file and if necessary rename the sequences such that their names reflect the species and their gene name / functional description. Go to clustal omega make an alignment and a tree. Look at the tree simply in the clustal omega server or on iToL. Which kinases seem closer related to which? When do you think these kinases duplicated?
Now do a new phmmer search with again PLK1 as the query but change the database to uniprot and restrict the species to Lottia gigantea (a sea snail). Also collect the top 6 hits from this species given the above considerations (especially the mulitple regions because I think alternative splicing is not a big consideration for this query). Again make a tree. View it in iToL, and annotate it in terms of duplications, speciations and losses (you can do this on paper or in a piece of drawing software). What is the timing of the different duplications?
Now add a plant (e.g. arabidopsis thaliana) and Naegleria gruberi to the the alignment. Again make a tree using the procedure outlined above. Does it change the intepretation of the tree?
Search for plk1 in ensembl and look at the gene tree, and the paralog table. Compare this to your results