Multiple homologs, getting better timing by adding some species.

In previous exercises we have seen that genomes contain many homologs / big gene families. To find out how these gene families evolved it is useful to make gene trees. This should give timing of when these proteins duplicated. However for that we also need to add species.
  1. Get the polo kinase 1 (PLK1) sequence from We will use phmmer because it is normally faster than blast. We will use ensembl human because it allows us especially for human because it is a consolidated genome set and it also allows us to easily distinguish splice variants from paralogs. If EBI/HMMER is down, you can do this exercise at NCBI blast. But then use refseq as a consolidated genome set (that does still contain splice variants) and change the word size to a higher number for some speed be prepared to wait quite long. Do a phmmer search at and change the database to ensembl human.

    We want to collect the kinase family members that are most closely related to plk1. One small but annoying disadvantage of the current implemententaion of hmmer is that a protein with multiple hits even if they are much less insignificant by themselves will rise to the top of the hit table. To visualize where our query hits press on the customize button in the right upper side of the output table and there check the checkbox for Hit Positions.

    Now collect the best first 6 (includign plk1 itself) hits considering tha above. Thus different genes. And thus hits where one region in our query corresponds to one region in the hit, or two regions in our query correspond to two regions in the hit; but not hits where one region in our query correspond to two regions in the hit. Collecting these sequences is some work because the identifiers of the hits lead to ensembl gene entries. There you have to show the transcript table, from which either you can try and find the ensembl protein entry or click on the uniprot link and in uniprot change the format to fasta. Do not forget to include polo kinase 1 itself (which you can directly do from the uniprot link above saving you the trouble of clickign around on ensembl).

    Put these sequences in a fasta file and if necessary rename the sequences such that their names reflect the species and their gene name / functional description. Go to clustal omega make an alignment and a tree. Look at the tree simply in the clustal omega server. Which kinases seem closer related to which?

  2. Now do a new phmmer search with again PLK1 as the query but change the database to uniprot and restrict the species to Lottia gigantea. Also collect the top 6 hits from this species given the above considerations (especially the mulitple regions because I think alternative splicing is not a big consideration for this query). Again make a tree. View it in itol, and annotate it in terms of duplications, speciations and losses (you can do this on paper or in a graphics program).
  3. Now add a plant (e.g. arabidopsis thaliana), an oomycete OR naegleria gruberi to the the alignment and the tree using the procedure outlined above. Does it change the intepretation of the tree?
  4. Search for plk1 in ensembl and look at the gene tree, and the paralog table. Compare this to your results