Number of RhoGAP29 homologs in the human genome

As explained in the lecture some gene families are very big while others are smaller. You would think that getting the size of a gene family in a certain orgnanism should be easy and in some way it is. In this exercises we hope to give you an idea of the kind of issues that need to be solved in order to obtain an estimate for a gene family size. Note that it is a general bioinformatics problem that you can ask even a simple question in a myriad of ways and that it is a useful skill to be able to reflect on how the choice of method impacts the result.

Also for the mini projects we want you to estimate the number of members of the gene family of the protein you will being doing your mini project on

  1. RhoGAPs are one of the many numerous protein families and domains in the human genome, amd by no means the biggest. A RhoGAP is give here. Put the sequence in SMART (without ticking all the checkboxes) or PFAM and inspect its domain composition.
  2. How many homologs are in the human genome. We are going to first use blast. Go to ensembl blast. For "sequence data" select protein. For "search against" select protein. For "maximum number of hits to report" select at least 1000. Blast at ensembl is quite slow.
  3. Press view results. Scroll down to see the pattern of hits across the human chromosomes. Look at the tabular output. How many hits do you have? Are these all the different genes? (look at the second column).
  4. How now to get the total number of RhoGAP homologs in the human genome? Press on the "download what you see" option in the upper right corner of the table. This list contains duplicates, count the number of unqiue hits in a scripting language like R or python. Or open the file in excel, copy the column containing the names of the hits and paste them in the list 1 field the the venny webtool (see ). You should get your number in a venn diagram of a single set. How many RhoGAP homologs does the human genome seem to contain?
  5. Do the same in blast. In order to only search human restrict your search to human in the "Organism" field. You can decide for yourself if you want to search against NR or against refseq. What do you see? Explain the difference. For this you maybe need to read a little bit about the NR database.
  6. Now let's see if we can also get an estimate for the number of RhoGAPs using profiles. Find the Rhogap model in pfam (either via text searching or via sequence searching our query gene). Then go the curation and model entry. Download the RhoGAP model to your machine. Then go to Select as database human ensembl. Search with the RhoGAP model. Look at two hits (out of multiple) with identical identifiers (e.g. ENSG00000180448.10), what is different about the alignments of the hits (open the alignment by clicking on the ">" symbol to fold open the alignment)? Download the results as a tab delimited file. Again filter for unqiue either via a scripting language, or via opening the file in excel and pasting the list at .
  7. Take one method from above which you trust most (or find most easy) and do the same thing for med11 (the profile or the protein). Contrast the results with RhoGAP. If you want try and find a family that you think should be bigger than RhoGAP and see if you can confirm it.