The whole reproductive process is truly awesome and may at first glance seem incredibly efficient, but ... accidents will happen. And every family researcher should be grateful that they do! In short, sometimes during the replication process, when the cell is busy reconstructing the missing half of the railway line that has just been split up the middle, it makes a mistake. And that mistake is called a mutation.[1]

There are different types of mutation (e.g. SNP = single nucleotide polymorphism) but the mutations most relevant to Y-DNA testing are the ones that involve STRs (short tandem repeats).[2] An STR is a sequence of two or more base pairs (the sleepers on the railway track) that is subsequently repeated along the DNA railway track to produce several repetitions of the same base-pair sequence (e.g. TTAGC,TTAGC,TTAGC,TTAGC ... 4 repeats of 5 'railway sleepers'). The sequence can range from 2 to 50 base pairs, is typically in a non-coding region, and may repeat anything from 6 to 30 times (for example).

However, during meiosis, the cell may inadvertently add or subtract a sequence from the DNA so that instead of repeating 5 times, the sequence may repeat 7 times in each of the 4 new sex cells. This is a mutation. And it will be passed from father to son in the Y-containing sperm. The father will have a repeat value of 5 for this STR, whereas the son will have a repeat value of 7. And if there are no replication errors when it comes time for the son to father children, then his STR with 7 repeats will be passed on to his children, and in turn their children, and so on, until another accident occurs resulting in another mutation. But that may be 500 years down the line.

By identifying repeats of a specific sequence at specific locations on the Y chromosome, it is possible to create a personal genetic signature (haplotype) for an individual. Thus, these STR's can serve as genetic markers that can also establish the degree of similarity between two people. STR analysis has become the prevalent analysis method for determining genetic profiles in forensic cases and it is the main genetic tool when trying to establish common ancestry between individuals in surname projects such as our own.

Each marker has a range of values - some have a narrow range (e.g. 10-14) whereas others can have a very broad range (e.g. 6-27). The range of values per marker is documented at www.genebase.com/in/dnaMarkerDetail.php

What does a haplotype look like?

There are currently over 10,000 published STR sequences in the human genome. That's roughly about 200 per chromosome, or one sequence every 2 miles on the London to Aberdeen railway line. For genealogical purposes, about 100 STR's in total are currently used by the various DNA testing companies to characterise Y-chromosome haplotypes (personal genetic signatures). FamilyTreeDNA (FTDNA), for example, offer a 12-marker, 25-marker, 37-marker, 67 marker, and more recently (April 2011) a 111-marker test; DNA-heritage offered a 23-marker and 43-marker test (before they were bought by FTDNA in June 2011). And there are other companies that offer a variety of other tests (see http://www.isogg.org/wiki/List_of_DNA_testing_companies). Each company and each test has its pros and cons.

The test results are presented as a series of Y-DNA haplotypes (example above). This is a list of the various markers (STRs) with their respective repeat values underneath for each of the individuals tested. The table above contains real (anonymised) data taken from the actual results of the Spearin Y-DNA project. Only 9 markers are shown but the results are identical for Tom, Dick and Harry, whereas Sam differs from the rest by 1 repeat on the marker DYS446. Sam is therefore said to differ from the rest by a genetic distance of 1.[3] The greater the genetic distance between two people, the less likely it is that there is a close relationship between them.

But which came first? Did Sam's ancestor develop a mutation and split away from the 'parent group'? Or did Sam's ancestor belong to the 'parent group' and the other three participants' ancestor was the one who started the 'splinter group'? In other words, which haplotype is older - the one with the DYS446 value of 12, or the one with a value of 13? The good news is that there are various analytical techniques, based on mutation rates and probability analyses, that can help to answer these questions.[4] The bad news is that this science is still so young that the results are not as accurate as one would like.

There are several terms worth knowing before we continue. One term is: the Most Recent Common Ancestor and refers to the earliest ancestor shared in common by two individuals. For two brothers, their MRCA is their father; for two second cousins, their MRCA is their great grandfather. The other term is: the Most Distant Known Ancestor and refers to the patriarch at the top of your family tree after whom you have your Brick Wall.

2. What is the estimate for when they shared a common ancestor (this is the MRCA)?

3. How are different branches of the same genetic family related? And who is more related to whom?

Let's look at each question in turn. The closer two people match on their haplotype, the more likely they are to be related. If they are a perfect match they are probably related to each other in the very recent past (say the last 100-300 years). And the more markers they have tested and match on, the stronger the probability (i.e. a 37-marker match indicates a much stronger probability than a 12-marker match). In the Spearin Y-DNA project, 3 of the first 4 participants were exact matches on 43 markers, indicating a very close match and a high probability of sharing a common ancestor in the past 300 years. In our particular Surname Project, the implication of a positive answer to this first question is that the participant can claim a connection to the London Spering's.

Next, we could estimate the TMRCA (Time to MRCA) by either using FTDNA's Time Predictor tool (TiP)[5] or Dean McGee's Y-utility probability matrix.[6] There are other similar tools that could be used and there is no concensus currently on which is the best, but the two mentioned base their calculations on the known mutation rates of the STR markers on the Y-chromosome. If we know how frequently a mutation is likely to occur in a particular marker, we can calculate the 'Time to Most Recent Common Ancestor' based on that marker.[7] Say, for example, the mutation rate for marker DYS446 above was once every 300 years on average, this would mean that the split between the two groups (haplotypes) probably occurred sometime in the previous 300 years. The problem is it could have happened in the previous generation, or it could have happened in the early 1700's - not a very narrow range. If we know from documentary research that there is no link between Sam's tree and the other three trees going back to about 1800, this would mean that Sam's group probably split away from the other group sometime between 1700 and 1800. Probably.

And this estimate is based on only 1 marker. With 37 markers, the probability estimate can be much more exact. One would think that the accuracy of the result would improve with more markers (e.g. with the 67-marker test) but it appears that there isn't a huge increase in additional accuracy above 37 markers. However, testing more than 37 markers has other advantages that we will discuss below (FTDNA introduced a new 111-marker test in April 2011).

Before we leave the TMRCA calculations, it is important to appreciate that mutation rates can differ between markers by a factor of several thousand! In the sample haplotype from the project results above, the mutation rates of the various markers are expressed as a ratio, relative to the mutation rate of DYS426. One can see that marker DYS439 mutates 53 times faster than DYS426. Put simply, some markers may mutate once every 100 years, others once every 5000 years ... so the interpretation of a genetic distance of 1 very much depends on which marker we are talking about. Data in the example is taken from Chandler, Journal of Genetic Genealogy, 2006,[8] but watch this space because the science is constantly being revised as more data becomes available.

The third question, how closely are the various branches related, is a very interesting one and still the topic of much debate. Theoretically it should be possible to build a 'Mutation History' family tree that shows when mutations occurred and which branches arose from individuals bearing that mutation. The science behind this is discussed in the next section - DNA Family Trees.

Limitations of Y-DNA testing using STR's

Just when you think everything is hunky dory, something comes along and throws a spanner in the works. Or in this case, several spanners, including reverse mutations, multi-step mutations, parallel mutations, multiple-copy STR's, lack of knowledge and different labs behaving in different ways.

Biological Problems

Reverse or back mutations are exactly what they sound like. First there is a mutation one way, and then there is a mutation back to the original. So for example, in 1580 a Nicholas Sperynge with a DYS446 value of 12 passes on a mutation with a DYS446 value of 13 to one of his sons (Luke), and thus a new subgroup is formed. This son's descendants bear this mutation for 10 generations until the conception of Matthew Spearin in 1830 when another mutation occurs, but instead of mutating forward (to give a DYS446 value of 14), the mutation goes backwards to a DYS446 value of 12 i.e. the same as Nicholas Sperynge back in 1580. This is a reverse mutation. The problem it causes is that Matthew's descendants may look like they belong to the 'parent group' headed by patriarch Nicholas Sperynge from 1580, when in fact they belong to a much younger (more distantly related) subgroup from 1830 headed by Matthew. In our example from the actual Spearin data, it may be that any or all of the three participants with exact 43-marker matches have had a reverse mutation in their ancestors' past and are in fact more distantly related to each other than first meets the eye.

Secondly, most mutations (95%) are 'single step' mutations i.e. the STR repeat value goes up or down by a value of 1. However, sometimes it changes by a value of 2 (in about 5% of cases, 1 in every 20 times) and very rarely by multiple steps. This can cause confusion if you are expecting single-step mutations.

Another fly in the ointment is the possibility that two distinct lines develop the same mutation at some point in their evolution. Even though they evolved separately, by bearing the same mutation it looks as if their descendants are closely related. In this situation they will be grouped together under the mistaken belief that they are genetically more closely related than they actually are.

A fourth limitation of STR testing is that sometimes there are several copies of the same marker. In other words, the same STR occurs at several places along the genetic railway line - once in London, twice on the outskirts of Birmingham, and once in Glasgow. But the way these markers are analysed means its impossible to tell which one came from where, so I don't know if I'm comparing the one from London with the one from Glasgow. This isn't a problem if the marker values are the same throughout e.g. values on DYS464 for two individuals of 14-14-14-14 and 14-14-14-14. However, if there are any variations in the numbers (e.g. 13-14-14-14) then it is impossible to know if the two sets are the same or different (e.g. the second set may also be 13-14-14-14 but the correct order for the second one should be 14-13-14-14 indicating a mismatch and genetic distance of 2. But this is impossible to tell with current testing procedures).

So, how do we handle this? How do we find out what is the probability of each of the following for each of the markers: 1) reverse mutations; 2) multi-step mutations; 3) parallel mutations? We then also have the issue of multiple copy markers - what do we do with these? These are all relevant questions, but how relevant are they if we are trying to connect people who lived in the last 500 years? Are they relevant at all for this type of analysis? John Robb thinks maybe not! Basically, the chances of these happening in the past 500 years may be remote and therefore NOT relevant. You can read his article here. Hopefully the answers to these questions will become more clear over time.

Logistical Problems

There are several other challenges currently facing genetic genealogy. Firstly, as new markers are discovered, it takes some time before their mutation rates can be calculated (because this depends on testing sufficient samples to arrive at an estimate of the mutation rate). And these mutation rates are necessary for further refining any calculation of the time to most recent common ancestor (see Whit Athey's editorial in http://www.jogg.info/52/files/Intro.pdf).

Another problem is the current lack of standardisation of calculations of time to most recent common ancestor. James Irvine has suggested a standardised approach (in his 2010 article in the Journal of Genetic Genealogy) but it remains to be seen if this is endorsed on a wide scale (see http://www.jogg.info/62/files/Irvine.pdf) Despite these current drawbacks, a lot can be learned from genetic testing and as the field is constantly changing, many of these glitches will be ironed out in due course.

Lab-related Problems

Lastly, there are several problems relating to how different labs approach the testing of Y-DNA:

different labs test different markers (see http://www.gendna.net/ydnacomp.htm)
different labs report the results in different orders
different labs use different values for markers with the same result (and these need to be corrected for comparison across labs - seehttp://www.smgf.org/ychromosome/marker_standards.jspx )
different labs use different names for the same marker (see http://en.wikipedia.org/wiki/List_of_Y-STR_markers and http://freepages.genealogy.rootsweb.com/~thefridays/laboratories.html)
different sources report different mutation rates for the same markers (see mutation rates 1 to 111 markers.xls at http://groups.yahoo.com/group/ISOGG/files/)
some of the mutation rates for the more recently introduced markers are not as yet reported

So, you can see that the science is still developing and improvements will need to be made all the time.

Some interesting bits & pieces

In 389 father-son pairs, 6% of the sons received a mutation from their fathers on only one marker while less than 1% of the population had mutations on two markers. Based solely on this study, double marker changes do occur; however, they do not occur in a significant number of the population. All samples resulted in single repeat mutations except one sample which contained a two repeat loss at Y-GATA-H4. Ref: http://www.cstl.nist.gov/strbase/pub_pres/Decker_YfilerMutationRate.pdf

[4] For more information on interpreting results, see http://www.familytreedna.com/faq/answers/default.aspx?faqid=19, http://www.familytreedna.com/faq/answers/default.aspx?faqid=36, and http://www.familytreedna.com/faq/answers/default.aspx?faqid=9#895

[5] See http://www.familytreedna.com/faq-tip.aspx

[6] See http://www.mymcgee.com/tools/yutility.html

[7] See http://www.familytreedna.com/faq/answers/default.aspx?faqid=21#788, and http://www.familytreedna.com/faq/answers/default.aspx?faqid=21#694

[8] See http://www.jogg.info/22/Chandler.pdf