116 Full-Sequence mtDNA Results and Subclades

From Haplogroup K Project

December 16, 2007

 

On May 24, 2007, I published a diagram and discussion showing the relationships between 42 mtDNA Haplogroup K Project members with full-sequence mtDNA (FGS) results. I had previously published diagrams for 35 FGS results on March 27, 2007, and seven on August 3, 2006. Those documents, plus several mentioned below, and many others may be found on the K Project website. The K Project now has 116 members with FGS results. Four more such tests are in progress, but the results are not due back until next year. So I have now created a new cladogram from the 116 results. In the previous three versions, I had used the actual control-region (HVR) mutation lists and the defining coding-region mutations for each member. No private or even branch coding-region mutations were used, out of respect for the privacy of the members’ results in the area which might have mutations with medical implications.

However, for this new diagram, I have used a different format. You might have noticed that the 42-member diagram almost completely filled the page. If I used the old format, the new 116-member diagram would be illegible. Instead, I’m using the cladogram format which I previously used on June 15, 2007, for the 305 members with high-resolution (HVR1 + HVR2) results and mostly predicted subclade designations. In this format, each subclade is shown by a circle proportional to its representation in the K Project’s 116 members with FGS results. The two cladograms look similar, but there are subtle differences which I will explain. The cladogram was created using Tom Glad’s mtDNAtool  and the Fluxus-Engineering Network software. I will also provide some general findings from the FGS results as they apply to the subclades.

The cladogram begins at the top with a node labeled K, which represents the ancestral haplotype of our haplogroup. I have previously added a MitoSearch entry, ARFHH, which lists all the mutations down to the founding of K as measured against the Cambridge Reference Sequence (CRS). Every K, so far, is in one of the major divisions K1 or K2, which are defined by coding-region mutations. Defining mutations are not shown on the cladogram; for those see Dr. Doron Behar’s K tree in Figure 1 of his 2006 paper.

Working right to left, three subclades of the major division K2 are shown, but they are different from the three on the June cladogram. There is no more K2+ - or any other subclade with a plus sign; all subclades here have an assigned designation. Every K2 so far not in K2a (ten members) or the lower K2a2a (two) has been determined to be in K2b (four). K2b is difficult to predict since it is defined by coding-region mutations. No members have been found in the lower divisions of K2a, except for those in the smallest Ashkenazi subclade K2a2a. The latter is easily predicted from HVR results by the 512C mutation. Also, no K Project member has been determined to be in K2c.

The major K1 division is defined by two coding-region mutations. It has been discussed before that Ötzi the Iceman was an undifferentiated K1. We now have two members designated as simply K1. Their HVR and coding-region mutations look completely different, so a new “K1d” subclade is not on the horizon.

In the major K1c group, defined by 498-, all members not in K1c2 (six members), defined by the addition of 16320T, have been located so far in K1c1 (three) – with one person in K1c1b.

K1b is defined by one coding-region mutation. All members are in either K1b1a (three) or K1b2 (six), both of which are easily predicted by HVR mutations. No members are found in K1b1b or K1b1c.

Perhaps it’s just coincidence, but I note that all three major subclades discussed so far – K2, K1c and K1b – do not have members designated at that level (so-called “empty subclades”) and that all three are split into two lower subclades of roughly equal size.

Now we get to K1a, defined by 497T, which is over 60% of K and is vastly more complicated. Here there are no empty subclades and the sizes of the lower subclades are very different. Dr. Behar divided K1a into nine next-level subclades. So far we do not have members in four of those; K1a5, K1a6, K1a7 or K1a8. However, I have discussed in several places two new large subclades discovered in members of our K Project and on FTDNA’s MitoSearch. I have given those, naturally enough, the provisional designations K1a10 and K1a11. Be aware that these two may have different designations in a future version of the K tree.

On the K Project website, you will notice that there are currently 23 members with the subclade designation K1a. But you will notice that on the cladogram there only two members shown at the undifferentiated K1a level. Twelve of the other members are shown on the cladogram under either K1a10 or K1a11. Nine others are shown at the node labeled “195C” for the HVR2 mutation above K1a9 on Behar’s K tree. I have previously referred to those as either “Pre-K1a10” or “Pre-K1a9” depending on whether or not they have one or more insertions at position 524. They are combined here, since the 524 insertions are not currently used in subclade designations. Of the nine, eight have the insertions. Many of those predicted to be in Pre-K1a9 because of 195C turned out to be in the K1a1 or K1a4 groups. So while Pre-K1a10 has turned out to be a robust, well-populated subclade, Pre-K1a9 has not.

Below the 195C node are the K1a9 and provisional K1a10 nodes, both of which are defined by single HVR mutations; 16524G and 16048A. K1a9 is the second largest Ashkenazi subclade, while it’s “sister” or perhaps “first cousin” subclade K1a10 appears to have been founded in Ireland. K1a9 has five members in the Project and K1a10 has seven. The Project website News tab lists several documents related to the connection between the two subclades.

Going now to numerical order, the largest part of K1a is under K1a1, of which there is only one undifferentiated example. K1a1 is defined by a single coding-region mutation. K1a1a, defined by another coding-region mutation, currently has four members and no lower branches. K1a1b also has four members; but it has two lower subclades. K1a1b1 has five members. By far the largest Ashkenazi subclade, K1a1b1a, has 15 members. The latter subclade is “bi-modal”; that is, it has almost equal numbers in haplotypes with and without the additional 16223T. All K Project examples have 114T, which is shown below the subclade designation on Behar’s K tree. K1a1b1a is generally predicted by 16234T, but one person in the Project is missing that – a back mutation – and that mutation is found in several other places within K1a1. My recent article in the Journal of Genetic Genealogy mentions 114T and other mutations affected by heteroplasmy.

Only one Project member is in K1a2, which is defined by one coding-region mutation. I might point out that three of the four examples used by Behar for his tree were from a study of Finnish mtDNA, but our example lists an origin in the British Isles.

K1a3, defined by one coding-region mutation, has one member in the Project. K1a3a has three members. The latter subclade on Behar’s tree is defined by a coding-region mutation plus 16093C. However, only one of the three in K1a3a has that mutation. FTDNA has generally ignored the requirement for this mutation in this and several other subclades. 16093C is the HVR1 mutation most subject to heteroplasmy. (That “honor” goes to the position 309 insertions in HVR2, but Behar didn’t use those in his K tree.)

As I have discussed before, the K1a4 group is the major surprise revealed by the full-sequence testing. It is usually not predicted by HVR mutations. In the earlier FGS diagrams, there were no members of K1a4 or K1a4a shown. Now we have four in K1a4 and one in K1a4a. A full dozen members are in the next lower subclade K1a4a1. Two members have 16261T and three have 16245T; both these HVR1 mutations appear to be predictors for the subclade. But that leaves seven members of K1a4a1 with no HVR mutations useful for prediction. In fact, we have one high-resolution haplotype which has confirmed members in K1a, K1a3 and K1a4a1. A good rule might be that if there are no mutations in a haplotype which may be used to predict a subclade, then exact HVR matches are not useful in determining relationships.

As mentioned above, there are no K Project members – or FTDNA customers that I am aware of – who are in subclades K1a5, K1a6, K1a7 or K1a8. K1a9 and K1a10 were discussed above.

The provisional subclade K1a11, with five members, is defined by both HVR mutations (16129A, 16T, 150T and 199C) and coding-region mutations; 16T being the key mutation, since it shows up in no other subclade and rarely in any other haplogroup. The unusual nine-base-pair-deletion mutation in the coding-region and other aspects of this provisional subclade has been discussed in other documents under the News tab on our website.

A few general comments are in order. In the K Project 120 FGS tests have been ordered, with 116 results back. We are very near 121, which was the number of sequences used by Dr. Behar to create the current K tree. The FTDNA database also has an unknown number of K results from customers not in the K Project. Out of the 127 FamilyTreeDNA submissions to the federal GenBank DNA database, 23 are from K. Review of 106 FGS results available to me shows many exact and partial coding-region matches between K Project members and the sequences used by Behar. Therefore, the next version of the K tree should be more precise and will no doubt include additional lower subclades.

Also observed from the available sequences is that different subclades “act” differently when their FGS results are compared. First, since Dr. Behar’s paper focused on Ashkenazi mtDNA, the three Ashkenazi subclades – K1a1b1a, K1a9 and K2a2a – are already well-defined and no new lower subclades may be necessary. But that only partly explains why K1a1b1a members are far more likely to have multiple exact 16,569 SNP matches than those in other subclades. In fact, they are more likely to differ in the HVR than in the coding region. Neighbors K1a9 and K1a10 exhibit similar behavior; more often than not they have no coding-region mutations below the K1 level. But K1a10 has obvious branches forming in the HVR. In contrast, members of K1a4a1 rarely have exact matches in HVR or the coding region; but numerous partial matches lead to new branches. In other subclades, members have exact HVR matches, but only partial coding-region matches, leading to new branches.

© 2007 William R. Hurst

Administrator, mtDNA Haplogroup K Project