Please refer to: https://en.wikipedia.org/wiki/Sample_size_determination
Specifically, you need a sample size of 100 for 90% certainty, and a sample of approximately 1000 for 97% certainty. Ultimately, how much error can we afford and what would be a reasonable sample size to determine mtDNA group 'in all probability'?
Moreover, in sampling, you need the sample out of people with the Exact Line!! Because this profile is listed as U5b - this entails a sample from her to the exact same daughters in all generations! And that assumes that all lines are correct for those people. (That means - it is unlikely to be ever found... except if you dig up 950 graves... (assuming all 50 people who are alive are contributing) i.e. the test is flawed to start with...)
Questions: Did someone gather 1000+ samples for any of the exact lines? (I think unlikely - and that is the number to hold it to a high likelyhood). Were all samples U5b? How have you allowed for 'errors'? What is the statistical strength of the test?
U5b would in my mind, today, only hold true if someone took her sample physically. On the profile there are no sources to where the test was done and how the samples were collected? If her profile is publicly attributed to be U5b, that should be surely be sourced? I recently saw a 'Barry' Y-DNA result - I think just over 100 men contributed and there were three distinct groups although they share the same progenitor. Confusing - that test must have a very low reliability :(
All users should be aware of the dangers of 'extrapolating' without emperical evidence. -the strength of the test needs to be shown otherwise or the public display of mtDNA be curbed.
Jan, you are correct - these are projections only, based on the assumption that the paper represented by the genealogy line is biologically correct. (A safer bet with mtDNA than with Y DNA:-))
The assumption is that this presumption is obvious, but perhaps we should be more specific about it on the project.
I'm out for the day today, but will come back to discuss in more depth.
Please have a look here too:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4599962/
I think the conclusion (bias, weak statistical relevance) may be applicable too?
I also see that FTDNA only has +- 575 000 total tests in its world wide database... claiming to be the largest.
I will provide comments here in a bit. I'm reading this paper you posted Jan.
The Barry Project is interesting and highlights some of the issues that can come up with Y-DNA. The Barry results imply that 3-4% of all births in each generation of Barry's since their arrival in Ireland involve a discontinuity in the paternal line - adoption (many people took in nieces, nephews, children of neighbors), out of wedlock birth, incorrectly attributed paternity.
We must take care when interpreting genetic results for genealogy. Having a good solid paper trail to start with is essential. There are strategies to employ when using genetics for genealogy that can help identify ancestors in a reliable way.
Jan, I'm not sure how the Wikipedia reference pertains here? Statistical significance in experimental designs and random sampling isn't pertinent to the manner in which the DNA results of a self selected population of collateral cousin descendants would be used to dis/prove their assertion of a common ancestor.
It's been a while since I did Stats, and I've never studied population genetics, so I wasn't familiar with all the terms, but I still don't think that paper is making the point you want it to make here.
I think you are misunderstanding how that population genetics study of random samples of people with the same surname in an area differs from a study of a known genealogical line. Dis/proving that a purported mtDNA progenitor is/n't correct, doesn't require thousands of descendants as a sample set for statistical significance. It only requires a number of cousins who can trace their genealogical line back to her.
In the Appel Botha study, it was something like 10 or 20 people, even though the Botha surname is amongst the most common in SA.
Look at the article referenced on the https://www.geni.com/projects/Bothas-who-are-DNA-descendants-of-Fer... 'Bothas who are DNA descendants of Ferdinandus Appel' project:
http://repository.up.ac.za/bitstream/handle/2263/32007/Greeff_Appel...
=Moreover, in sampling, you need the sample out of people with the Exact Line!! Because this profile is listed as U5b - this entails a sample from her to the exact same daughters in all generations! And that assumes that all lines are correct for those people. =
=On the profile there are no sources to where the test was done=
Francoise Martinet is the daughter to mother to mother line (is that what you mean by exact same daughters?) of Private User, who has uploaded the Test Result under documents on the profile.
=That means - it is unlikely to be ever found... except if you dig up 950 graves... (assuming all 50 people who are alive are contributing) i.e. the test is flawed to start with...)=
I think the Appel/Botha article is useful in showing how digging up graves is not the only method of proving ancestral / DNA veracity.
If you look at the South African mtDNA - Female Progenitors Project http://www.geni.com/projects/South-African-mtDNA-Female-Progenitors..., you will see that she is also the mother line ancestor of Daan Botes - who has yet to have a genetic test done, but when he does - this will add a level of dis/proof to the genealogy that helps us analyse the veracity of the tree.
So, the idea of publicizing the DNA results that pertain to our Progenitor profiles, is to get the SA users interested in having their DNA tested, in order to help us to confirm the lines.
On the project, we went to some trouble to find other Geni users who were also motherline descendants of the same progenitor. This is is in the hope that they'll be inspired to have themselves tested too.
Should we point out in the Curator note that we haven't dug up her grave? >Not possible when dealing with Prog profiles which usually contain a lot of other data on a Curator note of limited characters.
Is this dishonest?
>Well I hadn't thought so, given that the position of virtually every ancestor on our tree contains the same implicit caveat: '(As far as the data shows) this is my ... grandparent.' Very few of us have the 'empirical evidence" of DNA proof from samples of even our grandparents to prove the paper trail relationship beyond doubt.
So, '(As far as the data shows) this is her mtDNA Haplogroup.' If more data becomes available to contradict this, we will update.
That doesn't seem to me to be unreasonable, or needing to be 'curbed'.
I was unclear whether the test & question was about mtDNA or Y-DNA.
For mtDNA, as i understand it, you may actually only need to compare test results from a few living persons; stats are not relevant, it's a direct chromosomal comparison.
Here's an example study
https://www.familytreedna.com/public/HanksDNAProject/default.aspx?s...
I think the same is true of Y DNA.
(The proof that a significant chunk - was it as much as half? of all living SA Bothas are actually descendants of an Appel, not a Botha, came from comparing the DNA of 10 to 20 living men) http://repository.up.ac.za/bitstream/handle/2263/32007/Greeff_Appel...,
Without DNA from the grave, stats are always involved. It's only 'a direct chromosomal comparison' if you have DNA from the body. In that article: "The rarity of the X1c haplogroup makes these matching samples more definitive." is the stats reference.
Of course, Y DNA has a far greater chance of being extrapolated incorrectly, because the paper trail depends on taking the mother's word on the father's identity.
Far more difficult for her to pretend she's not the mother. (Although that does happen too.)
Wow! What a fabulous discussion!
I am about to go out, so this will be brief. I promise to come back with a response that has more meat.
Most importantly, for any DNA test to be useful, we need to have a detailed family tree going at least 4 generations back and we need solid documentation for each ancestor.
The important thing for finding genealogically relevant matches is matching high numbers of STR markers. The more STR markers you include in your test, the better. If you match someone at 67 markers, there is a good chance you are closely related. For example, say John matches Joe at 67 markers with only 3 differences. This suggests that they may descend from two different sons of a common male ancestor and that three cumulative Y STR mutations have occurred since that time in these two men's lines.
Stay tuned, more to come........
Roland Henry Baker, III I'm wondering if I could trouble you to take a look at the study http://repository.up.ac.za/bitstream/handle/2263/32007/Greeff_Appel and comment on Jan's questions, if you have a chance?
What the Wikipedia article is discussing is population statistics. We have a pond with 1,000 fish. Some fish are blue, some fish are green and some fish are yellow. How many fish do we have to catch in order to determine the ratio of each color of fish in the pond? This is population statistics 101. That has nothing to do with the determining the mtDNA haplotype of a specific known ancestor. Population statistics does not apply here. Population statistics would apply for example if we took a total population of a city which included individuals of several known mtDNA haplogroups and wanted to calculate the ratio of people who belonged to each of these mtDNA haplogroups. For example, in the second article we examine a population of all the people of certain surnames in 26 cities that were listed in the phone book. We then examine DNA data from a sample of people from each of these surnames (and surnames with similar spellings) and sort them into separate groups based on these data. That first step involves population statistics. Then we try to construct a number of founding genetic groups. Then we try to compare these two sets of data. Very interesting but this also has nothing to do with the determining the mtDNA haplotype of a specific ancestor.
There are a number of surname projects. For example the Baker surname project. I have no reason to believe if I sample all the Bakers in the United States at random that they will belong to the same Y haplogroup. I could sample the entire population of Bakers at random and then try to determine how many distinct haplogroups they belong to and in what ratio. But that also doesn’t have anything to do with to do with the determining the Y haplotype of a specific known ancestor.
What we are attempting to do is take a known proven genealogy and using known rules of inheritance to determine the haplogroup of specific members of a given generation. If I know the male in the first generation F(1) is of Y haplogroup R1a then I know that each of his male offspring will be R1a as well assuming that the number of generations between F(1) and F(n) is much less than the divergence rate of the haplogroup in question. We know that R1a split from R1b about 25,000 years ago. So I can safely compare F(1) to F(12) over a say 400 year period and safely assume that the genotype is constant. I can determine the haplogroup of a male member of F(12) and use that to deduce the haplogroup of the original F(1) male.
What is in question is the validity of the genealogy. We assume that the genealogy is correct. And that is why the genotype should always be posted along with information about the line of descent. That way any independent researcher can investigate the accuracy of the genealogy. Without a line of descent the haplogroup call is useless to researchers. Likewise the type of test taken and the place the test was taken should also be included.
As far as genealogy these DNA data are no different than any other primary source.
Erica for some reason I couldn't access http://repository.up.ac.za/bitstream/handle/2263/32007/Greeff_Appel. So I based my answer on the Wikipedia page and the PLOS article posted above. I hope that is helpful. If not please feel free to ask me further questions.
I’d like to make a suggestion and that would be that GENI allows users to attached DNA test results to their own personal profiles (and only their own personal profiles not to their ancestors). And then let these results propagate automatically up their tree to their ancestors where any researcher can see these results, the type of test taken and where it was taken and they can trace it to the person who took the test. Some ancestors will then have several DNA test results attached to them and any researcher can compare those results at the testing company in question, contact the person who submitted the results, verify the genealogy to their own personal satisfaction and resolve any conflicting data. Subsequently a discussion may be initiated if it appears a non-paternal event or genealogical error has occurred. It would be relatively easy to implement. I can enter my GEDMATCH kit #, my testing company and attach it to my profile on GENI. I can separately attach my mtDNA or Y DNA results with all the relevant information, etc. And then these results propagate up my tree a finite number of relevant generations and only to the relevant ancestors. If I visit ancestor John Doe I will see six GEDMATCH kits and two Y DNA kits. I can follow where they came from – do my own sanity check on the genealogy and do a comparison to my own results on GEDMATCH or FTDNA, etc. But we would not directly post DNA results to our ancestors. These might be included in the profile biography of an ancestor with related links to further details. But the primary attachment of test results would always be the profile of the test taker. Only discussions of these data would be included on our ancestor’s profiles. The data itself would be automatically propagated and displayed as a list like a project or one a separate tab. So for example if Joe X from Virginia decides Mayflower passenger Z has a new daughter and attaches her to passenger Z’s profile and Erica finds this new daughter and disconnect her than the DNA data would also get automatically disconnected as well.
As an aside I have had my mtDNA sequenced by two separate companies (FTDNA and FullGenomes) and these data analyzed by four separate groups and out of over 16,000 base-pairs compared not a single discrepancy was found.
Information about what test was taken and what company did the test is important because accuracy can vary and some tests have known issues. When determining a sequence the lab will actually sequences the same segment of DNA a number of times and compare the resulting data. The number of times they sequence the same segment is called the “coverage.” The higher the coverage the more expensive the test. So for example you would pay more for 30x coverage than for 2x coverage but the results for the former will be more accurate than the results for the latter. By comparing the differences between each raw sequence a “call” can be made about the actual sequence. The call is reported with a specific probability and these data can be important especially of the probability of the call is low. In some cases I have found conflicting STR or SNPs calls only to learn that the call probability on that STR was low.
The point is the more information that can be provided about the test the better.
Here is an example of what I am referring to on Wikitree:
http://www.wikitree.com/wiki/Baker-17919
On this profile on the left side I see two different Y DNA results and nine different GEDMATCH kits with information about where the test was taken. I can see in this profile two children and I can note that a Y DNA test was taken by a descendant of each of these two children and comparing these results I can see that both of these men belong to the same extremely rare haplotype. I know the this haplotype occurs in less than 1% of the population. This information is useful to me. I can then take the GEDMATCH kits and compare them on GEDMATCH. I can visit AncestryDNA and FTDNA and compare matches using their respective tools. I can contact the tester. I can trace the results to the tester and do a sanity check on their genealogy.
Note all of these DNA tests are attached to the profiles of the tester and not the ancestor. They then automatically propagate up the tree.
If I find an error in the genealogy of a descendant I can disconnect that line and the DNA results automatically disappear from the profile.
To me this is looks like the future of "one world tree" style genealogy sites.
GENI should adopt this.
Here we go with the correct link for
http://repository.up.ac.za/bitstream/handle/2263/32007/Greeff_Appel...
Appel Botha Cornelitz - the abc of a 300 year old divorce case
Jaco M. Greeff & Christos Erasmus, Dept. of Genetics, University of Praetoria
I would be interested in Noelle & Roland's comments on the study.
Here's a discussion
https://www.geni.com/discussions/151927?msg=1059959
Display DNA as a field in profiles
Roland, yes I agree with you if you have a rare DNA group you can say so with more certainty... so maybe a more simple general example...
If you compare the DNA of 3 individuals - A, B and C...
A is the this ancestor (Francoise Martinet, SM/PROG b1659 - c1701 - more than 4 generations ago so aDNA tests not useful) with unknown DNA - and we try to proof the paper trail.
B and C are living individuals, who done their non-autosomal DNA tests, with a direct paper trail to maternal A.
I agree 100% that should the DNA of A be known, then if B's DNA matches A's DNA - the test has been extremely useful. The paper trail of B to A is therefore proven and you need just one match. As in the Lincoln case where there were only a few used.
However, what happened on this profile is that the DNA of individual A - is not known - but with DNA we can find it - a reality - but with a large enough sample. The question is how many individuals B, C, D, ... do you need to 'proof' the DNA of A.
From statistics 101 :) you should have the numbers I quoted previously (and the link to the wikitree) where A's DNA is unknown and needs to be estimated.
Maybe the actual numbers (and not names) of matches and non-matches, should be given too - that would be great?
How about more than one DNA field calculated (and automated) from descendants DNA's test - with the numbers of 'hits' for each of mt,Y and A - I am just speculating - but it can be a great source to find mistakes (or proof it correct otherwise) in the WFT...
I would further like to point out that mt and Y only test a very small set of your ancestors - will not bother you with the maths, but there is nothing that I can do with DNA to proof my relationship to this profile (as my many times great-grandmother but not direct maternal) so my interest is merely that those who can, do so accurately as it impacts 1000s (potentially billions) of individuals.
The problem is exactly that if one says 'my tree and DNA is correct' with only your results, with A unknown, then you would 'disconnect individuals' that does not match. You need more than a few results.
If this discussion needs to be moved to another discussion or project, could a C please do so?
Jan, reread Roland's first post - which is an explanation of why your link to a definition of statistical significance isn't a useful specific comment on the process of attributing mt or Y DNA to a given ancestor on a family tree.
Then re read the Botha Appel paper and think about it a bit.
Your triangulation A, B, C scenario still requires A to be dug up. Once A has been dug up, no statistical intervention is required at all. You don't need B and C to be compared when you can compare C to A as well. DNA mutation over 400 years is a drop in the ocean.
=The problem is exactly that if one says 'my tree and DNA is correct' with only your results, with A unknown, then you would 'disconnect individuals' that does not match. You need more than a few results.=
This is exactly NOT the problem. If you only had your results, then what results would you be using to 'disconnect people who don't match'?
If two purported descendants of a genealogically determined mtDNA ancestor produce different mtDNA results, then the kind of study that happened on the Appel Botha line ensues. That is very exciting and really good for the veracity of our genealogy tree.
The fact that the Appel Botha result rendered thousands of SA genealogical DVNumbers incorrect and resulted in calls for graves to be dug up, isn't proof that thousands of DNA tests are required for statistical significance in this case. It's a misunderstanding of Stats 101 :-) to assume that.
Sharon, you don't understand my last message :( Roland's example is not the same as my example and therefore his mail as to why it is not statistical relevant is not relevant.
If you assign a DNA where you don't show the statistical relevance, people (either directly or indirectly) would question the accuracy and may as result 'break' a tree that was entirely correct, but the assumed DNA was incorrect.
Jan, all these answers are about your initial post:
=:Please refer to:https://en.wikipedia.org/wiki/Sample_size_determination
Specifically, you need a sample size of 100 for 90% certainty, and a sample of approximately 1000 for 97% certainty. Ultimately, how much error can we afford and what would be a reasonable sample size to determine mtDNA group 'in all probability'?=
explaining to you why this assumption is completely incorrect, and a misunderstanding of both statistical significance and it's application to genetic genealogy.
=If you assign a DNA where you don't show the statistical relevance, people (either directly or indirectly) would question the accuracy and may as result 'break' a tree that was entirely correct, but the assumed DNA was incorrect.=
makes no sense. What statistical relevance are you talking about?
The reliability of the extrapolated DNA data depends entirely on the reliability of the paper trail that has created the family tree (including, in the SA case, the genealogical DVNumbers you have been arguing should be in the Suffix Field of every profile, and stridently messaging me about in private.)
Since we don't require users to post the statistical reliability of the sources they've used to create the family tree , there is no way to calculate the statistical reliability of the extrapolated DNA based on these sources. Surely you can see this?
(It would have been interesting if the Botha DNA analysis had gone on to calculate the statistical reliability of the DVNumbers, as the DNA results rendered thousands of Botha DVN completely incorrect)