Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Apr 13;360(6385):171-175.
doi: 10.1126/science.aam9309. Epub 2018 Mar 1.

Quantitative analysis of population-scale family trees with millions of relatives

Affiliations

Quantitative analysis of population-scale family trees with millions of relatives

Joanna Kaplanis et al. Science. .

Abstract

Family trees have vast applications in fields as diverse as genetics, anthropology, and economics. However, the collection of extended family trees is tedious and usually relies on resources with limited geographical scope and complex data usage restrictions. We collected 86 million profiles from publicly available online data shared by genealogy enthusiasts. After extensive cleaning and validation, we obtained population-scale family trees, including a single pedigree of 13 million individuals. We leveraged the data to partition the genetic architecture of human longevity and to provide insights into the geographical dispersion of families. We also report a simple digital procedure to overlay other data sets with our resource.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.. Overview of the collected data
(A) The basic algorithmic steps to form valid pedigree structures from the input data available via the Geni API. Gray: profiles; Red: marriages (See fig. S2 for a comprehensive overview). The last step shows an example of a real pedigree from the website with ~6,000 individuals spanning about 7 generations (B) The size distribution of the largest 1,000 family trees after data cleaning sorted by size.
Fig. 2.
Fig. 2.. Analysis and validation of demographic data
(A) Distribution of life expectancy per year. The colors correspond to the frequency of profiles of individuals who died at a certain age for each year. Stars indicate deaths during military ages in the civil war, WWI, and WWII (B) The expected lifespan in Geni (black) and the Oeppen & Vaupel study (red, ref: 30) as a function of year of death (C) Comparing the lifespan distributions versus Geni (black) and HMD (red) (Also see fig. S5A) (D) The geographic distribution of the annotated place of birth information. Every pixel corresponds to a profile in the dataset (E) Validation of geographical assignment by historical trends. Top: the cumulative distribution of profiles since 1500 for each city on a logarithmic scale as a function of time. Bottom: year of first settlement in the city.
Fig. 3.
Fig. 3.. The genetic architecture of longevity
(A) The regression (red) of child longevity on its mid-parent longevity (defined as difference of age of death from the expected lifespan). Black: the average longevity of children binned by the mid-parent value. Gray: estimated 95% confidence intervals (B) The estimated narrow-sense heritability (red squares) with 95% confidence intervals (black bars) obtained by the mid-parent design stratified by the average decade of birth of the parents (C) The correlation of a trait as a function of IBD under strict additive (h2, orange), squared (VAA, purple), and cubic (VAAA, green) epistasis architectures after dormancy adjustments (D) The average longevity correlation as a function of IBD (black circles) grouped in 5% increments (gray: 95% CI) after adjusting for dominancy. Dotted line: the extrapolation of the models towards MZ twins from the Danish Twin Registry (red circle).
Fig. 4.
Fig. 4.. Analysis of familial dispersion
(A) The median distance [log10 x+1] of father-offspring places of birth (cyan), mother-offspring (red), and marital radius (black) as a function of time (average year of birth) (B) The rate of change in the country of birth for father-offspring (cyan) or mother-offspring (red) stratified by major geographic areas (C) The average IBD [log2] between couples as a function of average year of birth. Individual dots represent the measured average per year. Black line denotes the smooth trend using locally weighted regression (D) The IBD of couples as a function of marital radius. Blue line denotes best linear regression line in log-log space.

Comment in

Similar articles

Cited by

References

    1. Fisher RA, Trans. R. Soc. Edinb. 52, 399–433 (1919).
    1. Wright S, J. Agric. Res. 20, 557–585 (1921).
    1. Tenesa A, Haley CS, Nat. Rev. Genet. 14, 139–149 (2013). - PubMed
    1. Kong A et al., Nat. Genet. 40, 1068–1075 (2008). - PMC - PubMed
    1. Lowe JK et al., PLoS Genet. 5, e1000365 (2009). - PubMed

Publication types