Inferring population structure¶
Here we provide an implementation of the EPISTRUCTURE algorithm by Rahmani et al. [1] for inferring population structure from methylation without the need for genotyping data. The EPISTRUCTURE algorithm calculates components that are correlated with the ancestry information of the samples in the data by applying principal component analysis (PCA) on a set of pre-defined list of sites that were found to capture a high level of genetic information. The EPISTRUCTURE components can be added as covariates in a downstream analysis.
Note
The example commands described bellow assume that the user generated GLINT files with covariates file and phenotypes file.
Note
The reference list of sites were found based on European individuals.
EPISTRUCTURE¶
–epi:
Computes the EPISTRUCTURE components and generates a file titled epistructure.pcs.txt with the output.
For example:
python glint.py --datafile datafile.glint --epi
will compute the EPISTRUCTURE components of the data.
Note
EPISTRUCTURE leverages polymorphic sites in order to capture the genetic and therefore the ancesty information in the data better. Therefore, we recommend to avoid removing polymorphic sites (–rmpoly) before applying EPISTRUCTURE.
Note
In case of data from heterogeneous source (e.g., blood), we suggest to account for type composition (see –covar).
Note
Use –epi together with –gsave in order to generate a new version of GLINT files with the computed EPISTRUCTURE components (these will be included in the datafile.samples.txt file).
Note
Use –out in order to change the default output name.
–covar:
Selects covariates to use in the calculation of the EPISTRUCTURE components. Considering highly dominant genome-wide effectors such as cell type composition (in case of heterogeneous tissue) is expected to improve the correlation of the EPISTRUCTURE components with the cell type composition.
For example:
python glint.py --datafile datafile.glint --epi --covar c1 c2 c3
will compute the EPISTRUCTURE components while accounting for the covariates c1, c2 and c3. The names of the covariates are defined by the headers in the datafile.samples.txt file associated with the datafile.glint. For more details see GLINT files.
Note
Use the argument –covarfile in order to provide covariates that were not included in the datafile.glint file or in case where a textual version of the data is used rather than a .glint file.
–savepcs
Selectes the number of EPISTRUCTURE components to output (default is 1).
For example:
python glint.py --datafile datafile.glint --epi --savepcs 2
will compute the first two EPISTRUCTURE components of the data.
[1] | Rahmani, Elior, Liat Shenhav, Regev Schweiger, Paul Yousefi, Karen Huen, Brenda Eskenazi, Celeste Eng et al. “Genome-wide methylation data mirror ancestry information.” bioRxiv (2016): 066340. |