--- title: "The M statistic: A simple method to measure the impact of systematic heterogeneity in GWAS meta-analysis." author: "Lerato E. Magosi" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{The M statistic: A simple method to measure the impact of systematic heterogeneity in GWAS meta-analysis.} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r, echo = FALSE, message = FALSE} # Global options knitr::opts_chunk$set(collapse = T, comment = "#>", fig.path="fig/") options(tibble.print_min = 4L, tibble.print_max = 4L) ``` ## Background * Statistical heterogeneity can be described as consistent study effect-size differences among participating studies in a meta-analysis, across the array of genetic variants examined. * The getmstatistic function `?getmstatistic` computes M statistics to quantitatively describe systematic heterogeneity in meta-analysis. * M statistics are useful for identifying outlier studies which show null effects or consistently show stronger/weaker effects than average. ## Properties of the M-statistic * M's power to detect systematic heterogeneity patterns increases with the number and strength of association of variants examined in the meta-analysis. * M statistics can be calculated using a panel of known lead variants independently associated with the trait of interest or the GWAS significant lead variants in a meta-analysis. * Furthermore, M's power is relatively indepenent of the number of studies in a meta-analysis. * The M statistic can be applied to meta-analyses with few studies (e.g. 10 studies). #### See the M statistic paper for details on the statistical theory: Magosi LE, Goel A, Hopewell JC, Farrall M, on behalf of the CARDIoGRAMplusC4D Consortium (2017) Identifying systematic heterogeneity patterns in genetic association meta-analysis studies. PLoS Genet 13(5): e1006755. [https://doi.org/10.1371/journal.pgen.1006755](https://doi.org/10.1371/journal.pgen.1006755). #### This vignette walks through the steps involved in carrying out an M analysis --- ## A few tips for a succesful M analysis Prior to conducting an M analysis: * ensure that the variants that will be used to conduct the M analysis are independently associated with the trait of interest i.e. in linkage equilibrium. * ensure that the effect alleles for participating studies in the meta-analysis are aligned. This can be achieved by "flipping" the beta-values (i.e. multiply by -1) of misaligned effect alleles. * apply a study-level genomic corrrection to the standard errors to minimize false positives. For example: `gcse <- se * sqrt(lambda)` ## Data: heartgenes214 #### We shall use the heartgenes214 dataset for our M analysis. ``` heartgenes214 is a multi-ethnic GWAS meta-analysis dataset for coronary artery disease. It comprises summary data (effect-sizes and their corresponding standard errors) for 48 studies (68,801 cases and 123,504 controls), at 214 lead variants independently associated with coronary artery disease (P < 0.00005, FDR < 5%). Of the 214 lead variants, 44 are genome-wide significant (p < 5e-08). The meta-analysis dataset is based on individuals of: African American, Hispanic American, East Asian, South Asian, Middle Eastern and European ancestry. ``` #### Note: * The beta-values in heartgenes214 have already been flipped to align effect alleles. * Also, the standard errors have been genomically controlled at the study-level. #### The data was sourced from: Magosi LE, Goel A, Hopewell JC, Farrall M, on behalf of the CARDIoGRAMplusC4D Consortium (2017) Identifying systematic heterogeneity patterns in genetic association meta-analysis studies. PLoS Genet 13(5): e1006755. [https://doi.org/10.1371/journal.pgen.1006755](https://doi.org/10.1371/journal.pgen.1006755). ## The basics: Conducting an *M* analysis The getmstatistic function requires the following variables for an M analysis: * A numeric vector of beta-values i.e. study effect-sizes. * A numeric vector of standard errors. * A character vector of variant/SNP names. * A character vector of study names. ``` # Load libraries ------------------------------------------------ library(getmstatistic) # for calculating M statistics library(gridExtra) # for generating tables # Basic M analysis using heartgenes214 -------------------------- # heartgenes214 is a multi-ethnic GWAS meta-analysis dataset for coronary artery disease. # ?heartgenes214 to view the dataset documentation. head(heartgenes214) # View the structure of the heartgenes214 dataset str(heartgenes214) # Run M analysis on all 214 lead variants # To view getmstatistic documentation ?getmstatistic or ?getm # Set directory to store plots, this can be a temporary directory e.g. `plots_dir <- tempdir()` # or a path to a directory of choice e.g. `plots_dir <- "~/Downloads"` plots_dir <- "~/Downloads" getmstatistic_results <- getmstatistic(heartgenes214$beta_flipped, heartgenes214$gcse, heartgenes214$variants, heartgenes214$studies, save_dir = plots_dir) getmstatistic_results # Explore results generated by getmstatistic -------------------- # Retrieve dataset of M statistics dframe <- getmstatistic_results$M_dataset head(dframe) str(dframe) # Retrieve dataset of stronger than average studies (significant at 5% level) getmstatistic_results$influential_studies_0_05 # Retrieve dataset of weaker than average studies (significant at 5% level) getmstatistic_results$weaker_studies_0_05 # Retrieve number of studies and variants getmstatistic_results$number_studies getmstatistic_results$number_variants # Retrieve expected mean, sd and critical M value at 5% significance level getmstatistic_results$M_expected_mean getmstatistic_results$M_expected_sd getmstatistic_results$M_crit_alpha_0_05 ``` ## Understanding the plots generated by getmstatistic ![getmstatistic_plots](fig/getmstatistic_vignette_fig_even_smaller.png) M scatterplot: M statistics for each study in the meta-analysis (Y- axis) are plotted against the average variant effect size (expressed as odds ratios) (X-axis) in each study. The dashed lines indicate the Bonferroni corrected 5% significance threshold (M= ±0.483) to allow for multiple testing of 48 studies. Studies showing weaker (M < 0) than average genetic effects can be distinguished from those showing stronger (M > 0) than average effects. ## A step further: Exploring available options ``` # Run M analysis on GWAS significant lead variants -------------------- # Subset the GWAS significant variants (p < 5e-08) in heartgenes214 heartgenes44 <- subset(heartgenes214, heartgenes214$fdr214_gwas46 == 2) # Set directory to store plots, this can be a temporary directory e.g. `plots_dir <- tempdir()` # or a path to a directory of choice e.g. `plots_dir <- "~/Downloads"` plots_dir <- "~/Downloads" # Exploring getmstatistic options: # Estimate heterogeneity using "REML", default is "DL" # Modify x-axis of M scatterplot # Run M analysis verbosely getmstatistic_results <- getmstatistic(heartgenes44$beta_flipped, heartgenes44$gcse, heartgenes44$variants, heartgenes44$studies, save_dir = plots_dir, tau2_method = "REML", x_axis_increment_in = 0.03, x_axis_round_in = 3, produce_plots = TRUE, verbose_output = TRUE) getmstatistic_results ``` ## Generating M statistic forest plots ``` library(metafor) # for conducting meta-analysis library(dplyr) # for sorting data.frames # We shall use the results from the M analysis of the 214 lead variants in heartgenes214 # As a reminder, here is the command below: # Set directory to store plots, this can be a temporary directory e.g. `plots_dir <- tempdir()` # or a path to a directory of choice e.g. `plots_dir <- "~/Downloads"` plots_dir <- "~/Downloads" getmstatistic_results <- getmstatistic(heartgenes214$beta_flipped, heartgenes214$gcse, heartgenes214$variants, heartgenes214$studies, save_dir = plots_dir) # Sort getmstatistic_results dataframe by M statistics getm_res_srtd <- dplyr::arrange(getmstatistic_results$M_dataset, M) head(getm_res_srtd) # Add on the case and control fields # First, drop duplicate study_names in the sorted getmsatistic_results getm_res_srtd_nodups <- subset(getm_res_srtd, getm_res_srtd$snp == 5) # Checking dimensions to confirm that we have 48 studies str(getm_res_srtd_nodups) # Second, drop duplicate study_names in heartgenes heartgenes214_nodups <- subset(heartgenes214, heartgenes214$variants == "chr14:75614504:I") str(heartgenes214_nodups) # Third, merge the two dataframes to add on the cases and controls getm_res_plus_cases_ctrls <- merge(getm_res_srtd_nodups, heartgenes214_nodups[, c("studies", "cases", "controls")], by.x="study_names_in", by.y="studies") # Sort data.frame by M statistics getm_res_plus_cases_ctrls_srtd <- dplyr::arrange(getm_res_plus_cases_ctrls, M) # Compute inverse-variance weighted fixed effects model # using fixed effects model to get variable box size metafor_results_fe <- metafor::rma.uni(yi = getm_res_plus_cases_ctrls_srtd[, "M"], sei = getm_res_plus_cases_ctrls_srtd[, "M_se"], weighted = T, slab = sprintf("%02.0f", getm_res_plus_cases_ctrls_srtd[, "study"]), method = "FE") metafor_results_fe # Compute inverse-variance weighted random effects model metafor_results_dl <- metafor::rma.uni(yi = getm_res_plus_cases_ctrls_srtd[, "M"], sei = getm_res_plus_cases_ctrls_srtd[, "M_se"], weighted = T, knha = T, slab = sprintf("%2.0f", getm_res_plus_cases_ctrls_srtd[, "study"]), method = "DL") metafor_results_dl # Plotting: # set margins par(mar=c(4,4,1,2)) # generate forest plot forest(metafor_results_fe, xlim=c(-1.8, 1.6), at=c(-1, -0.5, 0, 0.5, 1), cex=0.66, xlab = "M statistic", ilab=cbind(getm_res_plus_cases_ctrls_srtd$cases, getm_res_plus_cases_ctrls_srtd$controls), ilab.xpos = c(-1.4,-1.1), ilab.pos = c(2,2), addfit=F) # Adding labels text(1.6, 50, "M statistic [95% CI]", pos=2, cex=0.66) text(-1.6, 50, "Cases", pos=4, cex=0.66) text(-1.39, 50, "Controls", pos=4, cex=0.66) text(-1.8, 50, "Study", pos=4, cex=0.66) ``` ![getmstatistic_plots](fig/getmstatistic_vignette_forestplot_fig_even_smaller.png)