The impact of population heterogeneity on risk estimation in genetic counseling

Background Genetic counseling has been an important tool for evaluating and communicating disease susceptibility for decades, and it has been applied to predict risks for a wide class of hereditary disorders. Most diseases are complex in nature and are affected by multiple genes and environmental conditions; it is highly likely that DNA tests alone do not define all the genetic factors responsible for a disease, so that persons classified into the same risk group by DNA testing actually could have different disease susceptibilities. Ignorance of population heterogeneity may lead to biased risk estimates, whereas additional information on population heterogeneity may improve the precision of such estimates. Methods Although DNA tests are widely used, few studies have investigated the accuracy of the predicted risks. We examined the impact of population heterogeneity on predicted disease risks by simulation of three different heterogeneity scenarios and studied the precision and accuracy of the risks estimated from a logistic regression model that ignored population heterogeneity. Moreover, we also incorporated information about population heterogeneity into our original model and investigated the resulting improvement in the accuracy of risk estimation. Results We found that heterogeneity in one or more categories could lead to biased estimates not only in the "contaminated" categories but also in other homogeneous categories. Incorporating information about population heterogeneity into the original model greatly improved the accuracy of risk estimation. Conclusions Our findings imply that without thorough knowledge about genetic basis of the disease, risks estimated from DNA tests may be misleading. Caution should be taken when evaluating the predicted risks obtained from genetic counseling. On the other hand, the improved accuracy of risk estimates after incorporating population heterogeneity information into the model did point out a promising direction for genetic counseling, since more and more new techniques are being invented and disease etiology is being better understood.


Background
With the in-depth study of modern genetics, its principles have been discovered and applied widely in clinical settings. Many diseases have been found to "run in families" exhibiting simple Mendelian inheritance patterns, such as muscular dystrophy and Huntington's disease. Advances in knowledge about the genetic basis of disease enable the expansion of DNA testing both for diagnosis and for prediction of disease susceptibility beyond simply inherited traits. Demand for and expectation of genetic counseling keeps increasing over time. Many methodologies have been developed to estimate the age of onset [1][2][3] and lifetime risk or recurrence of hereditary disorders [4][5][6][7]. These have been successfully applied to determine a person's risk of developing a genetic disease or to determine the risk of having a child with a genetic disease [8,9]. However, there are cases where predictive tests based on family history cannot give satisfactory assessment. For example, BRCA-1 and BRCA-2 are believed to be breast cancer susceptibility genes; Begg [10] has pointed out that lifetime risk estimates for breast cancer derived from samples of multiple-case families are not always applicable to new BRCA-1 or BRCA-2 positive women who request genetic counseling. Because patients undergoing predictive DNA testing usually have no symptoms or clinical presentation, it is particularly important for this type of DNA testing to give precise estimates. Getting wrong answers either way has long-term effects on the individual or family and could lead to irreversible life decisions, e.g. prophylactic mastectomy. Therefore, finding out the cause of biased estimates and its impact on the predicted risks should be an important goal of contemporary genetic counseling.
In this paper, we investigated mechanisms for generating biased risk estimates in heterogeneous populations. We assumed that some individuals in certain groups were "contaminated" by having a very low probability, as a result of unmeasured factors, of getting the disease. If individual contamination status is properly taken into account, the estimated risk should be close to its true value. However, if contamination status is overlooked, the estimated risk will be biased. A dichotomous disease was modeled before and after this latent contamination factor was incorporated into a logistic regression model. The accuracy of the estimates was explored by computing relative bias and root mean square error (RMSE) of the estimated risks using simulated data sets.

Simulations
To find out the effect of population heterogeneity on the estimated disease risks, we carried out a set of simulations. We assumed that the individual disease risks were estimated based on genotype at a biallelic locus (L1) and their exposure to a fixed, dichotomous environmental factor (E1). However, the phenotype resulting from L1 could be overridden by the genotype at another unscreened locus (L2). In addition, interactions between E1 and the joint genotype at L1 and L2 were assumed to be present.
For the two alleles, A and B at L1, there were three genotypes, AA, AB and BB. Individual genotypes at L1 were simulated assuming Hardy-Weinberg Equilibrium (HWE) with both alleles having equal frequencies. Individual exposure to E1 was randomly determined using a population exposure rate of 0.20. Disease status was determined by the following logistic regression model: where design variable x 1 indicated whether the individual genotype was AA, design variable x 2 indicated whether the genotype was BB, and independent variable x 3 denoted whether the individual was exposed to the environmental risk factor E1. Allele A was partially dominant to allele B, and AA individuals were most likely to have the disease. In addition, individuals exposed to E1 were more likely to be affected than individuals with the same genotype not exposed to E1. The coefficients of the model were set as follows: α = -2.197, β 1 = 1.35, β 2 = -0.747 and β 3 = 0.811 so that there was no genotype-environment interaction involved in the logistic regression model. The coefficients β 1 and β 2 measure the impact of genotypes AA and BB relative to AB.
To study the impact of population heterogeneity on predictive risks, we looked at three different contamination scenarios. In the first scenario, contamination occurred only in individuals with AB genotype and not exposed to E1. This could happen when there was interaction between some external environment and a joint genotype. For example, individuals with AB genotype at L1 and CC genotype at L2 could get the disease with a very low probability if not exposed to E1. In the second scenario, we assumed that contamination presented in two different categories: AB individuals exposed and not exposed to E1. This could occur if the disease phenotype resulted from AB genotype at L1 was masked by a genotype of the second locus, CC so that all the individuals with genotype AC/BC have very low disease susceptibilities in the absence of genotype-environment interaction. In the presence of genotype-environment interaction, some AC/BC individuals might express normally under one environmental condition but not the other, so that the proportions of contaminated individuals were different in the two categories. In the last scenario, contamination happened in AB individuals not exposed to E1 and in AA individuals exposed to E1, which was possible again when there was interaction between the environment and some (but not all) joint genotypes.
With contamination properly taken into account, the accuracy of predicted risks should be improved. To investigate potential improvement of the predicted risks, we also estimated the risks by incorporating the contamination factor into the logistic regression model (full model).
For contamination in a single category, an independent variable x 4 was used in the full model to denote whether the individual was contaminated or not. Disease status was determined by the full model as follows: Likewise, for contamination in two categories, an additional independent variable x 5 was used in the full model to denote whether the individual was contaminated or not in the second category.
In each contamination scenario, the proportion of contaminated individuals (contamination factor) varied from 0 to 0.8 at an interval of 0.2. The disease risks of contaminated individuals were set to 0.01. Each data set consisted of 600 unrelated individuals with simulated genotypes, environmental exposure status, contamination status and affection status. 1000 replicated data sets were simulated for each parameter set.

Statistical analysis
Two logistic regression models were used to fit the simulated data sets. The reduced model has two covariates denoting the genotype and one covariate indicating environmental exposure status. The full model has two genotype covariates, one environmental covariate and one or two additional covariates indicating individual's contamination status. Maximum likelihood estimates of the coefficients were computed using computer software SAS version 8.0. To find out how different the estimated disease risks and the true risks were, we calculated the RMSE and relative bias of the predicted risks, averaged over the 1000 replicated data sets. Relative bias was defined as the bias of the estimated risk divided by the true risk to facilitate interpretation of the bias on an appropriate scale.

Results
Tables 1 and 2 list the RMSE and relative bias of the disease risks estimated using the reduced and full model averaged over the 1000 replicated data sets when contamination occurred in AB individuals not exposed to the environmental factor. When there was no contamination, we could see that the relative biases were small (less than 1.8 percent) in all six categories in both models. This implied that our reduced and full models were both efficient and could give precise estimates under no contamination. When contamination occurred, both the relative biases and RMSE of the risks estimated from the reduced model increased in all categories with larger proportion of contaminated individuals. The predicted risks increased in BB individuals exposed to the environment and AA individuals exposed to the environment. They decreased in the other four categories. The logit of the contaminated category corresponded to the intercept of the logistic model. Since the estimated coefficients of logistic regression model were interdependent, changing of one parameter led to changing of all the other parameters. Thus all the predicted risks deviated from their true values as a result of contamination in a single category, though the relative bias increased fastest in the contaminated category. In the contaminated category, the predicted risk differed greatly from its true value (15 percent difference) even with 20 percent contaminated individuals. The deviation reached nearly 60 percent when the proportion of contaminated individuals reached 0.8. When contamination status was incorporated into the model (full model), the relative biases and RMSE remained small in all six categories despite increasing proportion of contaminated individuals. The relative bias was less than 3 percent in all the categories even with 80 percent contaminated individuals.
Tables 3 and 5 present our findings for the impact of twocategory contamination on predictive risks. In general, the relative biases and RMSE increased with increasing proportions of contamination. However, there were cases where they decreased with increasing proportions of contamination. For example, in scenario three the absolute value of the relative bias of the estimated risk of AB individuals not exposed to the environmental factor decreased with increasing contamination in AA individuals exposed to the environment, as shown in Table 5. Actually, we could see that the relative bias varied from -0.109 to 0.008 when r AB,NE equaled 0.2. This indicated that with increasing contamination, the predicted risk reduced at first, but then it increased and became larger than its true value. This was possible because the predicted risks were functions of the model coefficients. Contamination caused different coefficients to change in different directions. Therefore, the predicted risks could fluctuate in either direction with increasing contamination. In the third scenario, when contamination proportions in both categories were 0.2, the estimated risk for AB individuals exposed to the environment reduced 13 percent even though there was no contamination in this category. When the two contamination factors were 0.6 and 0.8, the estimated risk for this category decreased nearly 50 percent.
Tables 4 and 6 list the average relative bias and RMSE of the disease risks for the second and third contamination scenarios estimated using the full model. Similar to one group contamination case, when additional knowledge about individual's contamination status was available, the estimated relative biases and RMSE were greatly improved in all categories. They remained small despite increasing x proportion of contaminated individuals. In both scenarios, the largest relative risks were about 3 percent even with 80 percent contaminated individuals. These results suggested that the additional contamination covariate was efficient in modeling population heterogeneity. The difference in individual's disease susceptibility was accounted for properly. Knowledge about contamination could improve the accuracy of the predicted risks.

Discussion
Rapid developments in genetics have an increasing impact on medical practice. Genetic counseling has made it possible to predict an individual's risk for complex genetic diseases that do not cleanly follow Mendelian inheritance patterns. However, predictive tests based on family history cannot always give satisfactory assessment due to the complexity of human diseases; "one size fits all" techniques appear to be problematic. Incomplete information (a) Proportion of individuals in the contaminated category who get the disease with a very low probability (0.01). (b) Not exposed to the environmental factor. (c) Exposed to the environmental factor. (a) Proportion of individuals in the contaminated category who get the disease with a very low probability (0.01). (b) Not exposed to the environmental factor. (c) Exposed to the environmental factor. (a) Proportion of AB individuals not exposed to the environment who get the disease with a very low probability (0.01). (b) Proportion of AB individuals exposed to the environment who get the disease with a very low probability (0.01). (c) Not exposed to the environmental factor. (d) Exposed to the environmental factor. (a) Proportion of AB individuals not exposed to the environment who get the disease with a very low probability (0.01). (b) Proportion of AB individuals exposed to the environment who get the disease with a very low probability (0.01). (c) Not exposed to the environmental factor. (d) Exposed to the environmental factor. (a) Proportion of AB individuals not exposed to the environment who get the disease with a very low probability (0.01). (b) Proportion of AA individuals exposed to the environment who get the disease with a very low probability (0.01). (c) Not exposed to the environmental factor. (d) Exposed to the environmental factor. (a) Proportion of AB individuals not exposed to the environment who get the disease with a very low probability (0.01). (b) Proportion of AB individuals exposed to the environment who get the disease with a very low probability (0.01). (c) Not exposed to the environmental factor. (d) Exposed to the environmental factor. about disease etiology might lead to individuals classified in the same group by DNA tests to have different susceptibilities to a disease.
The problem of model misspecification is not new. Several studies investigated the asymptotic relative efficiency (ARE) of misspecified model in testing association between exposure and response [11][12][13]. ARE can be defined loosely as the ratio of sample size needed by the correct test to attain the same power as the mismodeled test. Among these studies, Begg and Lagakos explored the consequences of model misspecification in logistic regression and showed both theoretically and numerically that models with missing or incorrect covariates required larger sample sizes to achieve the same power of testing association between the exposure and response than the correct models [13]. Although efforts have been made to study the effect of model misspecification, little attention has been paid to investigate the problem in the context of genetic counseling, where predicting disease risk is the primary goal. In this paper, we studied the impact of population heterogeneity on predicted disease risks. A logistic regression model was fitted assuming the individual contamination status was unknown (contamination status is a missing covariate). We quantified the bias of the predicted risks based on the level of population contamination through a simulation study. Our results showed that contamination in one or more categories could cause the estimated risks in all categories to deviate from their true values. The departure could be in either direction and the biases were unpredictable. We focused our simulations in three specified situations, though the results could be easily generalized to other scenarios. This implies that without thorough knowledge about genetic basis of the disease, risks estimated from DNA tests may be misleading. Since human bodies are so complicated and disease systems are so sophisticated, it is hard to detect contamination status for many genetic disorders. Therefore, caution should be taken when evaluating the predicted risks obtained from genetic counseling.
Our simulation using the full model did show that major improvements could be made if individual disease status was available and incorporated into the prediction model. This pointed out a promising direction for genetic counseling, since more and more new techniques are being invented and genetic disorders are being better understood.

Conclusions
Our simulation results showed that heterogeneity in one or more categories could lead to biased estimates of disease risk not only in the "contaminated" categories but also in other homogeneous categories. The predicted risks could fluctuate in either direction and the biases were unpredictable. These findings imply that without thorough knowledge about genetic basis of the disease, risks estimated from DNA tests may be misleading. Caution should be taken when evaluating the predicted risks obtained from genetic counseling.