A comprehensive analysis of common genetic variation in prolactin (PRL) and PRL receptor (PRLR) genes in relation to plasma prolactin levels and breast cancer risk: the Multiethnic Cohort

Background Studies in animals and humans clearly indicate a role for prolactin (PRL) in breast epithelial proliferation, differentiation, and tumorigenesis. Prospective epidemiological studies have also shown that women with higher circulating PRL levels have an increase in risk of breast cancer, suggesting that variability in PRL may also be important in determining a woman's risk. Methods We evaluated genetic variation in the PRL and PRL receptor (PRLR) genes as predictors of plasma PRL levels and breast cancer risk among African-American, Native Hawaiian, Japanese-American, Latina, and White women in the Multiethnic Cohort Study (MEC). We selected single nucleotide polymorphisms (SNPs) from both the public (dbSNP) and private (Celera) databases to construct high density SNP maps that included up to 20 kilobases (kb) upstream of the transcription initiation site and 10 kb downstream of the last exon of each gene, for a total coverage of 59 kb in PRL and 210 kb in PRLR. We genotyped 80 SNPs in PRL and 173 SNPs in PRLR in a multiethnic panel of 349 unaffected subjects to characterize linkage disequilibrium (LD) and haplotype patterns. We sequenced the coding regions of PRL and PRLR in 95 advanced breast cancer cases (19 of each racial/ethnic group) to uncover putative functional variation. A total of 33 and 60 haplotype "tag" SNPs (tagSNPs) that allowed for high predictability (Rh2 ≥ 0.70) of the common haplotypes in PRL and PRLR, respectively, were then genotyped in a multiethnic breast cancer case-control study of 1,615 invasive breast cancer cases and 1,962 controls in the MEC. We also assessed the association of common genetic variation with circulating PRL levels in 362 postmenopausal controls without a history of hormone therapy use at blood draw. Because of the large number of comparisons being performed we used a relatively stringent type I error criteria (p < 0.0005) for evaluating the significance of any single association to correct for performing approximately 100 independent tests, close to the number of tagSNPs genotyped for both genes. Results We observed no significant associations between PRL and PRLR haplotypes or individual SNPs in relation to breast cancer risk. A nominally significant association was noted between prolactin levels and a tagSNP (tagSNP 44, rs2244502) in intron 1 of PRL. This SNP showed approximately a 50% increase in levels between minor allele homozygotes vs. major allele homozygotes. However, this association was not significant (p = 0.002) using our type I error criteria to correct for multiple testing, nor was this SNP associated with breast cancer risk (p = 0.58). Conclusion In this comprehensive analysis covering 59 kb of the PRL locus and 210 kb of the PRLR locus, we found no significant association between common variation in these candidate genes and breast cancer risk or plasma PRL levels. The LD characterization of PRL and PRLR in this multiethnic population provide a framework for studying these genes in relation to other disease outcomes that have been associated with PRL, as well as for larger studies of plasma PRL levels.


Background
Prolactin (PRL) is an essential regulator of mammary development, acting synergistically with a wide variety of hormones during puberty and pregnancy [1,2]. Early studies in animals first demonstrated that prolactin could induce spontaneous mammary tumors [3][4][5][6]. Results from in vitro studies support the findings from animal studies and suggest that PRL stimulates proliferation, [7][8][9][10] increases cell motility and cytoskeleton alterations [11], and promotes angiogenesis [12] in human breast cells. Prolactin receptor (PRLR), found in both normal and malignant breast tissue, has been reported to be slightly more prevalent in malignant tissue [13]. Though early clinical studies of patients treated with bromocriptine, an inhibitor of pituitary PRL, found no association with breast cancer, recent evidence of autocrine/paracrine regulation [14,15] of PRL in extra-pituitary tissue provides further support for a possible role of PRL in tumorigenesis.
There are few prospective epidemiological studies evaluating plasma PRL levels and breast cancer risk. The largest prospective cohort study of postmenopausal women reported a 34% increase in risk of breast cancer when comparing top to bottom quartiles (> 12 vs. < 7.4 ng/mL) of PRL levels [16]; these findings were similar to results from an earlier study reporting a non-significant increase in risk of 1.34, based on a smaller sample size [17]. Two smaller studies of postmenopausal women also reported a positive association, but these were also non-significant [18,19]. Results from case-control studies [20][21][22][23][24][25][26][27] give conflicting results and are difficult to interpret due to the retrospective nature of blood collection. There have been limited prospective data on prolactin levels and breast cancer risk among premenopausal women [18,19,28] until recently; the Nurses' Health Study reported a nonsignificant 30% increase in breast cancer risk among premenopausal women when comparing top to bottom quartiles (> 17.6 vs. < 9.8 ng/mL) of PRL levels among 377 cases and 786 controls [29].
In humans, the PRL gene lies on chromosome 6 and is approximately 10 kilobases (kb) in length with five coding exons [30]. An additional non-coding first exon has been described that lies 5.8 kb upstream of the pituitary promoter site [31]. This distal promoter region has been associated with extra-pituitary expression of PRL, described in a variety of tissues including decidua, lymphocytes, and breast tissue. Depending on promoter usage, PRL mRNAs may differ slightly in length but encode the same mature polypeptide protein hormone [32].
The human PRLR gene is located on chromosome 5 and is approximately 180 kb in length and is originally described as having 10 exons, of which exons 3-10 are coding exons [33]. Recently, six alternative non-coding first exons have been described whose functions are unknown but have been found to be expressed in human ovary, testis, liver, breast tissue, and breast cells [34,35]. In addition, an exon 11 located 15 kb downstream of exon 10 has been reported; alternative splicing of exons 10 and 11 appear to produce novel short forms of the receptor that may be involved in distinct signaling pathways than the common long form [36,37].
Previous studies have demonstrated that genetic polymorphisms in candidate genes can lead to variations in plasma levels of encoded proteins [38,39]. In this study, we used a combination of approaches that included sequencing the coding regions to identify common missense variation, and haplotype-based analyses to characterize common patterns of genetic variation across each locus to test the hypothesis that genetic variations in PRL and PRLR are associated with plasma PRL levels and breast cancer risk. Tests of association were performed in a large case-control study of breast cancer among African-American (AA), Native Hawaiian (NH), Japanese-American (JA), Latina (LA), and White (WH) women in the prospective Multiethnic Cohort Study (MEC). To our knowledge, this is the first comprehensive study of common genetic variation in PRL and PRLR genes in relation to breast cancer risk and plasma PRL levels in a multiethnic population

Characterization of Genetic Variation at PRL and PRLR loci
We genotyped 80 SNPs in PRL and 173 SNPs in PRLR (approximately 1 SNP every 1 kb) to characterize linkage disequilibrium (LD) and haplotype patterns in a multiethnic panel of 349 unaffected subjects (69-70 of each of the 5 racial/ethnic populations in the MEC). We characterized genetic variation across 59 kb of the PRL locus, 24 kb upstream of PRL's alternative first exon 1a (5.8 kb upstream of pituitary promoter site) to 20 kb downstream of exon 5, using 80 common (minor allele frequency, MAF, ≥ 5%) SNPs (Additional File 1, Table S1). In PRL, we observed three regions of LD (blocks 1, 3, 4, see the Methods section for a description of the criteria used to define LD block regions), and one 19 kb region ("pseudo-block 2") with little evidence of LD. Based on the dense coverage across this 19 kb region (1 common SNP every < 1 kb apart, on average), we decided to construct haplotypes to test associations with common variation (Figure 1, Additional File 1, Table S1). In this region, the multivariate squared correlation, R s 2 , between the selected tagSNPs and all SNPs examined in the multiethnic panel was = 0.70 in all ethnic groups, which suggests that unmeasured SNPs in this region are most likely well predicted by our set of tags. Thus, we describe four regions in PRL: block 1 (SNPs 1-24; 14 kb), "block" 2 (SNPs 25-45; 19 kb), block 3 (SNPs 46-59; 7 kb), and block 4 (SNPs 61-77; 14 kb). In general, block sizes in PRL were similar among racial/ ethnic groups (Additional Files 2, 3, 4, 5, 6).
Of the 60 tagSNPs selected in PRLR we were unable to genotype four of them in the case-control study because Illumina assays could not be designed, block 1: SNP6 (rs9986182), SNP12 (rs9292582), SNP24 (rs6451192), and SNP29 (rs7701473). This resulted in the inability to distinguish between haplotypes 1A1, 1A2, and 1A3 in LA (minor allele frequency 16.9%, 6.4%, and 6.6%), between haplotypes 1A1 and 1A3 in AA (9.2% and 2.2%), Linkage disequilibrium (LD) plot across the prolactin receptor (PRLR) locus for all racial/ethnic groups combined and between 1A1 and 1A2 in NH (17.2% and 4.5%) and in WH (34.6% and 5.9%) (Additional File 1, Table S9) which spans 14.2 kb, 142 kb upstream of the start codon in exon 3. Aside from block 1 of PRLR, the predicted common haplotypes frequencies in the multiethnic panel were similar to those observed in the larger case-control sample (Additional File 1, Tables S8-S11). Therefore, only haplotypes with ≥ 5% frequency in cases or controls, per each racial/ethnic group, are shown in Additional File 1, Tables S10 and S11. To assess how well the selected tag-SNP perform in capturing the common SNPs that were not selected as tagSNPs in each population, we calculated multi-marker R2 measures for both genes [40]. For PRL, the fraction of SNPs predicted with a multi-marker R2 > 0.7 was 89%, 93%, 98%, 100%, and 100% for AA, NH, JA, LA, and WH, respectively. For PRLR (even without the four tagSNPs), the fraction of SNPs captured with multimarker R2 > 0.7 was 84%, 92%, 90%, 92%, and 93%. Thus, the selected tagSNPs capture most of the SNPs evaluated in the LD characterization phase, and based on high-density SNPs coverage in this study (1 SNPs every ~1 kb, on average), we expect these tags to also predict the vast majority of all common alleles in these genes.
We sequenced the exons and splice-site regions of PRL and PRLR in germline DNA from 95 advanced breast cancer cases (19 of each racial/ethnic group). PRL and PRLR sequencing confirmed only one missense SNP, Ile 100 Val (rs16871473) in exon 5 of PRLR. The SNP was observed most commonly among Native Hawaiians (MAFs, 11%, 15%, 5%, 1%, and 2% in AA, NH, JA, LA, and WH, respectively) (Additional File 1, Table S2). A previously reported missense SNP in exon 6 of PRLR (Ile 170 Leu) was monomorphic in all ethnic groups [41]. For PRL, we discovered a low frequency synonymous SNP in exon 3 (A+444152G). We were also able to validate a previously reported synonymous SNP in exon 5 (rs6239), but not a synonymous SNP in exon 2 (rs6240) or a missense SNP in exon 4 (rs6238) (Additional File 1, Table S1).

Case-control analysis
The distribution of breast cancer risk factors among the 1,615 breast cancer cases and 1,962 controls were consistent with the patterns observed in the overall cohort, and have been previously published [42] (Additional File 1, Table S3). We tested the independent effects of each tag-SNP for PRL and PRLR in the case-control population (Additional File 1, Tables S4 and S5). Odds ratios (ORs) and 95% confidence intervals (CIs) were estimated for each tagSNP using unconditional logistic regression adjusted for age and ethnicity (co-dominant effects are reported in the manuscript, detailed genotype-specific effects are shown in the tables). Because of the large number of comparisons being performed, we used a relatively stringent type I error criteria (p < 0.0005) for evaluating the significance of any single association. (This "corrects" for performing approximately 100 independent tests, close to the number of tagSNPs genotyped for both genes). The strongest associations between individual SNPs and breast cancer risk were with SNP34 (rs9466314) in "block 2" of PRL (co-dominant effect OR, 1.48; 95% CI, 1.00-2.18; p = 0.049) and SNP49 (rs34024951) in block 3 of PRLR (co-dominant effect OR, 0.85; 95% CI, 0.73-0.99; p = 0.032) ( Table 1). Of note, SNP34 in PRL was only observed among AAs, with a MAF of 6% in cases and 5% in controls in our sample. The missense Ile 100 Val SNP in PRLR was not associated with breast cancer risk  Table S5).
We performed haplotype analyses using the most common haplotype as the reference group (Additional File 1, Tables S10 and S11); results were similar when we used all other haplotypes as the reference group (data not shown).

Plasma prolactin level analysis
Among the 362 postmenopausal controls in the biomarker analysis, the median plasma PRL level was 8.1 ng/mL. Prolactin levels did not vary by race/ethnicity, before or after adjusting for potential confounders: parity, age at first pregnancy, body mass index, family history of breast cancer, and menopause age and type (p-heterogeneity = 0.447) (data not shown). The strongest association between a single SNP and PRL levels was with SNP44 (rs2244502) of PRL, which showed approximately a 50% increase in levels between minor allele homozygotes versus major allele homozygotes (Additional File 1, Tables S6 and S7). We also observed nominally significant associations between prolactin levels and seven SNPs in PRL (SNP33, SNP34, SNP39, SNP44, SNP54, SNP62, SNP65) and two SNPs (SNP73, SNP148) in PRLR ( Table 2). None of these associations were significant at the p < 0.0005 level.

Discussion
We genotyped a high density of SNPs to characterize the haplotype structure of PRL and PRLR genes, using the criterion for haplotype-based studies described by Gabriel et al. [43] and the multivariate R h 2 statistic [44] to provide high predictability of the common haplotypes in PRL and PRLR. We found that in almost all ethnic groups and for both genes, the selected tagSNPs performed well in predicting the common SNPs typed in the LD characterization phase (average multi-marker R 2 = 0.95) and the common haplotypes defined by the tagSNPs (average minimum R h 2 = 0.87).
Assuming an average multi-marker R 2 = 0.90 between causal alleles and tagSNPs or haplotype predictors, we had 96% power to detect relative risks of 1.29 per haplotype or genotype copy with 10% frequency, allowing for a 5% type I error rate. However, given the large number of statistical tests for each gene, we expected several false positive associations. By a more stringent type I error criteria (p < 0.0005) the detectable relative risk, at 90% power, for a dominant allele with 10% frequency, is 1.45 per copy. By ethnic group, we had 78-82% power to detect large ORs ≥ 2.1 (except in NH, ORs ≥ 3.0) with this significance level. The purpose of this study however, was to assess shared common genetic variation across ethnic groups. For PRL levels among 362 controls, only fairly large differences in mean levels could be detected with good power. For example, after correcting for 100 comparisons (e.g. using p < 0.0005), we estimate that we had 90% power to detect an association between PRL levels and a common (10%) variant only when that variant was associated with approximately a 50% change in mean levels per genotype/ haplotype copy.
A recent German study of 441 cases and 552 controls reported an increase in breast cancer risk associated with genetic variation in PRL: rs1341239 (SNP35) (OR, 1.67; 95%CI, 1.11-2.50 for homozygous individuals) and rs12210179 (OR, 2.09; 95%CI, 1.23-3.52), which we did not genotype in our sample. SNP35 has been shown to be functionally significant in relation to Systemic Lupus Erythematosus (SLE) [45,46]. Vaclavicek et al. reported that rs12210179 does not lie within any transcription binding site and is in high LD (|D'| = 0.91) with SNP35 [47]. Among Whites in the MEC, SNP35 is well predicted by tagSNP33, pairwise R 2 = 0.86. Using HapMap data [48], rs12210179 is common (27%) among Caucasians (vs. Yorubans 4%, Japanese 1%) and for Caucasians, is well predicted by tagSNP43 (pairwise R 2 = 1.00). Though we did not test these SNPs directly in our study, using these "surrogate" tagSNPs, we did not find any significant association with breast cancer risk among Whites (tagSNP33: OR 0.96; 95%CI, 0.80-1. 16 Vaclavicek et al. also reported a TGTG haplotype in PRL comprised of rs1341239 (SNP35), rs12210179 (not genotyped in our sample), rs2244502 (tagSNP44), and rs1205960 (tagSNP56) associated with breast cancer risk (OR, 1.42; 95%CI, 1.07 -1.90) [47]. This haplotype falls in "block" 2 and block 3 of our characterization of the PRL locus (Additional File 1, Table S1). Using 11 tagSNPs for "block 2" (multi-marker R 2 = 0.79-1.00 for Whites) and 7 tagSNPs for block 3 (multi-marker R 2 = 0.92-1.00 for Whites), we did not observe an association with breast cancer risk (Additional File 1, Table S10). We used "surrogate" tagSNPs 33, 43, 44, and 56 to best approximate the TGTG haplotype but did not observe an association between common surrogate haplotypes and breast cancer risk among Whites (global test p = 0.78) or overall (global test p = 0.70). Further studies are needed to directly evaluate the TGTG haplotype in relation to breast cancer risk, especially among Whites.
We found that tagSNP34 (2.1 kb upstream of SNP35 in the promoter region of PRL) had the strongest association with risk of breast cancer (p = 0.049). It is possible that this SNP may be functionally significant as both SNP34 and SNP35 lie in the distal extra-pituitary promoter region of prolactin. However, this SNP was only observed among AAs, with a minor allele frequency (MAF) of 6% in cases and 5% in controls in our sample. Further studies are needed to assess the relevance of this finding. The strongest association in PRL between a haplotype and breast cancer risk was with haplotype 3I in block 3 (p = 0.036). This haplotype was only observed in JA and NH, and the association with risk was confined to JA.
For PRLR, the only missense SNP previously described in relation to breast cancer risk is a Leu 150 Ile SNP in exon 6 which was reported in 2 of 38 cases in a Turkish study [41]. In our large sample, this SNP was monomorphic; however, it is possible that it is rare or only observed in certain populations.
Vaclavicek et al. also reported a protective TCC haplotype in PRLR (OR, 0.69; 95%CI, 0.54-0.89; p = 0.004) using just three tagSNPs. The TCC haplotype consists of rs13354826 (not genotyped in our sample, block 2), rs9292573 (SNP59, block 3), and rs37389 (SNP141, block 7). In Whites, these SNPs are well predicted: rs13354826 (tagSNPs 7 and 35, HapMap data, multimarker R 2 = 1.00), SNP59 (tagSNP55, pairwise R 2 = 1.00), and SNP141 (tagSNP139, pairwise R 2 = 0.94). We used "surrogate" tagSNPs 7, 35, 55, and 139 to approximate the TCC haplotype and found that the common haplotypes comprised of these surrogate SNPs were not significantly associated with risk. Though we are unable to form a direct prediction of the TCC haplotype, we believe that our approach is comprehensive enough to have detected a true association within this region of the strength reported by Vaclavicek et al. Using 56 tagSNPs across high density coverage of 210 kb of the PRLR locus (25 kb upstream of first alternative exon E1 3 to 10 kb downstream of exon 11), we did not find an association between SNPs or haplotypes in PRLR and breast cancer risk.
We did not generate convincing evidence of an association between PRL levels and common genetic variation in PRL and PRLR, although our study was limited by small sample size. The most significant p-value was 0.002 for SNP44 in PRL, which corresponds to a 48% increase in PRL levels between major and minor allele homozygotes. The Nurses Health Study [16] demonstrated that > 1.6fold difference between upper and lower quartiles of PRL levels was associated with a 34% increase in breast cancer risk. We did not observe an association between breast cancer risk and SNP44 (p = 0.575). However, even if the association between SNP44 and prolactin levels were correct, and assuming a direct influence of genetically determined prolactin levels on breast cancer risk consistent with the Nurses Health Study, the 48% increase in PRL levels for minor allele homozygotes of SNP44 would still only correspond to a 10% risk increase between carriers and non-carriers of two copies. Such an increase in risk is not detectable in this study with reasonable power, which could explain the apparent lack of association between SNP44 and breast cancer risk in this study. Further studies in larger samples are needed to definitively assess the relationship between this polymorphism, plasma PRL levels and breast cancer. In addition, our results may not be generalizable to premenopausal women since we only included postmenopausal women in our analysis. Prolactin levels have been shown to decline slightly among postmenopausal women compared to premenopausal women [2]. However, the NHS study evaluated prolactin levels among premenopausal and postmenopausal women and found no difference in risk of breast cancer by menopausal status: premenopausal (RR 1.3, 95% CI 0.9-1.9) vs. postmenopausal (RR 1.3, 95% CI 1.0-1.8) women [16,29]. It is unclear whether we could draw similar conclusions from our study population.
Strengths of this study include the large case-control sample size, comprehensive assessment of LD block structure, and tagSNP selection providing excellent prediction of nearly all SNPs or common haplotypes, across five racial/ ethnic populations. However, the ability to definitively evaluate ethnic-specific risks and associations with plasma PRL levels should be interpreted with caution, due to the small number of subjects in these groups. Further studies using larger samples of PRL levels are needed to assess the relationship with polymorphisms in the PRL and PRLR genes, and in particular, to validate the association observed between PRL levels and SNP44 in PRL.

Conclusion
This the largest and most comprehensive study of common genetic variation in PRL pathway genes in relation to breast cancer risk and plasma PRL levels. In contrast to a recent study of PRL and PRLR in relation to breast cancer, we observed no strongly significant associations with breast cancer risk. We also did not find an association between common genetic variation in PRL or PRLR and circulating plasma PRL levels. Our results emphasize the importance of using high density genotyping to adequately characterize genes for use in association studies and caution against false positive results when interpreting these data. Though we did not observe an association with breast cancer risk, results from our study provide a framework for future association studies of PRL pathway genes in relation to other diseases (such as Systemic Lupus Erythematosus) and for larger studies of plasma PRL levels.

Subjects
The MEC consists of over 215,000 men and women in Hawaii and Los Angeles (with additional African-Americans from elsewhere in California) and has been previously described in detail [49]. The cohort is mainly comprised of five self-described racial-ethnic populations: Native Hawaiians, Japanese-Americans and Whites from Hawaii, and African-Americans, Japanese-Americans and Latinos from Los Angeles. Between 1993 and 1996, participants entered the MEC by completing a self-administered mail questionnaire that asked detailed information about dietary habits, demographic factors, personal behaviors, history of prior medical conditions, family history of common cancers, and for women, reproductive history and exogenous hormone use. The participants were between the ages 45 and 75 when they entered the cohort.
Incident cancers in the MEC are identified by record linkage to the Hawaii Tumor Registry, the Cancer Surveillance Program for Los Angeles County, and the California State Cancer Registry. These population-based tumor registries participate in the National Cancer Institute's Surveillance, Epidemiology and End Results (SEER) program of cancer registration which is known to have an excellent (98%) case ascertainment. From the registries we also obtained information about stage of disease at diagnosis. Breast cancer cases were classified as "advanced" cases when diagnosed with invasive/non-localized disease (SEER stage ≥ 2) at diagnosis.
Beginning in 1996, blood samples were collected from incident breast cancer cases. At this time, blood collection was also initiated in a random sample of MEC participants to serve as a control pool for genetic analyses. The participation rates for providing blood sample were ≥ 65% for cases and controls. Demographic characteristics related to socio-economic status and acculturation (e.g. age at cohort entry, education, place of birth, and years living in the United States) were similar among those who provided a blood sample and women in the entire cohort. Eligible breast cancer cases in this study consisted of women with incident breast cancer diagnosed after enrollment in the MEC through April 2002. Controls were women without breast cancer prior to entry into the cohort and without a cancer diagnosis up to April 2002, and were frequency matched to cases by age and ethnicity. Because < 6% of cohort members have moved outside of the Hawaii and Los Angeles between enrollment (1993)(1994)(1995)(1996)  Subjects included in the analysis of plasma PRL levels were a random sample of the controls in the case-control panel. A total of 500 postmenopausal women with previously collected biospecimens (100 in each ethnic group) were included. Women reporting hormone therapy use at blood draw were excluded (n = 128), and individuals with PRL levels that were 2.5-fold outside the normal range were excluded (n = 10).

Gene Sequencing
We sequenced the exons and splice-site regions of PRL and PRLR in germline DNA from 95 advanced breast cancer cases (19 of each racial/ethnic group). We used DNA samples from advanced cases to increase the probability of discovering single nucleotide polymorphisms (SNPs) that are biologically relevant to breast cancer. Sequencing was performed using ABI BigDye terminator chemistry on the ABI 3730 DNA Analyzer (Applied Biosystems, Foster City, CA). The PolyPhred program was used to identify polymorphisms with manual review by at least two observers, and all putative coding variants were validated by genotyping in the same panel of advanced cases and in the multiethnic panel (discussed below).

Characterization of Linkage Disequilibrium and Haplotype Patterns
We used a haplotype-based approach to study common variation in PRL and PRLR in the MEC, previously described elsewhere [42]. We selected single nucleotide polymorphisms (SNPs) from both the public (National Center of Biotechnology Information [50]) and private (Celera [51]) databases to construct high density SNP maps that included up to 20 kilobases (kb) upstream of the transcription initiation site and 10 kb downstream of the last exon of each gene, for a total coverage of 59 kb in PRL and 210 kb in PRLR. Block structure was assessed using SNPs with MAF ≥ 10%. Blocks were initially defined following alignment across racial/ethnic groups; borders were characterized by SNPs at the extreme ends of the block in any one ethnic group, except for African-Americans, whose block sizes, as expected, were modestly smaller than the other groups. We tested the suitability of this block definition by evaluating whether SNPs surrounding presumed block borders modified the number or identity of common haplotypes estimated within the blocks; changes in the number of haplotypes and the introduction of recombinant haplotypes would indicate whether SNPs were spanning a potentially important site of historical recombination and guided us in redefining a block boundary.
We genotyped common SNPs (MAF > 5% in at least one racial/ethnic group) at a density of 1 SNP every ~1 kb on average across the locus, all known missense SNPs in public database, and all newly identified missense SNPs in our sequencing effort. In total, 139 (PRL) and 276 (PRLR) SNPs were selected and genotyped in a multiethnic panel of 349 women in the MEC without a history of cancer (n = 69-70 per racial-ethnic group). This sample size allows > 99% power to detect common haplotypes (≥ 5% frequency) that are shared across all ethnic groups, and about 90% power to detect common ethnic-specific haplotypes. Of these SNPs, 36 (PRL) and 74 (PRLR) were identified as monomorphic and 17 (PRL) and 22 (PRLR) genotyped poorly (SNPs missing genotype data for ≥ 25% of samples or out of Hardy-Weinberg equilibrium more than one of the populations, p ≤ 0.01). This left 80 (PRL) and 173 (PRLR) SNPs with MAF = 5% in at least one racial-ethnic group to be included in the haplotype analysis.
The |D'| and r 2 statistics were used to assess pairwise linkage disequilibrium (LD) between the common SNPs. Within regions of strong LD [43], haplotype frequency estimates were constructed from the genotype data in the multiethnic panel (one ethnicity at a time) using the expectation-maximization (E-M) algorithm of Excoffier and Slatkin [52]. The squared correlation (R h 2 ) between the true haplotypes (h) and their estimates were then cal-culated as described by Stram et al. [44]. "Tagging" SNPs (tagSNPs) for the case-control study were then chosen by finding the minimum set of SNPs for each ethnic group that would have R h 2 > 0.7 for all common haplotypes with an estimated frequency of ≥ 5%. TagSNP selection was performed using the tagSNPs program [53].
Values of the multi-marker and pairwise R 2 values between tagSNPs and unmeasured SNPs were calculated using the Tagger algorithm [40] in Haploview and the slightly more general method given in Stram 2004 [54].
Genotyping DNA for all subjects was extracted from white blood cell fractions using the Qiagen Blood Kit (Qiagen, Chatsworth, CA). SNP genotyping in the multiethnic panel was performed using the Sequenom (Sequenom Inc, San Diego, CA) platform. Tag SNP genotyping in the breast cancer cases and controls was performed by the 5' nuclease TaqMan allelic discrimination assay (ABI7900) and the Illumina (Illumina Inc, San Diego, CA) platforms. Replicate blinded quality control samples (5%) were included to assess reproducibility of the genotyping procedure; the concordance was ≥ 99.7% for all platforms.

Plasma Prolactin Measurements
Prolactin was measured using a double-antibody, immunoradiometric assay from Diagnostic System Laboratories (Webster, Texas) in hormone analysis laboratories at the International Agency for Research on Cancer. The assay was performed in multiple batches with equal numbers of each population in each batch. The theoretic sensitivity (as stated by the manufacturer) is 0.1 ng/ml. Mean intraand inter-batch coefficients of variation were 5.4% and 12.8% respectively, using 25 microliters sample volumes. Plasma PRL levels have been shown to be stable in whole blood for 24-48 hours [55]. In the MEC, time from blood collection to processing was no more than six hours.

Statistical Analysis
Haplotype frequencies among breast cancer cases and controls were estimated using the tagSNPs selected to distinguish the common haplotypes (≥ 5% frequency) for each ethnic group in the multiethnic panel as described [56]. The E-M algorithm was used to estimate haplotype frequencies for the tagSNPs in the combined dataset (cases + controls) and individual estimates of haplotype count (expected number of copies of each haplotype carried by each individual) from the E-M were outputted to an external file and merged with case-control status. These estimates were then used as explanatory variables in logistic regression models.
As shown empirically [57], the majority of common variation is shared across racial and ethnic populations [57,58] while the biological effects on risk for the majority of common disease-associated alleles have also been shown to be consistent across populations [59]. These observations justify pooling genetic data across racial and ethnic populations if no heterogeneity is noted. To assess the consistency of genetic effects across populations, we first tested for heterogeneity across racial-ethnic groups prior to pooling genetic data. These tests were performed using a likelihood ratio test following the inclusion of an interaction term between the each haplotype (or SNP) and ethnicity in the logistic regression model. Pooled odds ratios (ORs) and 95% confidence intervals (CIs) were then estimated for each haplotype and tagSNP using unconditional logistic regression adjusted for age and ethnicity. Because of the large number of comparisons being performed we used a relatively stringent type I error criteria (p < 0.0005) for evaluating the significance of any single association. (This "corrects" for performing approximately 100 independent tests, close to the number of tagSNPs genotyped for both genes).
We used the methods described by Zaykin et al. to perform global tests of association between haplotypes and cancer risk within each LD block and to estimate haplotype-specific odds ratios [60]. ORs were estimated for each common haplotype using the most common haplotype as the reference group and for each SNP using the more common genotype as the reference group. We also performed the haplotype analyses using all other haplotypes as the reference group and performed individual SNP analyses for co-dominant effects, both of which yielded similar results (data not shown). Because further adjustment for study area (Hawaii or Los Angeles) and the established breast cancer risk factors (first-degree family history of breast cancer, body mass index, parity, age at first birth, age at menarche, type and age at menopause, use of hormone replacement therapy, and alcohol consumption) did not impact our results, we only present results from the age-and ethnicity-adjusted models.
We also calculated the effect of SNPs and estimated haplotypes on plasma PRL levels using generalized linear models adjusted for continuous (age, anthropometry) and categorical (reproductive history) variables. The hormone measurements were log-transformed to best approximate a normal distribution. These values were transformed back to normal physiologic values for presentation. Means are presented as least-squares means (LS means). For all analyses, a dominant, co-dominant, and recessive model were fitted.
The haplotype frequencies and counts were estimated using tagSNPs program [53]. All other statistical analyses were conducted using SAS version 9.1 (SAS Institute, Cary, NC).