Original PDF Flash format An-Introduction-to-Genetic-Data-Analysis-Using-SAS/Genetics  


An Introduction To Genetic Data Analysis Using SAS/Genetics

An Introduction to Genetic Data Analysis
Using SAS/Genetics
Wendy Czika, Xiang Yu, and Russell D. Wolfinger
SAS Institute Inc.
Cary, North Carolina, USA
librium (HWE); and measures of linkage disequilib-
rium (LD) between pairs of markers.
These types
Abstract
of analyses can be accomplished with the ALLELE
and HAPLOTYPE procedures. Once these marker
The Human Genome Project and other recent
features have been examined, marker-disease as-
genome research efforts have intensified the already
sociation tests can be performed on either random
vital relationship between the fields of genetics and
samples from populations of unrelated individuals
statistics. One important application of the analysis
who are either affected with a disease or unaf-
of genetic data is locating genes that affect complex
fected, or on families with an affected child meeting
traits, human diseases in particular. While statistical
certain requirements. Several different case-control
procedures already in SAS ® are capable of certain
and family-based tests can be implemented by the
analyses, the unique nature of these data demands
CASECONTROL and FAMILY procedures, respec-
customized tools to fully accommodate the genetic
tively. Smoothing and multiple testing adjustments are
dissection of a trait or disease. SAS/Genetics offers
then available to be applied to the p-values created by
new statistical procedures specifically tailored for ge-
these tests in order to improve power and/or control
netic data stored in a commonly used format. This pa-
type I error using the PSMOOTH procedure.
per discusses these methods and illustrates the fea-
tures of the software with an example.
Theory
Introduction
Examining Marker Properties
The methods described in this section are discussed
Genetic markers, while often nonfunctional, play a
in detail by Weir (1996) unless otherwise cited. Some
key role in the mapping of human disease genes.
of the most basic descriptive statistics for genetic
Single nucleotide polymorphisms (SNPs) in particu-
markers are the sample allele and genotype frequen-
lar are facilitating the creation of high-resolution ge-
cies. A marker locus M may have a series of alleles
netic maps when used in linkage and association
M
studies.
SAS/Genetics provides a framework and
u, u = 1, ..., k, which produce genotypes of the form
M
the tools for analyzing markers and their relationship
u/Mv .
In a sample of n individuals, the count of
individuals with genotype M
with a dichotomous trait (affection status of a disease
u/Mv or Mv /Mu is nuv .
The number n
e.g.). While many of these methods exploit the bial-
u of copies of allele Mu can then be
found directly by summation: n
n
lelic nature of SNPs in their design, multiallelic adap-
u = 2nuu +
v>u
uv .
The sample allele and genotype frequencies are writ-
tations are available when studying markers such as
ten respectively as ˜
p
microsatellites with more than two alleles.
u = nu/(2n) and ˜
Puv = nuv/n.
The variance of the sample allele frequency ˜
pu is cal-
Before examining marker-disease associations, ini-
culated using the multinomial distribution of genotype
tial calculations characterizing features of the mark-
counts.
ers should be performed. Properties of interest about
The variance of the sample genotype frequency ˜
P
each marker include measures of marker informative-
uv
is not generally calculated; instead, a maximum likeli-
ness, which are useful for determining which mark-
hood estimate (MLE) of the Hardy-Weinberg disequi-
ers may be most valuable for association or link-
librium (HWD) coefficient D
age studies; estimates of allele, genotype, and hap-
uv for alleles Mu and Mv
lotype frequencies; tests for Hardy-Weinberg equi-
1

is calculated as
k
Het
=
1 −
˜
Puu
˜
u=1
ˆ
P
D
uv − ˜
pu ˜
pv,
u = v
uv =
˜
p
˜
k
u ˜
pv − 1 P
2
uv ,
u = v
Div
=
1 −
˜
p2u
and its variance can be estimated using Fisher’s ap-
u=1
proximate variance formula. Under ideal population
These measures give an indication of the amount of
conditions, the two alleles an individual receives, one
heterozygosity in the sample at each marker. Markers
from each parent, are independent so that Duv = 0.
with higher values of these measures tend to be more
This statement about allelic independence within loci
informative and of more use in association and link-
is called Hardy-Weinberg equilibrium (HWE). Forces
age studies because there is more allelic variation.
such as selection, mutation, and migration in a pop-
ulation or nonrandom mating can cause departures
Often, tests and measures based on haplotypes can
from HWE.
be more powerful than those using genotypes alone.
The set of genetic material an individual receives from
Two methods are available for testing a marker for
each parent contains an allele at every locus, and a
HWE, both which can accommodate any number of
set of alleles on one chromosome is called a haplo-
alleles. These methods are testing the hypothesis
type. However, it is often the case that haplotypes
that Puu = p2 and P
u
uv = 2pupv , u = v, that is Duv = 0,
are unknown. You may have the genotypic informa-
for all u, v = 1, ..., k.
tion on an individual at several markers, for example
The chi-square goodness-of-fit test can be used to
you know an individual has genotype A/a at locus 1
test markers for HWE:
and B/b at locus 2. But you do not know which set
of haplotypes that individual received from his or her
(nuu − n˜
p2 )2
parents: A-B and a-b, or A-b and a-B. Clearly, as more
X2
=
u
T

p2
loci are examined this problem becomes more com-
u
u
plicated. Haplotype frequencies then must be esti-
(nuv − 2n˜
pu ˜
pv)2
+
mated when haplotypes are not observed. For pairs
2n˜
pu ˜
pv
u
v>u
of biallelic markers, the MLE of a haplotype frequency
can be obtained analytically by solving a cubic equa-
This chi-square statistic has k(k − 1)/2 degrees of
tion. For multilocus haplotypes or markers with more
freedom where k is the number of alleles at the
than two alleles, an analytic solution is not tractable.
marker locus.
The expectation-maximization (EM) algorithm, an it-
erative process, can be used to arrive at MLEs of
The exact test given by Guo and Thompson (1992)
haplotype frequencies, assuming HWE (Excoffier and
is based on the conditional probability of genotype
Slatkin 1995; Hawley and Kidd 1995; Long, Williams,
counts given allelic counts and the hypothesis of al-
and Urbanek 1995).
lelic independence. The test statistic is
In order to perform the EM algorithm, haplotype fre-
n! 2h
nu!
quencies must first be initialized to some starting val-
T =
u
(2n)!
n
ues. These values can be generated randomly, uni-
u,v
uv !
formly assigned, or calculated in a number of different
where h =
n
ways. For a sample of n individuals, suppose the ith
u
v>u
uv is the number of heterozy-
gous individuals. Significance levels are calculated
individual has genotype Gi. Genotype Gi has prob-
by the permutation procedure. The 2n alleles are ran-
ability Pi in the population, which is calculated in the
domly permuted to form new sets of n genotypes, and
expectation step (E-step) as
the proportion of times the value of the statistic from
the permuted data exceeds the value from the actual
Pi =
fjf ci
j
data is the estimated significance level.
j∈Hi
Three measures of marker informativeness are the
where fj is the frequency in the population of the
polymorphism information content (PIC) (Botstein et
jth possible haplotype hj and f ci is the frequency of
j
al. 1980), (observed) heterozygosity, and allelic diver-
the haplotype hci that together with h
j
j constitutes the
sity (expected heterozygosity). They are calculated
genotype Gi. Now the log likelihood can be evaluated
as follows:
as
k
k−1
k
n
PIC
=
1 −
˜
p2 −

p2 ˜
p2
u
u v
log L =
Pi
u=1
u=1 v=u+1
i=1
2

For each iteration after the first, the change in log
frequency of allele Nv and ˜
q2 the frequency for “not
likelihood from the previous iteration(s) is measured
Nv.” All measures have the same numerator, an esti-
against a convergence criterion to determine whether
mate of D = p11p22 −p12p21, which is equivalent to the
the process needs to continue, or the maximum like-
LD coefficient for biallelic markers. The five measures
lihood estimates have been reached. If the process
are
must continue, the maximization step (M-step) is used
as follows to create the haplotype frequencies for the
r
=
D/(p1p2q1q2)1/2
next iteration:
D
=
D/Dmax
n
1
m
δ
=
D/(q
ij fj f ci
1p22)
j
fj = 2n
P
d
=
D/(q
i
1q2)
i=1
Q
=
D/(p11p22 + p12p21)
where mij is the number of times haplotype hj occurs
in genotype Gi. The log likelihood will increase with
where Dmax = min(p1q2, q1p2) for D > 0 and Dmax =
each iteration of the EM algorithm until the maximum
min(p1q1, q2p2) otherwise. Estimates of measures are
is reached. This maximum may be a local one, not a
calculated by replacing parameters with their appro-
global maximum, and thus the EM algorithm can be
priate estimates.
performed several times using different sets of start-
ing values for the haplotype frequencies; the best log
Exploring Marker-Trait Relationships
likelihood will likely be the global maximum.
There are several different case-control tests that can
Haplotype frequencies, either observed or estimated,
be used to test for association between a genetic
can be used in a variety of ways to glean more infor-
marker and a binary trait. With cases (affected in-
mation from your marker data. The primary applica-
dividuals) and controls (unaffected individuals) rep-
tion is for examining linkage disequilibrium between
resenting the two rows, contingency tables can be
markers. Haplotype frequencies can be expressed as
created using genotype or allele categories for the
columns. Tests for differences in these frequencies
puv = pupv + Duv for haplotype Mu-Nv at loci M and
N.
between case and control groups can be performed
Duv is the linkage, or gametic, disequilibrium (LD)
coefficient. When
using the usual Pearson chi-square. Another type of
Duv = 0 (the haplotype frequency
is the product of the individual allele frequencies), the
test that can be performed is Armitage’s trend test
markers are said to be in linkage equilibrium, that
(1955), which uses the genotype contingency table
is they are transmitted independently. Though there
in a different manner than Pearson’s test; this is a
are many other factors that may affect disequilibrium,
test specifically for additive allele effects, whereas the
there is a general expectation that the amount of link-
Pearson chi-square based on the genotype contin-
age disequilibrium is inversely related to the distance
gency table is testing for both additive and dominance
between the two loci.
effects.
Duv = 0 may be an indication
that two loci are far apart on a chromosome, or even
located on two different chromosomes. Chi-square
Table 1.
Genotype Distribution for Case-Control Sample
approximations and permutation exact tests are avail-
able for testing whether Duv = 0. When examining
Number of M1 alleles
LD across more than two marker loci, a likelihood ra-
0
1
2
Total
tio test can be used for testing whether any linkage
Case
r0
r1
r2
R
disequilibrium exists among the loci.
Control
s0
s1
s2
S
Total
n
There are also five commonly used linkage disequi-
0
n1
n2
N
librium measures that can be calculated for each pair
of alleles M
Using Table 1, the following chi-square statistics can
u and Nv , described by Devlin and Risch
(1995): the correlation coefficient r, the population at-
be created:
tributable risk δ, Lewontin’s D , the proportional dif-
ference d,and Yule’s Q. Since these measures are
N [N (r1 + 2r2) − R(n1 + 2n2)]2
X2
=
designed for biallelic markers, the measures are cal-
T
R(N − R)[N (n1 + 2n2) − (n1 + 2n2)2]
culated for each allele at locus M with each allele at
2N [N (r1 + 2r2) − R(n1 + 2n2)]2
locus N, where all other alleles at each loci are com-
X2
=
A
R(N − R)[2N (n
bined together to represent one allele. Thus for each
1 + 2n2) − (n1 + 2n2)2]
2
allele Mu in turn, ˜
p1 will be used as the frequency
(N ri − Rni)2
(N si − Sni)2
X2
=
+
of allele M
G
u and ˜
p2 represents the frequency of “not
N Rni
N Sni
i=0
Mu”; similarly for each Nv in turn, ˜
q1 represents the
3

for the trend, allele, and genotype tests, which have 1,
mial is used to form the Z-score statistic
1, and 2 df, respectively (Sasieni 1997). Extensions
for multiallelic markers are straightforward for the al-
b − b+c
2
lele and genotype tests: the Pearson chi-square test
Z =
b+c
is carried out on the contingency table containing one
4
column for each allele or genotype category. The de-
grees of freedom for a 2xm table is (m − 1). Slager
where b is the number of M1 alleles in all affected chil-
and Schaid (2001) offer a multiallelic version of the
dren from heterozygous parents and c is the number
trend test to accommodate markers with more than
of M2 alleles in affected children from heterozygous
two alleles; this test still has the same degrees of free-
parents.
dom as the allele test.
The z score procedure given by Spielman and Ewens
Contingency tables can also be created using haplo-
(1998) is used to calculate p-values for the S-TDT.
types as columns. Since the number of haplotypes,
This test can be applied to families where there is
and thus degrees of freedom, can grow quite large
at least one affected sibling and one unaffected sib-
for multilocus haplotypes, an alternative approach is
ling, and not all siblings have the same genotype.
to construct for each haplotype one contingency table
The z score, whose two-sided p-value is approxi-
containing a column for that haplotype and a single
mated using the normal distribution, is calculated as

column containing all other haplotypes. Directly ob-
z = (Y − A)/ V . Y represents the total observed
served or estimated haplotype counts can be used
number of M1 alleles in the affected siblings. For t to-
to calculate the Pearson statistic that has an asymp-
tal siblings in the family, a affected and u unaffected,
totic χ2 distribution. Additionally, a likelihood ratio test
1
and r that are M1M1 and s that are M1M2, summing
based on the log likelihoods from the EM algorithm
over families gives
can be performed over all haplotypes at the markers
of interest simultaneously. Such tests can be more
A
=
(2r + s)a/t
powerful than the single-marker association tests.
au[4r(t − r − s) + s(t − s)]
The disadvantage with using case-control tests is that
V
=
t2(t − 1)
differences in allele, genotype, or haplotype frequen-
cies between groups of cases and controls due to
causes other than linkage may be found to be sig-
as the expected value and variance of Y , respectively.
nificant. That is, an association between the trait and
marker may not be an indication of the marker’s prox-
The S-TDT and TDT can be combined for situations
imity to the trait gene. Family-based tests offer a so-
when some, but not all, of the parents have been
lution to this problem: they test for linkage in the pres-
genotyped: the TDT is applied to all families that
ence of association. These tests are performed using
meet its requirements. The S-TDT is then applied to
either family trios (parents and at least one affected
the remaining families that meet the requirements de-
child) or sib pairs consisting of one affected and one
scribed in the preceding paragraph. Using the nota-
unaffected sibling.
tion already given for these tests, the z score for the
combined test can then be written as
For all tests, it is assumed that the marker has two
alleles, M1 and M2. Extensions to multiallelic markers
(Y + b) − (A + b+c )
are made by performing the tests on each allele in
z =
2
turn, with the current allele being considered to be
V + b+c
4
M1 and all other alleles considered to be M2, but are
not discussed here.
The SDT (Horvath and Laird 1998) is a sign test used
The TDT (Spielman, McGinnis, and Ewens 1993) an-
on discordant sibling pairs. As with the S-TDT, one af-
alyzes families where both parents have been geno-
fected sibling and one unaffected sibling are required
typed for the marker and at least one is heterozygous.
to be in each family, but unlike the S-TDT, the SDT
The TDT tests for equality between the proportion of
remains a valid test of linkage and association when
times a heterozygous parent transmits the M
the sibship is larger.
1 allele
to an affected child and the proportion of times a het-
Continuing the notation from the previous tests, for a
erozygous parent transmits the M2 allele to an af-
affected siblings in a family and u unaffected siblings
fected child. The normal approximation to the bino-
in a family, define for each family in the data the av-
erage number of M1 alleles among affected siblings
and unaffected siblings respectively as ma = Y /a
and mu = [(2r + s) − Y ] /u. Then for each family,
4

d = ma − mu and summing over each family gives
analysis of simulated data from GAW12 (Wijsman et
S =
sgn(d), where sgn(d)=1 for d > 0, 0 for d = 0,
al. 2001) is presented. To perform the analysis, you
and -1 for d < 0. The SDT statistic is then defined
need to create data sets from three separate data
as T = S2/W where W =
(sgn(d))2. This statistic
files. The first contains the marker genotype data;
has an asymptotic χ2 distribution, and this is used to
this file is named Genmrk19.1 and contains individ-
1
obtain p-values for the SDT. This sibship test can also
uals from a replicate of the general population geno-
be combined with the TDT, creating a test that can
typed at 105 markers from chromosome 19. Here are
potentially use more of the data (Horvath and Laird
the first 10 rows and several columns of the data:
1998).
1
1
4/ 7
1/ 5
7/ 1
1/ 1
2/ 3 ...
The RC-TDT (Knapp 1999) takes the combined S-
1
2
8/ 7
2/ 7
4/ 7
4/ 8
3/ 3 ...
TDT a step further by reconstructing missing parental
1
3
9/ 2
5/ 1
6/ 4
3/ 5
3/ 3 ...
1
4
2/ 3
1/ 6
3/ 3
1/ 3
4/ 4 ...
genotypes when possible in order to make use of
1
5
3/ 4
7/ 3
7/ 6
5/ 3
1/ 2 ...
more families. The RC-TDT can be applied to fami-
1
6
7/ 8
4/ 1
7/ 3
8/ 3
4/ 4 ...
1
7
3/ 9
5/ 4
6/ 3
5/ 5
3/ 2 ...
lies with at least one affected child that meet one of
1
8
1/ 8
5/ 1
8/ 8
4/ 3
3/ 4 ...
the following conditions:
1
9
3/ 3
5/ 1
7/ 7
3/ 9
3/ 3 ...
1
10
8/ 4
4/ 4
2/ 7
3/ 2
1/ 4 ...
• Both parents are typed with at least one het-
The first two columns contain the pedigree and indi-
erozygous for M1.
vidual IDs, respectively. They are followed by geno-
• One parent is typed, the other can be recon-
types at each of 105 markers (only 5 shown here),
structed, and at least one parent is heterozy-
with the two alleles comprising each marker geno-
gous for M
type separated by a “/”. The actual data set contains
1.

1497 rows, each representing an individual. These
Both parents’ genotypes are missing but can be
data can easily be read into a SAS data set using the
reconstructed, and at least one parent is het-
following code:
erozygous for M1.
• At least one parental genotype is missing and
data markers;
cannot be reconstructed, but the conditions for
infile ’Genmrk19.1’ delimiter=’/ ’ lrecl=640;
input ped id a1-a210;

the S-TDT are met
run;
As with the S-TDT, a z score is created using the
This data set needs to be combined with the phe-
statistic Y , but Knapp calculates a different expected
notype information, contained in the file Genpheno.1.
value e and variance v of Y , which take into account
Here are the first 15 rows of this file:
the bias created by the genotype reconstruction, to

form the z score over all families z = (Y − e)/ v.
1
1
0
0 1 D
1
2
0
0 2 D
Adjusting p-Values
1
3
0
0 2 D
1
4
0
0 1 D
1
5
0
0 1 D
In association studies, there can potentially be thou-
1
6
0
0 2 D
sands of markers being analyzed simultaneously. In
1
7
0
0 2 D
1
8
0
0 2 D
addition to the multiple testing issues that must be
1
9
0
0 1 D
taken into consideration, there is also the fact that
1
10
0
0 2 A 37 320
37.78 0 U
16.15 ...
1
11
0
0 1 A 40
2
269.39 0 U
10.41 ...
many of these tests are correlated due to LD. This
1
12
0
0 1 A 40
3
15.28 0 A 40 12.86 ...
correlation can be used as a smoothing tool to help
1
13
0
0 1 A 49
4
14.16 0 U
15.63 ...
1
14
0
0 1 A 64
5
33.17 1 U
17.52 ...
eliminate false-positives that are found. Various meth-
1
15
0
0 1 A 65
6
45.96 1 U
21.63 ...
ods can be used to combine p-values from neighbor-
ing markers into a new p-value for each marker by
The first two columns of these data are the same as
implementing the sliding window approach described
those in the marker data, the ID variables. They are
by Zaykin et al. (2002). Since this method does not
followed by two columns of parental IDs identifying the
change the number of tests being performed, multi-
father and mother of the individual. The next column
ple testing corrections such as Bonferroni and Sidak
represents gender, and the sixth column contains a
(1967) can then be applied to the new set of p-values.
“D” for deceased individuals and “A” for alive individ-
uals. Note that more columns follow only for living in-
Data Analysis
dividuals. Of these columns, only the fifth column fol-
lowing the living status column is needed in this anal-
To help demonstrate the statistical methods that are
ysis; this column represents the disease status of the
available in the SAS/Genetics software, a sample
individual, “A” for individuals affected with the disease
5

and “U” for unaffected individuals. These data can be
run;
read into a SAS data set and then combined with the
proc print data=ld noobs;
marker data as follows:
run;
data pheno;
Since there are 105 markers in this data set, the
infile ’Genpheno.1’ missover;
VAR statement contains 210 variables. By default,
input ped id father mother gender living $ age
hh ef1 ef2 status $ age_o 45-46 q1-q5;
three tables will be created.
The first is a marker
keep ped id father mother status;
summary table, containing measures of marker in-
run;
formativeness: the polymorphism information content
proc sort data=markers;
(PIC), heterozygosity, and allelic diversity; the num-
by ped id;
run;
ber of alleles and number of individuals typed at each
marker; and statistics for the HWE test. The test for
data markertrt;
merge pheno markers;
HWE is performed by default using the asymptotic chi-
by ped id;
square test; in this analysis, the exact version is per-
run;
formed as well by permuting the marker alleles, with
the number of permutations indicated by the EXACT=
Additionally, you can use the marker names provided.
option. Here is the “Marker Summary” table, shown
Here are the first 10 of 105 rows of the file Map19:
for the first 5 markers only:
D19G001
0.70
D19G002
1.42
The ALLELE Procedure
D19G003
2.51
Marker Summary
D19G004
4.50
D19G005
4.87
Number
Number
D19G006
6.24
of
of
Hetero-
Allelic
Locus
Indiv
Alleles
PIC
zygosity
Diversity
D19G007
6.46
D19G008
7.09
D19G001
165
9
0.8235
0.8606
0.8430
D19G009
10.85
D19G002
165
7
0.7873
0.8061
0.8137
D19G003
165
8
0.8011
0.8242
0.8233
D19G010
11.51
D19G004
165
9
0.7986
0.8242
0.8211
D19G005
165
4
0.5956
0.7273
0.6560
Marker Summary
which can be read into a SAS data set using the code
--------------Test for HWE--------------
Chi-
Pr >
Prob
data map;
Locus
Square
DF
ChiSq
Exact
infile ’Map19’;
D19G001
21.3950
36
0.9744
0.9464
input name $ location;
D19G002
27.4028
21
0.1579
0.1941
run;
D19G003
23.1209
28
0.7270
0.8453
D19G004
23.5607
36
0.9451
0.9525
D19G005
7.7220
6
0.2592
0.3280
This data set can be specified in the NDATA= option
of the ALLELE, HAPLOTYPE, CASECONTROL, and
The next two tables are the allele and genotype fre-
FAMILY procedures to provide names for the mark-
quency tables.
In addition to the frequency esti-
ers in the displayed output and output data sets these
mates themselves, these tables contain an estimate
procedures create. These four procedures require a
of the standard deviation of the frequencies. When
VAR statement that identifies the columns of marker
the BOOTSTRAP= option is included in the PROC
alleles to analyze, with the two columns of alleles for
ALLELE statement, bootstrap confidence intervals for
each marker listed consecutively.
the frequency estimates are formed using the num-
For the ALLELE, HAPLOTYPE, and CASECONTROL
ber of bootstrap samples specified in the option and
procedures, which assume individuals have been ran-
a confidence level of 0.05 by default, or the level that
domly sampled from the population, a subset of the
is provided in the ALPHA= option. These tables have
data containing only founders (identified by having the
been suppressed in this analysis with the NOFREQ
value 0 for each of their parental IDs) who are living
option.
are used.
An output data set is created with the OUTSTAT=
Examining Marker Properties
option in the PROC ALLELE statement. This data
set contains the statistics from the HWE tests in the
The ALLELE procedure shown in the following code
rows where Locus1 and Locus2 are the same, and
calculates some basic summary statistics for each
statistics for a test for LD between pairs of markers
marker included in the analysis.
in the rows where these two variables differ. Since
the HAPLO=EST option is included in this code, the
LD test and measures are calculated using maximum
proc allele data=founders ndata=map outstat=ld nofreq
likelihood estimates of the two-locus haplotype fre-
exact=10000 haplo=est corrcoeff dprime;
var a1-a210;
quencies for each pair of alleles at the two loci. Only
6

the asymptotic chi-square test is used because there
in the following section.
The CUTOFF= option re-
is no exact test available when haplotype frequencies
quests that only haplotypes with an estimated fre-
are estimated. The following SAS output shows a
quency greater than the number specified be dis-
subset of this output data set:
played. As shown in the following table, out of more
than 18,000 possible haplotypes, only 12 have an es-
timated frequency greater than 0.01:
OUTSTAT= Data Set from PROC ALLELE
Locus1
Locus2
NIndiv
Test
ChiSq
DF
ProbChi
D19G001
D19G001
165
HWE
21.3950
36
0.97441
The HAPLOTYPE Procedure
D19G001
D19G002
165
LD
84.4138
48
0.00091
D19G001
D19G003
165
LD
68.0223
56
0.13017
Haplotype Frequencies
D19G001
D19G004
165
LD
60.9314
64
0.58569
D19G001
D19G005
165
LD
34.6481
24
0.07379
Standard
95% Confidence
D19G002
D19G002
165
HWE
27.4028
21
0.15791
Number
Haplotype
Freq
Error
Limits
D19G002
D19G003
165
LD
49.8788
42
0.18866
D19G002
D19G004
165
LD
54.4225
48
0.24330
1
2-1-7-3-4
0.01212
0.00603
0.00030
0.02395
D19G002
D19G005
165
LD
24.1505
18
0.15016
2
2-1-7-8-4
0.01212
0.00603
0.00030
0.02395
D19G003
D19G003
165
HWE
23.1209
28
0.72696
3
5-5-7-9-3
0.01212
0.00603
0.00030
0.02395
4
7-4-6-4-3
0.01212
0.00603
0.00030
0.02395
5
8-1-8-3-4
0.01212
0.00603
0.00030
0.02395
6
8-3-3-9-4
0.01479
0.00665
0.00174
0.02783
7
8-3-7-3-3
0.01515
0.00673
0.00195
0.02835
The five LD measures previously described are of-
8
8-5-3-3-3
0.01818
0.00737
0.00375
0.03262
9
9-4-6-3-3
0.01212
0.00603
0.00030
0.02395
fered by PROC ALLELE and calculated for each
10
9-5-4-3-3
0.02121
0.00794
0.00564
0.03678
11
9-5-7-6-3
0.01212
0.00603
0.00030
0.02395
pair of alleles at two loci when requested. The ta-
12
9-6-4-3-4
0.01212
0.00603
0.00030
0.02394
ble reporting these measures is only included in the
output when one or more of these measures have
The OUT= data set provides for each possible hap-
been specified. Below is a subset of the “Linkage
lotype pair within an individual given the individual’s
Disequilibrium Measures” table that displays the es-
genotype, the probability that the haplotype pair com-
timated haplotype frequencies, LD coeffiecients, and
prises that genotype. Here, the first 10 rows of this
the two requested measures, the correlation coeffi-
output data set are shown:
cient r and Lewontin’s D .
OUT= Data Set from PROC HAPLOTYPE
The ALLELE Procedure
ID
a1
a2
a3
a4
a5
a6
a7
a8
a9
a10
HAPLOTYPE1
HAPLOTYPE2
PROB
Linkage Disequilibrium Measures
1
8
4
4
4
2
7
3
2
1
4
4-4-7-2-1
8-4-2-3-4
1.00000
LD
Corr
Lewontin’s
2
5
9
3
5
3
4
2
3
4
3
5-3-3-2-4
9-5-4-3-3
1.00000
Locus1
Locus2
Haplotype
Frequency
Coeff
Coeff
D’
3
8
2
5
1
6
3
3
5
3
4
2-1-6-5-4
8-5-3-3-3
1.00000
4
7
8
5
3
8
4
5
3
3
4
7-5-8-3-3
8-3-4-5-4
1.00000
D19G071
D19G072
1-1
0.0360
0.008
0.064
0.205
5
9
2
2
5
7
6
9
3
2
4
2-2-6-9-2
9-5-7-3-4
1.00000
D19G071
D19G072
1-2
0.1198
-0.016
-0.067
-0.115
6
2
7
1
4
6
7
8
4
4
3
2-1-7-8-4
7-4-6-4-3
1.00000
D19G071
D19G072
1-3
0.0968
0.020
0.106
0.192
7
7
7
6
6
1
4
9
5
3
1
7-6-1-5-1
7-6-4-9-3
0.25000
D19G071
D19G072
1-4
0.1711
-0.010
-0.041
-0.056
7
7
7
6
6
1
4
9
5
3
1
7-6-1-5-3
7-6-4-9-1
0.25000
D19G071
D19G072
2-1
0.0110
-0.009
-0.080
-0.454
7
7
7
6
6
1
4
9
5
3
1
7-6-1-9-1
7-6-4-5-3
0.25000
D19G071
D19G072
2-2
0.1323
0.035
0.163
0.170
7
7
7
6
6
1
4
9
5
3
1
7-6-1-9-3
7-6-4-5-1
0.25000
D19G071
D19G072
2-3
0.0575
0.002
0.013
0.019
D19G071
D19G072
2-4
0.0940
-0.036
-0.160
-0.279
D19G071
D19G072
3-1
0.0000
-0.006
-0.086
-1.000
D19G071
D19G072
3-2
0.0195
-0.011
-0.078
-0.353
D19G071
D19G072
3-3
0.0196
0.003
0.023
0.033
The first six individuals have only one possible haplo-
D19G071
D19G072
3-4
0.0528
0.012
0.086
0.231
D19G071
D19G072
4-1
0.0165
0.004
0.045
0.080
type pair given their genotype based on the inferred
D19G071
D19G072
4-2
0.0568
-0.002
-0.009
-0.027
D19G071
D19G072
4-3
0.0117
-0.021
-0.143
-0.645
haplotype frequencies.
For the seventh individual,
D19G071
D19G072
4-4
0.1008
0.023
0.118
0.218
there are four equally probable haplotype pairs.
Exploring Marker-Trait Relationships
In order to accommodate the frequency estima-
tion of haplotypes spanning more than two loci,
The CASECONTROL, FAMILY, and HAPLOTYPE
SAS/Genetics also offers the HAPLOTYPE proce-
procedures are available for testing for associations
dure. This procedure implements the EM algorithm
between dichotomous traits and markers. The fol-
to calculate maximum likelihood estimates of the hap-
lowing code demonstrates how to implement PROC
lotype frequencies assuming HWE with the following
CASECONTROL:
code:
proc casecontrol data=founders ndata=map outstat=cctests
genotype trend;
proc haplotype data=founders cutoff=0.01 out=outhap;
var a1-a210;
var a1-a10;
trait status;
trait status;
run;
run;
proc print data=cctests (obs=10) noobs;
run;

Here, haplotypes at the first five markers only are
examined. One of the tables included in the output
In addition to the VAR statement previously de-
by default is the “Haplotype Frequencies” table. The
scribed, PROC CASECONTROL has a TRAIT state-
results from using the TRAIT statement are shown
ment where a dichotomous trait such as disease
7

status is identified.
Three options are available in
OUTSTAT= Data Set from PROC FAMILY
the PROC CASECONTROL statement that indicate
ChiSq
df
Locus
TDT
TDT
ProbTDT
which case-control tests should be performed. The
D19G001
7.9903
8
0.43441
available tests are the allele case-control test (not
D19G002
8.2224
6
0.22226
D19G003
6.4880
7
0.48406
specified here), the genotype case-control test, and
D19G004
1.7472
8
0.98780
D19G005
1.8750
3
0.59875
the linear trend test. Note that the case-control statis-
D19G006
12.7737
14
0.54442
D19G007
4.1984
6
0.64985
tics all test the hypothesis of no association between
D19G008
9.1880
7
0.23944
D19G009
2.8388
3
0.41715
each marker and trait. The output data set that is cre-
D19G010
7.4416
11
0.76228
ated by this procedure contains the test statistics for
each test and each marker. A subset of the output
In the previous section, a TRAIT statement was
data set is displayed here:
specified in PROC HAPLOTYPE like the one in the
CASECONTROL and FAMILY procedures. This in-
OUTSTAT= Data Set from PROC CASECONTROL
vokes the likelihood ratio test for marker-trait associ-
ChiSq
ChiSq
df
df
Prob
Prob
ation using the estimated haplotype frequencies, per-
Locus
Genotype
Trend
Genotype
Trend
Genotype
Trend
formed over all haplotypes at the markers. The results
D19G001
32.3041
5.8754
44
8
0.90414
0.66118
D19G002
22.9640
7.5170
27
6
0.68695
0.27566
are displayed in the “Test for Marker-Trait Association”
D19G003
25.3857
5.0110
35
7
0.88367
0.65862
D19G004
26.7125
4.2236
44
8
0.98158
0.83641
table:
D19G005
13.0445
0.8921
9
3
0.16060
0.82732
D19G006
74.4421
21.0026
119
14
0.99954
0.10157
D19G007
23.4462
9.4603
27
6
0.66082
0.14930
D19G008
31.7414
6.8149
35
7
0.62621
0.44841
Test for Marker-Trait Association
D19G009
7.0928
3.3052
9
3
0.62746
0.34692
D19G010
47.1247
9.0602
77
11
0.99712
0.61634
Trait
Trait
Num
Chi-
Pr >
Number
Value
Obs
DF
LogLike
Square
ChiSq
1
U
120
1763
-1132
2
A
45
812
-360.27066
The FAMILY procedure has a similar syntax to that
Combined
165
2281
-1620
127.7227
1.0000
of the CASECONTROL procedure, with the only dif-
ferences being the tests and options offered in the
This table shows the p-value from the χ2
distribu-
PROC FAMILY statement, an option available in the
294
tion. An exact p-value can also be calculated by cre-
TRAIT statement, and the addition of an ID statement
ating new samples with the trait values permuted and
to designate the names of the individual and family
comparing the chi-square statistic of the new samples
identifiers, as shown in the following code:
to the one from the original sample.
proc family data=markertrt ndata=map outstat=famtests
Adjusting p-Values
tdt contcorr;
var a1-a210;
In this example, the case-control and family-based
trait status / affected=’A’;
id ped id father mother;

tests are each performed on over 100 markers. The
run;
PSMOOTH procedure offers two types of p-value ad-
proc print data=famtests (obs=10) noobs;
justments for these test results: smoothing methods
run;
to take into account p-values from neighboring, and
likely correlated, markers and multiple testing correc-
For the family-based tests, the value of the TRAIT
tions to account for the number of tests being per-
variable that is considered “affected” can have a sig-
formed. The following code invokes Simes’ method
nificant impact on the test results. By default, PROC
(1986) for smoothing the p-values using a bandwidth
FAMILY uses the second value of the TRAIT variable
of 10, which translates into a sliding window contain-
encountered in the input data set as the value cor-
ing 21 markers. The two methods available to correct
resonding to affected; alternatively, to ensure that the
for multiple testing are Bonferroni and Sidak. These
proper value is used, you can specify the appropriate
methods, had they been included as options in the
value in the /AFFECTED= option of the TRAIT state-
PROC statement, would have adjusted the smoothed
ment. As seen in the code, up to four identifiers may
p-values according to the number of markers tested
be included in the ID statement: the pedigree, indi-
(105 in this scenario). The VAR statement indicates
vidual, and two parental IDs. The pedigree variable
which column of p-values to adjust, and the ID state-
need not be specified if all individual IDs are unique.
ment lists the variables to include in the output data
There are several family-based tests available includ-
set along with the original and adjusted p-values.
ing those to accommodate missing parental geno-
types, but only the TDT is used in this analysis since
proc psmooth data=famtests out=sm_famtests simes bw=10;
the parental genotypes have been obtained. An out-
var ProbTDT;
id Locus;

put data set with a similar structure to the one created
run;
by PROC CASECONTROL is produced by PROC
proc print data=sm_famtests (obs=10) noobs;
FAMILY.
run;
8

The following is a subset of the output data set cre-
The color of the squares on the off-diagonal of the plot
ated by PROC PSMOOTH:
indicates the p-value range for the LD test. The sym-
bols on the diagonal represent both the HWE p-value
range using color, and the marker-disease associa-
OUT= Data Set from PROC PSMOOTH
tion p-value range from the TDT using the shape of
ProbTDT_
Locus
ProbTDT
S10
the symbol. In order to see the exact values of the
D19G001
0.43441
0.79427
p-values, the pop-up window can be viewed as dis-
D19G002
0.22226
0.86647
D19G003
0.48406
0.90088
played. This plot provides, at a glance, an indica-
D19G004
0.98780
0.88933
D19G005
0.59875
0.95285
tion of the extent that LD between markers extends,
D19G006
0.54442
0.93819
D19G007
0.64985
0.92063
which regions show a significant association with dis-
D19G008
0.23944
0.95115
D19G009
0.41715
0.94493
ease status, and for which markers Hardy-Weinberg
D19G010
0.76228
0.92263
proportions hold.
Evaluating Results
Conclusion
Instead of forcing you to scrutinize four separate out-
As shown in the preceding example, SAS/Genetics
put data sets, SAS/Genetics includes the %TPLOT
offers resources for analyzing genetic marker data
macro, which creates a graphical representation of
that go well beyond the methods currently available
the genetic marker test results. This plot contains dif-
in SAS. Procedures specifically designed for marker
ferent symbols and colors of the symbols representing
data provide a seamless transition from data collec-
certain p-value ranges. On a single plot, results from
tion to analysis and supply tools needed for the asso-
testing single markers for HWE and associations with
ciation mapping of a complex trait or disease. Gene-
the trait and pairs of markers for LD are combined.
mapping is further facilitated by a a graphical repre-
Here is a sample invocation of the macro using the
sentation of p-values in a distinct triangular plot.
output data sets that have been created in the pre-
ceding examples:
References
%tplot( ld, famtests, ProbTDT );
Armitage, P. (1955), “Tests for Linear Trends in
Proportions and Frequencies,” Biometrics, 11,
The first argument to the %TPLOT macro is the out-
375–386.
put data set from PROC ALLELE that contains the
HWE and LD test results.
The second argument
Botstein, D., White, R.L., Skolnick, M., and Davis,
is the output data set from PROC FAMILY contain-
R.W. (1980), “Construction of a Genetic Linkage
ing p-values for the family-based association test, the
Map in Man Using Restriction Fragment Length
TDT. Note that output data sets from either PROC
Polymorphisms,” American Journal of Human
CASECONTROL or PROC PSMOOTH can also be
Genetics, 32, 314–331.
used for the second argument, or any user-created
Curtis, D., Miller, M.B., and Sham, P.C. (1999),
data set containing p-values. The third argument in-
“Combining the Sibling Disequilibrium Test and
dicates which column of p-values from the data set
Transmission/Disequilibrium Test for Multiallelic
specified in the second argument to use in the plot.
Markers,” American Journal of Human Genetics,
The following is the color plot that is produced by this
64, 1785–1786.
macro:
Devlin, B. and Risch, N. (1995), “A Comparison of
Linkage Disequilibrium Measures for Fine-Scale
Mapping,” Genomics, 29, 311–322.
Excoffier, L. and Slatkin, M. (1995), “Maximum-
Likelihood Estimation of Molecular Haplotype
Frequencies in a Diploid Population,” Molecular
Biology and Evolution,
12, 921–927.
Hawley, M.E. and Kidd, K.K. (1995), “HAPLO: A
Program Using the EM Algorithm to Estimate the
Frequencies of Multi-site Haplotypes,” Journal of
Heredity,
86, 409–411.
Horvath, S. and Laird, N.M. (1998), “A Discordant-
Sibship Test for Disequilibrium and Linkage: No
9

Need for Parental Data,” American Journal of
Human Genetics,
63, 1886–1897.
Knapp, M. (1999),“The Transmission/Disequilibrium
Acknowledgments
Test
and
Parental-Genotype
Reconstruction:
The simulated GAW12 data are supported by NIGMS
The
Reconstruction-Combined
Transmission/
grant GM31575.
Disequilibrium Test,” American Journal of Human
Genetics,
64, 861–870.
Contact Information
Long, J.C., Williams, R.C., and Urbanek, M. (1995),
“An E-M Algorithm and Testing Strategy for
Wendy Czika
Multiple-Locus Haplotypes,” American Journal of
SAS Insititute Inc.
Human Genetics, 56: 799–810.
R-3212 SAS Campus Drive
Cary, NC 27513
Sasieni, P.D. (1997), “From Genotypes to Genes:
Doubling the Sample Size,” Biometrics,
53,
Email: wendy.czika@sas.com
1253–1261.
Sidak,
Z.
(1967),
“Rectangular
Confidence
Regions for the Means of Multivariate Normal
Distributions,” Journal of the American Statistical
Association,
62, 626–633.
Simes,
R.J.
(1986),
“An
Improved
Bonferroni
Procedure for Multiple Tests of Significance,”
Biometrika, 73, 751-754.
Slager, S.L. and Schaid, D.J. (2001), “Evaluation
of Candidate Genes in Case-Control Studies:
A Statistical Method to Account for Related
Subjects,” American Journal of Human Genetics,
68, 1457–1462.
Spielman,
R.S.
and
Ewens,
W.J.
(1998),
“A
Sibship Test for Linkage in the Presence of
Association: The Sib Transmission/Disequilibrium
Test,” American Journal of Human Genetics, 62,
450–458.
Spielman,
R.S.,
McGinnis,
R.E.,
and Ewens,
W.J. (1993), “Transmission Test for Linkage
Disequilibrium:
The Insulin Gene Region and
Insulin-dependent Diabetes Mellitus (IDDM),”
American Journal of Human Genetics,
52,
506–516.
Weir,
B.S.
(1996),
Genetic
Data
Analysis
II,
Sunderland, MA: Sinauer Associates, Inc.
Wijsman, E.M., et al. (2001), “Analysis of Complex
Genetic Traits:
Applications to Asthma and
Simulated Data,” Genetic Epidemiology,
21,
S1–S853.
Zaykin, D.V., Zhivotovsky, L.A., Westfall, P.H., and
Weir, B.S. (2002), “Truncated Product Method for
Combining P -values,” Genetic Epidemiology, 22,
170–185.
SAS and SAS/Genetics are registered trademarks or
trademarks of SAS Institute Inc. in the USA and other
countries. ® indicates USA registration.
10

Document Outline