Original PDF Flash format a-global-geometric-framework-for-nonlinear-dimensionality-reduction  


A Global Geometric Framework For Nonlinear Dimensionality Reduction

R E P O R T S
23; right 36, 13, and 27); superior frontal gyrus (left
18. Matched-pair t tests (two-tailed) were used to test
29. S. P. Mewaldt, M. M. Ghoneim, Pharmacol. Biochem.
9, 31, and 45; right 17, 35, and 37).
the significance of drug-related changes in the vol-
Behav. 10, 1205 (1979).
17. Although the improvement in WM performance with
ume of regions of interest that showed significant
30. M. Petrides, Philos. Trans. R. Soc. London Ser. B 351,
cholinergic enhancement was a nonsignificant trend
response contrasts.
1455 (1996).
in the current study (P
0.07), in a previous study
19. H. Sato, Y. Hata, H. Masui, T. Tsumoto, J. Neuro-
31. M. E. Hasselmo, E. Fransen, C. Dickson, A. A. Alonso,
(9) with a larger sample (n
13) the effect was
physiol. 55, 765 (1987).
Ann. N.Y. Acad. Sci. 911, 418 (2000).
highly significant (P
0.001). In the current study,
20. M. E. Hasselmo, Behav. Brain Res. 67, 1 (1995).
32. M. M. Mesulam, Prog. Brain Res. 109, 285 (1996).
we analyzed RT data for six of our seven subjects
21. M. G. Baxter, A. A. Chiba, Curr. Opin. Neurobiol. 9,
33. R. T. Bartus, R. L. Dean III, B. Beer, A. S. Lippa, Science
because the behavioral data for one subject were
178 (1999).
217, 408 (1985).
unavailable due to a computer failure. The difference
34. N. Qizilbash et al., JAMA 280, 1777 (1998).
22. B. J. Everitt, T. W. Robbins, Annu. Rev. Psychol. 48,
in the significance of the two findings is simply a
649 (1997).
35. J. V. Haxby, J. Ma. Maisog, S. M. Courtney, in Mapping
result of the difference in sample sizes. A power
and Modeling the Human Brain, P. Fox, J. Lancaster, K.
analysis shows that the size of the RT difference and
23. R. Desimone, J. Duncan, Annu. Rev. Neurosci. 18, 193
Friston, Eds. ( Wiley, New York, in press).
variability in the current sample would yield a signif-
(1995).
36. We express our appreciation to S. Courtney, R. Desi-
icant result (P
0.01) with a sample size of 13.
24. P. C. Murphy, A. M. Sillito, Neuroscience 40, 13
mone, Y. Jiang, S. Kastner, L. Latour, A. Martin, L.
During the memory trials, mean RT was 1180 ms
(1991).
Pessoa, and L. Ungerleider for careful and critical
during placebo and 1119 ms during physostigmine.
25. M. Corbetta, F. M. Miezin, S. Dobmeyer, G. L. Shul-
review of the manuscript. We also thank M. B. Scha-
During the control trials, mean RT was 735 ms during
man, S. E. Peterson, J. Neurosci. 11, 2383 (1991).
piro and S. I. Rapoport for input during early stages of
placebo and 709 ms during physostigmine, a differ-
26. J. V. Haxby et al., J. Neurosci. 14, 6336 (1994).
this project. This research was supported by the
ence that did not approach significance (P
0.24),
27. A. Rosier, L. Cornette, G. A. Orban, Neuropsychobiol-
National Institute on Mental Health and National
suggesting that the effect of cholinergic enhance-
ogy 37, 98 (1998).
Institute on Aging Intramural Research Programs.
ment on WM performance is not due to a nonspecific
28. M. E. Hasselmo, B. P. Wyble, G. V. Wallenstein, Hip-
increase in arousal.
pocampus 6, 693 (1996).
7 August 2000; accepted 15 November 2000
A Global Geometric Framework
The classical techniques for dimensional-
ity reduction, PCA and MDS, are simple to
for Nonlinear Dimensionality
implement, efficiently computable, and guar-
anteed to discover the true structure of data
Reduction
lying on or near a linear subspace of the
high-dimensional input space (13). PCA
finds a low-dimensional embedding of the
Joshua B. Tenenbaum,1* Vin de Silva,2 John C. Langford3
data points that best preserves their variance
as measured in the high-dimensional input
Scientists working with large volumes of high-dimensional data, such as global
space. Classical MDS finds an embedding
climate patterns, stellar spectra, or human gene distributions, regularly con-
that preserves the interpoint distances, equiv-
front the problem of dimensionality reduction: finding meaningful low-dimen-
alent to PCA when those distances are Eu-
sional structures hidden in their high-dimensional observations. The human
clidean. However, many data sets contain
brain confronts the same problem in everyday perception, extracting from its
essential nonlinear structures that are invisi-
high-dimensional sensory inputs—30,000 auditory nerve fibers or 106 optic
ble to PCA and MDS (4, 5, 11, 14 ). For
nerve fibers—a manageably small number of perceptually relevant features.
example, both methods fail to detect the true
Here we describe an approach to solving dimensionality reduction problems
degrees of freedom of the face data set (Fig.
that uses easily measured local metric information to learn the underlying
1A), or even its intrinsic three-dimensionality
global geometry of a data set. Unlike classical techniques such as principal
(Fig. 2A).
component analysis (PCA) and multidimensional scaling (MDS), our approach
Here we describe an approach that com-
is capable of discovering the nonlinear degrees of freedom that underlie com-
bines the major algorithmic features of PCA
plex natural observations, such as human handwriting or images of a face under
and MDS—computational efficiency, global
different viewing conditions. In contrast to previous algorithms for nonlinear
optimality, and asymptotic convergence guar-
dimensionality reduction, ours efficiently computes a globally optimal solution,
antees—with the flexibility to learn a broad
and, for an important class of data manifolds, is guaranteed to converge
class of nonlinear manifolds. Figure 3A illus-
asymptotically to the true structure.
trates the challenge of nonlinearity with data
lying on a two-dimensional “Swiss roll”: points
A canonical problem in dimensionality re-
ality may be quite high (e.g., 4096 for these
far apart on the underlying manifold, as mea-
duction from the domain of visual perception
64 pixel by 64 pixel images), the perceptually
sured by their geodesic, or shortest path, dis-
is illustrated in Fig. 1A. The input consists of
meaningful structure of these images has
tances, may appear deceptively close in the
many images of a person’s face observed
many fewer independent degrees of freedom.
high-dimensional input space, as measured by
under different pose and lighting conditions,
Within the 4096-dimensional input space, all
their straight-line Euclidean distance. Only the
in no particular order. These images can be
of the images lie on an intrinsically three-
geodesic distances reflect the true low-dimen-
thought of as points in a high-dimensional
dimensional manifold, or constraint surface,
sional geometry of the manifold, but PCA and
vector space, with each input dimension cor-
that can be parameterized by two pose vari-
MDS effectively see just the Euclidean struc-
responding to the brightness of one pixel in
ables plus an azimuthal lighting angle. Our
ture; thus, they fail to detect the intrinsic two-
the image or the firing rate of one retinal
goal is to discover, given only the unordered
dimensionality (Fig. 2B).
ganglion cell. Although the input dimension-
high-dimensional inputs, low-dimensional
Our approach builds on classical MDS but
representations such as Fig. 1A with coordi-
seeks to preserve the intrinsic geometry of the
nates that capture the intrinsic degrees of
data, as captured in the geodesic manifold
1Department of Psychology and 2Department of
freedom of a data set. This problem is of
distances between all pairs of data points. The
Mathematics, Stanford University, Stanford, CA
central importance not only in studies of vi-
crux is estimating the geodesic distance be-
94305, USA. 3Department of Computer Science, Car-
negie Mellon University, Pittsburgh, PA 15217, USA.
sion (15), but also in speech (6, 7 ), motor
tween faraway points, given only input-space
control (8, 9), and a range of other physical
distances. For neighboring points, input-
*To whom correspondence should be addressed. E-
mail: jbt@psych.stanford.edu
and biological sciences (1012).
space distance provides a good approxima-
www.sciencemag.org SCIENCE VOL 290 22 DECEMBER 2000
2319

R E P O R T S
tion to geodesic distance. For faraway points,
X. Two simple methods are to connect each
The final step applies classical MDS to
geodesic distance can be approximated by
point to all points within some fixed radius ,
the matrix of graph distances D
{d (i, j)},
G
G
adding up a sequence of “short hops” be-
or to all of its K nearest neighbors (15). These
constructing an embedding of the data in a
tween neighboring points. These approxima-
neighborhood relations are represented as a
d-dimensional Euclidean space Y that best
tions are computed efficiently by finding
weighted graph G over the data points, with
preserves the manifold’s estimated intrinsic
shortest paths in a graph with edges connect-
edges of weight d (i, j) between neighboring
geometry (Fig. 3C). The coordinate vectors y
X
i
ing neighboring data points.
points (Fig. 3B).
for points in Y are chosen to minimize the
The complete isometric feature mapping,
In its second step, Isomap estimates the
cost function
or Isomap, algorithm has three steps, which
geodesic distances d (i, j) between all pairs
M
E
D
are detailed in Table 1. The first step deter-
of points on the manifold M by computing
G
DY L2
(1)
mines which points are neighbors on the
their shortest path distances d (i, j) in the
where D
denotes the matrix of Euclidean
G
Y
manifold M, based on the distances d (i, j)
graph G. One simple algorithm (16 ) for find-
distances {d (i, j)
y
X
Y
i
yj } and A L2
between pairs of points i, j in the input space
ing shortest paths is given in Table 1.
the L2 matrix norm
A2 . The
operator
i, j
i j
Fig. 1. (A) A canonical dimensionality reduction
problem from visual perception. The input consists
of a sequence of 4096-dimensional vectors, rep-
resenting the brightness values of 64 pixel by 64
pixel images of a face rendered with different
poses and lighting directions. Applied to N
698
raw images, Isomap (K
6) learns a three-dimen-
sional embedding of the data’s intrinsic geometric
structure. A two-dimensional projection is shown,
with a sample of the original input images (red
circles) superimposed on all the data points (blue)
and horizontal sliders (under the images) repre-
senting the third dimension. Each coordinate axis
of the embedding correlates highly with one de-
gree of freedom underlying the original data: left-
right pose (x axis, R
0.99), up-down pose ( y
axis, R
0.90), and lighting direction (slider posi-
tion, R
0.92). The input-space distances d (i,j)
X
given to Isomap were Euclidean distances be-
tween the 4096-dimensional image vectors. (B)
Isomap applied to N
1000 handwritten “2”s
from the MNIST database (40). The two most
significant dimensions in the Isomap embedding,
shown here, articulate the major features of the
“2”: bottom loop (x axis) and top arch ( y axis).
Input-space distances d (i,j) were measured by
X
tangent distance, a metric designed to capture the
invariances relevant in handwriting recognition
(41). Here we used -Isomap (with
4.2) be-
cause we did not expect a constant dimensionality
to hold over the whole data set; consistent with
this, Isomap finds several tendrils projecting from
the higher dimensional mass of data and repre-
senting successive exaggerations of an extra
stroke or ornament in the digit.
2320
22 DECEMBER 2000 VOL 290 SCIENCE www.sciencemag.org

R E P O R T S
converts distances to inner products (17 ),
whose intrinsic geometry is that of a convex
density of points. To the extent that a data set
which uniquely characterize the geometry of
region of Euclidean space, but whose ambi-
presents extreme values of these parameters
the data in a form that supports efficient
ent geometry in the high-dimensional input
or deviates from a uniform density, asymp-
optimization. The global minimum of Eq. 1 is
space may be highly folded, twisted, or
totic convergence still holds in general, but
achieved by setting the coordinates y to the
curved. For non-Euclidean manifolds, such as
the sample size required to estimate geodes-
i
top d eigenvectors of the matrix (D ) (13).
a hemisphere or the surface of a doughnut,
ic distance accurately may be impractically
G
As with PCA or MDS, the true dimen-
Isomap still produces a globally optimal low-
large.
sionality of the data can be estimated from
dimensional Euclidean representation, as
Isomap’s global coordinates provide a
the decrease in error as the dimensionality of
measured by Eq. 1.
simple way to analyze and manipulate high-
Y is increased. For the Swiss roll, where
These guarantees of asymptotic conver-
dimensional observations in terms of their
classical methods fail, the residual variance
gence rest on a proof that as the number of
intrinsic nonlinear degrees of freedom. For a
of Isomap correctly bottoms out at d
2
data points increases, the graph distances
set of synthetic face images, known to have
(Fig. 2B).
d (i, j) provide increasingly better approxi-
three degrees of freedom, Isomap correctly
G
Just as PCA and MDS are guaranteed,
mations to the intrinsic geodesic distances
detects the dimensionality (Fig. 2A) and sep-
given sufficient data, to recover the true
d (i, j), becoming arbitrarily accurate in the
arates out the true underlying factors (Fig.
M
structure of linear manifolds, Isomap is guar-
limit of infinite data (18, 19). How quickly
1A). The algorithm also recovers the known
anteed asymptotically to recover the true di-
d (i, j) converges to d (i, j) depends on cer-
low-dimensional structure of a set of noisy
G
M
mensionality and geometric structure of a
tain parameters of the manifold as it lies
real images, generated by a human hand vary-
strictly larger class of nonlinear manifolds.
within the high-dimensional space (radius of
ing in finger extension and wrist rotation
Like the Swiss roll, these are manifolds
curvature and branch separation) and on the
(Fig. 2C) (20). Given a more complex data
set of handwritten digits, which does not have
a clear manifold geometry, Isomap still finds
Fig. 2. The residual
globally meaningful coordinates (Fig. 1B)
variance of PCA (open
and nonlinear structure that PCA or MDS do
triangles), MDS [open
not detect (Fig. 2D). For all three data sets,
triangles in (A) through
(C); open circles in (D)],
the natural appearance of linear interpolations
and Isomap (filled cir-
between distant points in the low-dimension-
cles) on four data sets
al coordinate space confirms that Isomap has
(42). (A) Face images
captured the data’s perceptually relevant
varying in pose and il-
structure (Fig. 4).
lumination (Fig. 1A).
Previous attempts to extend PCA and
(B) Swiss roll data (Fig.
3). (C) Hand images
MDS to nonlinear data sets fall into two
varying in finger exten-
broad classes, each of which suffers from
sion and wrist rotation
limitations overcome by our approach. Local
(20). (D) Handwritten
linear techniques (2123) are not designed to
“2”s (Fig. 1B). In all cas-
represent the global structure of a data set
es, residual variance de-
within a single coordinate system, as we do in
creases as the dimen-
sionality d is increased.
Fig. 1. Nonlinear techniques based on greedy
The intrinsic dimen-
optimization procedures (2430) attempt to
sionality of the data
discover global structure, but lack the crucial
can be estimated by
algorithmic features that Isomap inherits
looking for the “elbow”
from PCA and MDS: a noniterative, polyno-
at which this curve ceases to decrease significantly with added dimensions. Arrows mark the true or
mial time procedure with a guarantee of glob-
approximate dimensionality, when known. Note the tendency of PCA and MDS to overestimate the
dimensionality, in contrast to Isomap.
al optimality; for intrinsically Euclidean man-
Fig. 3. The “Swiss roll” data set, illustrating how Isomap exploits geodesic
1000 data points) allows an approximation (red segments) to the true
paths for nonlinear dimensionality reduction. (A) For two arbitrary points
geodesic path to be computed efficiently in step two, as the shortest
(circled) on a nonlinear manifold, their Euclidean distance in the high-
path in G. (C) The two-dimensional embedding recovered by Isomap in
dimensional input space (length of dashed line) may not accurately
step three, which best preserves the shortest path distances in the
reflect their intrinsic similarity, as measured by geodesic distance along
neighborhood graph (overlaid). Straight lines in the embedding (blue)
the low-dimensional manifold (length of solid curve). (B) The neighbor-
now represent simpler and cleaner approximations to the true geodesic
hood graph G constructed in step one of Isomap (with K
7 and N
paths than do the corresponding graph paths (red).
www.sciencemag.org SCIENCE VOL 290 22 DECEMBER 2000
2321

R E P O R T S
Table 1. The Isomap algorithm takes as input the distances d (i,j ) between all pairs i,j from N data points
local dimensionality varies across the data set. When
X
in the high-dimensional input space X, measured either in the standard Euclidean metric (as in Fig. 1A)
available, additional constraints such as the temporal
ordering of observations may also help to determine
or in some domain-specific metric (as in Fig. 1B). The algorithm outputs coordinate vectors y in a
i
neighbors. In earlier work (36) we explored a more
d-dimensional Euclidean space Y that (according to Eq. 1) best represent the intrinsic geometry of the
complex method (37), which required an order of
data. The only free parameter ( or K ) appears in Step 1.
magnitude more data and did not support the theo-
retical performance guarantees we provide here for
Step
- and K-Isomap.
16. This procedure, known as Floyd’s algorithm, requires
O(N3) operations. More efficient algorithms exploit-
1
Construct neighborhood graph
Define the graph G over all data points by connecting
ing the sparse structure of the neighborhood graph
points i and j if [as measured by d (i, j )] they are
X
can be found in (38).
closer than ( -Isomap), or if i is one of the K
17. The operator is defined by (D)
HSH/2, where S
nearest neighbors of j (K-Isomap). Set edge lengths
is the matrix of squared distances {S
D2}, and H is
ij
i j
equal to d (i,j).
X
the “centering matrix” {H
1/N} (13).
ij
ij
2
Compute shortest paths
Initialize d (i,j)
d (i,j) if i,j are linked by an edge;
18. Our proof works by showing that for a sufficiently
G
X
d (i,j)
otherwise. Then for each value of k
high density ( ) of data points, we can always choose
G
a neighborhood size ( or K) large enough that the
1, 2, . . ., N in turn, replace all entries d (i,j) by
G
graph will (with high probability) have a path not
min{d (i,j), d (i,k)
d (k,j)}. The matrix of final
G
G
G
much longer than the true geodesic, but small
values D
{d (i,j)} will contain the shortest path
G
G
enough to prevent edges that “short circuit” the true
distances between all pairs of points in G (16, 19).
geometry of the manifold. More precisely, given ar-
3
Construct d-dimensional embedding
Let
be the p-th eigenvalue (in decreasing order) of
bitrarily small values of
,
, and , we can guar-
p
1
2
the matrix (D ) (17), and v i be the i-th
antee that with probability at least 1
, estimates
G
p
of the form
component of the p-th eigenvector. Then set the
p-th component of the d-dimensional coordinate
1
1 dM i, j
dG i, j
1
2 dM i, j
vector y equal to
vi .
i
p p
will hold uniformly over all pairs of data points i,j. For
-Isomap, we require
2/ r0 24 1,
s0,
ifolds, a guarantee of asymptotic conver-
log V/
d
2 /16 d
/ d 2 /8 d
gence to the true structure; and the ability to
where r is the minimal radius of curvature of the
0
discover manifolds of arbitrary dimensional-
manifold M as embedded in the input space X, s is
0
ity, rather than requiring a fixed d initialized
the minimal branch separation of M in X, V is the
(d-dimensional) volume of M, and (ignoring boundary
from the beginning or computational resourc-
effects)
is the volume of the unit ball in Euclidean
d
es that increase exponentially in d.
d-space. For K-Isomap, we let be as above and fix
Here we have demonstrated Isomap’s per-
the ratio (K
1)/
( /2)d/2. We then require
d
formance on data sets chosen for their visu-
e K 1 /4
d
/4 d/4V,
ally compelling structures, but the technique
e/4 K 1 / 2
d
/8 d/16V,
may be applied wherever nonlinear geometry
4 log 8V/
d
2 /32
d
/ d 2 /16 d
complicates the use of PCA or MDS. Isomap
The exact content of these conditions—but not their
complements, and may be combined with,
general form—depends on the particular technical
linear extensions of PCA based on higher
assumptions we adopt. For details and extensions to
nonuniform densities, intrinsic curvature, and bound-
order statistics, such as independent compo-
ary effects, see http://isomap.stanford.edu.
nent analysis (31, 32). It may also lead to a
19. In practice, for finite data sets, d (i,j) may fail to
G
better understanding of how the brain comes
approximate d (i,j) for a small fraction of points that
M
are disconnected from the giant component of the
to represent the dynamic appearance of ob-
neighborhood graph G. These outliers are easily de-
jects, where psychophysical studies of appar-
tected as having infinite graph distances from the
ent motion (33, 34 ) suggest a central role for
majority of other points and can be deleted from
further analysis.
geodesic transformations on nonlinear mani-
20. The Isomap embedding of the hand images is avail-
folds (35) much like those studied here.
able at Science Online at www.sciencemag.org/cgi/
content/full/290/5500/2319/DC1. For additional
References and Notes
material and computer code, see http://isomap.
1. M. P. Young, S. Yamane, Science 256, 1327 (1992).
stanford.edu.
2. R. N. Shepard, Science 210, 390 (1980).
21. R. Basri, D. Roth, D. Jacobs, Proceedings of the IEEE
3. M. Turk, A. Pentland, J. Cogn. Neurosci. 3, 71 (1991).
Conference on Computer Vision and Pattern Recog-
Fig. 4. Interpolations along straight lines in
4. H. Murase, S. K. Nayar, Int. J. Comp. Vision 14, 5
nition (1998), pp. 414–420.
the Isomap coordinate space (analogous to
(1995).
22. C. Bregler, S. M. Omohundro, Adv. Neural Info. Proc.
the blue line in Fig. 3C) implement perceptu-
5. J. W. McClurkin, L. M. Optican, B. J. Richmond, T. J.
Syst. 7, 973 (1995).
ally natural but highly nonlinear “morphs” of
Gawne, Science 253, 675 (1991).
23. G. E. Hinton, M. Revow, P. Dayan, Adv. Neural Info.
6. J. L. Elman, D. Zipser, J. Acoust. Soc. Am. 83, 1615
Proc. Syst. 7, 1015 (1995).
the corresponding high-dimensional observa-
(1988).
24. R. Durbin, D. Willshaw, Nature 326, 689 (1987).
tions (43) by transforming them approxi-
7. W. Klein, R. Plomp, L. C. W. Pols, J. Acoust. Soc. Am.
25. T. Kohonen, Self-Organisation and Associative Mem-
mately along geodesic paths (analogous to
48, 999 (1970).
ory (Springer-Verlag, Berlin, ed. 2, 1988), pp. 119–
the solid curve in Fig. 3A). (A) Interpolations
8. E. Bizzi, F. A. Mussa-Ivaldi, S. Giszter, Science 253, 287
157.
in a three-dimensional embedding of face
(1991).
26. T. Hastie, W. Stuetzle, J. Am. Stat. Assoc. 84, 502
images (Fig. 1A). (B) Interpolations in a four-
9. T. D. Sanger, Adv. Neural Info. Proc. Syst. 7, 1023
(1989).
dimensional embedding of hand images (20)
(1995).
27. M. A. Kramer, AIChE J. 37, 233 (1991).
appear as natural hand movements when
10. J. W. Hurrell, Science 269, 676 (1995).
28. D. DeMers, G. Cottrell, Adv. Neural Info. Proc. Syst. 5,
viewed in quick succession, even though no
11. C. A. L. Bailer-Jones, M. Irwin, T. von Hippel, Mon.
580 (1993).
Not. R. Astron. Soc. 298, 361 (1997).
29. R. Hecht-Nielsen, Science 269, 1860 (1995).
such motions occurred in the observed data. (C)
12. P. Menozzi, A. Piazza, L. Cavalli-Sforza, Science 201,
30. C. M. Bishop, M. Svense´n, C. K. I. Williams, Neural
Interpolations in a six-dimensional embedding of
786 (1978).
Comp. 10, 215 (1998).
handwritten “2”s (Fig. 1B) preserve continuity not
13. K. V. Mardia, J. T. Kent, J. M. Bibby, Multivariate
31. P. Comon, Signal Proc. 36, 287 (1994).
only in the visual features of loop and arch artic-
Analysis, (Academic Press, London, 1979).
32. A. J. Bell, T. J. Sejnowski, Neural Comp. 7, 1129
ulation, but also in the implied pen trajectories,
14. A. H. Monahan, J. Clim., in press.
(1995).
which are the true degrees of freedom underlying
15. The scale-invariant K parameter is typically easier to
33. R. N. Shepard, S. A. Judd, Science 191, 952 (1976).
those appearances.
set than , but may yield misleading results when the
34. M. Shiffrar, J. J. Freyd, Psychol. Science 1, 257 (1990).
2322
22 DECEMBER 2000 VOL 290 SCIENCE www.sciencemag.org

R E P O R T S
35. R. N. Shepard, Psychon. Bull. Rev. 1, 2 (1994).
1 – R2(, D ). D is the matrix of Euclidean distanc-
ing the coordinates of corresponding points {x , y } in
M
Y
Y
i
i
36. J. B. Tenenbaum, Adv. Neural Info. Proc. Syst. 10, 682
es in the low-dimensional embedding recovered by
both spaces provided by Isomap together with stan-
(1998).
each algorithm. is each algorithm’s best estimate
dard supervised learning techniques (39).
M
37. T. Martinetz, K. Schulten, Neural Netw. 7, 507 (1994).
of the intrinsic manifold distances: for Isomap, this is
44. Supported by the Mitsubishi Electric Research Labo-
38. V. Kumar, A. Grama, A. Gupta, G. Karypis, Introduc-
the graph distance matrix D ; for PCA and MDS, it is
G
ratories, the Schlumberger Foundation, the NSF
tion to Parallel Computing: Design and Analysis of
the Euclidean input-space distance matrix D (except
X
(DBS-9021648), and the DARPA Human ID program.
Algorithms (Benjamin/Cummings, Redwood City, CA,
with the handwritten “2”s, where MDS uses the
We thank Y. LeCun for making available the MNIST
1994), pp. 257–297.
tangent distance). R is the standard linear correlation
database and S. Roweis and L. Saul for sharing related
39. D. Beymer, T. Poggio, Science 272, 1905 (1996).
coefficient, taken over all entries of and D .
M
Y
unpublished work. For many helpful discussions, we
40. Available at www.research.att.com/ yann/ocr/mnist.
43. In each sequence shown, the three intermediate im-
thank G. Carlsson, H. Farid, W. Freeman, T. Griffiths,
41. P. Y. Simard, Y. LeCun, J. Denker, Adv. Neural Info.
ages are those closest to the points 1/4, 1/2, and 3/4
R. Lehrer, S. Mahajan, D. Reich, W. Richards, J. M.
Proc. Syst. 5, 50 (1993).
of the way between the given endpoints. We can also
Tenenbaum, Y. Weiss, and especially M. Bernstein.
42. In order to evaluate the fits of PCA, MDS, and Isomap
synthesize an explicit mapping from input space X to
on comparable grounds, we use the residual variance
the low-dimensional embedding Y, or vice versa, us-
10 August 2000; accepted 21 November 2000
Nonlinear Dimensionality
along shortest paths confined to the manifold of
observed inputs. Here, we take a different ap-
Reduction by
proach, called locally linear embedding (LLE),
that eliminates the need to estimate pairwise
Locally Linear Embedding
distances between widely separated data points.
Unlike previous methods, LLE recovers global
nonlinear structure from locally linear fits.
Sam T. Roweis1 and Lawrence K. Saul2
The LLE algorithm, summarized in Fig.
2, is based on simple geometric intuitions.
Many areas of science depend on exploratory data analysis and visualization.
Suppose the data consist of N real-valued
The need to analyze large amounts of multivariate data raises the fundamental
vectors X , each of dimensionality D, sam-
i
problem of dimensionality reduction: how to discover compact representations
pled from some underlying manifold. Pro-
of high-dimensional data. Here, we introduce locally linear embedding (LLE), an
vided there is sufficient data (such that the
unsupervised learning algorithm that computes low-dimensional, neighbor-
manifold is well-sampled), we expect each
hood-preserving embeddings of high-dimensional inputs. Unlike clustering
data point and its neighbors to lie on or
methods for local dimensionality reduction, LLE maps its inputs into a single
close to a locally linear patch of the mani-
global coordinate system of lower dimensionality, and its optimizations do not
fold. We characterize the local geometry of
involve local minima. By exploiting the local symmetries of linear reconstruc-
these patches by linear coefficients that
tions, LLE is able to learn the global structure of nonlinear manifolds, such as
reconstruct each data point from its neigh-
those generated by images of faces or documents of text.
bors. Reconstruction errors are measured
by the cost function
How do we judge similarity? Our mental
coordinates as observed modes of variability.
2
representations of the world are formed by
Previous approaches to this problem, based on
ε W
Xi
jWij Xj
(1)
processing large numbers of sensory in-
multidimensional scaling (MDS) (2), have
i
puts—including, for example, the pixel in-
computed embeddings that attempt to preserve
which adds up the squared distances between
tensities of images, the power spectra of
pairwise distances [or generalized disparities
all the data points and their reconstructions. The
sounds, and the joint angles of articulated
(3)] between data points; these distances are
weights W summarize the contribution of the
ij
bodies. While complex stimuli of this form can
measured along straight lines or, in more so-
jth data point to the ith reconstruction. To com-
be represented by points in a high-dimensional
phisticated usages of MDS such as Isomap (4),
pute the weights W , we minimize the cost
ij
vector space, they typically have a much more
compact description. Coherent structure in the
world leads to strong correlations between in-
puts (such as between neighboring pixels in
images), generating observations that lie on or
close to a smooth low-dimensional manifold.
To compare and classify such observations—in
effect, to reason about the world— depends
crucially on modeling the nonlinear geometry
of these low-dimensional manifolds.
Scientists interested in exploratory analysis
or visualization of multivariate data (1) face a
similar problem in dimensionality reduction.
The problem, as illustrated in Fig. 1, involves
Fig. 1. The problem of nonlinear dimensionality reduction, as illustrated (10) for three-dimensional
mapping high-dimensional inputs into a low-
data (B) sampled from a two-dimensional manifold (A). An unsupervised learning algorithm must
dimensional “description” space with as many
discover the global internal coordinates of the manifold without signals that explicitly indicate how
the data should be embedded in two dimensions. The color coding illustrates the neighborhood-
preserving mapping discovered by LLE; black outlines in (B) and (C) show the neighborhood of a
1Gatsby Computational Neuroscience Unit, Universi-
single point. Unlike LLE, projections of the data by principal component analysis (PCA) (28) or
ty College London, 17 Queen Square, London WC1N
classical MDS (2) map faraway data points to nearby points in the plane, failing to identify the
3AR, UK. 2AT&T Lab—Research, 180 Park Avenue,
underlying structure of the manifold. Note that mixture models for local dimensionality reduction
Florham Park, NJ 07932, USA.
(29), which cluster the data and perform PCA within each cluster, do not address the problem
E-mail: roweis@gatsby.ucl.ac.uk (S.T.R.); lsaul@research.
considered here: namely, how to map high-dimensional data into a single global coordinate system
att.com (L.K.S.)
of lower dimensionality.
www.sciencemag.org SCIENCE VOL 290 22 DECEMBER 2000
2323