Original PDF Flash format Real-Neural-Networking:-Using-SAS-to-Build-and-Deliver-Behavior-...  


Real Neural Networking: Using SAS To Build And Deliver Behavior ...

NESUG 17
Analysis
REAL NEURAL NETWORKING:
USING SAS TO BUILD AND DELIVER BEHAVIOR-GENETIC-NERVOUS
SPATIAL AND FUNCTIONAL MAPS OF BIOLOGIAL SYSTEMS
Haftan Eckholdt, Albert Einstein College of Medicine, Bronx, NY

ABSTRACT
This talk describes the status of a long term project to build a digital model of genes, behavior, and neuroanatomy in a biological system.
Caenorhabditis elegans was the first species to be mapped genetically, and the first species to be mapped neurologically, and the first species
to provide evidence of specific gene-behavior relationships. SAS [DATA STEP, PROC GPLOT, %MACRO] is being used in the first phase to
build a spatially and functionally accurate digital model of the nervous system including information about location and branching, 3D maps of
neurons, 3D animations of neurons, and 3D VRML outputs. SAS has been used to build a database of the entire genetic structure [Computer
Cluster management and computation: SUGI2001]. Once the behavioral database has been developed, SAS can be used to serve WEB
enabled modeling systems accessible through neurons, genes, or behavior. With such a system, scientists could model complex genetic and
neurological hypotheses, including space-function-behavior models of outcome.
INTRODUCTION
On the technical front, new boundaries in capacity provide an environment for the simulation of complex biological processes. On the
scientific front, advances in genetics (bioinformatics) and neuroscience (neuroinformatics) provide an opportunity to consider modeling whole
biological systems (a worm / nematode: C. elegans). What follows are descriptions of progress in these realms.
PARALLEL WORLDS
Research in disciplines with high density data have been limited by the state of hardware and software in supercomputing. Examples include
simulations, model building, and graphics in weather, finance, and genetics. The barriers to supercomputing have traditionally been costs
associated with hardware and programming. Using a cluster of workstations (COW) to simulate a parallel processor puts the price of a
teraflop of hardware under $1,000,000. More importantly, perhaps, is that the scalability of a COW -- 50 gigaflops for $50,000 -- well within
most research budgets.

The growing history on parallel processing can best be accessed through the Computer Society of Institute of Electrical and Electronics
Engineers (IEEE) web site [http://www.computer.org/parascope/] which also sponsors the International Symposium on Parallel Architectures,
Algorithms and Networks. The state of the art can be summarized by the simple return of 66,500 URL’s from the www.google.com search for
“cluster of workstations” as of September 2000. Readers should refer to the brief bibliography for recent text books on the topic.

The real criteria for using COW’s are the nature and eligibility of the problem under study. Approaches to parallel processing, in
bioinformatics, are classified by the instruction set (single versus multiple) and data set (single versus multiple) to yield four types of parallel
processors SISD/MISD/SIMD/MIMD (Buyya, 1999b; Foster, 1994; Wilkinson and Allen, 1999).

In general, questions that lend themselves to parallel processing on a COW are those that can be broken down into parts or jobs.
”Granularity” is what COW programmers use to describe just how small, how similar, and how many jobs can be parallel-ized from the overall
problem. It does make coding much easier if jobs do not depend on eachother. Most importantly, there must be at least one job for each
processor / workstation, no matter how large or small the individual job may be. Those problems that benefit most from parallel processing
COWS are those that have either: (1) a large number of symmetric jobs, or (2) jobs that benefit from eachothers’ intermediate solutions.
Problems that fit criteria (1) are problems that can be distributed rapidly and evenly across the COW, whereby no processor is idle, and each
job solution is posted back to the server. The type (1) problems are those that have what might be considered a large set of non-dimensional
jobs. This long list of jobs must be completed for the problem to be solved, but there is no logical or necessary order in which the jobs must be
completed. In this example, a COW outruns a SoW (single workstation) in that the throughputs (area/access to critical devices: Hard Drive,
RAM, CPU) have been radically expanded, and repeated in parallel. In some cases, the resulting throughput of a COW exceeds that of a
similar capacity supercomputer. Type (2) problems are problems that benefit from the “folding of time” that can occur in a parallel process.
For instance, some solutions might occur faster with access to solutions from jobs later in the process, something that would never occur
under traditional computing environments – SoW/supercomputer.

While supercomputing in the form of parallel processing addresses the computational barriers to research, there are many other barriers
preventing researchers for pursuing their ideas.

The challenges of data intensive projects include that production, acquisition, management, and analysis of data (files) distributed across
many disks from many points in time. Writing, then finding, reading and reconciling these files often involves data loss rates beyond tolerance
for tracking and analysis in longitudinal / repeated measurement projects. SAS MACRO® and the SAS X® command can be used like a robot
to search drives and directories across networks in order to seek and map file structures for information regarding location, file characteristics,
and variables. The product of file mapping can then be used to read eligible files for analysis.

Researchers today are challenged with managing rapidly growing multi site data structures, a trend that is likely to accelerate with internet and
intranet evolution and usage. It is not unusual for analysts to manage and manipulate hundreds of thousands of files of varying formats
residing in many known and unknown places.

A typical experiment or project often involves thousands of files, each with thousands of data points. It is not unusual for laboratories to collect
a gigabyte of relevant data points on each subject or event. Nor is it unusual for the drive, directory (sub-sub-sub), and file naming schemes to
have no symmetric logic, and no documentation. Some researchers have no idea where data have been stored and what they are named.
Page 1

NESUG 17
Analysis

This reality is not a sign of bad science, just bad technical management. And bad technical management means that files cannot be found for
analysis. Lost data begins to look like bad science as many great hypotheses fall between the cracks of time and space along with their data.
To date, there are no self-organizing software systems that can take the place of good technical management. So scientists are left to their
own (de)vices.

MODEL CANDIDATES AND CANDIDATE MODELS
Caenorhabditis elegans was the first animal genetically sequenced, and it shares many genes in common with humans., including those
implicated in cellular differentiation, cell development, as well as the development and function of the nervous system.

The worm (Nematode) is smooth-skinned, unsegmented
worms with a long body and tapered ends. They are about 1 mm long, and live mostly in soil in many parts of the world. Many scientists study
C. elegans because of it similarities to humans:


conceived as a single cell

undergoes embryonic cleavage, morphogenesis, and growth to the adult

nervous system with a 'brain'

capable of rudimentary learning

produces sperm and eggs, mates and reproduces

gradually ages, loses vigor and finally dies (2-3 weeks)

For computation purposes, C. elegans presents itself as an ideal candidate C. elegans has about 400 neurons that have been mapped in two
dimensions and described in terms of development and function. There are approximately 81 muscles that have also been thoroughly
mapped. 17,000 genes in 6 chromosomes that have been sequenced and are well mapped in terms of function. More importantly, this little
worm has many qualities that make it ideal for computational modeling:
BIOINFORMATICS
Chromosomes (6 in C. elegans, 23 in homo sapiens) are made up of genes (about 17,000 in C. elegans, about 30,000 in homo sapiens) which
are made up of amino acids (about 20) which are made up of nucleic acids (4: A,C,T,G). Most of the gene is garbage! The useful parts are
the proteins which humans use to do things (metabolize) and build things (other cells). Of course, the codes for proteins are lumped all over
the genes, and not usually distinguishable from background noise, and sometimes overlapping one another. What is more, proteins are
usually functional under specific circumstances, like a specific moment in the life span, are only in the presence of some other protein, or only
when “turned on”. Furthermore, while DNA is a beautiful matrix that Watson and Crick made so famous, while proteins are like brillo pads…
the most complex architecture in the reality!

The study of proteins falls into three groups, from my perspective: (1) the “biologists” who study function; (2) the “engineers” who study
structure; and (3) the “cryptographers” who study sequence.

Most geneticists, and therefore most sequence search engines, confine themselves to known, or common, or relevant subsequences that are
well established in terms of structure and function. This works fine until you are in a new protein, or a new species, or looking across species
where you don’t know what is relevant. By the way, protein structure is a function of sequence, and protein function is a function of sequence.
Advances in computation and robotics have allowed attention to grow from sequencing to functional (gene chips) and structural analyses.

Figure 1. Sequential dynamics (modified gap analysis) of targe fly and worm proteins likely to be of the SOP family.


The Chromosomes of C. elegans can be downloaded as a set of text files under 20 mb, enabling analysts to assess within and between
species hypotheses:

• Cartisian
comparison

Replicates for normal variation

Broader Mutant contrast
• Developmental
processes

Eligible site similarity
• Evolutionary
Processes

2

NESUG 17
Analysis

NEUROINFORMATICS
The C. elegans has 300 neurons that are through to make 7000 connections (5,000 chemical synapses, 600 gap junctions). The
developmental history of each worm neuron is known. These “circuits” receive inputs from sensory cells, and they provide outputs to motor
cells and muscles.

Earlier reconstructions of C. elegans used hand tracing of the 40,000 electron micrographs. By capturing these data digitally, reconstructions
can be assembled with SAS. Nematode neuron processes are very small (0.1-0.2m diameter).

This begins with electron micrographs of the specimen, with a focus of the nerve bundles. Figure 8 shows such an image with a red dot
marking a typical nerve cell.

Figure 2. Electron micrograph of c elegans.


Keep in mind that this species has between 100 and 200 nerves in total. The next step logically is to array these images with their objects to
derive X,Y, coordinates for each nerve at each Z plane. Figures 3, 4, 5, and 6 show the logic of this approach whereby locations of each
relevant nerve are digitized and stored in a database.
Figure 3. Electron micrograph series of selected nerve.

3

NESUG 17
Analysis


The real fun starts with SAS in trying to derive probable nerve paths or trajectories for scientists utilizing the formula at the bottom of figure 10.
Put simply, a nerve at (x,y) in plane 1 (z) is represented as a target or circumference around is centroid into which one wants to merge the
centroid of what is thought to be the same nerve in the next plane.

Figure 4. Nerve trajectory revealed across micrographs.

This involves several preliminary steps to minimally align the data in terms of micrograph scan (one left and one right scan per original print,
buth are combined in scan of negative).

XF = ((X-XMIN)/(XMAX-XMIN))*100;
YF = ((Y-YMIN)/(YMAX-YMIN))*100;
4

NESUG 17
Analysis


Another procedure is used to aligh the data by a common architecture.

XDELTA = XF - XBAR;
YDELTA = YF - YBAR;

The final connections are made simply by minizing the between image distances from point to point.

DISTN=SQRT(((NEWXZ-Xg)**2) + ((NEWYZ-Yg)**2));

Figure 5. Making neuronal connections from slice to slice.



NEURAL NETWORKING
The neural network is determined by the (1) architecture of the network, and (2) the communication rules. While attempts have been made to
simulate communication in C. elegans using the proper rules of communications (based on known synapse locations and types), no model
has included spatially accurate network architecture in 3-Dimensions. Figure 6 shows a spatial dimensions of a single neuron.

While this might seem trivial, the lack of spatially accurate network architecture has been a primary barrier to biological modeling. Without
spatially accurate information, no hypotheses of development or mutation can be tested. For instance, removing a “node” from a network
might be testable in a static process, as if an adult worm simply had a neuron turned off or removed.

Disease modeling, especially developmental diseases, but also pathological processes that involve any time course (including trauma
recovery) depend upon spatial modeling. A neuron, once turned off, damaged, or removed, not only effects the neurons to which it is
“synapsed”, but also neourons that it is near. Neighboring neurons often depend upon eachother during growth, whether early in life, or early
in recovery from damage.
BIOLOGICAL MODELING
Bringing these databases together will allow scientists to assess the relationships between genes, neurons, and behavior.


5

NESUG 17
Analysis

Figure 6. Spatial mapping of one neuron.



REFERENCES
Bargmann CI (1998) "Neurobiology of the Caenorhabditis elegans genome" Science 282(5396):2028-33
Bargmann CI, Kaplan JM (1998) "Signal transduction in the Caenorhabditis elegans nervous system." Annu Rev Neurosci. 21:279-308
Bertin, J. (1981). Graphics and graphic information processing (W.J. Berg & P. Scott, Trnas.). New York: de Gruter. (Original work published
1977).
Bertin, J. (1983). Semiology of Graphics. (W.J. Berg, Trans.). Madison: University of Wisconsin Press. (Original work published 1973).
Chalfie M, Jorgensen EM (1998) "C. elegans neuroscience: genetics to genome." Trends Genet 14(12):506-12
Chambers, J.M., Cleveland, W.S., Kleiner, B., and Tukey, P.A. (1983). Graphical Methods for Data Analysis. Wadsworth & Brooks/Cole:
Pacific Grove, California.
Cleveland, W.S. (1985). The Elements of Graphing Data. Wadsworth Advanced Books and Software: Monterey, California.
Cleveland, W.S., & McGill, R. (1987a). Graphical perception: The visual decoding of quantitative information on graphical displays of data.
Journal of the Royal Statistical Society, 150(3), 192-229.
Cleveland, W.S., & McGill, R. (1987b). Dynamic graphics for statistics. Monterey, CA: Wadsworth.
De Porspero, A., & Cohen, S. (1979). Inconsistent visual analyses of intrasubject data. Journal of Applied Behavior Analysis, 12, 573-579.
Eckholdt, H. (1999). Visual Analysis on the WEB: Animating High Density Multidimensional Data. In M. Zdeb (Ed.) Proceedings of the North
East SAS Users Group 1999. Washington, D.C., 381-189.
Eckholdt, H. (2000a). The Use of Spatial Dynamics in Ontogenetic Sequence Analysis: c. elegans, homo sapiens, and Random examples.
Invited Oral Presentation to the Department of Molecular Genetics, March, 2000.
Eckholdt, H. (in press). Seeking, Mapping, and Fuzzy Merging Data Structures in Networked Environments. In E. Westerlund (Ed.)
Proceedings of the North East SAS Users Group 2000. Philadelphia, PA.
Eckholdt, H., Brown, L., Smith, D. and Feldman, S. (1998). Revealing Structure-Function patterns in the Basal Ganglia: Animating
autoradiographic maps. Paper presented at the Society for Neuroscience. November 1998: Los Angeles, California.
Gibson, G., & Ottenbacher, K. (1988). Characteristics influencing the visual analysis of single-subject data: An emperical analysis. Journal of
Applied Behavior Analysis, 15, 415-421.
Kaiser, J.F. (1960). Directional statistical decisions. Psychological Review, 67(3): 160-167.
Kenyon, C (1988) "The nematode Caenorhabditis elegans" Science 240, 1448-53
Keselman, H.J., and Murray, R. (1974). Tukey tests for pair-wise contrasts following the analysis of variance: Is there a Type IV error?
Psychological Bulletin, 81(9): 608-609.
Kimball, A.W. (1957). Errors of the third kind in statistical consulting. Journal of the American Statistical Association, 52(278): 133-142.
Mazziotta, J.C. (1996). Time and Space. In A.W. Toga and J.C. Mazziotta (Eds.) Bring Mapping: The Methods. Academic Press: San Diego,
California. Chap. 15.
Mazziotta, J.C. and Toga, A.W. (1996). Speculations about the Future. In A.W. Toga and J.C. Mazziotta (Eds.) Bring Mapping: The Methods.
Academic Press: San Diego, California. Chap. 18.
Palovcik, R.A., Ried, S.A., Principe, J.C., and Albuquerque, A. Jr. (1992). 3-D computer animation of electrophysiological responses. Journal
of Neuroscience Methods, 41(1): 1-9.
Park, H., Marascuilo, L., and Gaylord-Ross, R. (1990). Visual inspection and statistical analysis in single-case designs. Journal of
Experimental Educatio, 58(4): 311-320.
Riddle, R. et al. (eds.) (1997) "C.elegans II" Cold Spring Harbor Press
SAS. (1998). SAS for Windows version 6.12. [computer software] Cary, North Carolina: SAS Institute, Inc. inquiries (919) 677-8008;
http://www.sas.com/ .
Scherg, M. and Berg, P. (1996). New concepts of brain source imaging and localization. Electroencephalography & Clinical Neurophysiology -
Supplement. 46:127-37.
Toga, A.W. (1990). Three-Dimensional Neuroimaging. Raven Press: New York, New York.
Toga, A.W. and Mazziotta, J.C. (1996). Introduction to Cartography of the Brain. In A.W. Toga and J.C. Mazziotta (Eds.) Brain Mapping: The
Methods. Academic Press: San Diego, California. Chap. 1.
Toga, A.W., and Payne, B.A. (1991). Animating the 3D structure and function of brain. Computerized Medical Imaging and Graphics, 15(5):
6

NESUG 17
Analysis

285-291.
Tukey, J.W. (1977) Exploratory Data Analysis. Addison-Wesley Publishing Company: Reading, Massachusetts.
Tufte, E.R. (1983). The Visual Display of Quantitative Information. Graphics Press: Cheshire, Connecticut.
Tufte, E.R. (1990). Envisioning Information: Narrative of space and time. Graphics Press: Cheshire, Connecticut.
Tufte, E.R. (1997). Visual Explanations: Images and quantities, evidence and narrative. Graphics Press: Cheshire, Connecticut.
Wampold, B.E., & Furlong, M.J. (1981). The heuristics of visual inference. Behavioral Assessment, 3, 71-92.
Wood, W. (ed.) (1986) "C.elegans" Cold Spring Harbor Press
White, J.G. (1985). "Neuronal connectivity in Caenorhabditis elegans" Trends in Neuroscience: 277-283.
Acknowledgements: The research on neuroinformatics was supported by a grant from the National Institute of Mental Health: R21
MH63223-01
in collaboration with Dave Hall (Neuroscience) and Scott Emmons (Molecular Genetics).
CONTACT INFORMATION
Haftan Eckholdt

Albert Einstein College of Medicine

1410 Pelham Parkway South

Kennedy Center 923

Bronx, New York 10461

Work Phone: (718) 430-2427
Email:
eckholdt@aecom.yu.edu
7