-
SAS
SAS stands for Statistical Analysis System. It is widely used for data analysis in
industry, government, bank systems, and academia.
There are many books from which one can learn how to
write SAS programs, and there is a SAS online manual where one can find
syntaxes, definitions, and examples. Basic programming skills can be quickly
learned from any of the many available books, and in-depth knowledge can be
acquired with a bit of extra effort by studying the SAS manual.
So what is the purpose of this book? Compared to
typical introductory books on the subject, this one has much more content, not
only in scope, but in depth as well. This can be appreciated from the list of
topics in the CONTENTS section as well as from the size of the book. In many places throughout the book, the
reader will be informed not only about what will happen but also why this would
happen. That is the reason that the book is titled UNDERSTANDING SAS. After reading this book, a reader will have
a considerable understanding of BASE SAS at the intermediate level and will be
able to write SAS programs with great flexibility.
Although the SAS manual provides plenty of detail,
it does not include everything, and this book can be used as a supplement to
the SAS manual. Furthermore, this book has been written more like a textbook,
and it provides emphasis on a number of points.
For the writing of this book, I read the SAS manual
and many other sources, including published books, some of which are listed in
the REFERENCES section; papers posted on the Web, particularly from many
conferences such as SUGI, NESUG; and material from discussion forums. Moreover, plenty of content in this book
comes from my own research. I tried to keep the book at the "state of the
art" level for BASE SAS, and the reader will appreciate especially the
inclusion of the ODS procedure.
The book covers primarily BASE SAS software, and it
does so comprehensively.
However,
coverage of statistical procedures has been avoided here, for the inclusion of
them at an intermediate level would have significantly increased the size of
the book.
As for the style of the book, it combines simplicity
and thoroughness. Some topics are discussed briefly and some in full detail.
For instance, regarding the statistical functions SUM and MEAN, if the usage of
the first one of these is discussed, then readers will obviously know the usage
of the second one, for the argument types of both functions are exactly the
same; the only thing readers would need to do in this case is to replace the
function name. However, the book discusses in detail the allowable types of
arguments, and there are some tricky aspects of which readers may not be aware.
Similarly, in the case of the ODS procedure in which there are many different
attributes, once an example is given for one of them, readers should simply
play the "replace-and-run" game to learn them well.
Comparison. Comparison. Comparison. Comparison is
the soul of this book. There are more than eighty tables in the book, most of
which I designed; they are the culmination of all my
endeavors; and
they constitute the main characteristic of the book. Tables have
many advantages. From them it is easy to check, compare, and select their content. However, they also have a shortcoming: a
cell cannot contain too much detailed information.
Software
programs differ from books in the sense that sometimes bugs and inconsistencies
are incurable, while an error in a book can be corrected in the next printing. aA software developer has
to be extremely careful because the consistency between different versions of a
product is of paramount importance.
Lets see a famous example. As we know 1900 is not a
leap year. It means February of 1900 has 28 days. If you run the following
program you will have problem.
DATA _NULL_;
a='29Feb1900'd;
RUN;
However in Excel, it is OK to put
2/29/1900. In other words, Excel considers 1900 as leap year. Actually this is not a
"bug". Indeed, it is by design. Excel works this way
because it was truly a bug in Lotus 123. When Excel was introduced, Lotus
123 has nearly the entire market for spreadsheet software. Microsoft
decided to continue Lotus' bug, in order to fully compatible.
Therefore, a software
developer has to be extremely careful because the consistency between different
versions of a product is of paramount importance. For example, the main
differences between an "IF" statement and a "WHERE"
statement should be kept about the same from version to version. Even though
every SAS book talks about these differences, this book provides you with the
most detailed discussion.
It is the author's hope that the book will be
helpful in pointing out inconsistencies and that it can be used as a
comprehensive reference book.
In my writing, I met many problems, and I want to
share some of them with you.
Problem 1. What is the printout in the following
program?
DATA s;
INPUT a @@;
CARDS;
-3
0 . 2
;
DATA t;
SET s;
WHERE
+a>0;
PROC PRINT;RUN;
Problem 2. What is the output of the following
program, and how do you explain it?
PROC FORMAT;
INVALUE ss '10'-'6'=1 5-12=2;
INVALUE sp 1-5=3;
DATA s;
INPUT a ss. b sp.;
CARDS;
11
5
20
7
60
23
;
PROC PRINT;RUN;
Problem
3. What is the printout of the following program? If you feel that something is
wrong, how do you fix it?
DATA s;
INPUT a $ b p;
CARDS;
1 1 1
2 2 2
4 4 3
;
PROC REPORT NOWD ;
COLUMN p a,N a,PCTN b,N b,PCTN;
DEFINE p/GROUP;
RUN;
/*
PROC FORMAT;
INVALUE s 20-HIGH=4;
INVALUE t '20'-HIGH=8;
INVALUE $s 20-HIGH=4;
INVALUE $t '20'-HIGH=8;
VALUE $s 20-HIGH=4;
VALUE $t '20'-HIGH=8;
DATA p;
INPUT @1 a $s. @1 b $t. @1 a1 s. @1 b1 t. @1 c $ @1 d $;
p=INPUT(6,s.);
q=INPUT(6,t.);
r=INPUT('135',t.);
FORMAT c $s. d $t.;
CARDS;
6
135
;
PROC PRINT;RUN;*/
Problem 4. In the following function calls, all
variables and arrays are defined. Which function calls are OK, and which ones
are not?
a=SUM(OF x1-x6
p,x1+x2,LOG(p), OF q y1-y5, OF z(*));
a=MEAN(OF x1-x6 test:
y-a);
a=MIN(OF _NUMERIC_);
a=MEAN(OF
x1-x6 p, OF q y1-y5, OF z1 z2);
A=MAX(OF
x1-x6 y1+y2);
A=MEAN(x1
x2 x3);
A=MEAN(OF
x1-x5, y2 y3);
A=MEAN(OF
z(1) z(3));
Problem 5. What is data set is created by the
following program? What happens if I change the keyword IF to the keyword
WHERE?
DATA s;
INPUT a b c;
CARDS;
3
5 2
-3
5 2
2
5 2
-2
5 2
;
DATA t;
SET s;
IF a=-3 MAX b MIN c;
RUN;
PROC PRINT;RUN;
Problem 6. What is data set is created by the
following program, and how do you explain the error messages?
DATA student;
INPUT name $ gender $
height weight test
dob MMDDYY8. phone;
CARDS;
George M 56 111
89
01/04/60
7345678765
Joe M 57 115 87
01/05/60
8763459875
;
RUN;
Problem 7. Suppose we have a txt file with the
following contents:
12
12 1/1/60
12
11
13 11/11/60
11
14
15 12/3/76
20
Run the following program. What data set is created
and how do you explain it?
DATA ss;
INFILE 'c:\sas\test2.txt';
INPUT a b c MMDDYY8. d;
RUN;
PROC PRINT;
FORMAT c MMDDYY8.;
RUN;
Problem 8. The following program creates two
reports. One is a detailed report and the other is a summary report. Give the
necessary and sufficient conditions for PROC SQL to create a summary report.
DATA s;
INPUT a b;
CARDS;
1
2
3
4
;
proc sql;
SELECT * FROM s;
SELECT SUM(a),N(b) FROM s;
QUIT;
Problem 9. If you think the following program is not
OK, how would you explain the result?
%MACRO ss(company);
%IF &company=ge %THEN %PUT company is ge;
%MEND;
%ss(ge)
/*Problem 9. What is
the printout of the following program and how do you explain it?
DATA s;
a0=1;
PROC REPORT NOWD;
COLUMN a0 y z ;
COMPUTE y;
y=b+1;b=b+6;
ENDCOMP;
COMPUTE z;
z=c+1;
ENDCOMP;
RUN;*/
Dr. Xu, you have two problems numbered 9. Do you
plan to eliminate one of them? What about removing the /* and */ symbols?
Problem
10. Write a program to create the following table (rtf file):

Problem 11. The following table was is produced by using
PROC FREQ on a data set.
Cumulative Cumulative
branch
Frequency Percent Frequency Percent
A 1 33.33 1 33.33
B 1 33.33 2 66.67
C 1 33.33 3 100.00
After I copied it to a "doc" file and
changed its font to Courier New, it became
Cumulative Cumulative
branch
Frequency Percent Frequency Percent
A 1 33.33 1 33.33
B 1 33.33 2 66.67
C 1
33.33 3 100.00
Do you know what happened to the solid line?
The
book uses the following typographical conventions:
Times New Roman basic type style used for most text
UPPER CASE TIMES NEW ROMAN for SAS keywords
Italian Times New Roman for definitions
Courier
new for SAS program code
SAS
monospace for SAS printouts
There
is no need to have prior knowledge of SAS although some statistical background
is helpful for Chapter 6. The following mathematical notation is used in this
book, mainly in Chapters 3 and 8: Let A={1,2,3} and B={1,2,4} be two sets.
AΘB={1,2,3,4} union of A and B
AΗB={1,2} intersection of A and B
A\B={3} difference
of A and B
A+B={1,2,3,1,2,4} repeated union of A and B
You can find SAS documentation online. The
URL is
http://support.sas.com/onlinedoc/913/docMainpage.jsp
SAS GLOBAL FORUM (originally SUGI) provides many
professional and advanced topics on SAS. The following is its URL:
http://support.sas.com/events/sasglobalforum/
There are also some discussions online. You can post
your questions there. The URLs are the following:
http://www.listserv.uga.edu/archives/sas-l.html
http://groups.google.com/group/comp.soft-sys.sas
You can also get help from the SAS institute. The
URL is
http://support.sas.com/ctx/supportform/index.jsp
Finally, any suggestions, comments, and criticisms
are very welcome. Please send them to:
sasbook@beauthor.com
BACK COVER
SAS Certified Base Programmer and SAS
Certified Advanced Programmer.
M.S. in Statistics, Rutgers University.
Ph.D. in Operations Research, Rutgers
University.
Co-author of the book Linear Programming (in
Chinese, with Dr. Jianzhong Zhang) published by Science Publishing House,
Beijing, China 1990.