HOW PROGRAMMING LANGUAGES DIFFER: A CASE STUDY OF SPL, PASCAL, AND C by Eugene Volokh, VESOFT Presented at 1987 SCRUG Conference, Pasadena, CA Presented at 1987 INTEREX Conference, Las Vegas, NV, USA Published by The HP CHRONICLE, May 1987-May 1988. ABSTRACT: The HP3000's wunderkind sets out to study Pascal, C and SPL for the HP mini in a set of articles, using real-life examples and plenty of tips on how to code for optimum efficiency in each language. First in the series: ground rules for the comparison and a look at control structures. (The HP CHRONICLE, May 1987) INTRODUCTION Programmers get passionate about programming languages. We spend most of our time hacking code, exploiting the language's features, being bitten by its silly restrictions. There are dozens of languages, and each one has its fanatical adherents and its ardent detractors. Some like APL, some like FORTH, LISP, C, PASCAL; some might even like COBOL or FORTRAN, perish the thought. In particular, a lot of fuss has recently arisen about SPL, PASCAL, and C. All three of them are considered good "system programming" (whatever that is) languages, and naturally people argue about which one is the best. HP's Spectrum project has come out in favor of PASCAL -- all new MPE/XL code will be written in PASCAL, and HP won't even provide a native mode SPL compiler. On the other hand, HP's also getting more and more into UNIX, which is coded entirely in C. Especially between C and PASCAL adherents there seems to be something like a "holy war"; it becomes not just a matter of advantages and disadvantages, but of Good and Evil, Right and Wrong. Strict type checking is Good, some say -- loose type checking is Evil; pointers are Wrong -- array indexing is Right. The battle-lines are drawn and the knights are sharpening their swords. But, some ask -- what's the big deal? After all, it's an axiom of computer science that all you need is an IF and a GOTO, and you can code anything you like. Theoretically speaking, C, SPL, and PASCAL are all equivalent; practically, is there that much of a difference? In other words, is it just esthetics or prejudice that animate the ardent fans of C, PASCAL, or SPL, or are there real, substantive differences between the languages -- cases in which using one language rather than another will make your life substantially easier? Are the main differences between, say, C and PASCAL that PASCAL uses BEGIN and END and C uses "{" and "}"? That C's assignment operator is "=" and PASCAL's is ":="? The goal of this paper is to answer just this question. I will try to analyze each of the main areas where SPL, C, and PASCAL differ, and point out those differences using actual programming examples. I'll try not to emphasize vague, general statements, like "PASCAL does strict type checking", or subjective opinions, like "C is too hard to read"; rather, I want to use SPECIFIC EXAMPLES which can help make clear the exact influence of strict or loose type checking on your programming tasks. RULES OF EVIDENCE Saying that I'll "compare SPL, PASCAL, and C" isn't really saying a whole lot. How will I compare them? What criteria will I use to compare them? Will I compare how easy it is to read them or write them? Will I compare what programming habits they instill in their users? Which versions of these languages will I compare? To do this, and to do this in as useful a fashion as possible, I set myself some rules: * I resolved to try to show the differences by use of examples, preferably as real-life as possible. The emphasis here is on CONCRETE SPECIFICS, not on general statements such as "C is less readable" or "PASCAL is more restrictive". * I decided not to go into questions of efficiency. Compiling a certain construct using one implementation of a compiler may generate fast code, whereas a different implementation may generate slow code. Sure, the FOR loop in PASCAL/3000 may be less efficient than in SPL or in CCS's C/3000, but who knows how fast it'll be under PASCAL/XL? For this reason, I don't wax too poetic about the efficiency advantages of features such as C's "X++" (which increments X by 1) -- a modern optimizing compiler is quite likely to generate equally fast code for "X:=X+1", automatically seeing that it's a simple increment-by-one (even the 15-year-old SPL/3000 compiler does this). The only times when I'll mention efficiency is when some feature is INHERENTLY more or less efficient than another (at least on a conventional machine architecture); for instance, passing a large array BY VALUE will almost certainly be slower than passing it BY REFERENCE, since by-value passing would require copying all the array data. Even in these cases, I try to play down performance considerations; if you're concerned about speed (as well you should be), do your own performance measurements for the features and compiler implementations that you know you care about. * I resolved -- for space reasons if for no other -- not to be a textbook for SPL, PASCAL, or C. Some of the things I say apply equally well to almost all programming languages, and I hope that they will be understandable even to people who've never seen SPL, PASCAL, or C. For other things, I rely on the relative readability of the languages and their similarity to one another. I hope that if you know any one of SPL, PASCAL, or C, you should be able to understand the examples written in the other languages. However, it may be wise for you to have manuals for these three languages -- either their HP3000 implementations or general standards -- at hand, in case some of the examples should prove too arcane. * As you can tell by the size of this paper, I also decided to be as thorough as practical in my comparisons, and ESPECIALLY in the evidence backing up my comparisons. One of the main reasons I wrote this paper is that I hadn't seen much OBJECTIVE discussion comparing C and PASCAL; I wanted not just to present my conclusions -- which might as easily be based on prejudice as on fact -- but also the reasons why I arrived at them, so that you could decide for yourself. So as not to burden you with having to read all 200-odd pages, though, I've summarized my conclusions in the "SUMMARY" chapter. You might want to have a look there first, and then perhaps go back to the main body of the paper to see the supporting evidence of the points I made. WHAT ARE C AND PASCAL, ANYWAY? If you think about it, SPL is a very unusual language indeed. To the best of my knowledge, there is exactly one SPL compiler available anywhere, on any computer (eventually, the independent SPLash! may be available on Spectrum, but that is another story). I can say "SPL supports this" or "SPL can't do that" and, excepting differences between one chronological version of SPL and the next, be absolutely precise and objectively verifiable. SPL can be said to "support" something only because there is only one SPL compiler that we're talking about. To say "PASCAL can do X" is a chancy proposition indeed. ANSI Standard PASCAL doesn't support variable-length strings, but most modern PASCAL implementations, including HP PASCAL, have some sort of string mechanism. What about HP's new PASCAL/XL, reputed to be even more powerful still? Similarly, with C, there are the old "Kernighan & Ritchie" C, the proposed new ANSI standard C, whatever it is that HP uses on the Spectrum, AND whatever you use on the 3000, which might be CCS's C compiler or Tymlabs' C. On the one hand, I contemplated comparing standard C and standard PASCAL. This is easier for me, and it also makes sense from a portability point of view (if you want it to be portable, you're better off using the standard, anyway). On the other hand, portability is fine and dandy, but most people aren't going to be porting their software any further than from an MPE/XL machine to an MPE/V machine and back. As long as you stick to HP3000s, you have the full power of so-called "HP PASCAL", an extended superset of PASCAL that's supported on 3000s, 1000s, 9000s, and the rest; it's hardly fair (or practical) to ignore this in a comparison. Finally, what about PASCAL/XL? It'll have even more useful features, but they may not be ported back to the MPE/V machines, at least for a while. Should I then compare PASCAL/XL and C/XL, a representative contest for the XL machines, but not necessarily for MPE V machines, and certainly not if you really want to port your software onto other machines. This is all, incidentally, aggravated by the fact that HP's extensions to PASCAL are more substantial than its extensions to C; thus, comparing the "standards" is likely to put PASCAL in a relatively worse light than comparing "supersets" (not to say that PASCAL is worse than C in either case). Faced with all this, I've decided to compare everything with everything else. There are actually 7 different compilers I discuss at one time or another: * SPL. There's only one, thank God. * Standard PASCAL. This is the original ANSI Standard, on which all other PASCALs are based. This is also very similar to Level 0 ISO Standard PASCAL (see next item). * Level 1 ISO Standard PASCAL. This standard, put out in the early 1980's, supports so-called CONFORMANT ARRAY parameters (see the DATA STRUCTURES chapter). The same standard document defined "Level 0 ISO Standard PASCAL" to be much like classic "Standard PASCAL", i.e. without conformant arrays. Compiler writers were given the choice of which one to implement, and it isn't obvious how popular Level 1 ISO Standard will be. When I say "Standard PASCAL", I mean the original standard, which is almost identical to the ISO Level 0 Standard. * PASCAL/3000. This is HP's implementation of PASCAL on the pre-Spectrum HP3000. Although the Spectrum machines will also be called 3000's, when I say PASCAL/3000 I mean the pre-Spectrum version. PASCAL/3000 is itself a superset of HP Pascal, which is also implemented by HP on HP 1000s and HP 9000s. PASCAL/3000 is a superset of the original Standard PASCAL, not the ISO Level 1 Standard. * PASCAL/XL. This is HP's implementation of PASCAL on the Spectrum. It's essentially a superset of both PASCAL/3000 and the ISO Level 1 Standard. * Kernighan & Ritchie (K&R) C. This is the C described by Brian Kernighan and Dennis Ritchie in their now-classic book "The C Programming Language" (which, in fact, is usually called "Kernighan and Ritchie"). Although never an official standard, it is quite representative of most modern C's. In fact, for practical purposes, it can be said that a program written in K & R C is portable to virtually any C implementation (assuming you avoid those things that K&R itself describes as implementation-dependent). * Draft ANSI Standard C. ANSI is now working on codifying a standard of C, which will have some (but not very many) improvements over K&R. My reference for this was Harbison & Steele's book "C: A Reference Manual", which also discusses various other implementations of C. Although Draft ANSI Standard C is Standard, it is also Draft. Some of the features described in it are implemented virtually nowhere, and it's not clear how much of them C/XL will include. Matters are further complicated, of course, by the lack of an HP-provided C compiler on the pre-Spectrum HP3000. The compiler I used to research this paper is CCS Inc.'s C/3000 compiler, which is a super-set of K&R C and a subset of Draft ANSI Standard C. The most conspicuous Draft Standard feature that CCS C/3000 lacks is Function Prototypes -- an understandable lack since virtually all other C compilers don't have them, either. Whenever any difference exists between any of the PASCAL or C versions, I try to point it out. Which versions you compare are up to you: * You can compare Standard PASCAL and K&R C. If it isn't in these general standards that everybody implements, you're unlikely to get much portability. * You can compare PASCAL/XL and Draft ANSI Standard C. These are the compilers that will most likely be available on the Spectrum. * You can compare PASCAL/3000 and Draft ANSI Standard or K&R C. Even though you might not usually care about porting to, say, an IBM or a VAX, you may very seriously care about porting from the pre-Spectrum to the Spectrum and vice versa. HP hasn't promised to port PASCAL/XL back to the pre-Spectrums, so PASCAL/3000 is probably the lowest common denominator. SPL is nice. At least until SPLash!'s promised Native Mode SPL compiler comes out, there's only one SPL compiler to compare with. This makes me very happy. ARE C, PASCAL, AND SPL FUNDAMENTALLY DIFFERENT OR FUNDAMENTALLY ALIKE? In my opinion, they are definitely FUNDAMENTALLY ALIKE. In the rest of the paper, I'll tell you all about their differences, but those are EXCEPTIONS in their fundamental similarity. Why do I think so? Well, virtually every important construct in either of the three languages has an almost exact parallel in the other two (the only exception being, perhaps, record structures, which SPL doesn't have). * All three languages emphasize writing your program as a set of re-usable, parameterized procedures or functions (which, for instance, COBOL 74 and most BASICs do not); * All three languages share virtually the same rich set of control structures (which neither FORTRAN/IV nor BASIC/3000 possesses). * The languages may on the surface LOOK somewhat different (PASCAL and C certainly do), but remember that the ESSENCE is virtually identical -- PASCAL may say "BEGIN" and "END" where C says "{" and "}", but that's hardly a SUBSTANTIVE difference. Despite all the differences which I'll spend all these pages describing -- and I think many of the differences are indeed very important ones -- I still think that SPL, PASCAL, and C are about as close to each other as languages get. SO, WHICH IS BETTER -- C, PASCAL, OR SPL? You think I'm going to answer that? With all my pretensions to objectivity, and dozens of angry language fanatics ready to berate me for choosing the "wrong one"? The main purpose of this paper is to show you all the differences and let you decide for yourselves; after all, there are so many parameters (how portable do you want the code to be? how much do you care about error checking?) that are involved in this sort of decision. The closest I come to actually saying which is better is in the "SUMMARY" chapter (at the very end of the paper); there I explain what I think the major drawbacks and advantages of each language are. Look there, but remember -- only you can decide which language is best for your purposes. TECHNICAL NOTE ABOUT C EXAMPLES In case you didn't know, C differentiates between upper- and lower-case. The variables "file" and "FILE" are quite different, as are "file", "File", and "fILE". (In SPL and PASCAL, of course, case differences are irrelevant; all of the just-given names would refer to the same variable.) In fact, in C programs the majority of all objects -- reserved words, procedure names, variables, etc. -- are lower-case. The reserved words ("if", "while", "for", "int", etc.) are required to be lower-case by the standard; theoretically, you can name all your variables and procedures in upper-case, but most C programmers use lower-case for them, too (although they can sometimes use upper-case variable names as well, perhaps to indicate their own defined types or #define macros). This is why all the examples of C programs in this paper are written in lower-case. The one exception to this is when I refer to a C object -- a variable, a procedure, or a reserved word -- within the text of a paragraph. Then, I'll often capitalize it to set it off from the rest of the paper, to wit: proc (i, j) int i, j; { if (i == j) ... } The procedure PROC takes two parameters, I, and J. The IF statement makes sure that they're equal, .... The fact that I refer to them in upper-case in the text doesn't mean that you should actually use upper-case names. I just do it to make the text more readable. Another example of how a little lie can help reveal the greater truth... ACKNOWLEDGMENTS I'd like to thank the following people for their great help in the writing of this paper: * CCS, Inc., authors of CCS C/3000, a C compiler for pre-Spectrum HP3000s. All the research and testing of the C examples given in this paper was done using their excellent compiler. In particular, I'd also like to thank Tim Chase, who gave me a great deal of help on some of the details of the C language. * Steve Hoogheem of the HP Migration Center, who served as liaison between me and the PASCAL/XL lab in answering my questions about PASCAL/XL. * Mr. Tom Plum (of Plum Hall, Cardiff, NJ), a recognized C expert and member of the Draft ANSI Standard C committee, who was kind enough to answer many of the questions that I had about the Draft Standard. * Dennis Mitrzyk, of Hewlett-Packard, who helped me obtain much of my PASCAL/XL information, and who was also kind enough to review this paper. * Joseph Brothers, David Greer (of Robelle), Dave Lange and Roger Morsch (of State Farm Insurance), and Mark Wallace (of Robinson, Wallace, and Company), all of whom reviewed the paper and provided a lot of useful input and corrections. CONTROL STRUCTURES GOTOs, some say, are Considered Harmful. Perhaps they are and perhaps they are not. But the major reason for the control structures that PASCAL and C provide (as opposed to, say, FORTRAN IV, which doesn't) is not that they replace GOTOs, but rather that they replace them with something more convenient. If given the choice between saying IF FNUM = 0 THEN PRINTERROR ELSE BEGIN READFILE; FCLOSE (FNUM, 0, 0); END; and IF FNUM <> 0 THEN GOTO 10; PRINTERROR; GOTO 20; 10: READFILE; FCLOSE (FNUM, 0, 0); 20: then I would choose the former. IF-THEN-ELSE is a common construct in all of the algorithms we write, and it's easier for both the writer and the reader to have a language construct that directly corresponds to it. C and PASCAL share some of the fundamental control structures. Both have * IF-THEN-ELSEs. They look slightly different: IF FNUM=0 THEN { PASCAL } PRINTERROR ELSE BEGIN READFILE; FCLOSE (FNUM, 0, 0); END; and if (fnum==0) /* C */ printerror; /* note the semicolon */ else { readfile; fclose (fnum, 0, 0); } but I hardly think the difference very substantial. There'll be some who forever curse C for using lower-case or PASCAL for using such L-O-N-G reserved words, like "BEGIN" and "END"; I can live with either. * WHILE-DOs, although again there are some minor differences WHILE GETREC (FNUM, RECORD) DO PRINTREC (RECORD); vs. while (getrec (fnum, record)) printrec (record); * DO-UNTILs: REPEAT GETREC (FNUM, RECORD); PRINTREC (RECORD); UNTIL NOMORERECS (FNUM); and do { getrec (fnum, record); printrec (record); } while (!nomorerecs (fnum)); /* "!" means "NOT" */ Note that PASCAL has a DO-UNTIL and C has a DO-WHILE. Big difference. * And, finally, C's and PASCAL's procedure support is comparable, as well. The interesting things, of course, are the points at which C and PASCAL differ. There are some, and for those us who thought that IF-THEN-ELSE and WHILE-DO are all the control structures we'll ever need, the differences can be quite surprising. THE "WHILE" LOOP AND ITS LIMITATIONS; THE "FOR" LOOP It is, indeed, true, that all iterative constructs can be emulated with the WHILE-DO loop. On the other hand, why do the work if someone else can do it for you? The PASCAL FOR loop -- a child of FORTRAN's DO -- is actually not that hard to emulate: FOR I:=1 TO 9 DO WRITELN (I); is identical, of course, to I:=1; WHILE I<=9 DO BEGIN WRITELN (I); I:=I+1; END; Not such a vast savings, but, still, the FOR loop definitely looks nicer. Unfortunately, for all the savings that the FOR loop gives you, I've found that it's not as useful as one might, at first glance, believe. This is because it ALWAYS loops through all the values from the start to the limit. How often do you need to do that, rather than loop until EITHER a limit is reached OR another condition is found? String searching, for instance -- you want to loop until the index is at the end of the string OR you've found what you're searching for. Always looping until the end is wasteful and inconvenient. Looking through my MPEX source code, incidentally, I find 53 WHILE loops and 8 FOR loops. In my RL, the numbers are 170 WHILEs and 38 FORs (at least 6 of these FORs should have been WHILEs if I weren't so lazy). (How's that for an argument -- I don't use it, ERGO it is useless. I'm rather proud of it.) In any case, though, my experience has been that * THE PURE "FOR" LOOP -- A LOOP THAT ALWAYS GOES ON UNTIL THE LIMIT HAS BEEN REACHED -- IS NOT AS COMMON AS ONE MIGHT THINK IN BUSINESS AND SYSTEM PROGRAMS (although scientific and engineering applications, which often handle matrices and such, use pure FOR loops more often). MORE OFTEN YOU WANT TO ALSO SPECIFY AN "UNTIL" CONDITION WHICH WILL ALSO TERMINATE THE LOOP. What I wanted, then, was simple -- a loop that looked like FOR I:=START TO END UNTIL CONDITION DO For instance, FOR I:=1 TO STRLEN(S) UNTIL S[I]=C DO; or FOR I:=1 TO STRLEN(S) WHILE S[I]=' ' DO; What I got -- and I'm not sure if I'm sorry I asked or not -- is the C FOR loop: for (i=1; i<=strlen(s) && s[i]!=c; i=i+1) ; The C FOR loop -- like most things in C, accomplished with a minimum of letters and a maximum of special characters -- looks like this: for (initcode; testcode; inccode) statement; It is functionally identical to initcode; while (testcode) { statement; inccode; } In other words, this is a sort of "build-your-own" FOR loop -- YOU specify the initialization, the termination test, and the "STEP". This is actually quite useful for loops that don't involve simple incrementing, such as stepping through a linked list: for (ptr=listhead; ptr!=nil; ptr=ptr.next) fondle (ptr); The above loop, of course, fondles every element of the linked list, something quite analogous to what an ordinary PASCAL FOR loop would do, but with a different kind of "stepping" action. The standard PASCAL loop, of course, can easily be emulated -- for (i=start; i<=limit; i=i+1) statements; I'm sure it would be fair to conclude that C's FOR loop is clearly more powerful than PASCAL's. On the other hand, a WHILE loop is more powerful than a FOR loop, too; and, a GOTO is the most powerful of them all (heresy!). The reason a PASCAL FOR loop -- or for that matter, a C FOR loop -- is good is because simply by looking at it, you can clearly see that it is a WHILE loop of a particular kind, with clearly evident starting, terminating, and stepping operations. The major argument that may be made against C's for loop is simply one of clarity. Possible reasons include: * The loop variable has to be repeated four (or three, if you use "i++" instead of "i=i+1") times. * The semicolons, adequate to delimit the three clauses for the compiler, may not sufficiently delimit them to a human reader -- it may not be instantly obvious where one clause starts and another ends. * Also, the very use of semicolons instead of control keywords (like "TO") may be irritating; in a way, it's like having to write FOR I,1,100 instead of FOR I:=1 TO 100 If you think the first version isn't any worse than the second, you shouldn't mind C; some, however, find "FOR I,1,100" slightly less clear than "FOR I:=1 TO 100". for (i=1; i<=10; i=i+1) FOR I:=1 TO 10 DO or, alternatively for (i=1; i<=10; i++) FOR I:=1 TO 10 DO Which do you prefer? Frankly, for me, the PASCAL version is somewhat clearer, although I'm not prepared to say that the clarity is worth the cost in power. On the other hand, many a C programmer doesn't see any advantage in the PASCAL style, and perhaps there isn't any. Some of the C/PASCAL differences, I'm afraid, boil down to simply this. THE WHILE LOOP AND ITS LIMITATIONS -- AN INTERESTING PROBLEM Consider the following simple task -- you want to read a file until you get a record whose first character is a "*"; for each record you read, you want to execute some statements. Your PASCAL program might look like this: READLN (F, REC); WHILE REC[1]<>'*' DO BEGIN PROCESS_RECORD_A (REC); PROCESS_RECORD_B (REC); PROCESS_RECORD_C (REC); READLN (F, REC); END; All well and good? But, wait a minute -- we had to repeat the READLN statement a second time at the end of the WHILE loop. "Lazy bum," you might reply. "Can't handle typing an extra line." Well, what if, in order to get the record, we had to do more than just a READLN? We might need to, say, call FCONTROL before doing the READLN, and perhaps have a more complicated loop test. Our program might end up looking like: FCONTROL (FNUM(F), EXTENDED_READ, DUMMY); FCONTROL (FNUM(F), SET_TIMOUT, TIMEOUT); READLN (F, REC); GETFIELD (REC, 3, FIELD3); WHILE FIELD3<>'*' DO BEGIN PROCESS_RECORD_A (REC); PROCESS_RECORD_B (REC); PROCESS_RECORD_C (REC); FCONTROL (FNUM(F), EXTENDED_READ, DUMMY); FCONTROL (FNUM(F), SET_TIMOUT, TIMEOUT); READLN (F, REC); GETFIELD (REC, 3, FIELD3); END; This is not a happy-looking program. We had to duplicate a good chunk of code, with all the resultant perils of such a duplication; the code was harder to write, it's now harder to read, and when we maintain it, we're liable to change one of the occurrences of the code and not the other. Workarounds, of course, exist. We can say REPEAT FCONTROL (FNUM(F), EXTENDED_READ, DUMMY); FCONTROL (FNUM(F), SET_TIMOUT, TIMEOUT); READLN (F, REC); GETFIELD (REC, 3, FIELD3); IF FIELD3 <> '*' THEN BEGIN PROCESS_RECORD_A (REC); PROCESS_RECORD_B (REC); PROCESS_RECORD_C (REC); END; UNTIL FIELD3 = '*'; although this is also rather messy -- we've had to repeat the loop termination condition, and the resulting code is really a WHILE-DO loop masquerading as a REPEAT-UNTIL. Some might reply that what we ought to do is to move the FCONTROLs, READLN, and GETFIELD into a separate function that returns just the value of FIELD3, or perhaps even the loop test (FIELD3 <> '*'). Then, the loop would look like: WHILE FCONTROLS_READLN_AND_GETFIELD_CHECK_STAR (FNUM, REC) DO BEGIN PROCESS_RECORD_A (REC); PROCESS_RECORD_B (REC); PROCESS_RECORD_C (REC); END; This, indeed, does look nice -- but are we to be expected to create a new procedure every time a control structure doesn't work like we want it to? I like procedures just as much as the next man; in fact, I'm a lot more prone to pull code out into procedures than others are (I like my procedures to be twenty lines or shorter). On the other hand, what if someone said that you couldn't use BEGIN .. END in IF/THEN/ELSE statements -- if you want to do more than one thing in the THEN clause, you have to write a procedure? C has some advantage here. With C's "comma" operator, you can combine any number of statements (with some restrictions) into a single expression, whose result is the last value. Thus, what you can do is something like this: while ((fcontrol (fnum(f), extended_read, dummy), fcontrol (fnum(f), set_timeout, timeout), gets (f, rec, 80), getfield (rec, 3, field3), field3<>'*')) { process_record_a (rec); process_record_b (rec); process_record_c (rec); }; Whether this is better or not, you decide. The "comma" construct can be very confusing. In "while ((...)) do", the outside parentheses are the WHILE's; the inner pair is the comma constructs'; and all others belong to internal expressions and function calls. Additionally, you have to keep track of which commas belong to the function calls and which delimit the comma constructs' elements. &P What is that slithering underfoot? Could it be the serpent? He proposes this: WHILE TRUE DO BEGIN FCONTROL (FNUM(F), EXTENDED_READ, DUMMY); FCONTROL (FNUM(F), SET_TIMOUT, TIMEOUT); READLN (F, REC); GETFIELD (REC, 3, FIELD3); IF FIELD3<>'*' THEN GOTO 99; PROCESS_RECORD_A (REC); PROCESS_RECORD_B (REC); PROCESS_RECORD_C (REC); END; 99: "Sssimple and ssstraightforward, madam. Won't you have a bite?" Shame on you! Still, it's not obvious that the old faithful "GOTO" isn't, relatively speaking, a reasonable solution. C has its own variant, that lets us get away without using the "forbidden word": while (TRUE) { fcontrol (fnum(f), extended_read, dummy); fcontrol (fnum(f), set_timeout, timeout); gets (f, rec, 80); getfield (rec, 3, field3); if (field3='*') break; process_record_a (rec); process_record_b (rec); process_record_c (rec); }; C's "BREAK" construct gets you out of the construct that you're immediately in, be it a WHILE loop (as in this case), a SWITCH statement (in which it is vital), a FOR, or a DO. If you believe in the evil of GOTOs, you probably won't much like BREAKs; again, though, I ask -- is the above example any less muddled than the other ones I showed? Incidentally, the best approach that I've seen so far comes from a certain awful, barbarian language called FORTH (OK, all you FORTHies -- meet me in the alley after the talk and we can have it out). Translated into civilized terms, the loop looked something like this: DO FCONTROL (FNUM(F), EXTENDED_READ, DUMMY); FCONTROL (FNUM(F), SET_TIMOUT, TIMEOUT); READLN (F, REC); GETFIELD (REC, 3, FIELD3); WHILE FIELD3<>'*' PROCESS_RECORD_A (REC); PROCESS_RECORD_B (REC); PROCESS_RECORD_C (REC); ENDDO; This so-called "loop-and-a-half" solves what I think is the key problem, present in so many WHILE loops -- that the condition often takes more than a single expression to calculate. Well, in any case, neither SPL, PASCAL, nor C have such a construct, so that's that. BREAK, CONTINUE, AND RETURN -- PERFECTION OR PERVERSION? As I mentioned briefly in the last section, C has three control structures that PASCAL does not, and some say should not. These structures, Comrade, are Ideologically Suspect. A Dangerous Heresy. Still, they're there, and ought to be briefly discussed. * BREAK -- exits the immediately enclosing loop (WHILE, DO, or FOR) or a SWITCH statement. Essentially a GOTO to the statement immediately following the loop. * CONTINUE -- goes to the "next iteration" of the immediately enclosing loop (WHILE, DO, or FOR). * RETURN -- exits the current procedure. "RETURN <expression>" exits the current procedure, returning the value of <expression> as the procedure's result. * Of course, GOTO, the old faithful. Now, as you may or may not recall, a while ago there was much argument made against GOTOs. Instead of GOTOs, it was said, you ought to use only IF-THEN-ELSEs and WHILE-DOs. CASEs, FORs, and REPEAT-UNTILs, being just variants of the other control structures, were all right; but GOTOs were condemned, on several very good grounds: * First of all, with GOTOs, the "shape" of a procedure stops being evident. If you don't use GOTOs, each procedure and block of code will have only one ENTRY and only one EXIT. This means that you can always assume that control will always flow from the beginning to the end, with iterations and departures that are always clearly defined and the conditions for which are always evident. * If you avoid GOTOs, then for any statement, you can tell under what conditions it will be executed just by looking at the control structures within which it is enclosed. These concerns, I would say, may apply equally well to BREAKs, CONTINUEs, and RETURNs. Personally, I must confess, I don't use GOTOs. I don't know if it is the appeal of reason, the lesson of experience, or fear for my immortal soul. About five years ago I resolved to stop using them; except for "long jumps" (which I'll talk about more later), I use GOTOs in 1 procedure of MPEX's 40 procedures, and in 2 procedures of my RL's 350 (both of the uses of "GOTO" are as "RETURN" statements). However, I must say that in many cases the temptation does seem great. Consider, for a moment, the following case. We need to write a procedure that opens a file, reads some records, writes some records, and closes the file. In case any of the file operations fails, we should immediately close the file and not do anything else. The "GOTO-less" solution: munch_file (f) char f[40]; { int fnum; fnum = fopen (f, 1); if (error == 0) /* let's say ERROR is an error code */ { freaddir (fnum, buffer, 128, rec_a); if (error == 0) { munch_record_one_way (buffer); fwritedir (fnum, buffer, 128, rec_a); if (error == 0) { freaddir (fnum, buffer, 128, rec_b); if (error == 0) { munch_record_another_way (buffer); fwritedir (fnum, buffer, 128, rec_b); if (error == 0) some_more_stuff; } } } } fclose (fnum, 0, 0); } Or, using "GOTO": munch_file (f) char f[40]; { int fnum; #define check_error if (error != 0) goto done fnum = fopen (f, 1); if (error = 0) { freaddir (fnum, buffer, 128, rec_a); check_error; munch_record_one_way (buffer); fwritedir (fnum, buffer, 128, rec_a); check_error; freaddir (fnum, buffer, 128, rec_b); check_error; munch_record_another_way (buffer); fwritedir (fnum, buffer, 128, rec_b); check_error; some_more_stuff; } done: fclose (fnum, 0, 0); } Is the latter way really worse? I'm not so sure. Also, I can't see any way in which I can rewrite this example without GOTOs without making it as cumbersome as the first case. Similar examples can be found for BREAK and RETURN. If, for instance, I wasn't required to close the file, I'd just do a RETURN instead of doing the "GOTO DONE"; if I had to loop through the file, my code might look something like: framastatify (f) char f[40]; { int fnum; fnum = fopen (f, 1); if (error = 0) { while (TRUE) { fread (fnum, rec1, 128); if (error != 0) break; if (frob_a (rec1) == failed) break; fupdate (fnum, rec1, 128); if (error != 0) break; freadlabel (fnum, rec1, 128, 0); if (error != 0) break; if (twiddle_label (rec1) == failed) break; fwritelabel (fnum, rec1, 128, 0); if (error != 0) break; fspace (fnum, 20); if (error != 0) break; } fclose (fnum, 0, 0); } } Just IMAGINE all those IFs you'd need to nest if you avoided BREAK! CONTINUE, on the other hand, is a vile heresy. Everybody who uses CONTINUE should be burned at the stake. To summarize, "C Notes, A Guide to the C Programming Language" by C.T. Zahn (Yourdon 1979) says: "In practice, BREAK is needed rarely, CONTINUE never, and GOTO even less often than that... It also is good style to minimize the number of RETURN statements; exactly one at the end of the function is best of all for readability." On the other hand, I say "If this be treason, make the most of it!" Especially if your procedures are short enough and otherwise well-written enough, I think that you can well make the judgment that even with the introduction of GOTOs, the control flow will still be clear enough. Just don't tell anyone I told you to do it. LONG JUMPS -- PROBLEM AND SOLUTION Modern structured programming encourages FACTORING. Your algorithm, it says, should be broken up into small procedures, small enough that each one can be easily understood and digested by anybody reading it. I'm quite fond of factoring myself, and you'll find most of my procedures to be about 20-odd lines long or shorter. I try to make each procedure a "black box", with a well-defined, atomic function and no unobvious side effects. Naturally, with procedures this small, I often end up going several levels of procedure calls deep to do a relatively simple task. For instance, I might have a procedure called ALTFILE that takes a file name and a string of keywords indicating how the file is to be altered: * ALTFILE calls PARSE_KEYWORDS to parse the keyword string; * PARSE_KEYWORDS separates the string into individual keywords, calling PROCESS_KEYWORD for each one; * PROCESS_KEYWORD figures out what keyword is being referenced, and calls a parsing routine -- PARSE_INTEGER, PARSE_DATE, PARSE_INT_ARRAY, etc. -- depending on the type of the value the user specified; * PARSE_INT_ARRAY takes a list of integer values delimited by, say ":"s, and calls PARSE_INTEGER for each one. * PARSE_INTEGER converts the text string containing an integer value into a number and returns the numeric value. Not a far-fetched example, you must agree; in fact, many of my programs (e.g. MPEX's %ALTFILE parser) nest even deeper. Now, the question arises -- what if PARSE_INTEGER realizes that the value the user specified isn't a valid number after all? The solution seems clear -- PARSE_INTEGER, in addition to returning the integer's value, also returns a true/false flag indicating whether or not the value was actually valid. PARSE_INTEGER returns this to PARSE_INT_ARRAY; now, PARSE_INT_ARRAY realizes that its parameter isn't a valid integer array -- it must also return a success/failure flag to PROCESS_KEYWORD; PROCESS_KEYWORD must pass it back up to PARSE_KEYWORDS; PARSE_KEYWORDS should return it to ALTFILE; finally, ALTFILE informs its caller that the operation failed. Let's look at a particular specimen of one of these procedures; say, the portion that handles the keyword FOOBAR, which the user should specify in conjunction with an integer array, a string, and two dates: ... IF KEYWORD="FOOBAR" THEN BEGIN GET_SUBPARM (0, PARM_STRING); IF PARSE_INT_ARRAY (PARM_STRING, SP0_VALUE) = FALSE THEN PARSE_KEYWORD:=FALSE ELSE BEGIN GET_SUBPARM (1, PARM_STRING); IF PARSE_STRING (PARM_STRING, SP1_VALUE) = FALSE THEN PARSE_KEYWORD:=FALSE ELSE BEGIN GET_SUBPARM (2, PARM_STRING); IF PARSE_DATE (PARM_STRING, SP2_VALUE) = FALSE THEN PARSE_KEYWORD:=FALSE ELSE BEGIN GET_SUBPARM (3, PARM_STRING); PARSE_KEYWORD:= PARSE_DATE (PARM_STRING, SP3_VALUE) = FALSE ; END; END; END; END; ... Of course, the same sort of thing has to be repeated in every procedure in the calling sequence; the moment an error return is detected from one of the called procedures, the other calls have to be skipped, and the error condition should be passed back up to the caller. Error handling, of course, is important business, and it would hardly be appropriate to crash and burn just because the user inputs a bad value (users input bad values all the time). Still, all this work just to catch the error condition? What we really want to do in this case is to * HAVE WHOEVER DETECTS THE ERROR CONDITION AUTOMATICALLY RETURN ALL THE WAY TO THE TOP OF THE CALLING SEQUENCE. In other words, the error finder might have code that looks like: NUM:=BINARY (STR, LEN); IF CCODE<>CCE THEN { an error detected? } SIGNAL_ERROR; { return to the top! } The procedure we want to return to would indicate its desire to catch these errors by saying something like: ON ERROR DO { the code to be activated when the error is detected }; RESULT:=ALTFILE (FILE, KEYWORDS); Finally, the intermediate procedures can now be the soul of simplicity: ... IF KEYWORD="FOOBAR" THEN BEGIN GET_SUBPARM (0, PARM_STRING); PARSE_INT_ARRAY (PARM_STRING, SP0_VALUE); GET_SUBPARM (1, PARM_STRING); PARSE_STRING (PARM_STRING, SP1_VALUE); GET_SUBPARM (2, PARM_STRING); PARSE_DATE (PARM_STRING, SP2_VALUE); GET_SUBPARM (3, PARM_STRING); PARSE_DATE (PARM_STRING, SP3_VALUE); END; ... Thus, the three components of this scheme: * The code that finds the error -- it "SIGNALS THE ERROR"; * The code that should be branched to in case of error is somehow indicated, at compile time or run time (but before the error is actually signaled). * Finally, the intermediate code knows nothing about the possible error condition. It's automatically exited by the error signaling mechanism. For want of a better name, I'll call this concept a "Long Jump". It's also been called a "non-local GOTO", a "throw", a "signal raise", and other unsavory things, but "Long Jump" -- which happens to be the C name for it -- sounds more romantic. LONG JUMPS, CONTINUED -- SOLUTIONS AND PROBLEMS I've indicated the need -- or at least, I think it's a need -- and a possible prototype solution. There are several implementations of this already extant, each with its own little quirks and problems. PASCAL -- STANDARD AND /3000 The only mechanism Standard PASCAL and PASCAL/3000 give you to solve our problem is the GOTO. In PASCAL, you're allowed to GOTO out of a procedure or function; however, you can only branch INTO the main body of the program or from a nested procedure into the procedure that contains it. In other words, if you have PROCEDURE P; PROCEDURE INSIDE_P; { nested in P } BEGIN ... END; BEGIN ... END; PROCEDURE Q; BEGIN ... P; ... END; then you can branch from INSIDE_P into P, but you can't branch from P into Q, even though Q calls P. Even if this restriction weren't present, the GOTO to a fixed label still wouldn't be the right answer -- what if our PARSE_KEYWORDS procedure is called from two places? Surely we wouldn't want an error condition to cause a branch to the same location in both cases! Besides, if we want to compile PARSE_KEYWORDS separately from its caller, we'd have to allow "global label variables". In reality, PASCAL can't do these "long jumps". SPL SPL has a different and rather better facility. In SPL, you can't branch from one procedure into another; however, you CAN pass a label as a parameter to a procedure. Thus, you could write: PROCEDURE PARSE'INT'ARRAY (PARM, RESULT, ERR'LABEL); BYTE ARRAY PARM; INTEGER ARRAY RESULT; LABEL ERR'LABEL; BEGIN ... IF << test for error condition >> THEN GOTO ERR'LABEL; ... END; Then, you might call this from within PROCESS'KEYWORD by saying PROCEDURE PROCESS'KEYWORD (KEYWORD'AND'PARM, ERR'LABEL); BYTE ARRAY KEYWORD'AND'PARM; LABEL ERR'LABEL; BEGIN ... IF KEYWORD="FOOBAR" THEN BEGIN GET'SUBPARM (0, PARM'STRING); PARSE'INT'ARRAY (PARM'STRING, SP0'VALUE, ERR'LABEL); ... END; ... END; When you call PARSE'INT'ARRAY, you pass it the label to which it should return in case of error -- in this case, also called ERR'LABEL, which was also passed to this procedure. Finally, the topmost procedure -- ALTFILE -- might say: RESULT:=ALTFILE (FILENAME, KEYWORDS, GOT'ERROR); ... GOT'ERROR: << report to the user that an error occurred >> The key point here is that each procedure doesn't really return directly to the top; rather, it returns to the error label that it was passed by its caller. Since that may well be the label passed by the caller's caller, and so on, you get a sort of "daisy chain" effect by which you can easily exit ten levels of procedures in one GOTO statement. At this point, I think it's quite important to mention a SEVERE PROBLEM of these "long jumps" that I think any implementation mechanism has to be able to address: * THE VERY ESSENCE OF A LONG JUMP IS THAT IT BYPASSES SEVERAL OF THE PROCEDURES IN THE CALLING SEQUENCE. A PROCEDURE (say, our PROCESS_KEYWORD) CALLS ANOTHER PROCEDURE, EXPECTING THE CALLEE TO RETURN, BUT THE CALLEE NEVER DOES! Imagine for a moment that PROCESS_KEYWORD opened a file, intending to close it at the end of the operation; after the long jump branches out of it, the file will remain open. Any other kind of cleanup -- resetting temporarily changed global variables, releasing acquired resources -- that a procedure expects to do at the end might remain undone because the procedure will be branched out of. Similarly, what if a procedure EXPECTS another procedure that it calls to detect an error condition? What is a fatal error under some circumstances may be quite normal under others; for instance, say you have a procedure that reads data from a file and signals an error if the file couldn't be opened -- in some cases, you may expect the file to be unopenable, and have a set of defaults you want to use instead. By using the convenience of long jumps, you lose the certainty that every procedure has complete control over its execution, and can be sure that any procedure it calls will always return. The advantage of SPL's approach is that you could call a procedure passing to it any error label you want to. For instance, PROCESS'KEYWORD might look like: PROCEDURE PROCESS'KEYWORD (KEYWORD'AND'PARM, ERR'LABEL); BYTE ARRAY KEYWORD'AND'PARM; LABEL ERR'LABEL; BEGIN INTEGER FNUM; FNUM:=FOPEN (KEY'INFO'FILE, 1); ... IF KEYWORD="FOOBAR" THEN BEGIN GET'SUBPARM (0, PARM'STRING); PARSE'INT'ARRAY (PARM'STRING, SP0'VALUE, CLOSE'FILE); ... END; ... RETURN; << if we finished normally, just return >> CLOSE'FILE: << branch here in case of error >> FCLOSE (FNUM, 0, 0); GOTO ERR'LABEL; END; Because you have complete control over each branch, you don't HAVE to pass the procedure you call the same error label that you were passed; if you want to do some cleanup, you can just pass the label that does the cleanup, and THEN returns to your own error label. Thus, with SPL's label parameter system, you get the best of both worlds: * If you pass an "error label" to a procedure, the procedure may choose to return normally or to return to the error label. * Since you can pass the same label to a procedure you call as the one that you yourself were passed, a single GOTO to that label can conceivably exit any number of levels of procedures. * On the other hand, if you want to do some cleanup in case of an error, you can just pass a different label, one that points to the cleanup code. * Finally -- if you want to -- you can actually pass several labels to a procedure, allowing it to return to a different one depending on what error condition it finds. A bit extravagant for my blood, but maybe I'm just too stodgy. The only problems that this system has are: * You have to pass the label to any procedure that might conceivably want to participate in a long jump -- either the procedure that initially detects the error or any one that wants to pass it on. This may often mean that virtually every one of your procedures will have to have this error label parameter. Not a very unpleasant problem, but a bit of a bother nonetheless. * Similarly, there are some procedures whose parameters you can't dictate; for instance, control-Y trap procedures (ones in which a long jump to the control-Y handling code may often be just the thing you want to do). Other trap procedures (arithmetic, library, and system) are just like this, too, as are those which are themselves passed as "procedure parameters" to other procedures and whose parameters are dictated by those other procedures (got that?). Besides these minor problems, though, SPL's long jump support is quite reasonably done. PROPOSED ANSI STANDARD C C's "GOTO" doesn't allow any branch from one function to another; neither does C provide label parameters like SPL does. Long jumps in C are accomplished with a different mechanism, involving the SETJMP and LONGJMP built-in procedures. SETJMP is a procedure to which you pass a record structure (of the predefined type "jmp_buf"). When you first call it, it saves all the vital statistics of the program -- the current instruction pointer, the current top-of-stack address, etc. -- in this record structure. Then, when the same record structure is passed to LONGJMP, LONGJMP uses this information to restore the instruction pointer and stack pointer to be exactly what they were at SETJMP time. Thus, control is passed back to the SETJMP location, wherever it may be. A typical application of this might be: jmp_buf error_trapper; proc() { ... if (setjmp(error_trapper) != 0) /* do error processing */; else { result = altfile (filename, keywords); ... } ... } ... int parse_integer (str) char str[]; { ... if (bad_value) longjmp (error_trapper, 1); ... } One thing, I didn't, as you see, mention at first was the "IF (SETJMP(ERROR_TRAPPER) != 0)". Well, since the LONGJMP jumps DIRECTLY to the instruction following the SETJMP, we have to have some way of distinguishing the first time it is executed (after a legitimate SETJMP) and the next time (after the LONGJMP which transferred control back to it). The initial SETJMP, you see, returns a 0; a LONGJMP takes its second parameter (in this case, a 1), and returns it as the "result" of SETJMP. Thus, when the IF statement is first executed, the value of the "(SETJMP ... != 0)" will be FALSE, and the ALTFILE will be done; when the IF is executed a second time, the value will be TRUE, the error processing will be performed. Note the distinctive features of the SETJMP/LONGJMP construct: * The "jump buffer" -- set by SETJMP and used by LONGJMP -- need not be passed as a parameter to each procedure that needs it (although it could be). Typically, it's stored as a global variable (which the SPL error label parameter couldn't be). * You still have control over procedures you call; if you want to trap their jump yourself (either to do some cleanup or treat it as a normal condition), you can just do your own SETJMP using the same buffer that they'll LONGJMP with. * On the other hand, if you want do some cleanup and then continue the LONGJMP process -- propagate it back up to the original error trapper, in this case PROC -- you have to do more work. You must save the original jump buffer in a temporary variable before doing the SETJMP, and restore it before continuing the LONGJMP (or simply returning from the procedure). For instance, PROCESS_KEYWORD might look like this: process_keyword (keyword_and_parm) char keyword_and_parm[]; { jmp_buf error_trapper; /* declare our temporary save buffer */ int fnum; fnum = fopen (key_info_file, 1); save_error_trapper = error_trapper; if (setjmp (error_trapper) != 0) /* Must be an error condition */ { fclose (fnum, 0, 0); error_trapper = save_error_trapper; longjmp (error_trapper, 1); } ... if (strcmp (keyword, "foobar")) { get_subparm (0, parm_string); parse_int_array (parm_string, sp0_value); ... } ... fclose (fnum, 0, 0); error_trapper = save_error_trapper; /* restore for future use */ } Frankly speaking, if you ask me -- and even if you don't -- this doesn't look very clean. I'd like to see some way of automatically "stacking" SETJMPs so that the system would do the saving of the old jump buffer for you; also, I'd prefer not to have to type that ugly "IF (SETJMP ... != 0)" kludge. On the other hand, this can be made quite palatable-looking with a few macros, and it's better than nothing (or is it?). PASCAL/XL AND THE TRY..RECOVER The authors of PASCAL/XL -- perhaps because they were faced with the non-trivial task of building a language that MPE/XL could be profitably written in -- must have given this subject a great deal of thought. And, fortunately, they've come up with what I think to be a very powerful construct. TRY statement1; statement2; ... statementN; RECOVER recoverycode; The behavior here is * EXECUTE statement1 THROUGH statementN. IF ANY PASCAL ERROR (e.g. giving a bad numeric value to a READLN) OR A CALL TO THE BUILT-IN "ESCAPE" PROCEDURE OCCURS WITHIN THESE STATEMENTS, CONTROL IS TRANSFERRED TO recoverycode, AND AFTER THAT TO THE STATEMENT FOLLOWING TRY..RECOVER. This, as you see, allows you to put a TRY..RECOVER into the top-level procedure (in our case, PROC or ALTFILE) and an ESCAPE call in any of the called procedures (e.g. PARSE_INTEGER) that detects a fatal error. The best part, though, is that any procedure that wants to establish some sort of "cleanup" code can do this trivially! For instance, our PROCESS_KEYWORD might say: PROCEDURE PROCESS_KEYWORD (VAR KEYWORD_AND_PARM: STRING); VAR FNUM: INTEGER; SAVE_ESCAPECODE: INTEGER; BEGIN FNUM:=FOPEN (KEY_INFO_FILE, 1); TRY ... IF KEYWORD="FOOBAR" THEN BEGIN GET_SUBPARM (0, PARM_STRING); PARSE_INT_ARRAY (PARM_STRING, SP0_VALUE); END; ... FCLOSE (FNUM, 0, 0); RECOVER BEGIN SAVE_ESCAPECODE:=ESCAPECODE; FCLOSE (FNUM, 0, 0); ESCAPE (SAVE_ESCAPECODE); END; END; If any error occurs in the code between TRY and RECOVER, the BEGIN/END in the RECOVER part is triggered. This is now free to close the file, or do whatever else it needs to, and then "pass the error down" by calling ESCAPE again. This ESCAPE -- since it's no longer between this TRY and RECOVER -- will activate the previously defined TRY/RECOVER block (say, in the PARSE_KEYWORDS procedure) which might do more cleanup and then call ESCAPE again. Eventually, the error will percolate to the top-most TRY/RECOVER, which will just do some work and not call ESCAPE any more, continuing with the rest of the program. In other words, "TRY .. RECOVER"s can be nested. In the following piece of code TRY A; TRY B; TRY C; RECOVER R1; D; RECOVER R2; E; RECOVER R3; * An error or ESCAPE in C will cause a branch to R1. * An error/ESCAPE in B or D will, of course, branch to R2 (since B and D are outside the innermost TRY .. RECOVER R1). However, an error/ESCAPE in R1 will also cause a branch to R2! That's because R1 is also out of the area of effect of the innermost TRY .. RECOVER. In other words, the "recovery handler" R1 is only "established" between the innermost TRY and the innermost RECOVER; when it's actually "triggered", it's disestablished, and the recovery handler that was previously in effect is re-established. * By this token, an error/ESCAPE in A, E, or R2 will branch to R3. * And, finally, an error in R3 -- or anywhere else outside of the TRY .. RECOVER -- will actually abort the program with an error message. As you see, then, all is for the best in this best of all possible worlds. We can do long jumps "up the stack" to the RECOVER code, but each intervening procedure can also easily set up "cleanup code" that needs to be executed before the long jump can continue. Several notes: * First of all, remember that the RECOVER statement is executed ONLY in case of an error or an ESCAPE. If the statements between TRY and RECOVER finish normally, any "cleanup" code you may have inside the RECOVER will NOT be executed. That's why our sample program has two FCLOSEs -- one for the normal case and one for the cleanup case. * Note also that the ESCAPE call can take a parameter (just like C's LONGJMP). This parameter is then available as the variable ESCAPECODE in the RECOVER handler, and is used to indicate what kind of error or ESCAPE happened. A RECOVER handler might, for instance, be used to avoid an abort caused by an expected error condition (e.g. file I/O error); however, if it sees that ESCAPECODE indicates some other, unexpected, error condition, it might terminate or call ESCAPE again, hoping that some "higher-level" RECOVER block can handle the error. * Finally, if a RECOVER block wants to continue the long jump after doing its cleanup work, it often needs to pass the ESCAPECODE up as well (unless, of course, the higher-level RECOVER handler won't use the ESCAPECODE). Unfortunately, the PASCAL/XL manual explicitly tells us: - "It is wise to assign the result of the ESCAPECODE function to a local variable immediately upon entering the RECOVER part of a TRY-RECOVER construct, because the system can change that value later in the RECOVER part." This is too bad; it would have been nice to have TRY .. RECOVER do this saving for you automatically, saving you the burden of having to declare and set an extra local variable. Still, we oughtn't look a gift horse in the mouth. Note, incidentally, how C's #define macro facility can come to our aid if we want to implement this same construct in C. All we need is three #defines: int escapecode; int jump_stack_ptr = -1; jmp_buf jump_stack[100]; /* the stack used to do nesting */ #define TRY if (setjmp(jump_stack[++jump_stack_ptr])==0) { #define RECOVER jump_stack_ptr--; } else #define ESCAPE(parm) \ { \ escapecode = parm; \ longjmp(jump_stack[jump_stack_ptr--], 1); } This would allow us to say: TRY code; RECOVER errorhandler; and ESCAPE(value); just like we could in PASCAL/XL! Note how we've added this entirely new control structure without any changes to the compiler -- nothing more complicated than a few #defines! (Many thanks to Tim Chase of CCS for showing me how to do this!) NESTED PROCEDURES An interesting feature of PASCAL is its ability to have procedures nested within other procedures. In other words, I could say: PROCEDURE PARSE_THING (VAR THING: STRING); VAR CURR_PTR, CURR_DELIMS: INTEGER; QUOTED: BOOLEAN; ... PROCEDURE PARSE_CURR_ELEMENT (...); BEGIN ... END; BEGIN ... PARSE_CURR_ELEMENT (...); ... END; PARSE_CURR_ELEMENT here is just like a local variable of PARSE_THING -- it's a local procedure. It's callable only from within PARSE_THING and not from any other procedure in the program. More importantly, * THE NESTED PROCEDURE (PARSE_CURR_ELEMENT) CAN ACCESS ALL OF PARSE_THING'S LOCAL VARIABLES. This is a significant consideration. If PARSE_CURR_ELEMENT didn't need to access PARSE_THING's local variables, not only could it be a different (non-nested) procedure, but it probably should be. When a procedure is entirely self-contained, it's usually a good idea to make it accessible to as many possible callers as possible. On the other hand, what if PARSE_CURR_ELEMENT needs to interrogate CURR_PTR to find out where we are in parsing the thing; or look at or modify CURR_DELIMS or QUOTED or whatever other local variables are relevant to the operation? We don't want to have to pass all these values as parameters -- there could be dozens of them. We don't want to make them global variables, since they're really only relevant to PARSE_THING -- why make them accessible by other procedures that have no business messing with them? (Incidentally, making the variables global will also prevent PARSE_THING from calling itself recursively.) But, on the other hand, we certainly DO want to have PARSE_CURR_ELEMENT be a procedure -- after all, we might need to call it many times from within PARSE_THING; surely we don't want to repeat the code every time! Thus, the main advantage of nested procedures is not just that, like local variables, they can only be accessed by the "nester". Rather, the advantage is the fact that they can share the nester's local variables, which are often quite relevant to what the nested procedure is supposed to do. Another substantial benefit comes when you pass procedures as parameters to other procedures. A good example of this might be a report writer procedure: TYPE LINE_TYPE = PACKED ARRAY [1..256] OF CHAR; PROCEDURE PRINT_LINE (VAR LINE: LINE_TYPE; LINE_LEN: INTEGER; PROCEDURE PAGE_HEADER (PAGENUM: INTEGER); PROCEDURE PAGE_FOOTER (PAGENUM: INTEGER)); This procedure takes the line to be output and its length, but also takes two procedures -- one that will be called in case a page header should be printed and one in case a page footer should be printed. The utility of this is obvious -- it gives the user the power to define his own header and footer format. Now, let's say we have the following procedure: PROCEDURE PRINT_CUST_REPORT (VAR CATEGORY: INTEGER); VAR CURRENT_COUNTRY: PACKED ARRAY [1..40] OF CHAR; ... BEGIN ... PRINT_LINE (OUT_LINE, OUT_LINE_LEN, MY_PAGE_HEAD_PROC, MY_PAGE_FOOT_PROC); ... END; PRINT_LINE will output OUT_LINE and, in some cases, call MY_PAGE_HEAD_PROC or MY_PAGE_FOOT_PROC. Now, it makes sense for you to want these procedures to print, say, the current value of CATEGORY and, perhaps, CURRENT_COUNTRY. In C and SPL, which have no nested procedure, both MY_PAGE_HEAD_PROC and MY_PAGE_FOOT_PROC would have to be separate procedures which have no access to PRINT_CUST_REPORT's local variables. The variables would either have to be global (which is quite undesirable) or would somehow have to be passed to PRINT_LINE, which in turn would pass them to the MY_PAGE_xxx_PROC procedures. This would be quite cumbersome, since in PRINT_CUST_REPORT the header and footer procedures need to be passed an integer and a PACKED ARRAY OF CHAR, whereas in some other application of PRINT_LINE they would have be to passed, say, three floats and a record structure. In PASCAL, on the other hand, both MY_PAGE_HEAD_PROC and MY_PAGE_FOOT_PROC can be nested within PRINT_CUST_REPORT and thus have access to CATEGORY and CURRENT_COMPANY (and all the other local variables of the PRINT_CUST_REPORT procedure). Another useful application for nested procedures. C, as I mentioned, has no nested procedure support at all. On the other hand, it does have #DEFINEs, which allow you to define text substitutions that can often do the job (see the section on DEFINES) of a nested procedure, especially if it's a small one. For instance, you can say: #define foo(x,y) \ { \ int a, b; \ /* variables local to THIS DEFINE */ a = x + parm1; \ /* access a variable local to the procedure */ b = y * parm2; \ /* (the nesting procedure) */ x = a + b; \ y = a * b; \ } As you can see, C's support for "block-local" variables -- local variables that are local not just to the procedure, but rather to the "{"/"}" block in which they're defined -- allows you to have #DEFINEs that are almost as powerful as real procedures. SPL allows you to have "SUBROUTINE"s nested within procedures, but subject to some rather stringent restrictions: * The subroutines can have no local variables of their own. This is a pretty severe problem, since it means that all your local variables have to be declared in the nesting procedure, which increases the likelihood of errors and also prohibits you from calling the subroutine recursively (which you would otherwise be able to do). * The subroutines can not be passed as procedure parameters to other procedures (only procedures can be -- try parsing that!). * Furthermore, this nesting capability goes to only one level; you can nest SUBROUTINEs in PROCEDUREs, but you can't nest anything within SUBROUTINEs. In PASCAL, procedures can be nested within each other to an arbitrary number of levels. Frankly speaking, I'm hard put to think of an application for triply-nested procedures. Practically, you'll have to decide for yourself whether PASCAL's nested procedure support -- and C's lack of it -- is important to you. I brought this issue up to a C partisan, and she replied that she's simply never run into a case where nested procedures were all that important. Upon thinking about this, I found myself forced to agree, at least partially: * #DEFINEs can do much of the job that nested procedures are needed for; * Most procedures should often NOT be nested, but rather be made self-contained and made available to the world at large (rather than just to a particular procedure). * If the reason you don't want to declare your variables as global is that you want to "hide" them from other procedures, you can do this in C by making them "static". This will make them available only to the procedures in the file in which they're defined. This allows you to share data between procedures (which you might otherwise have wanted to nest within each other) without making the data readable and modifiable by everybody. * On the other hand, there's no denying that there are cases in which PASCAL's nested procedures are quite a bit superior to any C or SPL alternative. For instance, a recursive procedure might well not be able to use the "static global variable" approach I just mentioned. DATA TYPES The difference most often cited between PASCAL and C is the way that they treat data types. PASCAL is often considered a "strict type checking" language and C a "loose type checking" language, and that's true enough. However, the effects of this philosophical difference are subtler and more pervasive than at first glance appears. What are data types? Data types can be seen in the earliest of languages, from FORTRAN and COBOL onwards. When you declare a variable to be a certain data type, you give certain information to the compiler -- information that the compiler must have to produce correct code. Historically, this information has included: * What the various operators of the language MEAN when applied to the variable. "+", for instance, isn't just "addition" -- when you add two integers, it's integer addition, and when you add two reals, it's real addition. Two entirely different operations, with entirely different machine language opcodes and (possibly) different effects on the system state. Similarly, a FORTRAN "DISPLAY X" means: - If X is a string, print it verbatim; - If X is an integer, print its ASCII representation; - If X is a real, print its ASCII representation, but in a different format and with a different conversion mechanism. * How much SPACE is to be allocated for the variable. "Array of 20 integers" is a type, too, one from which the compiler can exactly deduce how much memory (20 words) needs to be allocated to fit this data. If you look at SPL (and, incidentally, FORTRAN and other languages), you'll find that all of its type declarations essentially aim at serving these two functions. However, in recent times, a few other functions have been ascribed to type declarations: * Using type declarations, the compiler can DETECT ERRORS that you may make. The compiler can't, of course, figure out if your program does "the right thing" since it doesn't know what the right thing is; however, it can see if there are any internal inconsistencies in your program. For instance, if you're multiplying two strings, the compiler can tag that as an obvious error; similarly, if you pass a string parameter to a procedure that expects an integer (or vice versa), a good compiler will find this and save you a lot of run-time debugging. The more elaborate and precise the type specifications you give, the more error checking the compiler can do. Error checking can also be provided at run time, where code that knows what size arrays are, for instance, can make sure that you don't inadvertently index outside them. PASCAL's "subrange types" do this sort of thing, too, allowing you to declare what values (e.g. "0 to 100") a variable may take and triggering an error when you try assigning it an invalid value. * Furthermore, with a type declaration, the compiler can automatically SAVE WORK for you by automatically defining special tools for the given type. The classic example of this is the record structure -- by declaring the structure, you're automatically defining a set of "operators" (one for each field of the structure) that allow you to easily access the structure. Similarly, enumerated types can save you the burden of having to manually allocate distinct values for each of the elements in the enumeration (admittedly, not a very large burden). Some fancy compilers can even automatically define "print" operations for each record structure, so that you can easily dump it in a legible format to the terminal without having to print each element individually. * Good type handling provisions can INSULATE YOUR PROGRAMS FROM CHANGES IN YOUR DATA'S INTERNAL REPRESENTATION. For instance, if the compiler allows you to refer to a field of a record as, say, "CUSTREC.NAME" instead of "CUSTREC(20)", then you can easily reformat the insides of the record (adding new fields, changing field sizes, etc.) without having to change all places that reference this record. Similarly, if your language allows functions to return records and arrays as well as scalars, you can easily change the type of your, say, disc addresses from a 2-word double integer to a 10-word array of integers. In SPL, for instance, such a change would require rewriting all procedures that want to return such objects or to take them as "by-value" parameters. Even changing a value from an "integer" to a "double integer" in SPL will require you to change a great deal of code. The reason I've given this list is that SPL, PASCAL, and C place different weights on each of these points, and this makes for rather substantial differences in the way you use these languages. Now, away from the generalities and on to concrete examples. RECORD STRUCTURES Consider for a moment an entry in your "employee" data set. It could be a file label; it could be a Process Control Block entry; it could be any chunk of memory that contains various fields of various data types. A typical layout of this employee entry (or employee "record") might be: Words 0 through 14 - The employee name (a 30-character string); Words 15-19 - Social security number (10-character string); Words 20-21 - Birthday (a double integer, just to be interesting); Words 22-23 - Monthly salary (a real number). A simple record. It's 24 words long, but it's not really an "array of 24 words"; logically speaking, to you and me, it's a collection of four objects, each of a different type, each starting at a different (but constant) offset within the record. How do we declare a variable to hold this record? In FORTRAN and SPL, it's easy: INTEGER ARRAY EMPREC(0:23); or INTEGER EMPREC(24) Short and sweet. The compiler's happy -- it knows that it's an array of integers, which means you can extract an arbitrary element from it, and pass it to a procedure (like DBGET), which will receive its address as an integer pointer. This defines to the compiler the MEANING of the "indexing" and "pass to procedure" operations that can be done on EMPREC. Also, the compiler knows that 24 words must be allocated for this array, as a global or local variable. The compiler is happy, but are you? First of all, how are you going to access the various elements of this record structure? Are you going to say EMPREC(20) when you mean the employee's birthday (actually, since it's a double integer, you couldn't even do that)? What about error checking? Since all the compiler knows about this is that it's an integer array, it'll be happy as punch to allow you to put it anywhere an integer array can go. Would you like to pass it as the database name to DBGET instead of as the buffer variable? Fine. Would you like to view it as a 4 by 5 matrix and multiply it by, say, the department record? The computer will gladly oblige. Finally, consider the burden this places on you whenever you want to change the layout of EMPREC -- say, to increase the name from 30 characters to 40. You'll have to change all your "EMPREC(20)"s to "EMPREC(25)", all your "INTEGER ARRAY EMPREC(0:23)" to "INTEGER ARRAY EMPREC (0:28)". And, of course, if you forget one or the other -- why, the compiler will be happy to extract the 4th word of the social security number and treat it as the employee's birthday! Of course, you're not going to do this. You will certainly not refer to all the elements of the record structure by their numeric array indices (although it so happens that most of HP's MPE code does exactly this). Rather, you'll say (of course, in SPL, you can also do the same thing with DEFINEs): EQUATE SIZE'EMPREC = 24; BYTE ARRAY EMP'NAME (*) = EMPREC(0); BYTE ARRAY EMP'SSN (*) = EMPREC(15); DOUBLE ARRAY EMP'BIRTHDATE (*) = EMPREC(20); REAL ARRAY EMP'SALARY (*) = EMPREC(22); [Note: The fact that we define, say, EMP'BIRTHDATE and EMP'SALARY as arrays isn't a problem. If we say EMP'SALARY with no subscript, it'll refer to the 0th element of this "array", which is exactly what we want it to do.] FORTRAN is similar (you'd use an EQUIVALENCE); COBOL is a bit simpler, allowing you to say (remembering that COBOL doesn't have REALs). 01 EMPREC. 05 NAME PIC X(30). 05 SSN PIC X(10). 05 BIRTHDATE PIC S9(9) COMP. 05 SALARY PIC S9(5)V9(2) COMP-3. As you see, COBOL at least has the advantage that it automatically calculates the indexes of each subfield for you. This is nice, especially when you change the structure, reshuffling, inserting, deleting, or resizing fields. On the other hand, I wouldn't call this a very substantial feature, especially since sometimes you WANT to manually specify the field offsets (whenever the record structure is not under your control, like, say, an MPE file label). To summarize, this "EQUIVALENCE"ing approach that's available in SPL, FORTRAN, and COBOL saves you from the very substantial bother of having to hardcode the offsets of all the subfields into your program. This is certainly a good thing; however, PASCAL and C go substantially beyond this. The most serious problem with what I'll call the "EQUIVALENCE"ing approach is a rather subtle one, one that I didn't realize until I'd used it for some time. The definitions we saw above -- in SPL, FORTRAN, or COBOL -- defined several variables as subfields of another variable. EMP'NAME and EMP'SSN are subfields of EMPREC. What if we need to declare this EMPREC twice -- say, in two different procedures? Clearly we don't want to have to repeat the EQUIVALENCEs in each procedure. Yet what choice do we have? We might, for instance, set up each of the subfields as a DEFINE instead of an equivalence, making the DEFINEs available in all the procedures that reference EMPREC: DEFINE EMP'NAME = EMPREC(0) #; DEFINE EMP'SSN = EMPREC(15) #; DEFINE EMP'BIRTHDATE = EMPREC(20) #; DEFINE EMP'SALARY = EMPREC(22) #; but then, since DEFINEs are merely text substitutions and EMPREC is an integer array, each EMP'xxx will also be an integer array. We'd have to say BYTE ARRAY EMPREC'B(*)=EMPREC; DOUBLE ARRAY EMPREC'D(*)=EMPREC; REAL ARRAY EMPREC'R(*)=EMPREC; in each procedure that defines an EMPREC array, and a DEFINE EMP'NAME = EMPREC'B(0) #; DEFINE EMP'SSN = EMPREC'B(15) #; DEFINE EMP'BIRTHDATE = EMPREC'D(20) #; DEFINE EMP'SALARY = EMPREC'R(22) #; at the beginning of the program. Still, we'd have had to have the defines of the BYTE ARRAY, DOUBLE ARRAY, and REAL ARRAY repeated once for each declaration of EMPREC; and, what if we want to call the record something else, like have two records called EMPREC1 and EMPREC2? * THE PROBLEM WITH DEFINING SUBFIELDS OF A RECORD STRUCTURE USING THE "EQUIVALENCING" APPROACH IS THAT IT DEFINES THE SUBFIELDS OF ONLY ONE RECORD STRUCTURE VARIABLE. WHAT WE WANT IS TO DEFINE A GENERALIZED "TEMPLATE" ONCE AND THEN APPLY THIS TEMPLATE TO EACH RECORD STRUCTURE VARIABLE WE USE. In other words, we want to be able to say DEFINE'TYPE EMPLOYEE'REC (SIZE 24) BEGIN BYTE ARRAY NAME (*) = RECORD(0); BYTE ARRAY SSN (*) = RECORD(15); DOUBLE ARRAY BIRTHDATE (*) = RECORD(20); REAL ARRAY SALARY (*) = RECORD(22); END; and then declare any particular employee record buffer by saying: EMPLOYEE'REC EMPREC1; EMPLOYEE'REC EMPREC2; Then, we could extract each individual subfield of the record like this: NEW'SALARY := EMPREC1.SALARY * 1.1; The point here is that * IN ADDITION TO NOT HAVING TO EXPLICITLY SPECIFY THE OFFSET OF THE SUBFIELD OF THE RECORD (like having to say RECORD(22), an awful thing to do), WE CAN NOW DEFINE THE LAYOUT OF THE RECORD STRUCTURE ONCE, REGARDLESS OF HOW MANY VARIABLES WITH THAT STRUCTURE WE WANT TO DECLARE. Do you see how nicely this dovetails with the "INSULATING YOUR PROGRAM FROM CHANGING INTERNAL REPRESENTATION" principle we gave above? The record structure layout is defined in EXACTLY ONE PLACE in the program file. We can have a hundred different variables of this type -- none of them will have to specify the physical size of the buffer or the offsets of the subfields. Each one will merely refer back to the type declaration. Also, we've now announced EMPREC1 to the compiler as being of the special "EMPLOYEE'REC" type. It's no longer a simple INTEGER ARRAY, just like any other integer array. Conceivably, if we declare a procedure to be PROCEDURE PUT'EMPLOYEE (DBNAME, EMPREC, FRAMASTAT); INTEGER ARRAY DBNAME; EMPLOYEE'REC EMPREC; INTEGER FRAMASTAT; ... the compiler can warn us that EMPLOYEE'REC EMPREC; INTEGER ARRAY DBNAME; INTEGER FOOBAR; ... PUT'EMPLOYEE (EMPREC, DBNAME, FOOBAR); is an invalid call -- it sees that an object of type EMPLOYEE'REC is being passed in place of an INTEGER ARRAY, and an INTEGER ARRAY is being passed in place of an EMPLOYEE'REC. Without this error checking, you'd have to find this problem yourself at run-time, a distinctly more difficult task. RECORD STRUCTURES IN PASCAL AND C What I just gave is the rationale for record structures, mostly for the benefit of SPL programmers who haven't used PASCAL and C before. Of course, the only reason I gave it is that PASCAL and C do have record structure support, remarkably similar support at that. Here's the way you declare a structure data type in PASCAL: { "PACKED ARRAY OF CHAR"s are PASCAL strings } TYPE EMP_RECORD = RECORD NAME: PACKED ARRAY [1..30] OF CHAR; SSN: PACKED ARRAY [1..10] OF CHAR; BIRTHDATE: INTEGER; { really a double integer } SALARY: REAL; END; ... VAR EMPREC: EMP_RECORD; { declare a variable called "EMPREC" } And in C: typedef struct {char name[30]; char ssn[10]; long int birthdate; float salary; } emp_record; ... emp_record emprec; /* declare a variable called "emprec" */ You can see the minor differences -- the type names are different ("float" instead of "REAL", "long int" to mean double integer); the type name comes at the end of the "typedef"; the newly defined type is used a "statement" all its own rather than as part of a VAR statement; and, of course, everything's written in those CUTE lower-case characters. In essence, of course, the constructs are absolutely identical. The use is identical, as well: NEW_SALARY := EMPREC.SALARY * 1.1; new_salary = emprec.salary * 1.1; Incidentally, if we didn't want to define a new type, but rather just wanted to define one variable of a given structure, we could have said: VAR EMPREC: RECORD NAME: PACKED ARRAY [1..30] OF CHAR; SSN: PACKED ARRAY [1..10] OF CHAR; BIRTHDATE: INTEGER; { really a double integer } SALARY: REAL; END; struct {char name[30]; char ssn[10]; long int birthdate; float salary; } emprec; Note how the type declaration is very much like the original variable declaration. So, declaring and using record structures is identical in PASCAL and C. However, there's a VERY BIG DIFFERENCE between PASCAL and C. * In PASCAL, strict type checking is more than just a good idea, it's the LAW. If a function parameter is declared as type EMPLOYEE_REC, any function call to it must pass an object of that type. Even if it passes a record structure that's defined with exactly the same fields but with a different type name (admittedly a rare occurrence), the compiler will cough. Any structure parameter must be of EXACTLY THE RIGHT TYPE. * Many C programmers view strict type checking much as you or I might view, say, the Gestapo or the KGB. Kernighan & Ritchie C compilers DO NOT do type checking. In fact, in Kernighan & Ritchie C, you can pass a string where a real number is expected, and the compiler won't say a word! (On the other hand, your program is unlikely to work right.) I could fault C for this, treating C's lack of type checking much as I do, say, SPL's lack of an easy I/O facility. The trouble is that C programmers don't think that lack of type checking is a bug; they think it's a feature. The problem is philosophical -- what are the benefits of type checking and do they outweigh the drawbacks? TYPE CHECKING -- ORIGINAL STANDARD PASCAL AND PASCAL/3000 Earlier in the paper I brought up a certain point. Compilers that know the type of variables can, I said, check your code to make sure that you're not using types inconsistently. For instance, if you use a character when you should be using a real number, that's an "obvious error" and the compiler can do you a favor by complaining at compile-time. Similarly, if you pass an employee record to a procedure that expects a database name, that's also an error, and should also be reported. Now, this principle is in many ways at the heart of the PASCAL language. And, certainly, everyone will agree that it would be good for the compiler to find errors in your program rather than making you do it yourself. The question is -- IS A COMPILER WISE ENOUGH TO DETERMINE WHAT IS AN ERROR AND WHAT IS NOT? For instance, say you write VAR CH: INTEGER; IF 'a'<=CH AND CH<='z' THEN CH:=CH-'a'+'A'; Utterly awful! We have what -- to PASCAL, at least -- is at least 4 type inconsistencies; we're comparing an integer against a character two times, and then we're adding and subtracting characters and integers! Obviously an error. Actually, of course, this code takes CH, which it assumes is a character's ASCII code, and upshifts it. If it finds that CH is a lower case character, it shifts it into the upper case character set by subtracting 'a' and adding 'A'. Some might complain that this code is not portable (it won't, for instance, work on EBCDIC machines), but that's not relevant. The programmer has a perfect right to assume that the code will run on an ASCII machine; you mustn't ram portability down his throat. Sometimes, it's very useful to be able to, say, treat characters as integers and vice versa. Now, before anybody accuses me of slandering PASCAL, I must point out that the solution is readily available. Pascal can convert a character to an integer using the "ORD" function, and an integer to a character using "CHR"; our code could easily be re-written: VAR CH: INTEGER; IF ORD('a')<=CH AND CH<=ORD('z') THEN CH:=CH-ORD('a')+ORD('A'); The important point here is not whether or not you can upshift characters; the important fact is that: * SOMETIMES A PROGRAMMER MAY CONSCIOUSLY WANT TO DO THINGS THAT MIGHT USUALLY BE VIEWED AS TYPE INCOMPATIBILITIES. Consider, for a moment, the following application: * You want to write a procedure that adds a record to the database. Unlike DBPUT, this one should just take the database name, the dataset name, and the buffer, and do all the error checking itself. Sounds simple, no? You write: TYPE TDATABASE = PACKED ARRAY [1..30] OF CHAR; TDATASET = PACKED ARRAY [1..16] OF CHAR; TRECORD = ???; ... PROCEDURE PUT_REC (VAR DB: TDATABASE; S: TDATASET; VAR REC: TRECORD); BEGIN ... END; BUT HOW DO YOU DEFINE "TRECORD"? Remember why I said that type checking is such a wonderful thing. After all, if a procedure expects a "customer record" and you pass it an "employee record", you want the compiler to complain. But what if the procedure expects ANY kind of record? What if it'll be perfectly HAPPY to take an employee record, a sales record, a database name, or a 10 x 10 real matrix? How should the compiler react then? Unfortunately, PASCAL, with all its sophisticated type checking, falls flat on its face (this is true of both Standard PASCAL and PASCAL/3000). At this point, in the interest of fairness (and for the practical use of those who HAVE to do this sort of thing in PASCAL), I must point out that PASCAL does have a mechanism for supporting record structures of different types. The trick is to use a degenerate variation of the record structure called the "tagless variant" or "union" structure. It's quite similar to EQUIVALENCE in FORTRAN, but even uglier. To put it briefly, you have to say the following: TYPE TANY_RECORD = RECORD CASE 1..5 OF 1: (EMP_CASE: TEMPLOYEE_RECORD); 2: (CUST_CASE: TCUSTOMER_RECORD); 3: (VENDOR_CASE: TVENDOR_RECORD); 4: (INV_CASE: TINVOICE_RECORD); 5: (DEPT_CASE: TDEPARTMENT_RECORD); END; This defines the type "TANY_RECORD" to be a record structure which can be looked at in one of FIVE different ways: * As having one field called "EMP_CASE" which is of type "TEMPLOYEE_RECORD". * As having one field called "CUST_CASE" which is of type "TCUSTOMER_RECORD". * Or, as having one field called "VENDOR_CASE", "INV_CASE", or "DEPT_CASE", which is of type "TVENDOR_RECORD", "TINVOICE_RECORD", or "TDEPARTMENT_RECORD", respectively. You get the idea. If you declare a variable of type "TANY_RECORD", it'll be allocated with enough room for the largest of the component datatypes. Then, you can make the variable "look" like any one of these records by using the appropriate subfield: VAR R: TANY_RECORD; ... WRITELN (R.EMP_CASE.NAME); { views R as an employee record } WRITELN (R.DEPT_CASE.DEPTHEAD); { views R as a dept record } WRITELN (R.INV_CASE.AMOUNT); { views R as an invoice record } In other words, an object of type TANY_RECORD is actually five different record structures "equivalenced" together; which one you get depends on which ".xxx_CASE" subfield you use. Got all that? Now, here's how you define and call the PUT_REC procedure: PROCEDURE PUT_REC (VAR DB: TDATABASE; S: TDATASET; VAR REC: TANY_RECORD); BEGIN ... END; ... { now, all dataset records you need to pass must be declared to } { be of type TANY_RECORD. } READLN (R.EMP_CASE.NAME, R.EMP_CASE.SSN); R.EMP_CASE.BIRTHDATE := 022968; R.EMP_CASE.SALARY := MINIMUM_WAGE - 1.00; PUT_REC (MY_DB, EMP_DATASET, R); You must declare ALL YOUR DATASET RECORDS to be of type TANY_RECORD (wasting space if, say, TDEPARTMENT_RECORD is 10 bytes long and TINVOICE_RECORD is 200 bytes long); you must refer to them with the appropriate ".xxx_CASE" subfield; then, you must pass the TANY_RECORD to PUT_REC. (Alternately, you may have one "working area" record of type TANY_RECORD and move the record you want into the appropriate subfield of this "working area" record before calling PUT_REC.) As you may have guessed, I think this is a very poor workaround indeed: * You need to specify in the TANY_RECORD declaration every possible type that you'll ever want to pass to PUT_REC; * You have to declare any record you want to pass to PUT_REC to be of type TANY_RECORD, even if it wastes space. * If you don't want to use a "working area" record, you have to refer to all your records as "R.EMP_CASE" or "R.DEPT_CASE" rather than just defining R as the appropriate type and referring to it just as "R". * If you do use a "working area" record, to wit: VAR WORK_RECORD: TANY_RECORD; EMP_REC: TEMPLOYEE_RECORD; ... READLN (EMP_REC.NAME, EMP_REC.SSN); EMP_REC.BIRTHDATE := 022968; EMP_REC.SALARY := MINIMUM_WAGE - 1.00; WORK_RECORD.EMP_CASE := EMP_REC; PUT_REC (MY_DB, EMP_DATASET, WORK_RECORD); then you have to move your data into it before every PUT_REC call, which is both ugly and inefficient. And why? All because PASCAL isn't flexible enough to allow you to declare a parameter to be of "any type". A couple more examples of cases where strict type checking is utterly lethal may be in order: * Say that you want to write a procedure that compares two PACKED ARRAY OF CHARs (in Standard PASCAL, these are the only way of representing strings). You must define the types of your parameters, INCLUDING THE PARAMETER LENGTHS! In other words, TYPE TPAC = PACKED ARRAY [1..256] OF CHAR; VAR P1: PACKED ARRAY [1..80] OF CHAR; P2: PACKED ARRAY [1..80] OF CHAR; ... FUNCTION STRCOMPARE (VAR X1: TPAC; VAR X2: TPAC): BOOLEAN; BEGIN ... END; ... IF STRCOMPARE (P1, P2) THEN ... is ILLEGAL. P1, you see, is an 80-character string, which is not compatible with the function parameter, which is a 256-character string. * Say that you want to write a procedure like WRITELN, which will format data of various types. WRITELN may not be sufficient for your needs -- you might need to be able to output numbers zerofilled or in octal, you might want to provide for page breaks and line wraparound, etc. Surely you should be allowed to do this! Well, first of all, you can't have a variable number of parameters. But, even if you're willing to have a maximum of, say, 10 parameters and pad the list with 0s, your parameters must all be of fixed types! Thus, even if your design calls for some kind of "format string" that'll tell your WRITELN-replacement what the actual type of each parameter is, you can't do anything. You must either have a procedure for each possible type combination (one to output two integers and a string, one to output a real, an integer, and three strings, etc.), or have the procedure only output one entity at a time. This way, you'll have to write: PRINTS ('THE RESULT WAS '); PRINTI (ACTUAL); PRINTS (' OUT OF A MAXIMUM '); PRINTI (MAXIMUM); PRINTS (', WHICH WAS '); PRINTR (ACTUAL/MAXIMUM*100); PRINTS ('%'); PRINTLN; instead of PRINTF ('THE RESULT WAS %d OUT OF A MAXIMUM %d, WHICH WAS %f', ACTUAL, MAXIMUM, ACTUAL/MAXIMUM*100); * Finally -- although it should be obvious by now -- you can't write, say, a matrix inversion function that takes any kind of matrix. You could write a 2x2 inverter, a 3x3 inverter, a 4x4 inverter, and so on. You could also write a matrix multiplier that multiplies 2x2s by 2x2s, another that does 2x2s by 2x3s, another 2x2s by 2x4s, another 3x2s by 2x2s, .... Just think of the job security you'll have! For fairness's sake, I must admit that this problem is SLIGHTLY mitigated in PASCAL/3000. PASCAL/3000 has a "STRING" data type, which is a variable-length string (as opposed to PACKED ARRAY OF CHAR, which is a fixed-length string). In other words, PASCAL/3000 STRINGs are essentially (internally) record structures, containing an integer -- the current string length -- and a PACKED ARRAY OF CHAR -- the string data. When HP implemented this, they were good enough to make all STRINGs -- regardless of their maximum sizes -- "assignment- compatible" with each other. This means that you can say: VAR STR1: STRING[80]; STR2: STRING[256]; ... STR1:=STR2; and also TYPE TSTR256 = STRING[256]; VAR S: STRING[80]; ... FUNCTION FIRST_NON_BLANK (PARM: TSTR256): INTEGER; BEGIN ... END; ... I := FIRST_NON_BLANK (S); Since STRING[80]s (strings with maximum length 80) and STRING[256]s (strings with maximum length 256) are assignment- compatible, you may both directly assign them (STR1:=STR2) and pass one by value in place of another (PROC(S)). Although "assignment compatibility" allows by-value passing, a variable passed by reference still has to be of exactly the same type as the formal parameter specified in the procedure's header. Thus, TYPE TSTR256 = STRING[256]; VAR S: STRING[80]; ... FUNCTION FIRST_NON_BLANK (VAR PARM: TSTR256): INTEGER; BEGIN ... END; ... I := FIRST_NON_BLANK (S); is still illegal, since STRING[80]s can't be passed to by-reference (VAR) parameters of type STRING[256]. Fortunately, PASCAL/3000 also lets you say: FUNCTION FIRST_NON_BLANK (VAR PARM: STRING): INTEGER; Specifying a type of "STRING" rather than "STRING[maxlength]" allows you to pass any string in place of the parameter. This only works for STRING parameters. It doesn't work for PACKED ARRAYs OF CHAR; it doesn't work for other array structures; it isn't supported by Standard PASCAL. However, for the specific case of string manipulation, you can get around some of PASCAL's onerous parameter type checking restrictions. Remember also that this is strictly an PASCAL/3000 (PASCAL/3000 and PASCAL/XL) feature, and can not be relied on in any other PASCAL compiler. TYPE CHECKING -- KERNIGHAN & RITCHIE C Where PASCAL insists on checking all parameters for an exact type match, original -- Kernighan & Ritchie -- C takes the diametrically opposite view. Classic C checks NOTHING. It does not check parameter types; it does not even check the number of parameters. All data in C is passed "by value", which means that the value of the expression you specify is pushed onto the parameter stack for the called procedure to use; if you want to pass a variable "by reference" -- pushing its pointer onto the stack -- you have to use the "&" operator to get the variable's address, to wit: myproc (&result, parm1, parm2); If you omit the "&", or specify it when you shouldn't -- well, C doesn't check for this, either. Much can be said about the philosophical reasons that C is this way; many labels, from "flexibility" to "cussedness" can be attached to it. The fact of the matter, though, is that K&R C -- which means many, if not most, of today's C compilers -- doesn't do any type checking. The effects of this, of course, are the opposite of the effects of PASCAL's strong type checking: * You have almost complete flexibility in what types you pass to a procedure. In two different calls, the same parameter can be one of two entirely different record structures; one of two character or integer arrays of entirely different lengths (C doesn't do run-time bounds checking, anyway); a real in one call, an integer in another, and a pointer in a third. Practically, virtually all of the examples I showed in the PASCAL chapter can thus be implemented in C. For instance, int strcompare(s1,s2,len) char *s1, *s2; int len; { int i; i = 0; while ((i < len) && (s1[i] == s2[i])) i = i+1; } will merrily compare two character arrays, no questions asked. You can pass arrays of any size, and it'll do the job. You can pass integers, reals, integer arrays, whatever; of course, the code isn't likely to work, but, hey, it's a free country -- nobody'll stop you. * In most implementations of K&R C, you're even allowed to pass a different number of parameters than the function was declared with. Though this is not guaranteed portable, most C compilers make sure that if, say, your procedure's formal parameters are "a", "b", and "c" (all integers) and you actually pass the values "1" and "2", then A will be set to 1, B to 2, and C will contain garbage (that's "C" the variable, not "C" the language). This is good because it allows you to write procedures that take a variable number of parameters; as long as you have a way of finding out how many parameters were actually passed (e.g. the PRINTF format string), your procedure can handle them accordingly. * On the other hand, say you make a mistake in a procedure call -- you pass a real instead of an integer, a value instead of a pointer, or perhaps even omit a parameter. The compiler won't check this; the only way you'll find the error is by running the program, and even then the erroneous results may first appear far away from the real error. Some C compilers (especially on UNIX) come with a program called LINT that can check for this error and others, but that's often not enough. First of all, your programmers have to run LINT as well as C for each program, which slows down the compilation pass; more importantly, since LINT is no way part of standard C, many C compilers don't have it. VAX/VMS C, for instance, doesn't come with LINT; neither does the CCS C that's available on the HP3000. * Similarly, even things that seem like they ought to work -- passing an integer in place of a real and expecting it to be reasonably converted -- will fail badly. Thus, sqrt(100) won't work if SQRT expects a real; C won't realize that an integer-to-real conversion is required, and will thus pass 100 as an integer, which is a different thing altogether. A similar problem occurs on computers (like the HP3000) that represent byte pointers (type "char *") and integer pointers (type "int *" and other pointer types) differently. Since C doesn't know which type of pointer a procedure expects, it'll never do conversions. If you call a procedure like FGETINFO that expects byte pointers and pass it an integer pointer, you'll be in trouble (unless you manually cast the pointer yourself). Incidentally, for ease of using real numbers, C will automatically convert all "single-precision real" (called "float" in C) arguments to "double-precision real" ("long") in function calls. This makes sure that if SQRT expects a "long", passing it a "float" won't confuse it. * On the other hand (how many hands am I up to now?), C's conversion woes -- requirements of passing "float"s instead of "int"s, "char *"s instead of "int *"s, etc. -- are easier to solve than in PASCAL. Since C allows you to easily convert a value from one datatype to another (using the so-called "type casts"), you could say my_proc ((float) 100, (char *) &int_value); and thus pass a "float" and a "char *" to "my_proc". In PASCAL you couldn't do things this easily. The compiler might automatically translate an integer to a float for you; but, if it expects a character value and all you've got is an integer, there's no easy way for you to tell it "just pass this integer as a byte address, I know what I'm doing." Thus, K&R C is flexible enough to do all that Standard PASCAL can not. If this is necessary to you -- and I can easily understand why it would be; Standard PASCAL's restrictions are very substantial -- then you'll have to live with C's lack of error checking. On the other hand, if flexibility is of less than critical value, you have to ask yourself whether or not you want the extra level of compiler error checking that PASCAL can provide you. My personal experience, incidentally, has been that compiler error checking of parameters is very nice, but not absolutely necessary. I'd love to have the compiler find my bugs for me, but I can muddle through without it. PASCAL's restrictions, though, are substantially more grave. More than inconveniences, they can make certain problems almost impossible to solve. DRAFT ANSI STANDARD C Time, it is said, heals all wounds; perhaps it can also heal wounded computer languages. God knows, FORTRAN 77 isn't the greatest, but it sure is better than FORTRAN IV. The framers of the new Draft ANSI Standard C have apparently thought about some of the problems that C has, especially the ones with function call parameter checking and conversion. The solution seems to be quite good, letting you impose either strict or loose type checking -- whichever you prefer -- for each procedure or procedure parameter. Remember, though, the standard is still only Draft, so it's not unlikely that any given C compiler you might want to use won't have it. In Draft Standard C, you can do one of two things: * You can call a procedure the same old way that you'd do in K & R C. No type checking, no automatic conversion, no nothin'. You might declare its result type, to wit: extern float sqrt(); (Remember, you'd have to do that anyway in K&R C; otherwise, the compiler will treat SQRT's result as an integer.) But no other declarations are required, and no checking will be done. * Alternatively, you can declare a FUNCTION PROTOTYPE. This can be done either for an external function or for one you're defining -- the prototype is very much like PASCAL's procedure header declaration. A sample might be: extern int ASCII (int val, int base, char *buffer); or simply extern int ASCII (int, int, char *); [Note that the parameter NAMES, as opposed to TYPES, are not necessary in a prototype for an EXTERNAL function. For a function that you're actually defining, the names are necessary; the declarations in the prototype are used in place of the type declarations that you'd normally specify for the function parameters.] This function prototype tells the compiler enough about the function parameters for it to be able to do appropriate type checking and conversion. One of the reasons K&R C couldn't do that is precisely because of the lack of this information. Consider the cases where this would come in handy. We might declare SQRT as extern float sqrt (float); and then a call like sqrt (100) would automatically be taken to mean "sqrt ((float) 100)", i.e. "sqrt (100.0)". Similarly, sqrt (100, 200) or sqrt () would cause a compiler error or warning, since now the compiler KNOWS that SQRT takes exactly one parameter. In general, say that you have a function declared as extern int f(formaltype); /* or non-extern, for that matter */ This simply means that "f" is a function that returns an "int" and takes one parameter of type "formaltype". Now, say that your code looks like: actualtype x; ... i = f(x); Is this kind of call valid or not? Of course, it depends on what "formaltype" and "actualtype" are: * If both FORMALTYPE and ACTUALTYPE are numbers -- integers or floats, short, long, or whatever -- then X is converted to ACTUALTYPE before the call. This is what lets us say sqrt(100) when "sqrt" is declared to take a parameter of type "real". (The same goes the other way -- if "mod" is declared to take two "int"s, then "mod(10.5,3.2)" would be converted to "mod(10,3)", although the compiler might print a warning message to caution you that a truncation is taking place.) * If FORMALTYPE is a pointer -- which is the case for all "by-reference" parameters, since that's how we pass things by reference in C -- then ACTUALTYPE must be EXACTLY the same type. In other words, if we say: int copystring (char *src, char *dest) then in the call char x; int y; ... copystring (x, &y); BOTH parameters will cause an error message. The first parameter will be a "CHAR" passed where a "CHAR *" is expected, which is illegal -- a good way of checking for attempts to pass parameters by value where by-reference was expected. The second parameter will be an "INT *" passed where a "CHAR *" is expected, which is also illegal, since although both are pointers, they don't point to the same type of thing. * If ACTUALTYPE is a pointer, then FORMALTYPE must also be a pointer of EXACTLY the same type. Again, this is useful for catching attempts to pass "by-reference" calls to procedures that expect "by-value" parameters, and also attempts to pass a pointer to the wrong type of object. * If either ACTUALTYPE or FORMALTYPE is a pointer of the special type "void *", then the other one may be any type of pointer. This is very useful when we want a parameter to be a BY-REFERENCE parameter of some arbitrary type (similar to PASCAL/XL's ANYVAR, for which see below). Thus, if we want to write our "put_rec" procedure that'll put any type of record structure into a database, we'd say: put_rec (char *dbname, char *dbset, void *rec) Then, we could say: typedef struct {...} sales_rec_type; typedef struct {...} emp_rec_type; ... sales_rec_type srec; emp_rec_type erec; ... put_rec (mydb, sales_set, &srec); ... put_rec (mydb, emp_set, &erec); Both of the PUT_REC calls are valid since both "&srec" and "&erec" (and, for that matter, any other pointer) can be passed in place of a "void *" parameter. If we'd declared "put_rec" as: put_rec (char *dbname, char *dbset, sales_rec_type *rec) then the "put_rec (mydb, emp_set, &erec)" call would NOT be legal, sinec "&erec" is NOT compatible with "sales_rec_type *". Note that on some machines -- including the HP3000 -- integer pointers and character pointers are NOT represented the same way. However, it's always safe to pass either a "char *" or an "int *" in place of a parameter that's declared as a "void *". The C compiler will always do the appropriate conversion; thus, if we declare the ASCII intrinsic as extern int ASCII (int, int, void *); then both of the calls below: char *cptr; int *iptr; ... i = ASCII (num, 10, cptr); ... i = ASCII (num, 10, iptr); will be valid (assuming that a "void *" is actually represented as a byte pointer, which is what the ASCII intrinsic wants). You can thus think of "void *" as the "most general type"; any pointer can be successfully passed to a "void *". * Note that although you CAN'T pass, say, a "char *" to a parameter of type "int *", C will ignore the SIZE of the array the pointer to which is being passed. In other words, a function such as extern strlen (char *s); may be passed a pointer to a string of any size -- both of the following calls: char s1[80], s2[256]; ... i = strlen (s1); i = strlen (s2); are valid. Remember that C makes no distinction between a "pointer to an 80-byte array" and a "pointer to a 256-byte array"; similarly, it makes no distinction between an array like "s1" and a "pointer to a character" (see below). * An interesting exception to the above rules is that the integer constant 0 can be passed to ANY pointer parameter. This is because a pointer with value 0 is conventionally used to mean a "null pointer". This is quite useful in some applications, but can often prevent the compiler from detecting some errors. If I say: extern PRINT (int *buffer, int len, int cctl); ... PRINT (0, -10, current_cctl); this won't, of course, print a "0"; rather, it'll pass PRINT the integer pointer "0", which will point to God knows what in your stack. Not a very serious problem, but something you ought to keep in mind. * Unlike Standard PASCAL, not only can you entirely waive parameter checking for a procedure (just omit the prototype!), but you can also explicitly CAST an actual parameter whenever you want it to match the type of a formal parameter. In other words, say that you declare two structure types: typedef struct {...} rec_a; typedef struct {...} rec_b; rec_a ra; /* declare a variable of type "rec_a" */ rec_b rb; /* declare a variable of type "rec_b" */ and then write a function process_record_a (int x, int y, rec_a *r) { ... } If you then say process_record_a (10, 20, &rb); then the compiler will (quite properly) print an error message, since you were trying to pass a "pointer to rec_b" instead of a "pointer to rec_a". If you really want to do this, though, all you need to do is say: process_record_a (10, 20, (rec_a *) &rb); manually CASTING the pointer "&rb" to be of type "rec_a *", and the compiler won't mind. * Finally, let me also point out that, like everywhere in C, an "array of T" and a "pointer to T" are mutually interchangeable. In other words, if you say: extern int string_compare (char *s1, char *s2); and then call it as: char str1[80], str2[256]; ... if (string_compare (str1, str2)) ... the compiler won't mind. To it a "char *" and a "char []" are really one and the same type. Somewhat (but not exactly) similarly -- perhaps I should say, similarly but differently -- the NAME OF A FUNCTION can be passed to a parameter that is expecting a POINTER TO A FUNCTION. In other words, if you write a procedure int do_function_on_array_elems (int *f(), int *a, int len); (which takes a pointer to a function, a pointer to an integer, and an integer), and then call it as: do_function_on_array_elems (myfunc, xarray, num_xs); the compiler won't complain (assuming, of course, that MYFUNC is really a function and not, say, an integer or a pointer). To summarize, then, Draft Proposed ANSI Standard C lets you check function parameters almost as precisely as Standard PASCAL. The differences are: * You can ENTIRELY INHIBIT PARAMETER CHECKING for all function parameters by just omitting the function prototype. * You can declare a parameter to BE A BY-REFERENCE PARAMETER OF AN ARBITRARY TYPE by declaring it to be of type "void *". You can do this while still enforcing tight type checking for all the other parameters. * In addition to overriding type checking on a PROCEDURE BASIS or PROCEDURE PARAMETER basis, you can also override type checking on a particular call by simply casting the actual parameter to the formal parameter's datatype. * Unlike PASCAL, C will never check the SIZE of an array parameter; only its TYPE. STANDARD "LEVEL 1" PASCAL TYPE CHECKING -- CONFORMANT ARRAYS If you recall, one of the PASCAL features I most complained about was the inability to pass arrays of different sizes to different procedures. This essentially prevents you from writing any sort of general array handling routine, including: * For PACKED ARRAYs OF CHAR -- the way that Standard PASCAL represents strings -- you can't write things like blank trimming routines, string searches, or anything that's intended to take PACKED ARRAYs OF CHAR of different sizes. * For other arrays, the problem is exactly the same -- you can't write matrix handling routines that work with arbitrary sizes of arrays, e.g. matrix addition, multiplication division, etc. This wasn't the only type checking problem (others included the inability to pass various record types to database I/O routines, etc.), but it was a major one. The ISO Pascal Standard, released in the early 80's, addresses this problem. A new feature called "conformant arrays" has been defined; PASCAL compilers are encouraged, but not required, to implement it. A compiler is said to * "Comply at level 0" if it does not support conformant arrays; * "Comply at level 1" if it does support them. You see the problem -- who knows just how many new PASCAL compilers will include this feature? It is a fact that most compilers written before the ISO Standard do NOT include it. PASCAL/3000, for instance, does not have it; PASCAL/XL, on the other hand, does. What are "conformant arrays"? To put it simply, they are * FUNCTION PARAMETERS that are defined to be ARRAYS OF ELEMENTS OF A GIVEN TYPE, but whose bounds are NOT defined. Instead, the compiler makes sure that the ACTUAL BOUNDS of whatever array parameter is ACTUALLY passed are made known to the procedure. An example: FUNCTION FIRST_NON_BLANK (VAR STR: PACKED ARRAY [LOWBOUND..HIGHBOUND: INTEGER] OF CHAR): INTEGER; VAR I: INTEGER; BEGIN I:=LOWBOUND; WHILE I<HIGHBOUND AND STR[I]=' ' DO I:=I+1; FIRST_NON_BLANK:=I; END; This procedure is intended to find the index of the first non-blank character of STR. Note how it declares STR: Instead of specifying a constant lower and upper bound in the PACKED ARRAY [x..y] OF CHAR declaration, it specifies TWO VARIABLES. When the procedure is entered, the variable LOWBOUND is automatically set to the lower bound of whatever array the user actually passed, and HIGHBOUND is set to the upper bound of the array. In other words, if we say: VAR MYSTR: PACKED ARRAY [1..80] OF CHAR; ... I:=FIRST_NON_BLANK (MYSTR); then, in FIRST_NON_BLANK, the variable LOWBOUND will be set to 1 and HIGHBOUND will be set to 80. Instead of just passing the MYSTR parameter, PASCAL actually passes "behind your back" 1 and 80 as well. The way I see it, this is a very good solution, even better in some ways than C's (in which you can always pass an array of any arbitrary size): * You're no longer restricted (like you are in Standard PASCAL) to a fixed size for your array parameters. * When you pass an array to a conformant array parameter, you don't have to manually specify the size of the array; the array bounds are automatically passed for you. If I were to write the same procedure in C, I'd have to say int first_non_blank (str, maxlen) char str[]; int maxlen; ... and then manually pass it both the string and the size that it was allocated with; otherwise, the procedure won't know when to stop searching (assuming you don't use the convention that a string is terminated by a null or some such terminator). * Since the compiler itself knows what the conformant array parameter's bounds are (it doesn't know the actual values, but it does know what variables contain the values), it can emit appropriate run-time bounds checking code. This can automatically catch some errors at run-time, which is good if you like heavy compiler-generated error checking. * Conformant arrays are even better for two-dimensional arrays. To index into a two-dimensional array the compiler must, of course, know the number of columns in the array (assuming it's stored in row-major order, as C and PASCAL 2-D arrays are). In C, you must either declare the number of columns as a constant, e.g. matrix_invert (m) float m[][100]; or declare the parameter as a 1-D array, pass the number of columns as a parameter, and then do your own 2-D indexing, to wit: matrix_invert (m, numcols) float m[]; int numcols; ... element = m[row*numcols+col]; /* instead of M[ROW,COL] */ ... In ISO Level 1 PASCAL, you just declare the procedure as: PROCEDURE MATRIX_INVERT (M: ARRAY [MINROW..MAXROW, MINCOL..MAXCOL] OF REAL); Then you automatically know the bounds of the array AND can also do normal array indexing (M[ROW,COL]), since the compiler knows the number of columns, too. This, it seems, is how original Standard PASCAL should have worked, and I'm glad that the standards people have established it. The only problems are: * This is, of course, somewhat less efficient than not passing the bounds or just passing, say, the upper bound (like you would in C). * Remember that this only fixes the case where we want to pass differently sized arrays to a procedure. If we want to pass different TYPES (like in our PUT_REC procedure that should accept one of several database record types), conformant arrays won't help us. * Most importantly, MANY PASCAL COMPILERS MIGHT NOT SUPPORT THIS WONDERFUL FEATURE. In particular, PASCAL/3000 DOES NOT SUPPORT CONFORMANT ARRAYS. PASCAL/XL TYPE CHECKING PASCAL/XL obeys all of PASCAL's type checking rules, but gives you a number of ways to work around them: * PASCAL/XL supports the CONFORMANT ARRAYS that I just talked about. * PASCAL/XL allows you to specify a variable as "ANYVAR", e.g. PROCEDURE PUT_REC (VAR DB: TDATABASE; S: TDATASET; ANYVAR REC: TDBRECORD); What this means to PASCAL is that, when PUT_REC is called, the third parameter (REC) will NOT be checked. Inside PUT_REC, you'll be able to refer to this parameter as REC, and to PUT_REC it'll have the type TDBRECORD; however, the CALLER need not declare it as TDBRECORD. For instance, VAR SALES_REC: TSALES_REC; EMP_REC: TEMP_REC; ... PUT_REC (MY_DB, SALES_DATASET, SALES_REC); ... PUT_REC (MY_DB, EMP_DATASET, EMP_REC); will do EXACTLY what we want it to -- it'll pass SALES_REC and EMP_REC to our PUT_REC procedure without complaining about their data types. As I said, PUT_REC itself will view the REC parameter as an object of type TDBRECORD. However, PUT_REC can say SIZEOF(REC) and determine the TRUE size of the actual parameter that was passed in place of REC. This can be very useful if PUT_REC needs to do an FWRITE or some such operation that needs to know the size of the thing being manipulated. The way this is done, of course, is by PASCAL/XL's passing the size of the actual parameter as well as the parameter's address. Incidentally, you can turn this off for efficiency's sake if you're not going to use this SIZEOF construct. * PASCAL/XL allows you to do TYPE COERCION -- you can take an object of an arbitrary type and view it as any other type. For instance, you can take a generic "ARRAY OF INTEGER" and view it as a record type, or take an INTEGER parameter and view it as a FLOAT. A possible application might be: TYPE COMPLEX = RECORD REAL_PART, IMAG_PART: REAL; END; INT_ARRAY = ARRAY [1..2] OF INTEGER; ... PROCEDURE WRITE_VALUE (T: INTEGER; ANYVAR V: INT_ARRAY); BEGIN IF T=1 THEN WRITELN (V[1]) ELSE IF T=2 THEN WRITELN (FLOAT(V)) ELSE IF T=3 THEN WRITELN (BOOLEAN(V)) ELSE IF T=4 THEN WRITELN (COMPLEX(V).REAL_PART, COMPLEX(V).IMAG_PART); END; As you see, this procedure takes a type indicator (T) and a variable of any type V. Then, depending on the value of T, it VIEWS V as an integer, a float, a boolean, or a record structure of type COMPLEX. All we need to do is say typename(value) and it returns an object with EXACTLY THE SAME DATA as "value", but viewed by the compiler as being of type "typename". Note that this means that "REAL(10)" won't return 10.0 (which is what a C "(float) 10" type cast would do); rather, it'll return the floating point number the MACHINE REPRESENTATION of which is 10. Some other example applications for this very useful construct are: - You can now have a pointer variable that can be set to point to an object of an arbitrary type; this allows you to write things like generic linked list handling procedures that work regardless of what type of object the linked list contains. More about this on ANYPTR below. - You may write a generic bit extract procedure that can be used for extracting bits from characters, integers, reals, etc. You'd declare it as: FUNCTION GETBITS (VAL, STARTBIT, LEN: INTEGER): INTEGER; ... and call it using I:=GETBITS (INTEGER(3.0*EXP(X)), 10, 6); or I:=GETBITS (INTEGER(MYSTRING[I]), 5, 1); or whatever. Note that you couldn't do this with ANYVAR parameters since ANYVAR parameters are by-reference, and thus can't be passed constants or expressions. * PASCAL/XL -- just like PASCAL/3000 -- makes STRING parameters of any size compatible with each other. Thus, you can pass a STRING[20] to a procedure that's defined to take a STRING[256]; or, if you're passing the string by REFERENCE, you can just declare the formal parameter as "STRING", which will be compatible with any string type. * PASCAL/XL has a new type called "ANYPTR"; declaring a variable to be an ANYPTR makes it "assignment-compatible" with any other pointer type, which means that that variable can be easily made to point to objects of different types. This, coupled with the "type coercion" operation mentioned above, makes manipulating say, linked lists of different data structures much easier. Needless to say, use of any of these constructs can get you into trouble precisely because of the additional freedom they give you. Converting a chunk of data from one record data type to another only makes sense if you know exactly what you're doing; if you don't, you're likely to end up with garbage. However, often there are cases where you NEED this additional freedom, and in those cases, PASCAL/XL really comes through. As a rule, its type checking is as stringent and thorough as Standard PASCAL's, but it allows you to relatively easily waive the type checking whenever you need to. ENUMERATED DATA TYPES If you recall, before I started talking about type checking, I was describing RECORD STRUCTURES, a new data type that PASCAL and C support. My mind, you see, works like a stack -- sometimes I'll interrupt what I'm doing and go off on a digression (sometimes relevant, sometimes not); then, I'll just POP the stack, and I'm back where I was before. So, I'm popping the stack and continuing with the discussion of "new" data types -- data types that C and/or PASCAL support, but SPL does not. Say you want to call the FCLOSE intrinsic. You pass to it the file number of the file to be closed and you also pass the file's DISPOSITION. This disposition is a numeric constant, indicating what the system is to do with the file being closed: FCLOSE (FNUM, 0, 0); means just close the file; FCLOSE (FNUM, 1, 0); means SAVE the file as a permanent file; FCLOSE (FNUM, 2, 0); means save the file as a TEMPORARY file; FCLOSE (FNUM, 3, 0); means save the file as a temporary file, but if it's a tape file, DON'T REWIND; FCLOSE (FNUM, 4, 0); means DELETE the file being closed; [we'll ignore for now the "squeeze" disposition and the fairly useless third parameter.] Now, naturally, today's enlightened programmer doesn't want to specify the disposition as a numeric constant -- how many people will understand what's going on if they see a FCLOSE (FNUM, 4, 0); in the middle of the program? Instead, we'd define some constants -- EQUATE DISP'NONE = 0, DISP'SAVE = 1, DISP'TEMP = 2, DISP'NOREWIND = 3, DISP'PURGE = 4; Now, we can say FCLOSE (FNUM, DISP'PURGE, 0); Don't you like this better? I knew you would. As you see, in this case, an integer is being used not as a QUANTITATIVE measure (how large a file is, how many seconds an operation took, etc.), but rather as a sort of FLAG. This flag has no mnemonic relationship to its numeric value; the numeric value is just a way of encoding the operation we're talking about (save, purge, etc.). This sort of application actually occurs very frequently. Some examples might include: * FFILEINFO item codes, which indicate what information is to be retrieved (4 = record size, 8 = filecode, 18 = creator id, etc.). * CREATEPROCESS item numbers, which indicate what parameter is being passed (6 = maxdata, 8 = $STDIN, 11 = INFO=, etc.). * FOPEN foptions bits -- 1 = old permanent, 2 = old temporary, 4 = ASCII file, 64 = variable record length file, 256 = CCTL file, etc.; same for aoptions bits. * And many other cases; each system table you look at, for instance, will contain at least two or three of these sorts of encodings. As I mentioned, SPL's solution to this sort of problem is just declaring constants (using EQUATE). Similarly, in PASCAL you could easily say: CONST DISP_NONE = 0; DISP_SAVE = 1; DISP_TEMP = 2; DISP_NOREWIND = 3; DISP_PURGE = 4; and in C, you could code: #define disp_none 0 #define disp_save 1 #define disp_temp 2 #define disp_norewind 3 #define disp_purge 4 Nice and readable; the constant declaration creates the link between the symbolic name and the real numeric value -- after this, you can use the symbolic name wherever you need to. Enumerated data types are just like this, only different. In PASCAL, you could say TYPE FCLOSE_DISP_TYPE = (DISP_NONE, DISP_SAVE, DISP_TEMP, DISP_NOREWIND, DISP_PURGE); Instead of just defining five constants with values 0, 1, 2, 3, and 4, this declaration defines a new DATA TYPE called FCLOSE_DISP_TYPE and five OBJECTS of this type -- DISP_NONE, DISP_SAVE, etc. If you declare the FCLOSE intrinsic as PROCEDURE FCLOSE (FNUM: INTEGER; DISP: FCLOSE_DISP_TYPE; SEC: INTEGER); EXTERNAL; you'll now be able to say FCLOSE (FNUM, DISP_PURGE, 0); The key difference between an "ENUMERATED TYPE" declaration and the ordinary constant definitions is that the objects of the data type can't be used as integers. For instance, saying this: VAR DISP: FCLOSE_DISP_TYPE; ... DISP:=1; is an error, and you certainly can't say: DISP:=DISP_SAVE*DISP_PURGE; In fact, if you've declared FCLOSE as was shown above, then PASCAL will even check the DISP parameter to make sure you're really passing a DISP_xxx; if you accidentally pass something else, the compiler will catch this and complain. As you see, the advantage of enumerated types is type checking (a field which PASCAL, in general, is rather compulsive about). The disadvantage is this: * How are you certain that when you declared the enumerated type, DISP_SAVE actually corresponded to 1 and DISP_PURGE to 4? In other words, when you pass a disposition to FCLOSE, PASCAL must pass it as some integer value -- if you had declared it as a constant, you'd KNOW the value; with an enumerated type, how are you sure? Well, although Standard PASCAL doesn't define what the "ACTUAL" value of an enumerated type object is, most PASCALs -- including PASCAL/3000 and PASCAL/XL -- number the objects from 0 up, in the order given in the enumerated type declaration. This is what lets our FCLOSE_DISP_TYPE type work; the way that PASCAL allocates the numeric values of the DISP_xxx objects is exactly the way we want it to. On the other hand, say that I want to define file system error numbers (which FCHECK might return). These are also special numeric codes that we'd like to access using symbolic names, but they are NOT sequentially ordered. For instance, you might want to declare CONST FERR_EOF = 0; FERR_NO_FILE = 52; FERR_SECURITY = 93; FERR_DUP_FILE = 100; How can you declare this as an enumerated data type? Well, you can't, unless you're willing to declare 51 "dummy items" between FERR_EOF and FERR_NO_FILE so that FERR_NO_FILE will fall on 52. In general, wherever there are "holes" in the sequence, enumerated types can't really be used. Now, this is not a complaint against enumerated types per se. Enumerated types are great as long as YOU DON'T CARE WHAT THE VALUES OF THE ENUMERATED TYPE OBJECTS ARE; if the type is used solely within your programs, you won't have any problems. The trouble comes in when you try to use enumerated types for objects whose values are dictated externally. To summarize, * Enumerated types are very similar to constant declarations. * Enumerated types' big advantages are: - The compiler does type checking for them, making sure that you don't accidentally use, say, an FOPEN foptions mode where you ought to use an FCLOSE disposition. - You don't have to manually assign a numeric value to each enumerated type object. * Enumerated types are great if they're defined and used solely within your program, where you don't care what values the compiler assigns to each object. * If you're using enumerated types to represent objects whose actual value is important -- say FCLOSE dispositions, FFILEINFO item numbers, file system errors -- you may have troubles. If the actual values are numbered sequentially starting with 0, you can use an enumerated type to represent these values; if they don't start with 0 or are not sequential, you can't really use an enumerated type. * In general, even if the values ARE numbered sequentially from 0 (like FCLOSE dispositions, FFILEINFO item numbers, CREATEPROCESS item numbers, etc.) you might want to use constants instead of enumerated types. This is because the numeric assignments aren't easily visible in enumerated type declarations; if you accidentally omit a possible value (e.g. declare the type as (NONE, SAVE, TEMP, PURGE), omitting NOREWIND), it won't be at all obvious that PURGE now has the wrong value. ENUMERATED DATA TYPES IN C Classic K&R C did not support enumerated types; as we saw, this probably wasn't such a big disadvantage, since enumerated types are just a fancy form of constant declarations. Draft ANSI Standard C -- and, in fact, most modern Cs -- supports enumerated types; you can say typedef enum { disp_none, disp_save, disp_temp, disp_norewind, disp_purge } fclose_disp_type; which will define the type FCLOSE_DISP_TYPE just like PASCAL's enumerated type declaration will. In fact, the numeric values of DISP_NONE, DISP_SAVE, etc. will even be assigned the same way as they would be with PASCAL. The trouble is this: what was the major advantage of PASCAL enumerated types over constants? Well, once the PASCAL compiler knew that a variable was of an enumerated type, it could to appropriate type checking. But C isn't a strong type checking language! To C, any object of any enumerated type is viewed exactly as an integer would be viewed. Thus, the above declaration is EXACTLY the same as: #define disp_none 0 #define disp_save 1 #define disp_temp 2 #define disp_norewind 3 #define disp_purge 4 If you say fclose_disp_type disp; (thus declaring DISP to be an object of FCLOSE_DISP_TYPE), you can now code disp = disp_save; but you could also (if you wanted to) say disp = 1; or disp = (i+123)/7; One advantage that C has, though, is that (unlike PASCAL) it allows you to explicitly specify the numeric values of each element in the enumeration, to wit: typedef enum { ferr_eof = 0, ferr_no_file = 52, ferr_security = 93, ferr_dup_file = 100 } file_error_type; The DEFAULT sequence, you see, is from 0 counting up by 1; however, you can override it with any initializations you want. In other words, in C, enumerated type declarations are truly just another way of defining integer constants. The above declaration is in fact identical to #define ferr_eof 0 #define ferr_no_file 52 #define ferr_security 93 #define ferr_dup_file 100 To summarize: * Enumerated data types in PASCAL = Just like ordinary constants + Type checking. * Enumerated data types in Draft ANSI Standard C = Enumerated data types in PASCAL - Type checking. * Ergo, Enumerated data types in Draft ANSI Standard C = Just like ordinary constants! See how easy things become if you use a little mathematics? SUBRANGE TYPES IN PASCAL Another new category of data type that PASCAL has is the so-called subrange type. It is in some ways the quintessential PASCAL feature because it really performs NO NEW FUNCTION except for allowing additional compiler type checking. In PASCAL, you can declare a variable thus: VAR V: 1..100; This means that V is defined to always be between 1 and 100. It is an error for it be outside of these bounds, and the PASCAL compiler may generate code to check for this (PASCAL/3000 certainly does). Now, fortunately, the type checking on this isn't quite as stringent as on other types. In other words, if you declare: TYPE RANGE1 = 1..10000; SMALL_RANGE = 100..199; VAR SM: SMALL_RANGE; ... PROCEDURE P (NUM: RANGE1); then you can still call P (SM); even though SMALL_RANGE and RANGE1 are not the same type. On the other hand, if NUM is a BY-REFERENCE (VAR) parameter, i.e. PROCEDURE P (VAR NUM: RANGE1); then saying P (SM); will be an error! Any by-reference parameter MUST be an IDENTICAL type (i.e. either the same type or one defined as identical, i.e. "TYPE NEWTYPE = OLDTYPE"). Different subranges of the same type (even two differently-named and separately-defined types whose definitions are identical!) are FORBIDDEN. If the full implications of this haven't sunk in yet, consider this procedure: TYPE TPAC256 = PACKED ARRAY [1..256] OF CHAR; PROCEDURE COUNT_CHARS (VAR S: TPAC256; VAR NUM_BLANKS: INTEGER; VAR NUM_ALPHA: INTEGER; VAR NUM_NUMERIC: INTEGER; VAR NUM_SPECIALS: INTEGER); This one's simple -- it goes through a string S and counts the number of blanks, alphabetic characters, numeric characters, and other "special" characters; all the counts are returned as integer VAR parameters. The variables that we pass as NUM_BLANKS, NUM_ALPHA, NUM_NUMERIC, and NUM_SPECIALS can NOT be declared as subrange types! If we say: VAR NBLANKS, NALPHA, NNUMERIC, NSPECIALS: 1..256; ... COUNT_CHARS (S, NBLANKS, NALPHA, NNUMERIC, NSPECIALS); the compiler WON'T just check the variables after the COUNT_CHARS call to make sure that COUNT_CHARS didn't set them to the wrong values; rather, THE COMPILER WILL PRINT AN ERROR MESSAGE! If you still insist on using subrange types for this sort of thing, you get into the ridiculous circumstance in which YOU NEED A SEPARATE COUNT_CHARS PROCEDURE FOR EACH POSSIBLE TYPE COMBINATION OF THE NUM_BLANKS, NUM_ALPHA, NUM_NUMERIC, AND NUM_SPECIALS PARAMETERS! This is why I'm skeptical of the utility of subrange variables. It's great for the compiler to be able to do run-time error checking and warn me of any errors in my program; however, I can never really pass "by reference" subrange variables to any general-purpose routine! On the one hand, we are told that it's great to have lots of general-purpose utility procedures that can be called by a number of other procedures in a number of possible circumstances; on the other hand, we're prevented from doing this by too-stringent type checking! Thus, to summarize: * Subrange types are theoretically useful as a way of giving the compiler more information with which to do run-time checking. * However, their utility is SERIOUSLY COMPROMISED by the fact that you can't, for instance, pass a subrange type by reference to a procedure that expects an INTEGER (or vice versa). This is especially damaging if you like to (and you should like to) write general-purpose procedures -- your only serious alternative there is to declare any by-reference parameters' types to be INTEGER and make sure that all the variables you'd ever want to pass to such a procedure are type INTEGER too. SPL and C, not being very strict type-checkers, don't support subranges. In light of all I've said, this doesn't seem to be such an awful lack. Finally, one more important comment about subrange types. As I mention in the "BIT OPERATORS" section of this paper, subrange types (in PACKED RECORDs and PACKED ARRAYs) are PASCAL/3000's and PASCAL/XL's mechanism for accessing bit fields. This is NOT endorsed or supported by the PASCAL Standard, but it turns out to be one of the most useful applications of subrange types in PASCAL/3000 and PASCAL/XL. DATA ABSTRACTION When I was converting our MPEX/3000 product to run on both the pre-Spectrum and Spectrum machines, I had to overcome several problems. One was, of course, that some (although not all) of the privileged procedures and operations that I did had to be done somewhat differently on MPE/XL. My conversion here was helped by the concept of "code isolation" -- rather than putting various calls to, say, DIRECSCAN or FLABIO all over my program, I isolated them in individual procedures. Then, all I had to do was replace those "wrapping" procedures, and all of the programs that called them didn't have to be changed. Another problem was that some of the tables (like the file label, directory entries, ODD, JIT, etc.), though similar in principle and containing much the same fields, had different offsets for those fields. Here I was helped by the fact that I never explicitly referenced, say, the filecode field of the file label by saying "FLAB(26)". Instead, I had an $INCLUDE file that DEFINEd the token "FLAB'FCODE" to be "FLAB(26)" -- all I had to do was change the $INCLUDE file and again the rest of my programs didn't need to be changed. One area, though, that gave me more trouble than I would have expected was the changing size of some fields. Not the changing meaning -- the file label still contained a record size field and a block size field, and the directory still contained a file label address -- but rather the changing SIZE. The record size and the block size were now 2 words rather than 1; the file label address was 10 words instead of 2. Consider a few of my procedures: DOUBLE PROCEDURE ADDRESS'FNUM (FNUM); VALUE FNUM; INTEGER FNUM; << Given a file number, returns its file label's disc address. >> DOUBLE PROCEDURE ADDRESS'NAME (FILENAME); BYTE ARRAY FILENAME; << Given a filename, returns its file label's disc address. >> PROCEDURE FLABREAD (ADDR, FLABEL); VALUE ADDR; DOUBLE ADDR; ARRAY FLABEL; << Given a file label's disc address, reads the file label. >> The plan here is that FLABREAD is the master file label read procedure, to which we pass a disc address. We can either say FLABREAD (ODD'FLAB'ADDR, FLABEL); << if we have the address >> or FLABREAD (ADDRESS'FNUM(IN'FNUM), FLABEL); or FLABREAD (ADDRESS'NAME(PROG'FILENAME), FLABEL); Convenient, readable, general. What's wrong with it? This mechanism was quite acceptable for MPE/III, MPE/IV, and MPE/V because then the disc address was a double integer. In MPE/XL it changed to a 10-word array. Any places that explicitly refer to it as a DOUBLE must be changed to call it an INTEGER ARRAY. "Data abstraction" refers to exactly this concern. Don't call a disc address "a double integer". Rather call it "an object of type DISC_ADDRESS". In PASCAL terms, don't say: PROCEDURE FLABREAD (ADDR: INTEGER; VAR F: FLABEL); Say TYPE DISC_ADDRESS = INTEGER; { double integer } ... PROCEDURE FLABREAD (ADDR: DISC_ADDRESS; VAR F: FLABEL); Then, when you need to change to MPE/XL, all you need to do is change the TYPE declaration to TYPE DISC_ADDRESS = ARRAY [1..10] OF SHORTINT; { 10 words } and you're home free. Of course, you'll doubtless have to change the IMPLEMENTATION of FLABREAD (if the disc address format has changed, probably the way of accessing it has, too); however, you won't have to touch any of the CALLERS of FLABREAD. So that's the first component of data abstraction -- the responsibility of the programmer for declaring objects not with the type they happen to have -- say, INTEGER -- but rather with some "abstract type" (DISC_ADDRESS) that is defined elsewhere as INTEGER. The second component of data abstraction, though, is much less obvious. Say that you said { in PASCAL } TYPE DISC_ADDRESS = ARRAY [1..10] OF SHORTINT; ... FUNCTION ADDRESS_FNUM (FNUM: INTEGER): DISC_ADDRESS; { in C } typedef int disc_addr[10]; ... disc_addr address_fnum (fnum); int fnum; { in SPL } DEFINE DISC'ADDRESS = INTEGER ARRAY #; DISC'ADDRESS ADDRESS'FNUM (FNUM); VALUE FNUM; INTEGER FNUM; All of these operations would make sense -- instead of returning an integer, ADDRESS'FNUM is to return an object of type DISC_ADDRESS, which happens to be an integer array. The trouble here is that in neither Standard PASCAL nor C nor SPL can a procedure return an integer array! Thus, "hiding" the type of an object from most of the object's users is very nice, but ONLY IF THE COMPILER PERMITS IT TO REMAIN HIDDEN. In another example, in SPL, saying FOR I:=1 UNTIL RECSIZE DO is only legal if RECSIZE is an integer. If RECSIZE is a double integer, all the data abstraction in the world will do us no good because the SPL compiler itself will reject the above FOR loop. To be truly able to have "data abstraction" -- to be able to not care about an object's underlying representation type -- the compiler must treat all the possible types as equally as possible. Considering again the case of the disc address, there's no way we can have an SPL procedure return anything that can represent a 10-word value. We'd have to write ADDRESS'FNUM as PROCEDURE ADDRESS'FNUM (FNUM, RETURN'VALUE); VALUE FNUM; INTEGER FNUM; INTEGER ARRAY RETURN'VALUE; and then call it as: DOUBLE TEMP'DISC'ADDR; ... ADDRESS'FNUM (FNUM, TEMP'DISC'ADDR); FLABREAD (TEMP'DISC'ADDR, FLAB); instead of simply FLABREAD (ADDRESS'FNUM(FNUM), FLAB); This is, of course, less convenient, which is why I kept the address as a double integer instead of an integer array -- and got stuck when I converted to Spectrum. In Standard PASCAL, as I said, I couldn't have a function returning an integer array. PASCAL/3000, though, lifts this restriction -- you can now say: TYPE DISC_ADDRESS = ARRAY [1..10] OF SHORTINT; ... FUNCTION ADDRESS_FNUM (FNUM: INTEGER): DISC_ADDRESS; ... FLABREAD (ADDRESS_FNUM(FNUM), FLAB); Since the function is allowed to return an integer array, we can keep the same interface regardless of whether DISC_ADDRESS is a double integer or an array. Of course, the efficiency of the code won't be quite the same; similarly, the internals of ADDRESS_FNUM would doubtless be somewhat different. However, the callers of ADDRESS_FNUM wouldn't have to change a whit despite the change in the underlying definition of the DISC_ADDRESS type. In C (K&R or Draft Standard), functions can't return arrays, either. However, they can return structures, and a structure might well contain only one element -- an array. Thus, we could say typedef struct {int x[10];} disc_address; ... disc_address address_fnum(fnum) int fnum; ... flabread (address_fnum(fnum), flab); Of course, it isn't quite as convenient to manipulate an object of type DISC_ADDRESS as it would be if it were a simple array (instead of "discaddr[3]=ldev", we have to say "discaddr.x[3]=ldev"), but this is a reasonable alternative. Again, note how we can easily switch the underlying representation of DISC_ADDRESS to, say, a double integer, or a long float, or whatever, without changing the fundamental structure of the procedures that use DISC_ADDRESSes. Similarly, compare the SPL treatment of INTEGERs and DOUBLEs against the PASCAL treatment of SHORTINTs (1-word integers) vs. INTEGERs (2-word integers) or the C treatment of "short int"s (1-word) vs. "int"s (2-word). In SPL, INTEGERs and DOUBLEs are mutually INCOMPATIBLE -- you can't say: DOUBLE D; INTEGER I; D:=I+D; In PASCAL, though, TYPE SHORTINT = -32768..32767; { this is built into PASCAL/XL } VAR S: SHORTINT; I: INTEGER; I:=I+S; will work, as will short int s; long int i; i = i + s; in C. A similar thing, incidentally, can be said about SPL, PASCAL, and C's handling of real numbers. SPL's REAL and LONG (double precision) types are incompatible; in PASCAL and C dialects where two floating-point types are provided (remember, neither language is OBLIGATED to provide more than one floating-point type), the floating-point types are always mutually compatible. What this means, of course, is that it's quite easy in PASCAL or C to change the type of a variable from "short integer" or "short real" to "long integer" or "long real", or vice versa; in SPL, it's quite difficult, since we'll have to put in a lot of manual type conversions to make sure everything stays consistent. To summarize, then, the differences in the way the various languages handle data types: [Note: "STD PAS" refers to both Standard PASCAL and the ISO Level 1 Standard.] STD PAS/ PAS/ K&R STD SPL PAS 3000 XL C C CAN A FUNCTION RETURN ANY OBJECT? CAN IT RETURN AN ARRAY? NO NO YES YES NO NO CAN IT RETURN A RECORD? NO NO YES YES NO YES CAN A FUNCTION OR A PROCEDURE HAVE ANY OBJECT AS A "BY-VALUE" PARM? CAN IT HAVE A BY-VALUE ARRAY? NO YES YES YES NO NO CAN IT HAVE A BY-VALUE RECORD? NO YES YES YES YES YES DOES AN ASSIGNMENT STATEMENT COPY ANY TYPE OF OBJECT? CAN IT COPY AN ARRAY? NO YES YES YES NO NO CAN IT COPY A RECORD? NO YES YES YES NO YES CAN YOU MIX, SAY, "INTEGER"S AND "DOUBLE"S IN AN EXPRESSION? NO YES YES YES YES YES CAN YOU MIX, SAY, "REAL"S AND "LONGREAL"S IN AN EXPRESSION? NO YES YES YES YES YES The more similar the treatment of various types, the easier it is to achieve data abstraction -- and thus to insulate a program from the underlying representation that a particular type might have. I/O IN PASCAL AND C You can't write a program without I/O -- that's obvious enough. Even minimally sophisticated programs, especially system programs, need to be able to do many I/O-related things. This doesn't just mean reading and writing; it means direct I/O (by record number rather than serial), building new files, opening old files, deleting files, checking to see if files exist, and so on. Of course, here we run into the classic problem of portability vs. functionality. Nowhere do operating systems vary more than in their file systems and the modes of I/O that they support; implementing I/O in a portable programming language can be a nightmare for the language designers. PASCAL and C I/O are substantially different in many respects. Standard PASCAL and PASCAL/3000 I/O are different too, and PASCAL/XL adds a couple more interesting quirks. And, of course, Kernighan & Ritchie C and Draft Standard C have their differences as well -- what fun! Before I go further, some ground rules have to be established. There are two ways to talk about I/O (or any other feature of a language): * We can discuss the BUILT-IN I/O mechanisms; in PASCAL's case, this includes WRITE, READ, WRITELN, READLN, GET, PUT, and the like -- in C's it includes "fopen", "fclose", "getc", "putc", "printf", "scanf", etc. * We can discuss how EXTENSIBLE the I/O mechanism is. Since I/O systems differ on all machines, no standard portable language can include all the features that are available on all computers. Thus, the question arises -- how easily can we use additional, machine-related features, together with the standard I/O facility? In other words, do we have to choose "all standard" vs. "all native mode" or can we, say, open a file using our particular computer's I/O system and then read it using the language's facility? This, I believe, is an important distinction. It's true that PASCAL and C are "extensible" languages -- as long as a hook is available to the machine-specific system procedures (e.g. INTRINSICs in PASCAL/3000), we can use the host's I/O system (e.g. FOPEN, FREAD, FWRITE, FCLOSE). But what's the point of re-inventing the wheel? We'd like the built-in I/O system to satisfy most of our needs, both for portability's sake and convenience's sake. On the other hand, we know that some machine-dependent features won't be included in either the standard or even the particular machine implementation. How do you expect, for instance, to have PASCAL/3000 know about RIO files? You have to have some means of accessing the native I/O procedures (e.g. HP's FOPEN, FCLOSE, etc.), but more than that -- you have to be able to use a maximum of the language's I/O mechanism combined with the necessary minimum of the host's non-portable I/O system. In other words, you shouldn't be forced to either use RESET, READLN, and WRITELN or FOPEN, FREAD, FWRITE, and FCLOSE, but not both. For instance, you ought to be able to call FOPEN to open a file in a special mode but then use READLN and WRITELN against it; or, conversely, open the file using RESET or REWRITE and then be able to call built-in procedures like FGETINFO or FREADLABEL against it. This will be both easier to write and more portable -- when you port the program, you'll only have to change the small system-dependent part. STANDARD PASCAL I have a theory about SPL. I believe that the main reason why SPL/3000 isn't more popular in the HP3000 community is not that it has, say, an ASSEMBLE statement or a TOS construct. Nobody HAS to use ASSEMBLEs or TOSes. Rather, the problem was that you CAN'T DO SIMPLE I/O IN SPL. You want to write a program that adds two numbers? The addition statement is simple: INTEGER NUM1, NUM2, RESULT; RESULT:=NUM1+NUM2; Ah, but the I/O! INTRINSIC ASCII, BINARY, PRINT, READX; INTEGER ARRAY BUFFER'L(0:39); BYTE ARRAY BUFFER(*)=BUFFER'L; INTEGER LEN; LEN:=READX (BUFFER'L, -80); NUM1:=BINARY (BUFFER, LEN); LEN:=READX (BUFFER'L, -80); NUM2:=BINARY (BUFFER, LEN); ... LEN:=ASCII (RESULT, 10, BUFFER); PRINT (BUFFER'L, -LEN, 0); And this is without prompting the user, or printing any string constants at all! How can a beginner program get anything DONE this way? For that matter, think of the trouble that even an EXPERT has to go to to do anything useful! Note that in SPL, you have complete FLEXIBILITY -- you can call any intrinsic, open a file in any mode, do I/O with any carriage control. But, since you have no built-in I/O interface to make all these features easy to use, you have to go through a lot of effort to do what you need to do. Like life itself -- everything is possible but nothing is easy. PASCAL -- having originally been designed as a teaching language -- naturally placed a premium on quick "start-up" time. Terminal I/O, for instance, of either strings or numbers, isn't difficult; READ, READLN, WRITE, and WRITELN can do appropriate formatting. File I/O, however, is quite a bit less flexible, and even terminal I/O lacks some rather valuable features. Consider the set of Standard PASCAL I/O operators: * READ and READLN can be used to read data from a file. * WRITE and WRITELN can be used to write data to a file. * PAGE can be used to trigger a form feed. * RESET and REWRITE "open" files for reading or writing, filename that corresponds to a particular PASCAL file. * GET, PUT, and file buffer variables allow you to work with the file in a slightly different way than READ and WRITE; we won't discuss these much in this section, since for the most part they're quite similar to READ and WRITE. * EOLN allows you to detect an end-of-line condition in text input. * EOF allows you to determine whether or not the NEXT read against a file will get an end-of-file. This is very nice, since it allows you to say: WHILE NOT EOF(F) DO BEGIN READLN (F, X, Y, Z); ... END; as opposed to, say, the SPL solution, with which you have to repeat the read twice: FREAD (F, REC, 128); WHILE = DO BEGIN ... FREAD (F, REC, 128); END; This is what PASCAL has -- what doesn't it have? * There is no standard way of telling PASCAL to output a "prompt" -- a string not followed by a carriage return/line feed. A vital operation, I'm sure you'll agree, and surely any computer can support it -- why doesn't Standard PASCAL include it? * There is no standard way of accessing a file using "direct- access" -- reading or writing by record number instead of serially (like FREADDIR and FWRITEDIR do). Even FORTRAN IV supports this (READ (10@RECNUM))! * There is no standard way of indicating exactly what file you want to open. Most PASCALs associate some default system filename with each file declared in the programmer (e.g. the pascal file "EMPFILE" may be associated with the MPE filename "EMPFILE"); but what if you don't know the filename at compile time? Portability, incidentally, isn't a concern here. There are plenty of very portable programs that require this feature -- say, a simple file copier, a text editor, etc. * Standard PASCAL allows you to open a file for read access or for write access. You can't open a file for APPEND access or INPUT/OUTPUT access, both very common requirements. Again, why not? Almost every operating system supports these access modes! * Of course, no provisions are provided for other, equally portable and equally important features like closing a file, deleting files, creating files, checking if a file exists, not to mention, say, renaming a file. * No standard mechanism exists for detecting errors in file operations. If, say, a file open (RESET or REWRITE) fails, the program is typically aborted by the PASCAL run-time library. What about graceful recovery? How would you like, say, a command-driven file-manipulation program that aborted with a compiler library error message whenever you gave it a bad filename? * The lack of error handling is particularly grave in READs from text files. It's great that PASCAL will parse the numeric input for you, but what if the user enters an invalid number? Surely you don't just want the entire program to abort! * WRITELN and READLN are rather simple-minded. No provision exists for left- vs. right-justification, octal or hex output of numbers, mandatory sign specification (i.e. print a "+" if the number is positive, rather than printing no sign at all), and a number of other useful things. I find this to be a rather substantial set of inadequacies, especially if we want to use PASCAL as a system programming language. Now, all those problems are a property of Standard PASCAL. I'll be the first to admit that virtually all PASCAL implementations work around at least some of these things (after all, if they didn't, the language wouldn't be usable). However, remember the advantages of STANDARDS. Some PASCAL compilers might call the prompt function PROMPT and others might just use WRITE; some might have an APPEND procedure to open a file for append access and others might have this as an option to a general OPEN procedure. A general language like PASCAL is great for writing portable code, and surely there's nothing non-portable about wanting to prompt the user or append to a file! But, the more implementation-dependent features we have to use, the more portability we'll lose. PASCAL/3000 The designers of PASCAL/3000 knew about Standard PASCAL's I/O deficiencies, and they introduced a number of features to correct them: * PROMPT has been added -- this is just like WRITELN, but prints its stuff without a carriage return/line feed. * READDIR, WRITEDIR, and SEEK do direct I/O; they are equivalent to FREADDIR, FWRITEDIR, and FPOINT. * RESET and REWRITE allow you to specify the filename of the file to be opened, for input or output access, respectively. * OPEN allows you to open a file for input/output access; APPEND lets you open for appending. * CLOSE lets you close a file; procedures like LINEPOS, MAXPOS, and POSITION let you find out various information about an open file (a very small subset of FGETINFO). CLOSE has an option that lets you purge the open file or save it as temporary. * FNUM allows you to get the file number of any open PASCAL file, thus letting you call any file system intrinsic (like FGETINFO, FRENAME, etc.) on a PASCAL file. This is a MAJOR and VITAL flexibility feature, because otherwise you would have to either do your I/O on a particular file using either ONLY the PASCAL I/O system, or only the MPE I/O system, but never both. * Finally, a very intricate and hard-to-use mechanism (XLIBTRAP) has been implemented to catch either I/O errors or string-to-number conversion errors. To use it, you have to use XLIBTRAP, the low-level WADDRESS procedure, and a global variable; look at the example in the HP Pascal manual under TRAPPING RUN-TIME ERRORS (pages 10-21 through 10-23 in the OCT 83 issue of the manual) to convince yourself that there's GOT to be a better way. This of course makes things a lot more bearable. Still, some things remain unresolved: * PASCAL/3000 allows you to open a file for input, output, input/output, and append access. It also allows you to indicate CCTL vs. NOCCTL and SHARED vs. EXCLUSIVE. This is very nice, but, of course, MPE allows ten times this many options -- how about opening temporary files, opening files for OUTKEEP access, specify record size, file limit, ASCII vs. BINARY, etc.? You can use FNUM to go from a PASCAL file variable to an MPE file number and thus use MPE intrinsics on a PASCAL-opened file. You ought to be able to do the converse of this -- open a file using FOPEN and then use PASCAL features (like READLN and WRITELN) on this file. You can't -- if you need to open, say, a temporary file, you'd have to FOPEN it and then use your own FREADs and FWRITEs. (Actually, you could use a :FILE equate issued using the COMMAND intrinsic; this, however, is much more cumbersome, doesn't support all the FOPEN intrinsic options, and prevents you from allowing the user to issue his own file equations for the file.) * Error trapping, as I said, is still very hard to do. The first two, I think, are the most serious problems. Since so much systems programming involves file handling, flexible and resilient file system operations are, I believe, a MUST for any system programming language. PASCAL/3000 is a lot better than Standard PASCAL, but it still has some flaws. PASCAL/XL PASCAL/XL I/Os is even better than PASCAL/3000's. PASCAL/XL adds a couple of features that make its I/O capability almost complete: * The ASSOCIATE built-in procedure: ASSOCIATE (pascalfile, fnum); Very simple. Makes the given PASCAL FILE variable point to the file indicated by FNUM. Thus, you can call FOPEN with whatever options your heart desires, and then use all of PASCAL's I/O facilities against that file. Such a deal! * TRY .. RECOVER. This construct is described in more detail in the "CONTROL STRUCTURES" chapter of this manual -- and a very powerful construct it is -- but for file I/O it lets you do this: ERROR:=FALSE; REPEAT PROMPT ('Enter filename: '); READLN (FILENAME); TRY OPEN (FILENAME); RECOVER ERROR:=TRUE; { will branch here in case of error } UNTIL NOT ERROR; You just wrap a "TRY" and a "RECOVER" around the file operation that might get an error, and the statement after "RECOVER" will get branched into in case of error (instead of having the program abort). Similarly, you can say: ERROR:=FALSE; REPEAT PROMPT ('Enter an integer: '); TRY READLN (I); RECOVER ERROR:=TRUE; UNTIL NOT ERROR; Still not QUITE as easy as I'd like it to be, but a lot better than before. The only trouble is that -- at least for you and me and the rest of the HP3000 user community -- PASCAL/XL is still a "future" language; we can't really tell how good it is until we've hacked at it for some time and have seen all the implications of the various new features. Still, the PASCAL/XL I/O system seems to be an eminently reasonable and usable creature. KERNIGHAN & RITCHIE C The original "Kernighan & Ritchie" book, which for practical purposes was the original C "standard" has a chapter describing the C I/O library. Its first sentence was "input and output facilities are not part of the C language", which while technically true, proved practically incorrect. By the very act of inclusion of the I/O chapter into the K & R C book, this I/O library became as "standard" as the rest of the C language described therein -- which is to say, not entirely standard, but nonetheless surprisingly compatible on virtually all modern machines. Note that the same can not be said of the next chapter, "THE UNIX SYSTEM INTERFACE", and I won't consider the features listed there as part of standard C. The list of standard C I/O features differs from standard PASCAL I/O features: * The C I/O facility emphasizes what are known in PASCAL as TEXT files -- files that are viewed as streams of characters. In PASCAL you can declare a "FILE OF RECORD" or "FILE OF ARRAY [0..127] OF INTEGER" and read an entire record or 128-element array at a time. In C you'd have to read this array character-by-character. Note that from a performance point of view, this may not be a problem, since virtually all C's buffer their I/O in rather big chunks -- 256 single-byte reads in C shouldn't be much slower than a single 128-word read. Still, this kind of character-read loop is more cumbersome than one would like. * C provides the GETC and PUTC primitives to read and write a character at a time. C also provides an UNGETC primitive that "ungets" the last character you've read, effectively moving the file pointer back by one byte and assuring that the next character you'll read will be the same one you've just read. This is surprisingly useful, especially for parsing. * C provides FSCANF and FPRINTF to do formatted I/O. These are rather more powerful than PASCAL's READLN and WRITELN -- see the FORMATTED I/O: C vs. PASCAL section below. * FGETS and FPUTS read an entire line at a time. Nothing much -- just like PASCAL's READLN and WRITELN of strings. * FOPEN (not to be confused with the MPE intrinsic of the same name!) lets you open an arbitrary file for read, write, or append access. Unlike Standard PASCAL, C allows you to specify the name of the file you want to open. FCLOSE closes an open file. * End of file is indicated by a special return condition from GETC and PUTC. In PASCAL, of course, the special EOF procedure returns you the end of file indication, and all attempts to read at an end of file cause a program abort. Each method has its advantages. * Records in a file are delimited not by a special "line delimiter" as in PASCAL, but rather by the ordinary ASCII character "NEWLINE". The exact ASCII value of this character is left up to the compiler's discretion -- it's usually a LINEFEED (10), but sometimes a CARRIAGE RETURN (13); however, this character is always available in C as '\n', so you can say something like: if (getc(stdin) == '\n') /* do end of line processing */ * If you want to skip to the next line in a file (or on the terminal), you have to output a newline character. Thus, instead of WRITELN ("HELLO, WORLD!") you'd say fprintf (stdout, 'hi, wld!\n'); /* no C programmer would actually SPELL OUT "hello" */ /* or "world". */ This implies that just by omitting the "\n", you can prevent the skip-to-the-next-line. Thus, printf ("name? "); /* same as 'fprintf (stdout,"name? ");' */ scanf ("%s", &name); will presumably prompt the user with "name? " and request input on the same line. In most PASCALs (including PASCAL/3000), if you do a WRITE followed by a READ, the prompt won't actually come out until a subsequent WRITELN -- the WRITE output will be "buffered" until "flushed" by a WRITELN. As best I can tell, no C standard would actually prevent this behavior in a C compiler; however, most C compilers do the right thing and flush any pending terminal output before doing terminal input. * Error handling is different from PASCAL's: - If FOPEN can't open a file, it returns a special value to the program (unlike PASCAL, which aborts the program). The program can then check for this condition and handle it appropriately. I like this, even if there's no standard way to find out exactly what kind of error occurred. - FSCANF behaves differently from PASCAL READLN. If you use FSCANF to read an integer it won't print an error if it sees a letter or some special character; rather, it'll just consider that that character has delimited the read of the integer. 0 is returned as the value of the integer, and the file pointer points to the non-numeric character that stopped the read. Then, you can use GETC to make sure that the character was really a newline or a blank or whatever it is that you expected; or, you can check the result of FSCANF (which will be set to the number of items actually read) to see if all the items that you were asking for were really given. This is a lot better than PASCAL's approach of just aborting and giving the program no chance to recover gracefully. - Error conditions for the other functions (except for end of file on GETC) are not defined by K&R. Seeing how I tore apart Standard PASCAL's I/O facility earlier, you might expect the following complaints from me about C's I/O: * As I mentioned, K&R C can't gracefully handle reads of, say, records or large arrays from files. It emphasizes flexible-format text files rather than fixed-format binary files. * You can't read or write a record at a particular record number (direct access); you can only access the file serially. * You can't open a file for input/output access. * No delete/create/check-if-file-exists support. How sad! DRAFT ANSI STANDARD C Draft ANSI Standard C has expanded the standard C I/O library quite dramatically. A number of useful (and often confusing) features now exist: * Input/output file opens are supported. * I/O in units of more than one character is allowed; FREAD and FWRITE (again, no relationship to the MPE intrinsics) let you easily read or write structures and arrays from/to files. * Direct I/O is provided using the FSEEK procedure, which positions the file pointer in a file. Then you can use any of the read/write mechanisms (GETC, PUTC, FSCANF, FPRINTF, FREAD, FWRITE, etc.) to do the I/O at the new location. * Error handling is more concrete. Presumably, none of the new services are allowed to abort in case of error; the FERROR procedure returns to you the error status (combination of a has-error-occurred-flag and some kind of error number) of an operation. * REMOVE and RENAME, which delete and rename files, are provided. * Various other new features of various utility and arcaneness have been added. These all look quite nice, and seem to satisfy me as thoroughly -- or more so -- as PASCAL/XL. Note, however, that the only two languages that I'm happy with are ones that barely exist and in which I've done virtually no serious programming. This may say something about my character; it also says something about the pitfalls of comparing "new- improved-we'll have them for you Real Soon Now" languages. Both PASCAL/XL and Draft Standard C SEEM nice, but who knows how and whether they'll actually work? One other thing that I ought to point out: as you recall, in the discussion of PASCAL/3000 and PASCAL/XL I sang the praises of FNUM, which returns the system file number of a PASCAL file variable, and ASSOCIATE, which initializes a file variable to point to a given system file number. The reason for this was to allow you to mix PASCAL and native file system I/O. Naturally, Draft Standard C, being a portable standard, doesn't discuss these features; however, I wouldn't like to use any particular implementation of C that doesn't support FNUM- and ASSOCIATE-like operations. I hope that HP's C/XL provides them; I know that CCS's C/3000 provides both. FORMATTED I/O: C vs. PASCAL In the previous discussion, we talked about the I/O operations that C and PASCAL allow. Two of the most useful ones, of course, are formatted write and formatted read -- this is what allows you to input and output numbers (so hard to do using SPL). Standard PASCAL lets you input (READ, READLN) and output (WRITE, WRITELN) characters, strings, integers, reals, and booleans. A sample call might look like: WRITELN (STRING_VALUE, INT_VALUE:10, REAL_VALUE:10:2); { write a string, an integer left-justified in a 10-character field, and a real number in a 10-character field with 2 characters after the decimal } or READLN (STR, INT, REALNUM); These procedures allow you to * Read entities delimited by blanks. * Write values right-justified in a fixed-format field. * Write values "free-format", i.e. in as many characters as they need (this is done by setting the "width" parameter in WRITELN to a size smaller than that needed to fit the entire number; e.g., WRITELN (I:0)). * Write real numbers in either exponential format (comparable to FORTRAN's Ew.d) or fraction format (Fw.d). * Output booleans as the strings "TRUE" or "FALSE"; PASCAL/3000 expands this to allow WRITEing any variable belonging to an enumerated type as its symbolic equivalent. Thus, if COLOR is of type (MAUVE, PUCE, AQUA) and has value PUCE, it'll be output as "PUCE", rather than, say, 1, which might happen to be PUCE's integer representation. This is quite a nice set of functions, but quite obviously there are some important features missing: * The ability to output data to a program string variable, rather than a file. PASCAL/3000 has this feature (STRREAD and STRWRITE). * Output in hex or octal, vital for a system programming language. * Left-justified as well as right-justified output. * Money format ("123,456,789"). * As mentioned before, some way of reading numbers without having the program abort in case the number is invalid (AS I SAID, THIS IS *VERY IMPORTANT!). Less important but desirable features include: * Padding with zeroes (e.g. printing 123 as "00123"; especially important in octal and hex). * Always printing the sign, even if the number is positive. The most important failing of PASCAL's READ, WRITE, et al. is, in fact, one of the less obvious ones: * If you're dissatisfied with the way READ and WRITE work, IT'S VERY DIFFICULT FOR YOU TO WRITE YOUR OWN. Think about it -- say you wanted to add a "money format" output facility. You'd like to write a procedure called MYWRITELN that's just like WRITELN, but allows its caller to somehow specify that a particular type REAL parameter is to be output in money format. What could you do? Remember: * In Standard PASCAL and PASCAL/3000 you can't have your functions have a variable number of parameters. * In Standard PASCAL and PASCAL/3000 you can't have your functions take parameters of flexible data types. * Almost incidentally to all this, READ, WRITE, READLN, and WRITELN are the only "procedures" that allow you to specify auxiliary parameters like field width and fraction digits using a ":". You see, PASCAL documentation calls READ, READLN, WRITE, and WRITELN "procedures", but they're not like the procedures that we mere mortals can write. If we want to write our "money-format output" procedure, we have to have it have only one data parameter of type REAL and a couple of parameters indicating the field width and fraction digits. A typical call to this might look like: WRITE ('COMMISSIONS ARE '); WRITEMONEY (COMMISSIONS, 15, 2); WRITE (' OUT OF A TOTAL OF '); WRITEMONEY (TOTAL, 15, 2); WRITELN; Instead of being able to stick this all into one WRITELN, we have to have a special procedure that takes exactly one value to be output, making us write five lines rather than one. [In PASCAL/3000, we can avoid this by having WRITEMONEY be a procedure that returns a string instead of outputting it, and then write the call as "WRITELN ('COMMISIONS ARE ', FMTMONEY(COMMISIONS, 15, 2), ..."); however, this is both fairly inefficient and quite non-portable, since Standard PASCAL doesn't allow functions to return string results.] Note that this all applies only to Standard PASCAL and PASCAL/3000. PASCAL/XL's winning new features might very well extinguish this particular problem. So much for PASCAL. How about C? Standard C's WRITELN is called PRINTF (or FPRINTF, if you want to print to a file rather than the terminal); READLN is called SCANF (or FSCANF, to read from a file). Examples might be: printf ("%s %10d %10.2f", string_value, int_value, real_value); or scanf ("%s %d %f", &string_value, &int_value, &real_value); /* The "&"s are needed to indicate that the address of the variable is to be passed, not the actual value. */ Note how both PRINTF's and SCANF's first parameters are "control strings" that indicate the format of the input or output. Incidentally, they also tell PRINTF and SCANF how many parameters they are to expect and what the type of each parameter will be. If you make an error in the control strings, beware! You'll get VERY interesting results. In any event, PRINTF's and SCANF's features include: * Output of integers, in decimal, octal, or hex. * Output of reals, in exponential or fractional format. You can also output a real number using so-called "general" format (similar to FORTRAN's Gw.d), which uses exponential or fractional format, whichever is shorter. * Free-format output, left-justification, and right-justification. In other words, 10 can be output as "10", " 10", or "10 ". All three of these formats are useful in different applications. * Zero-padding (e.g. outputting 10 as "00010"). * Bad input data (strings where numbers are expected, etc.) does not generate an error; in fact, the only way of detecting is to read the next character after the SCANF is done (say, using, GETC) and see if it's the terminator you expected (e.g. a blank or a newline) or some other character that might have been viewed as a numeric terminator. This is somewhat cumbersome, but in the long run more flexible; it is certainly much better than having your entire program abort whenever the user types bad data. * Standard C allows you to use SSCANF to read from a string and SPRINTF to output to a string. This is, overall, a richer feature set than PASCAL's. Note, however, some problems: * Still no monetary output facility. * Still no "always print a sign character even if the number's positive" feature (again, Draft Standard corrects this). * Unlike PASCAL, printing a boolean value will just print a 0 or 1 (since C doesn't have a separate boolean type). What's more, even an enumerated type value will just be printed as its numeric equivalent (since C views variables of enumerated types as simple integer constants). In my opinion, these things are all pretty bearable; but, the important thing is that in C you CAN define your own PRINTF and SCANF like procedures. You can have parameters of varying types; even variable numbers of parameters are supported by the Draft Standard (most non-Standard compilers give you some such feature, too). Thus, you can write your "myprintf" procedure, and call it using myprintf ("comms are %15.2m of tot %15.2m", commissions, total); /* assuming you've defined "%X.Ym" to be your "money-format" format specifier. */ Of course, nobody says it'll be easy to write this MYPRINTF procedure, especially if you'll want to emulate the standard PRINTF directives (which can be done by just calling SPRINTF); the important thing is that you CAN write a procedure like MYPRINTF, whereas in Standard PASCAL and PASCAL/3000, you can't. SUMMARY OF I/O FACILITIES [Since SPL relies solely on the HP System Intrinsics to do I/O, numeric formatting, etc., I don't include it in this comparison. Believe me -- with SPL I/O, everything is possible but nothing is easy.] STD PAS/ PAS/ K&R STD PAS 3000 XL C C OPEN ARBITRARY FILE GIVEN NAME NO YES YES YES YES OPEN FILE FOR APPEND ACCESS NO YES YES YES YES OPEN FILE FOR INPUT/OUTPUT ACCESS NO YES YES NO YES CLOSE A FILE NO YES YES YES YES READ, WRITE FILES SERIALLY YES YES YES YES YES READ, WRITE FILES BY RECORD NUMBER NO YES YES NO YES DETECT AND HANDLE FILE ERRORS NO NO+ YES YES YES FORMAT NUMBERS FOR OUTPUT YES YES YES YES+ YES+ INPUT NUMBERS YES YES YES YES YES INPUT ERROR DETECTION? NO NO+ YES YES YES OUTPUT A STRING WITH NO NEW-LINE NO YES YES YES YES USE A PASCAL-OPENED FILE FOR NATIVE N/A YES YES N/A N/A FILE OPERATIONS USE A "NATIVELY" OPENED FILE FOR N/A NO YES N/A N/A PASCAL FILE OPERATIONS WRITE YOUR OWN WRITELN/PRINTF-LIKE NO NO YES- YES- YES FUNCTION (WITH VARIOUS PARAMETER TYPES AND NUMBERS OF PARAMETERS) LEGEND: YES = Implemented. YES+ = Implemented in a really nice and useful way. YES- = Implemented, but there's some ugliness involved. NO = Not implemented. NO+ = I can't fairly say that it's simply "not implemented", but believe me, it's soooo ugly... N/A = Not applicable. STRINGS IN STANDARD PASCAL AND SPL Much, if not most, of the data we keep on computers is character data -- filenames, user names, application data, text files. Of course, it's imperative that any programming language we use can represent and manipulate this sort of data. Standard PASCAL's mechanism for storing strings is the "PACKED ARRAY OF CHAR". PACKED here is simply a convention used to indicate that there should be one character stored per byte, not one per word. I've never seen anyone use an unpacked ARRAY OF CHAR. If you think about it, SPL and C use PACKED ARRAY OF CHARs, too. All a PACKED ARRAY [1..100] OF CHAR means is "an array of 100 bytes, each of which is individually addressable". Practically speaking, VAR X: PACKED ARRAY [1..256] OF CHAR; { PASCAL } BYTE ARRAY X(0:255); << SPL >> char x[256]; /* C */ are all identical -- and fairly reasonable -- ways of storing a string that's between 0 and 256 characters long. Still, despite this identity of representation, I claim that Standard PASCAL has severe problems with string processing. Support for strings involves much more than just having a way of representing them. The important thing for strings -- as for any data type -- is the OPERATORS THAT ARE DEFINED to manipulate them. What's the use of having strings if you can't extract a substring? Concatenate them? Find a character within a string? It's by the level of this sort of support that a language's string facility is measured. SPL, for instance, has several useful operators that help in string manipulation: * You can say MOVE STR1(OFFSET1):=STR2(OFFSET2),(LENGTH); to move one substring of a string into another. PASCAL can only move the entire thing in one shot (STR1:=STR2), or examine/set one character at a time (STR1[I]:=STR2[I]). * You can say IF STR1(OFFSET1)=STR2(OFFSET2),(LENGTH) THEN ... to compare two substrings. You can also compare for <, >, <=, >=, and <>, as well as compare against constants (e.g. IF STR1(X)="FOO"). * You can say MOVE STR1(OFFSET1):=STR2(OFFSET2) WHILE ANS; that will copy substrings WHILE the character being copied is Alphabetic or Numeric (upShifting in the process). You can copy only WHILE AN (no upshifting), WHILE AS (while alphabetic, upshifting), or WHILE N (while numeric). You can also find out how many characters were so copied (i.e. at what point the copy stopped). * You can say I:= SCAN STR(OFFSET) UNTIL "xy"; assigning to I the index of the first character in the substring which is either equal to "x" or to "y"; you can also say I:= SCAN STR(OFFSET) WHILE "b"; which will assign to I the index of the first character in the substring that is NOT equal to "b" (for more details, see the SPL manual). Note that you can NOT say "SCAN until you either find this character OR you've gone through 80 characters", which is very desirable if you know that the maximum length of your string is, say, 80. * You can say P (STR(OFFSET)); calling the procedure P and passing to it all of STR starting with offset OFFSET. In PASCAL, you can only pass the entire string, not this sort of substring. Note, however, that in SPL you can't pass BYTE ARRAYs by value -- only by reference. * On the other hand, reading or writing strings is rather more difficult than one would like. Since the PRINT and READX intrinsics take "logical arrays" rather than "byte arrays", you in general have to say: LOGICAL ARRAY BUFFER'L(0:127); BYTE ARRAY BUFFER(*)=BUFFER'L; INTEGER IN'LEN; ... IN'LEN:=READX (BUFFER'L, -256); MOVE STR(OFFSET):=BUFFER,(IN'LEN); ... MOVE BUFFER:=STR(OFFSET),(LEN); PRINT (BUFFER'L, -LEN, 0); These features are a part of the SPL language; you can use them without writing any procedures of your own. Furthermore, if you want to, say, write a procedure that finds the first occurrence of STR1 in STR2, you can just say: INTEGER PROCEDURE FIND'STRING (STR1, LEN1, STR2, LEN2); VALUE LEN1, LEN2; BYTE ARRAY STR1, STR2; INTEGER LEN1, LEN2; ... and implement it yourself. It may not be easy (actually, it is), but it's certainly possible. These are the string-handling features that SPL supports, and you may consider them sufficient or not. PASCAL supports a different, and somewhat smaller set of features: * You can copy entire strings, or examine and set single characters: STR1:=STR2; or STR1[I]:=STR2[J]; You can't copy substrings without writing your own FOR loop, or having a special temporary array and calling the little-known PACK and UNPACK procedures. * You can input and output a string using READLN and WRITELN. You can output the first N characters of a string by saying WRITELN (STR:N); but you again can't output an arbitrary substring of STR. * You can pass a string to a procedure; you can't pass a substring. On the other hand, you can pass a string by value as well as by reference. As you see, this set of operators is in some respects richer than SPL's (I/O) and in others poorer (substrings, comparisons, SCANs, etc.). But the WORST problem with PASCAL's string handling is: * YOU CAN'T WRITE YOUR OWN GENERAL STRING HANDLERS! Strange for a language that emphasizes breaking things down into little, general-purpose procedures, eh? But if you've read the "DATA STRUCTURES, TYPE CHECKING" chapter of this paper, you'll know why: * YOU CAN'T DECLARE A PROCEDURE TO TAKE A GENERAL STRING! You can write TYPE PAC256 = PACKED ARRAY [1..256] OF CHAR; ... FUNCTION STRCOMPARE (S1, S2: PAC256): INTEGER; but THE ONLY STRINGS YOU CAN PASS TO THIS PROCEDURE ARE THOSE OF TYPE PAC256! What if you have a string that's at most 8 bytes long, a PACKED ARRAY [1..8] OF CHAR? No dice! You have to either declare it with a maximum length of 256 bytes (thus wasting 97% of the space!) OR write one STRCOMPARE procedure for every possible combination of S1 and S2 maximum lengths. This means that not only do you start with a somewhat poor set of string handling primitives, but you'll have a very hard time of implementing your own, unless you're willing to have all your strings be of the same maximum length! In my opinion, this is a very, very unpleasant circumstance. STRING HANDLING IN PASCAL/3000 AND PASCAL/XL A better string handling system is one of the conspicuous improvements that HP put into PASCAL/3000. The first new feature that you'd notice in PASCAL/3000 strings is that A PASCAL/3000 STRING CONTAINS MORE THAN JUST CHARACTERS. When you say VAR S: STRING[256]; you're allocating more than just a PACKED ARRAY [1..256] OF CHAR. You're essentially creating a record structure: VAR S: RECORD LEN: -32768..32767; { 2 bytes in /3000; 4 bytes in /XL } DATA: PACKED ARRAY [1..256] OF CHAR; END; Now S isn't REALLY a PASCAL RECORD -- you can't just access its subfields using "S.LEN" and "S.DATA". But internally, it is a record structure, in that it contains both of these pieces of data independently. When you say S:='FOOBAR'; then not only will PASCAL move "FOOBAR" to the data portion, but also set the length portion to 6. The LEN subfield contains the actual current length of the data; there may be room for up to 256 characters, but in this case it indicates that the actual length is only 6 characters. A brief aside: Obviously, it's quite important to somehow keep track of the current string length. For a fixed-length thing like the 8-character account name, we may not need it, but if we're doing, say, text editing, we want to know the actual length of the line. Let me point out, though, that keeping a separate length field is not imperative for this; C uses a NULL character as a string terminator, and many SPL programmers do similar things. In other words, just because PASCAL/3000 represents strings this way, don't think that that's the only way of doing it... Back to the PASCAL approach. The representational change is the most obvious difference in PASCAL/3000; but, as we saw earlier, it's the OPERATORS rather than the REPRESENTATION that really make a data type. PASCAL/3000 provides a pretty rich set, including especially: * You can extract and manipulate arbitrary substrings using STR: STR(X,10,7) returns a string containing the 7 characters starting from the 10th characters. The result of the STR function can be used anywhere a "real string" could be used; however, it can not be assigned to or passed as a by-reference parameter. * You can concatenate two strings using "+": S:='PURGE '+STR(FILENAME,1,10)+',TEMP'; * You can find out a string's length using the STRLEN procedure. This is somewhat more convenient than in SPL; SPL strings don't have a separate "length" field, so most SPL programmers end up terminating their string data with some distinctive character (often a carriage return, %15). Thus, an SPL programmer would have to say I:= SCAN STR UNTIL [8/%15,8/%15]; to scan through the string looking for a carriage return (a relatively, though not very, slow process). The PASCAL programmer would say I:= STRLEN(STR); * You can copy substrings using STRMOVE. (STRMOVE also works for PACKED ARRAY OF CHARs.) * You can easily edit a string using STRDELETE and STRINSERT to delete/insert characters anywhere in the string. * You can find the first occurrence of one string in another using STRPOS. * You can strip leading and trailing blanks using STRLTRIM and STRRTRIM. Stripping trailing blanks is a particularly useful operation. * You can also do READLNs and WRITELNs into strings using STRREAD and STRWRITE; this means that you can easily convert a string to an integer and vice versa. * You can have functions that return strings (many of the above, including STR and STRRTRIM are examples). Standard PASCAL doesn't allow functions to return structured types including arrays. More importantly, you can now write a procedure PROCEDURE STRREPLACE (VAR STR, FROMSTR, TOSTR: STRING); { Changes all occurrences of FROMSTR to TOSTR in STR. } that will work for a string of ANY maximum length, because you declared the parameters to be of type STRING, rather than STRING[256] or some such fixed length. (Note, however, that only BY-REFERENCE string parameters can be declared to be of type STRING.) One fairly serious problem, though, that still afflicts PASCAL/3000 strings (PASCAL/XL fixes this) is the inability to dynamically allocate a string of a size that is not known until run-time. For more discussion of this, look at the "POINTERS" chapter of this manual. STRING HANDLING IN C The designers of C, of course, faced the same sorts of problems as the designers of PASCAL/3000, but at least in the area of strings, they attacked them in a somewhat different way. As a matter of convention, C strings -- kept as simple arrays of characters -- are terminated by a NULL ('\0', ascii 0) character. You can, of course, have a char x[10]; array, none of whose characters is a null, but all that means is that you'll get screwy results if you pass it to a string manipulation procedure. When you say strcpy (x, "testing"); the C compiler will pass to STRCPY the address of the array X (remember that in C saying "arrayname" gets you the address of the array) and the address of the 8-character string which contains "t", "e", "s", "t", "i", "n", "g", and NULL. Then STRCPY -- just because of the way it's written, and because this is the useful thing to do -- will copy all the characters from the second string ("testing") into the first string (X), up to and including the terminating NULL. So here you see a fundamental representational difference between PASCAL/3000 (and PASCAL/XL) and C strings: * PASCAL/3000 keeps the current length of the string as a separate field. * C keeps it implicitly, having a null character terminate the string's actual data. The PASCAL/3000 clearly has some advantages: * Determining the string length is much faster -- you need only extract the first 2 bytes of the string array, and you've got it. In C, you'd need to scan through each character until you find a null. * PASCAL/3000 strings can contain any character. C strings may not contain a null, since that would be viewed as a terminator. In practice, though, the second issue (strings that need to contain nulls) doesn't arise, and the first issue isn't as important as one would think. In fact, there are some compensating advantages to C's approach, but I'll discuss them a bit later. Given what we know about the different representational format, what about the defined operations? Kernighan & Ritchie is rather cavalier about this vital question, and merely alludes to the "standard I/O library", which is said to contain various string manipulation functions. Thus, it's not unlikely that there'll be some non-trivial differences between various C implementations in this area (although there'll also be a good deal of similarity). Therefore, I'll have to compare PASCAL/3000 and Draft ANSI Standard C; keep in mind that the C functions might not be available on all compilers. Let's consider a (possibly) typical application. We need to write two procedures: * One that takes a file name (MPEX), group name (PUB), and account name (VESOFT), and makes them into a fully-qualified filename (MPEX.PUB.VESOFT). * Another that does the opposite -- takes a fully-qualified filename and splits it into its file part, group part, and account part. Here's what they'd look like, in PASCAL: PROGRAM PROG (INPUT, OUTPUT); TYPE TSTR256 = STRING[256]; TSTR8 = STRING[8]; VAR FILENAME, GROUP, ACCT: STRING[8]; FUNCTION FNAME_FORMAT (FILENAME, GROUP, ACCT: TSTR8): TSTR256; BEGIN FNAME_FORMAT := STRRTRIM(FILENAME) + '.' + STRRTRIM(GROUP) + '.' + STRRTRIM(ACCT); END; PROCEDURE FNAME_PARSE (QUALIFIED: TSTR256; VAR FILENAME, GROUP, ACCT: STRING); VAR START_GROUP, START_ACCT: INTEGER; BEGIN START_GROUP := STRPOS (QUALIFIED, '.') + 1; START_ACCT := STRPOS (STR (QUALIFIED, START_GROUP, STRLEN(QUALIFIED)-START_GROUP-1), '.') + START_GROUP; FILENAME := STR (QUALIFIED, 1, START_GROUP - 2); GROUP := STR (QUALIFIED, START_GROUP, START_ACCT-START_GROUP-1); ACCT := STR (QUALIFIED, START_ACCT, STRLEN (QUALIFIED) - START_ACCT + 1); END; BEGIN WRITELN (FNAME_FORMAT ('MPEX ', 'PUB ', 'VESOFT ')); FNAME_PARSE ('MPEX.PUB.VESOFT', FILENAME, GROUP, ACCT); WRITELN (FILENAME, ',', GROUP, ',', ACCT, ';'); END. and in C: #include <stdio.h> #include <string.h> char *strrtrim (s) char s[]; { /* Strips trailing blanks from F; also returns F as the result. */ int i; for (i = strlen(s); (i>0) && (s[i-1]==' '); i = i-1) ; s[i] = '\0'; return s; } char *fname_format (filename, group, acct, qual) char filename[], group[], acct[], qual[]; { qual[0] = '\0'; strcat (qual, strrtrim (filename)); strcat (qual, "."); strcat (qual, strrtrim (group)); strcat (qual, "."); strcat (qual, strrtrim (acct)); return qual; } fname_parse (qual, filename, group, acct) char qual[], filename[], group[], acct[]; { char *start_group, *start_acct; start_group = strchr (qual, '.') + 1; start_acct = strchr (start_group, '.') + 1; strncpy (filename, qual, start_group - qual - 1); filename[start_group - qual - 1] = '\0'; strncpy (group, start_group, start_acct - start_group - 1); group[start_acct - start_group - 1] = '\0'; strcpy (acct, start_acct); } main () { char qual[256], filename[8], group[8], acct[8]; printf ("%s\n", fname_format ("sl ", "pub ", "sys ", qual)); fname_parse ("mpex.pub.vesoft", filename, group, acct); printf ("%s,%s,%s;\n", filename, group, acct); } What are the differences between these two, besides the fact that one is upper-case and one is lower-case? Let's examine these programs a piece at a time. The FNAME_FORMAT procedure, which merges the three "file parts" into a fully-qualified filename in PASCAL looks like this: FUNCTION FNAME_FORMAT (FILENAME, GROUP, ACCT: TSTR8): TSTR256; BEGIN FNAME_FORMAT := STRRTRIM(FILENAME) + '.' + STRRTRIM(GROUP) + '.' + STRRTRIM(ACCT); END; As you see, we're taking full advantage here of the fact that PASCAL/3000 lets us say "A + B" to concatenate two strings. In C, this is quite a bit more difficult: char *fname_format (filename, group, acct, qual) char filename[], group[], acct[], qual[]; { qual[0] = '\0'; strcat (qual, strrtrim (filename)); strcat (qual, "."); strcat (qual, strrtrim (group)); strcat (qual, "."); strcat (qual, strrtrim (acct)); return qual; } Instead of just saying "A + B", we must say "STRCAT (A, B)", which MODIFIES THE STRING A (appending the string B to it) rather than just returning a newly-constructed string. In fact, STRCAT does return the address of the modified A, so we could conceivably say: strcat (strcat (strcat (strcat (strcat (qual, strrtrim(filename)), "."), strrtrim (group)), "."), strrtrim (acct)); but for obvious reasons we don't. This is one advantage of having an operator like "+" instead of a function like "strcat" -- it makes the program look quite a bit cleaner, especially if we have to nest it. Note also that the PASCAL string manipulators are quite willing to "create" a new string, like + and STR do. The C string package, on the other hand, can only modify parameters that are passed to it (like "strcat" does to its first parameter). This is an artifact of the fact that C functions can't return arrays (like PASCAL functions, but unlike PASCAL/3000 functions). [Actually, some C compilers, including Draft ANSI Standard, allow functions to return structures, so we could have a structure that "contains" only one subfield, which is an array -- but this is not usually done.] While we were writing FNAME_FORMAT, we needed some way of stripping trailing blanks from the file name, group name, and account name strings. In PASCAL, this was accomplished by calling the STRRTRIM function; in C, no such function exists, so we had to write our own: char *strrtrim (s) char s[]; { /* Strips trailing blanks from F; also returns F as the result. */ int i; for (i = strlen(s); (i>0) && (s[i-1]==' '); i = i-1) ; s[i] = '\0'; return s; } What we do is quite simple -- we find the end of the string using STRLEN and then step back until we find a non-blank; then, we set the first of the trailing blanks to a '\0', which is the string terminator. What this means, among other things, is that even if you aren't satisfied with C's string library, or if you're using a C compiler that doesn't come with a string library, you can write all your string handling primitives quite easily. I'd guess that all of the Draft ANSI Standard C string-handling routines (except perhaps the numeric formatting/parsing ones, like "sprintf" and "sscanf") could be implemented in 200 lines or less. On the other hand, of course, you'd rather not have to write even that much yourself. Continuing through our sample programs, we get to FNAME_PARSE. It's a more complicated procedure -- we have to find the locations of the two dots and then extract the three substrings that lie before, between, and after the dots. In PASCAL, this would be: PROCEDURE FNAME_PARSE (QUALIFIED: TSTR256; VAR FILENAME, GROUP, ACCT: STRING); VAR START_GROUP, START_ACCT: INTEGER; BEGIN START_GROUP := STRPOS (QUALIFIED, '.') + 1; START_ACCT := STRPOS (STR (QUALIFIED, START_GROUP, STRLEN(QUALIFIED)-START_GROUP-1), '.') + START_GROUP; FILENAME := STR (QUALIFIED, 1, START_GROUP - 2); GROUP := STR (QUALIFIED, START_GROUP, START_ACCT-START_GROUP-1); ACCT := STR (QUALIFIED, START_ACCT, STRLEN (QUALIFIED) - START_ACCT + 1); END; and in C: fname_parse (qual, filename, group, acct) char qual[], filename[], group[], acct[]; { char *start_group, *start_acct; start_group = strchr (qual, '.') + 1; start_acct = strchr (start_group, '.') + 1; strncpy (filename, qual, start_group - qual - 1); filename[start_group - qual - 1] = '\0'; strncpy (group, start_group, start_acct - start_group - 1); group[start_acct - start_group - 1] = '\0'; strcpy (acct, start_acct); } Again, there are both similarities and differences: * Both PASCAL and Draft Standard C have functions that find a character inside a string (STRPOS in PASCAL, "strchr" in C). * PASCAL's STRPOS returns the INDEX of the character in the string, but C "strchr" returns a POINTER to the character. * PASCAL has a STR function that returns a substring, the room for which is allocated on the stack. In C, on the other hand, one would usually use a pointer and address directly into the original string. That's why we say: start_group = strchr (qual, '.') + 1; start_acct = strchr (start_group, '.') + 1; instead of START_GROUP := STRPOS (QUALIFIED, '.') + 1; START_ACCT := STRPOS (STR (QUALIFIED, START_GROUP, STRLEN(QUALIFIED)-START_GROUP-1), '.') + START_GROUP; As you see, we just passed START_GROUP (which is a pointer into the string QUAL) directly to STRCHR, rather than having to specially extract a substring, which would probably have been inefficient as well as being somewhat more cumbersome. Note, though, that when we pass START_GROUP, we don't pass a true substring in the sense of "L characters starting at offset S"; rather, STRPOS sees all of QUAL starting at the location pointed to by START_GROUP. * Although the C routine manipulates pointers more than it does offsets, we can say p + 1 to refer to "a pointer that points 1 character after P", or p - q to refer to "the number of characters between the pointers P and Q". Finally, the calling sequences to the two procedures are quite similar: WRITELN (FNAME_FORMAT ('MPEX ', 'PUB ', 'VESOFT ')); FNAME_PARSE ('MPEX.PUB.VESOFT', FILENAME, GROUP, ACCT); WRITELN (FILENAME, ',', GROUP, ',', ACCT, ';'); printf ("%s\n", fname_format ("sl ", "pub ", "sys ", qual)); fname_parse ("mpex.pub.vesoft", filename, group, acct); printf ("%s,%s,%s;\n", filename, group, acct); Note that in both PASCAL and C, you can pass constant strings (e.g. "MPEX.PUB.VESOFT") to procedures -- a major improvement over SPL, which can't do this. I've already mentioned some of the built-in string handling functions that Draft Standard C provides; here's a full list: * STRCPY and STRNCPY copy one string into another (one copies the entire string, the other copies either the entire string or a given number of characters, whichever is smaller). They're comparable to PASCAL/3000's string assignment and STRMOVE functions. The "N" procedures -- STRNCPY, STRNCAT, STRNCMP -- coupled with C's ability to extract "instant substrings" ("&x[3]" is the address of the substring of X starting with character 3) are intended to compensate for the lack of a substring function like PASCAL/3000's STR. * STRCAT and STRNCAT concatenate one string to another; again the "N" version (STRNCAT) will append not an entire string but rather up to some number of characters of it. These functions are most similar to PASCAL/3000's STRAPPEND, but can do the job of "+" with a bit of extra difficulty (as we saw above). * STRCMP and STRNCMP compare two strings, much like PASCAL/3000's ordinary relational operators (<, >, <=, >=, =, <>) applied to strings. * STRLEN returns the length of a string (just like PASCAL/3000's STRLEN). * STRCHR and STRRCHR search for the first and last occurrence, respectively, of a character in a string. STRSTR finds the first occurrence of one string within another. STRSTR is the direct equivalent of PASCAL/3000's STRPOS (which therefore can do what STRCHAR does, too). STRRCHR has no direct PASCAL/3000 equivalent. * STRCSPN searches for the first occurrence of ONE OF A SET OF CHARACTERS within a string. 'strspn (x, "abc")' will find the first occurrence of an "a", "b", OR "c" in the string X. STRSPN searches for the first character that is NOT ONE OF A SET OF CHARACTERS -- 'strspn (x, " 0.")' will skip all leading spaces, zeroes, and dots in X, and return the index of the first character that is neither a space, zero, nor dot. These functions have no real parallels in PASCAL/3000. * STRTOK is a pretty complicated-looking routine that allows you to break up a string into "tokens" separated by delimiters. It has no parallel in PASCAL/3000. * SPRINTF and SSCANF are the equivalents of PASCAL/3000's STRWRITE and STRREAD -- they let you do formatted I/O using a string instead of a file. * PASCAL/3000 procedures that don't have a direct equivalent in C include: STRDELETE and STRINSERT (which can be emulated with some difficulty using STRNCPY); STRLTRIM and STRRTRIM, which trim leading/trailing blanks; and STRRPT, which returns string containing a given number of repetitions of some other string. I intentionally gave this list AFTER the example comparing FNAME_PARSE and FNAME_FORMAT is PASCAL and C. As we saw with STRRTRIM, PASCAL/3000's standard string handling routines can be easily implemented in C, and I'm sure that C's string handling routines can be easily implemented in PASCAL/3000. The important thing, I believe, is not the exact set of the built-in string-handling procedures but rather the ease of extensibility (which is good in both PASCAL/3000 and C, but very bad in Standard PASCAL) and the general "style" of string-handling programming (which, as you can see, is somewhat different in PASCAL/3000 and C). If you prefer C's pointer and null-terminated strings -- or, conversely, if you prefer PASCAL/3000's "+" operator and the ability of functions to return string results -- I'm sure that you'll have no problems implementing whatever primitives you need in either language. SEPARATE COMPILATION -- STANDARD PASCAL As I'm sure you can imagine, the MPE/XL source code is NOT stored in one source file. Neither, for that matter, is my MPEX/3000 or SECURITY/3000, or virtually any serious program. Not only do I, for instance, heavily use $INCLUDE files, I often compile various portions of my program into various USLs, RLs, and SLs, and then eventually link them together at compile and :PREP time. Obviously, this sort of thing is imperative when your programs get into thousands or tens of thousands of lines. Now, what you may not be aware of is that * STANDARD PASCAL DEMANDS THAT YOUR ENTIRE PROGRAM BE COMPILED FROM A SINGLE SOURCE FILE. That's right. In the orthodox Standard, you can't have your program call SL procedures; you can't have it call RL procedures; you can't have it call any procedures OTHER THAN THE ONES THAT WERE DEFINED IN THE SAME SOURCE FILE. Believe it or not, this is true -- and it's one of the major problems with Standard PASCAL. Let's say that you want to keep all of your utility procedures -- ones that might be useful to many different programs -- in an RL (or SL). This way, you won't have to copy them all into each program, which would be a maintenance headache, and would slow down the compiles substantially (my utility RL is 13,000 lines long; MPEX is 3,000 lines). Standard PASCAL has a problem with this. Say that it encounters a statement such as: I:=MIN(J,80); If it had previously seen a function definition such as: FUNCTION MIN (X, Y: INTEGER): INTEGER; BEGIN IF X<Y THEN MIN:=X ELSE MIN:=Y; END; then it would realize that MIN is a function -- a function that takes two by-value integer parameters and returns an integer -- and would generate code accordingly. But what if MIN isn't in the same source file? How does PASCAL know what to do? Now, PASCAL might conceivably be able to decide that MIN is a function -- after all, it couldn't be anything else. Still, what are the function's parameters? Is J, for instance, a by-value or a by-reference (VAR) parameter? PASCAL must know, because it would have to generate different code in these cases. Are its parameters and return value 32-bit integers or 16-bit integers? Again, PASCAL must know. Does the function really have two integer parameters, or is the programmer making a mistake? PASCAL wants to do type checking, but it has no information to check against. Essentially, what we have here is a "knowledge crisis": * WHEN YOU TRY TO CALL A PROCEDURE THAT ISN'T DEFINED IN THE SAME SOURCE FILE, PASCAL DOESN'T HAVE ENOUGH KNOWLEDGE ABOUT THE PROCEDURE TO GENERATE CORRECT CODE FOR THE CALL. FURTHERMORE, PASCAL's TYPE-CHECKING DESIRES ARE FRUSTRATED BY THIS VERY SAME THING. HOW HP PASCAL, PASCAL/XL, AND SPL HANDLE SEPARATE COMPILATION Now this, of course, is by no means a new problem. Other languages have to call separately compiled procedures, too, and they've managed to work out some solutions. In HP's COBOL/68, for instance, if you say CALL "MIN" USING X, 80. the compiler will assume that the function has two parameters, X, and 80, and that both of them are passed by reference, as word addresses. If the parameters are passed by value, or as byte address, or if the procedure returns a value -- why, that's just your tough luck. You can't call this procedure from COBOL/68. COBOL/68 compatibility, incidentally, is the reason why all the IMAGE intrinsics have by-reference, word-addressed parameters. FORTRAN/3000 adopted a similar but somewhat more flexible approach. In FORTRAN, saying CALL MIN (X, 80) will also make FORTRAN assume that MIN has two parameters, each of them by reference. However, if X is of a character type, FORTRAN will pass it as a byte address, not a word address; if X is an integer or any other non-character object, FORTRAN will pass it as a word address. This gives the user more flexibility. Furthermore, FORTRAN/3000 allows you to say: CALL MIN (\X\, 80) to tell the compiler that a particular parameter -- in this case X -- should be passed BY VALUE rather than by reference. Furthermore, if you want MIN to be a function, you can say I = MIN (\X\, \80\) from which FORTRAN will deduce that MIN returns a result. The type of the result, incidentally, is assumed to be an integer by FORTRAN's default type conventions (anything starting with an M, or any other character between I and N, is an integer). If you want to declare MIN to be real, you can simply say: REAL MIN ... I = MIN (\X\, \80\) Thus, we can see four possible components in a compiler's decision about how a procedure is being called: * STANDARD ASSUMPTIONS. Both COBOL68/3000 and FORTRAN/3000, for instance, ASSUME that all parameters are by reference. * ASSUMPTIONS DERIVED FROM A NORMAL CALLING SEQUENCE. How many parameters does a procedure have? Both COBOL68 and FORTRAN guess this from the number of parameters the user specified. Similarly, FORTRAN determines whether or not a procedure returns a result and also the word-/byte-addressing of the parameters from the details of the particular call. * CALLING SEQUENCE MECHANISMS BY WHICH A USER CAN OVERRIDE THE COMPILER'S ASSUMPTIONS. In FORTRAN, the backslashes (\) around by-value parameters allow a user to override the assumption that the parameters are to be passed by reference. Similarly, in HP's COBOL/74 (COBOLII/3000), you can say CALL "MIN" USING @X, \10\. which indicates that X is to be passed as a byte address and 10 is to be passed by value. Of course, if MIN expects X to be by-value, too, this call won't give you the right result -- it's your responsibility to specify the correct calling sequence. * ONE-TIME DECLARATIONS THAT THE USER CAN SPECIFY. If a user says REAL MIN and then uses MIN as a function, the compiler will automatically know that MIN returns a real result, regardless of the context in which it is used. Different compilers, as you see, use different ones of the above methods, and use them in different cases. COBOL/68, as I said, only uses default assumptions and information that it can derive from the standard calling sequence; FORTRAN/3000 uses all four of the above methods to determine different things about how a procedure is to be called. HP PASCAL, PASCAL/XL, and SPL all take exactly the same approach. They * REQUIRE A USER TO DECLARE EVERYTHING THAT THE COMPILER NEEDS TO KNOW ABOUT THE PROCEDURE CALLING SEQUENCE. Unlike COBOL/3000 or FORTRAN/3000, they don't make any "educated guesses"; but, on the other hand, they let you specify the calling sequence in exact detail, thus allowing you to call procedures that wouldn't be easily callable from either COBOL or FORTRAN. In fact, SPL, HP PASCAL, and PASCAL/XL demand that you copy into your program the PROCEDURE HEADER of every separately-compiled (also known as "external") procedure that you call. For instance, if you declared your SPL procedure MIN as INTEGER PROCEDURE MIN (X, Y); VALUE X, Y; INTEGER X, Y; BEGIN IF X<Y THEN MIN:=X ELSE MIN:=Y; END; then an SPL program that wants to call MIN as an external would have to say INTEGER PROCEDURE MIN (X, Y); VALUE X, Y; INTEGER X, Y; OPTION EXTERNAL; The "OPTION EXTERNAL;" indicates that the compiler shouldn't expect the actual body of MIN to go here; rather, the procedure itself will be linked into the program later on. Similarly, if you want to call the PASCAL/3000 procedure FUNCTION MIN (X, Y: INTEGER): INTEGER; BEGIN IF X<Y THEN MIN:=X ELSE MIN:=Y; END; from another PASCAL/3000 program, you'd have to say: FUNCTION MIN (X, Y: INTEGER): INTEGER; EXTERNAL; Here, just the word "EXTERNAL;" tells PASCAL that this is only a declaration of the calling sequence; but, armed with this calling sequence, PASCAL can both * Generate the correct code, and * Check the parameters you specify to make sure that they're really what the procedure expects. In other words, armed with these declarations, both SPL and PASCAL can make sure that you specify the right number of parameters, and (in PASCAL more than in SPL) that they are of the right types. For instance, if MIN was declared with by-reference rather than by-value parameters, the compiler would BOTH be sure to pass the address rather than the value AND would make sure that you're really passing a variable and not a constant or expression. Finally, since the external declaration is an exact copy of the actual procedure's header, you're sure that you can call ANY PASCAL procedure from another program, even if it has procedural/ functional parameters and other arcane stuff. SEPARATE COMPILATION IN K&R C Where PASCAL/3000, PASCAL/XL, and SPL all follow the strict "declare everything, assume nothing" approach, Kernighan & Ritchie C does almost the exact opposite. Its solution is actually much like FORTRAN's, only more general and more demanding on the programmer. C will: * Pass ALL parameters by value -- this isn't an assumption, it's a requirement. * Deduce the number of parameters the called procedure has from the procedure call. * Deduce the type of each parameter -- integer, float, or structure -- from the procedure call. * Allow you to declare the procedure's result type, but usually assume it to be "integer" (some compilers make this assumption, while others signal an error). What's more, these are K&R C's assumptions EVEN IF THE PROCEDURE IS DECLARED IN THE SAME FILE (i.e. not separately compiled). If you declare MIN as int min(x,y) int x,y; { if x<y then return (x); else return (y); } and then call it as r = min(17.0,34.0); then the C compiler will merrily pass 17.0 and 34.0 as floating-point numbers, although it "knows" that MIN expects integers. In other words, the C compiler will neither print an error message nor automatically convert the reals into integers; it'll pass them as reals, leaving MIN to treat their binary representations as binary representations of integers. In fact, C won't even "KNOW" that MIN has two parameters; if you pass three, it'll try to do it and let you suffer the consequences. The only thing that C remembers about MIN is the same thing that it would allow you to declare about MIN if MIN were an external procedure -- that MIN's result type is "int" (or whatever else you might declare it as). Since external function call characteristics are thus pretty much the same in K&R C as internal call characteristics, I describe them elsewhere (primarily in the "C TYPE CHECKING" section of the "DATA STRUCTURES" chapter). However, I'll mention a few of the most important points here: * To refer to an external procedure, all you need to do is say: extern <type> <proc>(); The EXTERN indicates that <proc> is defined elsewhere; the "()" indicates that <proc> is a procedure; <type> indicates the type of the object returned by <proc>. An example might be: extern float sqrt(); which declares SQRT to be an external procedure that returns a FLOAT. * If you want to pass a parameter by reference, you actually pass its address by value. In other words, you'd declare your procedure to be int swap_ints (x, y) int *x, *y; { int temp; temp = *x; *x = *y; *y = temp; } -- a procedure that takes two BY-VALUE parameters which happens to be a pointer. Then, to call it, you'd say swap_ints (&foo, &bar); passing as parameters not FOO and BAR, but rather the expressions "&FOO" and "&BAR", which are the addresses of FOO and BAR. If you omit an "&" (i.e. say "FOO" instead of "&FOO"), the compiler will neither warn you nor do what seems to be "the right thing" (extract the address automatically); rather, it'll happily pass the value of FOO, which SWAP_INTS will treat as an address (boom!). * Similarly, you must be meticulously careful with the number of parameters you try to pass and the type of each parameter; as I said before, if you say SQRT (100) and "sqrt" expects a floating point number, C won't automatically do the conversion for you (because it doesn't know just what it is that SQRT expects). The only exceptions are the various "short" types that are automatically converted to "int"s and "float"s which are automatically converted to "double"s. To summarize, K&R C saves you having to include the procedure header of every external procedure you want to call; however, it does require you to specify the procedure's result type. On the flip side, it can't check for what you don't specify, so it will neither check for your errors nor automatically do the kinds of conversions (e.g. automatically take the address of a by-reference parameter) that SPL and PASCAL programmers take for granted. DRAFT ANSI STANDARD C AND SEPARATE COMPILATION Draft ANSI Standard C allows you to say extern float sqrt(); or extern int min(); just like you would in standard Kernighan & Ritchie C. One new feature that it provides, though, is the ability to declare a "function prototype" which specifies the types of the parameters that the called procedure expects, thus letting the C compiler do some type checking and automatic conversion. I discuss this facility in some detail in the "DATA STRUCTURES -- TYPE CHECKING" sections of this manual, but I'll talk about it briefly here. In Draft ANSI Standard C, you can say extern float sqrt(float); or extern int min(int,int); thus declaring the number and types of the procedures' parameters. This imposes some type checking that's almost as stringent as PASCAL's; the major differences between it and PASCAL's type checking are: * The sizes of array parameters are not checked. This -- as I mention in the DATA STRUCTURES chapter -- is actually a good thing. * You can entirely prevent parameter checking for a given procedure by simply using the old ("extern float sqrt()") standard K&R C mechanism instead of the new method. * You can declare a parameter to be a "generic by-reference object" by declaring its type to be "void *", to wit extern int put_rec (char *, char *, void *); where the third parameter is declared to be a "void *" and can thus have any type of array or record structure (or any other by-reference object) passed in its place. * Finally, you can use C's "type cast" mechanism to force an object to the expected data type if for some reason the object isn't declared the same way. I happen to like this new Draft Standard C approach; it allows you to implement strict type checking, but waive it whenever appropriate. MORE ABOUT SEPARATE COMPILATION -- GLOBAL VARIABLES What we talked about above is how the compiler can know about procedures that are declared in a different source file. What about variables? What if we want some procedures in the RL to share some global variables with procedures in our main program? In recent times, global variables have fallen somewhat out of favor, and for good reason. A procedure is a lot clearer if its only inputs and outputs are its parameters; you needn't be afraid that calling FOO (A, B); will actually change some global variable I or J that isn't even mentioned in the procedure call. I myself often ran into cases where an apparently innocent procedure call did something I didn't expect because it modified a global variable. As a rule, it's much better to pass whatever information the called procedure needs to look at or set as a procedure parameter. For this reason, many of today's programming textbooks -- especially PASCAL and C textbooks -- counsel us to have as few global variables as possible. Unfortunately, though global variables are usually "bad programming style", they are often necessary. For instance, I have several global variables in my MPEX/3000 product and the supporting routines I keep in my RL: * CONY'HIT, a global variable that my control-Y trap procedure sets whenever control-Y is hit. Since its "caller" is actually the system, the only way it can communicate with the rest of my program is by using a global variable. * DEBUGGING. If this variable is TRUE, many of my procedures print useful debugging information (e.g. information on every file I open, parsing info, etc.). If I were to pass DEBUGGING as a procedure parameter, virtually every one of my procedures would have to have this extra parameter, either to use itself or to pass to other procedures that it calls. * VERSION, an 8-byte array that's set to the current version number of my program. Whenever one of my low-level routines detects some kind of logic error within my program (e.g. I'm trying to read from a non-existent data segment), it prints an error message, prints the contents of VERSION, and aborts. That way, if a user gets an abort like this and sends a PSCREEN to us, we'll see what version of the program he was running. Again, it would be a great burden to pass this variable as a parameter to all my RL procedures. All I'm saying is that there are cases where global variables are necessary and desirable, and it's important that a programming language -- especially a systems programming language -- support them. Now, how do PASCAL and C do them? GLOBAL VARIABLES IN PASCAL Since Standard PASCAL can't handle separate compilation anyway, it certainly has no provisions for "cross-source file" global variables, i.e. global variables shared by separately compiled procedures. You can declare normal global variables, to wit PROGRAM GLOBBER; { The variables between a PROGRAM statement and the first } { PROCEDURE statement are global. } VAR DEBUGGING: BOOLEAN; VERSION: PACKED ARRAY [1..8] OF CHAR; ... PROCEDURE FOO; VAR X: INTEGER; { this variable is local } ... but these variables will only be known within this source file; even if you can somehow call an external, separately-compiled procedure, there's no way for it to know about these global variables. PASCAL/3000 and PASCAL/XL, of course, had to face this problem just like they had to face the problem of calling external procedures. Their solution was this: * One of the source files (the MAIN BODY) should have a $GLOBAL$ control card at the very beginning. Then, ALL of its global variables become "knowable" by any other source files that are compiled separately. * Any of the other source files that wants to access a global variable must declare the variable as global within it; it must also have a $EXTERNAL$ control card at the very beginning. * All the global variables defined in any of the $EXTERNAL$ source files must also be defined in the $GLOBAL$ source file. In other words, our main body might say $GLOBAL$ PROGRAM GLOBBER; { These global variables are now accessible by all the } { separately compiled procedures. } VAR DEBUGGING: BOOLEAN; VERSION: PACKED ARRAY [1..8] OF CHAR; ... Two other separately-compiled files might read: $EXTERNAL$ $SUBPROGRAM$ { only subroutines here, no main body } PROGRAM PROCS_1; VAR DEBUGGING: BOOLEAN; { want to access this global var } ... $EXTERNAL$ PROGRAM PROCS_2; { want to access this global var } VAR VERSION: PACKED ARRAY [1..8] OF CHAR; ... So, you see, the main body file DECLARES all the global variables that are to be shared among all of the separately-compiled entities; all other files can essentially IMPORT none, some, or all of the global variables that were so declared. If, however, some procedures in PROCS_1 and PROCS_2 wanted to share some global variables, or even some procedures in a single file (e.g. PROCS_1) wanted to share a global variable between them, this variable would also have to be declared in the main body. Also note that, unlike the external procedure declaration, which is rather similarly implemented in many versions of PASCAL, this $GLOBAL$/$EXTERNAL$ is a distinctly unusual construct that is unlikely to be compatible with virtually any other PASCAL. GLOBAL VARIABLES IN SPL Global variable declaration and use is much the same in SPL as in PASCAL. The "main body" program -- the one that'll become the outer block of the program file -- should declare all the global variables used anywhere within the program. This would look something like: BEGIN GLOBAL LOGICAL DEBUGGING:=FALSE; GLOBAL BYTE ARRAY VERSION(0:7):="0.1 "; ... END; [Note that unlike PASCAL, you can initialize the variables when you declare them; this isn't overwhelmingly important, but does save a bit of typing.] Then, each procedure in a separately-compiled file that wants to "import" any of these global variables would say: ... PROCEDURE ABORT'PROG; BEGIN EXTERNAL BYTE ARRAY VERSION(*); ... END; ... Note one difference between SPL and PASCAL -- in PASCAL, all the imported global variables would be listed at the top of the source file; in SPL, they'd be given inside each referencing procedure. On the one hand, the SPL method means more typing; on the other, it "localizes" the importations to only those procedures that explicitly request them, thus making it clearer who is accessing global variables and who isn't. In any case, this isn't much of a difference. Note once again an interesting feature, present also in PASCAL: any global variables used anywhere in the various separately-compiled sources have to be declared in the main body (an exception in SPL is OWN variables; more about them later). This can be quite a burden; say you have an RL file whose procedures use a bunch of local variables, often just to communicate between each other (for instance, a bunch of resource handling routines might need to share a "table of currently locked resources" data structure). Then, any program that calls these procedures -- whether it needs to access the global variables or not -- would have to declare the variables as GLOBAL. Not a good thing, especially if you like to view your RL procedures as "black boxes" whose internal details should not be cared about by their callers. GLOBAL VARIABLES IN C Both K&R C and Draft Standard C were designed with separate compilation in mind; thus, they have standard provisions for global variable declaration. In general, you're required to do two things: * In one of your separately compiled source files, you must declare the variable as a simple global variable (i.e. outside of a procedure declaration), e.g. int debugging = 0; char version[8] = "0.1 "; Note that unlike PASCAL and SPL, C doesn't require that these declarations occur in any particular source file, or even in the same source file. You could, for instance, put these declarations in your RL procedure source file -- then, if your main program doesn't want to change these variables, it need never know that they exist. * Any source file that wants to access a global variable that it didn't itself declare must define the variable as "extern": extern int debugging; extern char version[]; The EXTERN declaration may occur either at the top level (outside a procedure), in which case the variable will be visible to all procedures that are subsequently defined; alternatively, it may occur inside a procedure, in which case the variable will be known only to that procedure. As I mentioned, the advantage of this sort of mechanism is that there is no central place that is obligated to declare all the global variables that are used in any of the separately compiled source files. Rather, each variable may be declared and "extern"ed only where it needs to be used. This apparently unimportant feature can be useful in many cases. Say you have two procedures, "alloc_dl_space" and "dealloc_dl_space", that allocate space in the DL- area of the stack. They need to keep track of certain data, for instance, a list of all the free chunks of memory. You can't just say: alloc_dl_space() { int *first_free_chunk; ... } Not only will FIRST_FREE_CHUNK not be visible to DEALLOC_DL_SPACE, but even a subsequent call to ALLOC_DL_SPACE won't "know" this variable's value, since any procedure-local variable is thrown away when the procedure exits. In C, however, you can say: int *first_free_chunk = 0; alloc_dl_space() { ... } dealloc_dl_space() { ... } Now, both ALLOC_DL_SPACE and DEALLOC_DL_SPACE can both see and modify the value of FIRST_FREE_CHUNK; also, since it is no longer a procedure-local variable, the value will be preserved between calls to these two procedures. Equally importantly, * NOBODY WHO MIGHT CALL THESE TWO PROCEDURES WILL HAVE TO KNOW THAT THEY USE THIS GLOBAL VARIABLE. Contrast with the PASCAL and SPL approach, where each global variable (and its type and size) would have to be known to the main program. This both puts a burden on the programmer and adds the risk that this variable -- really the property of the ALLOC_DL_SPACE and DEALLOC_DL_SPACE procedures -- will somehow be changed by the main program that has to declare it. Note that this is the one case where the ability to initialize variables (which C has, SPL has to some extent, and PASCAL doesn't have) becomes really necessary. Since the variable is not known to the main body, who's going to initialize it? The initialization clause ("int *first_free_chunk = 0") will do it. Incidentally, SPL has a similar feature in its ability to have OWN variables. These are also "static" variables -- i.e. ones that don't go away when the procedure is exited -- but are only known within one procedure. Thus, this is somewhat less useful than C's approach, but better than PASCAL's, where all variables are either global (and thus have to be declared in the main body) or non-static (i.e. get thrown away whenever the procedure is exited). PASCAL/XL'S MODULES -- A NEW (IMPROVED?) SEPARATE COMPILATION METHOD PASCAL/3000, PASCAL/XL, SPL, and both species of C provide for separate compilation. Essentially, any file may declare a procedure or variable as "external", tell the compiler some things about it, and be able to reference this procedure or variable. This sort of solution is certainly necessary, and we can certainly live with it. But it is not without its problems. I keep about 300 procedures in my Relocatable Library (RL) file; these are general-purpose routines that I call from all my various programs. Say that I write a program in SPL or PASCAL. In order for my program to be able to call any RL procedure, the program must include an EXTERNAL declaration for the procedure, complete with declarations of all the parameters. Even in C, I'd have to declare at least the procedure's result type. Now, none of my programs actually calls all 300 of the RL procedures; most call only about 10 or 20, though some can call hundreds. Even if the program only calls 20 RL procedures, having to type all these EXTERNAL declarations can impose a substantial burden. Twenty full procedure headers, each one of which must be exactly correct, or else the program may very well fail in very strange ways at run-time. My solution to this problem was to create one $INCLUDE file -- C, SPL, and PASCAL/3000 all have $INCLUDE-like commands -- that contains the external declarations of each procedure in the RL. Then, every one of my programs $INCLUDEs this file, thus declaring as external all of the RL procedures, and making any one of them callable from the program. Thus, if my source file looks like: BEGIN INTEGER PROCEDURE MIN (I, J); VALUE I, J; INTEGER I, J; BEGIN IF I<J THEN MIN:=I ELSE MIN:=J; END; INTEGER PROCEDURE MAX (I, J); VALUE I, J; INTEGER I, J; BEGIN IF I<J THEN MAX:=I ELSE MAX:=J; END; ... END; then my $INCLUDE file would look like: INTEGER PROCEDURE MIN (I, J); VALUE I, J; OPTION EXTERNAL; INTEGER PROCEDURE MAX (I, J); VALUE I, J; INTEGER I, J; OPTION EXTERNAL; ... As you see, one OPTION EXTERNAL declaration for each procedure. Now, instead of manually declaring each such procedure in every file that calls it, I can just $INCLUDE the entire "external declaration" file, and have access to ALL the procedures in my RL. Similarly, I may have a separate $INCLUDE file for various global constants, DEFINEs, and variable declarations. Still, the problem is obvious -- the procedure headers have to be written at least twice, once where the procedure is actually defined and at least once where the procedure is declared EXTERNAL. These two definitions then have to be kept in sync, and any discrepancy may have unpleasant consequences. The same goes for global variables, too. Do you feel sorry for me yet? Well, if you don't, consider this. If I write my procedures in PASCAL, then doubtless many of the parameters will be of specially-defined types -- records, enumerations, subranges, etc. My EXTERNAL declarations must have EXACTLY THE SAME TYPES! This means that not only does each external procedure need to be defined in the external declarations, but so must any type that is used in any procedure header. The same goes for any constant used in defining the type (e.g. MAX_NAME_CHARS in the type 1..MAX_NAME_CHARS). In other words, all of the "externally visible entities" of the file containing the utility procedures must be duplicated in each caller of the procedures (either directly or using a $INCLUDE). This includes the externally-visible procedure, the externally- visible variables, types, and constants. They have to be maintained together with the procedure definitions themselves; any change in the external appearance of the file must be reflected in the copy. Thus, to summarize, if I have this source file: CONST MAX_NAME_CHARS = 8; TYPE FLAB_REC = RECORD ... END; NAME = PACKED ARRAY [1..MAX_NAME_CHARS] OF CHAR; ... PROCEDURE FLAB_READ (N: NAME; VAR F: FLAB_REC); BEGIN ... END; ... PROCEDURE FLAB_WRITE (N: NAME; VAR F: FLAB_REC); BEGIN ... END; ... FUNCTION FLAB_CALC_SECTORS (F: FLAB_REC): INTEGER; BEGIN ... END; then at the very least I have to have an $INCLUDE file that looks like this: CONST MAX_NAME_CHARS = 8; TYPE FLAB_REC = RECORD ... END; NAME = PACKED ARRAY [1..MAX_NAME_CHARS] OF CHAR; PROCEDURE FLAB_READ (N: NAME; VAR F: FLAB_REC); EXTERNAL; PROCEDURE FLAB_WRITE (N: NAME; VAR F: FLAB_REC); EXTERNAL; FUNCTION FLAB_CALC_SECTORS (F: FLAB_REC): INTEGER; EXTERNAL; So much for the tale of woe. I bet you're thinking: "Now Eugene's going to tell us that PASCAL/XL solves all these problems." Well, I wish I could, but I can't. PASCAL/XL's MODULEs look something like the following: $MLIBRARY 'FLABLIB'$ MODULE FLAB_HANDLING; $SEARCH 'PRIMLIB, STRLIB'$ IMPORT PRIMITIVES, STRING_FUNCTIONS; EXPORT CONST FSERR_EOF = 0; FSERR_NO_FILE = 52; FSERR_DUP_FILE = 100; TYPE FLAB_TYPE = RECORD FILE: PACKED ARRAY [1..8] OF CHAR; GROUP: PACKED ARRAY [1..8] OF CHAR; ... END; DISC_ADDR_TYPE = ARRAY [1..10] OF INTEGER; FUNCTION FLABREAD (DISC_ADDR: DISC_ADDR_TYPE): FLAB_TYPE; PROCEDURE FLABWRITE (DISC_ADDR: DISC_ADDR_TYPE; F: FLAB_TYPE); IMPLEMENT CONST FSERR_EOF = 0; FSERR_NO_FILE = 52; FSERR_DUP_FILE = 100; MAX_FILES_PER_NODE = 16; { not exported } TYPE FLAB_TYPE = RECORD FILE: PACKED ARRAY [1..8] OF CHAR; GROUP: PACKED ARRAY [1..8] OF CHAR; ... END; DISC_ADDR_TYPE = ARRAY [1..10] OF INTEGER; DISC_ID = 1..256; { not exported } FUNCTION ADDRCHECK (DISC_ADDR: DISC_ADDR_TYPE): BOOLEAN; BEGIN { the implementation -- ADDRCHECK is not exported } END; FUNCTION FLABREAD (DISC_ADDR: DISC_ADDR_TYPE): FLAB_TYPE; BEGIN { the actual implementation of the function... } END; PROCEDURE FLABWRITE (DISC_ADDR: DISC_ADDR_TYPE; F: FLAB_TYPE); BEGIN { the actual implementation of the function... } END; END; OK, let's look at this one piece at a time: * First, we say "$MLIBRARY 'FLABLIB'$" followed by "MODULE FLAB_HANDLING". These tell the compiler that this file DEFINES A MODULE CALLED "FLAB_HANDLING" INSIDE THE SPECIALLY-FORMATTED MODULE LIBRARY "FLABLIB". All of the information about the "external interface" of this module -- i.e. everything that is specified in the EXPORT section -- will be stored into this module library. * Then we say "$SEARCH 'PRIMLIB, STRLIB'$" followed by "IMPORT PRIMITIVES, STRING_FUNCTIONS;". This essentially "brings in" the external interface of the modules PRIMITIVES and STRING_FUNCTIONS that are stored in the module library files PRIMLIB and STRLIB. Practically speaking, this is exactly the same as if ALL THE TEXT SPECIFIED IN THE "EXPORT" SECTION OF THE "PRIMITIVES" AND "STRING_FUNCTIONS" MODULES WAS INCLUDED DIRECTLY INTO THE FILE, WITH "EXTERNAL;" KEYWORDS PLACED RIGHT AFTER THE FUNCTION DEFINITIONS. Now the compiler "knows" about all the TYPEs, CONSTs, VARs, PROCEDUREs, and FUNCTIONs that are defined by the PRIMITIVES and STRING_FUNCTIONS modules, and will let you use them from within the module that's being currently defined (FLAB_HANDLING). * The EXPORT section defines the "external interface" of this module. We've already explained what this really means -- whenever you IMPORT a module, the result is exactly the same as if you had copied in the EXPORT section of the module (except that "EXTERNAL;" keywords are automatically put after all the functions). Any CONSTants, VARiables, TYPEs, PROCEDUREs, or FUNCTIONs that you define in this module but want other modules to use should be declared in the EXPORT section. * Finally, the IMPLEMENT section is the actual source code of your file. It has to include all the declarations and procedure/function definitions (EVEN THE ONES ALREADY MENTIONED IN THE "EXPORT" SECTION!). In fact, it's just like an ordinary PROGRAM, except that it's missing the PROGRAM statement. Thus, if you think about it, you could have -- instead of defining a MODULE -- * Written the IMPLEMENT section as an ordinary program; * Put the EXPORT information into a separate file (we'll call it the "external declarations $INCLUDE file"); * And, instead of the IMPORTs, just used the $INCLUDE$ compiler command to include the "external declarations $INCLUDE files" of all of the IMPORTed modules. Really, this is ALL THERE IS TO A MODULE. Its only advantages are: * You can keep the EXPORT declarations and IMPLEMENT part in the same file, so that when you change the definition of one of the external objects, you can easily change the EXPORT declaration in the same file. * The compiler will check to make sure that the EXPORT declarations are exactly the same as the actual implementation declarations (i.e. that you didn't define the procedure one way in an EXPORT and another way in the IMPLEMENT section). * Finally, for honesty's sake, I ought to point out that you'd actually need two $INCLUDE files per module if you wanted to use them instead of MODULEs. You'd need one $INCLUDE file for the TYPE/CONST/VAR declarations and one file for the PROCEDURE/FUNCTION EXTERNAL declarations -- this is because ALL of the TYPE/CONST/VAR declarations for all $INCLUDE$d modules would have to go before ALL the EXTERNAL declarations. The major flaw -- not compared with $INCLUDE$s but rather compared to what they could have so easily done! -- is obvious: * WHY FORCE US TO DUPLICATE ALL THE EXTERNAL OBJECT DECLARATIONS IN THE "EXPORT" AND "IMPLEMENT" SECTIONS? Why should I define the hundred-odd subfields of a file label twice? Why should I specify the parameter list of all my external procedures twice? To me it seems simple -- just have all the declarations go into the IMPLEMENT section, and then let me say: EXPORT FSERR_EOF, FSERR_NO_FILE, FSERR_DUP_FILE, FLAB_TYPE, DISC_ADDR_TYPE, FLABREAD, FLABWRITE; How simple! Think of all the effort and possibility for error that we'd avoid. Even better -- I'd like to be able to say either EXPORT_ALL; to indicate that ALL the things I define should be exported, or perhaps specify a $EXPORT$ flag near every definition that I want to export. That way, if I have a file with 20 constants, 5 variables, 10 data types, and 30 procedures, I wouldn't have to enumerate them all in an EXPORT statement at the top. This way, not only do I have to enumerate all of them, but I have to DUPLICATE ALL THEIR DEFINITIONS! Why? Another quirk that you may or may not have noticed: obviously, since I specify a module name AND a module library filename, there can be more than one module per library. And yet, I define a different library file for each of the modules FLAB_HANDLING, PRIMITIVES, and STRING_FUNCTIONS. Why would I do a silly thing like that? Well, a little paragraph hidden in my PASCAL/XL Programmer's Guide says: "A module can not import another module from its own target library; that is, the compiler options MLIBRARY and SEARCH can not specify the same library." Seems innocent, eh? But that means that any time a module must import another module, the imported module must be in a different library file! The only modules that CAN be stored in the same module library file are ones that do NOT import one another. Well, it makes perfect sense for FLAB_HANDLING to want to use PRIMITIVES and STRING_FUNCTIONS -- after all, why do I define modules if not to be able to IMPORT them into as many places as possible? Similarly, STRING_FUNCTIONS will probably want to use PRIMITIVES. What we get is a rather paradoxical situation in which THE ONLY MODULES THAT CAN BE PUT INTO THE *SAME* MODULE LIBRARY FILES ARE ONES THAT HAVE *NOTHING TO DO WITH EACH OTHER*! Of course, that's a bit of hyperbole, but you get my point -- you make modules so you can IMPORT them into one another, but the more you use IMPORT the less able you are to have several related modules in the same library. STANDARD LANGUAGE OPERATORS Everybody knows operators; all languages have them. +, -, *, / -- you pass them parameters and they return to you results. Imagine what would happen if you couldn't multiply two numbers! PASCAL's set of operators is, by standards of most computer languages (like BASIC, FORTRAN, and COBOL), about average. It includes: monadic + monadic - + - * / DIV MOD NOT AND OR < <= = <> > >= IN (the set operator) ^ (pointer deferencing) [ (array subscripting) These include: * The "monadic" operators, which take one parameter. * The "dyadic" operators, which take two parameters. * The arithmetic operators +, -, *, /, DIV, and MOD. These operate on integer or real parameters. * The logical operators NOT, AND, and OR. They work on booleans. * The relational operators <, <=, =, <>, >, and >=. They take numbers and return booleans. * The SET operators IN, +, -, and *. They work on sets; note that +, -, and * mean quite different (though conceptually similar) things for sets than they do for integers -- + is set union, - is set difference, and * is set intersection. * Some other things that you may not think of as operators but which most assuredly are. ^ and [, it seems to me, are just as much operators as anything else. Now, it takes no feats of analysis to construct this kind of list; I just looked in the PASCAL manual. The C manual reveals to me that the C operator set is -- at least in terms of number of operators -- quite a bit richer: C operator PASCAL equivalent ---------- ----------------- monadic +, monadic - Same +, -, *, / Same % MOD <, >, <=, >=, ==, != Same; == is "equal", != is "not equal" !, &&, || NOT, AND, OR (respectively) ~, &, |, ^ BITWISE NOT, AND, OR, and EXCLUSIVE OR <<, >> BITWISE LOGICAL SHIFTS (LEFT and RIGHT) monadic *, e.g. *X pointer^ monadic &, e.g. &X None -- returns address of a variable ++X, --X None -- increments or decrements X by 1 (usually) and also returns the new value (simultaneously changing X) X++, X-- None -- increments or decrements X by 1 (usually) but returns the value that X had BEFORE the increment/decrement! X=Y := assignment, but can be used in an expression (e.g. "X = 2+(Y=3)+Z"). X+=Y, X-=Y, X*=Y, X/=Y, None -- X+=Y means the same as X&=Y, X|=Y, X^=Y, "X=X+Y", and so on for the other ops X%=Y, X<<=Y, X>>=Y X?Y:Z None -- an IF/THEN/ELSE that can be used within an expression. "(X<Y)?X:Y" returns the minimum of X and Y (IF X<Y THEN X ELSE Y). (X,Y,Z) None -- executes the expressions X, Y, and Z, but only returns the result of Z! sizeof X Returns the size, in bytes of its operand expression. This is the rich operator set that many C programmers are so proud of, and deride PASCAL and other languages for not possessing. Indeed, several new categories of operators do exist. But are they really that useful? Or can they be easily (and, perhaps, more readably) emulated with conventional, more familiar, operators. BIT MANIPULATION PASCAL prides itself on being a High Level language. That's not just high level as in "high-level" -- it's High Level, with a capital H and a capital L. In PASCAL, much care is taken to insulate the programmer from the underlying physical structure of the data. You don't need to know how many bytes there are in a word, or how many words there are in your data structure; you don't even need to know that there is such a thing as a "bit" and that it is what is manipulated deep down inside the computer. Unfortunately, we do not live in a High Level world. If I want to write a PASCAL program that can access, say, file labels, or PCB entries, or log file records, I need to be able to access individual bit fields. This may mean that the system was badly designed in the first place; perhaps to much attention was paid to saving space; perhaps the operating system should have been written in PASCAL, too, so I could just use the same record structure definitions to access the system tables that the system itself uses. For better or worse, there are plenty of cases where we need to manipulate bit fields: * In FOPEN, FGETINFO, and WHO calls; * In accessing data in system tables and system log files; * In writing one's own packed decimal manipulation routines; * In compressing one's own data structures to save space, by no means an unworthy goal; * And many other cases, but primarily systems programming rather than application programming. Thus, although bit fields are admittedly not something you usually use quite as often as, say, strings or integers, they must be supported by any system programming language. How do PASCAL, SPL, and C support bit fields? TYPICAL OPERATIONS ON BITS There are really three main classes of operations that you'd want to perform with bits: * BITWISE OPERATIONS ON ENTIRE NUMBERS. These view a quantity -- an 8-bit, a 16-bit, a 32-bit, or whatever -- as just a sequence of bits. When you do a "bitwise not" of a number, each of the bits in it is negated; a "bitwise and" of two numbers ands together each bit in the two numbers -- bit 0 of the result becomes the "logical and" of bit 0 of the first number and bit 0 of the second; bit 1 of the result becomes the and of the two bits 1; and so on. * BIT SHIFTS. Bit shifts also view a number as a bit sequence. A bit shift just takes all of the bits and "moves" them some number to the left or to the right. For instance, if you shift 101 left by 2 bits, this takes the bit pattern 00001100101 and makes it into 00110010100 which is 404. * BIT EXTRACTS/DEPOSITS. Bit extracts take a particular sequence of bits -- say "3 bits starting from bit #10" -- and extract the value stored at those bits; in other words, they view a bit string as an integer. Bit deposits allow you to set the 3 bits starting at bit #10 to some value. In SPL, for instance, RECFORMAT:=FOPTIONS.(8:2) and FOPTIONS.(8:2):=RECFORMAT extract and deposit bit fields. Since the computer's lowest-level data type is the bit, bit operations can usually be performed quite easily and efficiently. Virtually all computers have the BITWISE OPERATIONS and SHIFTS built in as instructions, and some (like the HP3000, but not, say, the VAX or the Motorola 68000) have BIT EXTRACT/DEPOSIT instructions as well. An interesting fact is that although bitwise operations and shifts are relatively hard and inefficient to emulate in software, bit extracts and deposits are relatively easy. For instance, X.(10:3) -- extract 3 bits starting from bit #10 (assuming 16-bit words, least significant bit=#15) is the same as X shifted left by 10 bits and shifted right by 13 bits or X shifted right by 3 bits and ANDed with 7. In general, I.(START:COUNT) = I shifted left by START and shifted right by (#bits/word) - COUNT. So, just because these three types of operations are available in certain languages, it does not follow that all of them are necessary. Some can be easily (or not so easily) emulated using the others, but, more importantly, it may well be the case that some just aren't very frequently useful. Bit extracts, for instance, are something that I've frequently found myself doing; bitwise operations and especially shifts are (at least for me) far rarer things. SPL SPL supports all three types of bit operations, and supports them all in a big way. * Single-word unsigned quantities (LOGICALs) can be bit-wise negated (NOT), ANDed (LAND), inclusive ORed (LOR), and exclusive ORed (XOR). * Single-word quantities can have arbitrary CONSTANT bit substrings extracted and deposited. In other words, you can say "FOPTIONS.(10:3)", but you CAN'T say "CAPMASK.(CAPNUM:1)". * Single-, double-, triple-, and quadruple-word quantities can be shifted in a number of ways -- see the SPL Reference Manual for details. * Some other, less important features, like "Bit concatenation" are also supported; see the SPL Reference Manual if you're curious. Bit extractions are, naturally, vital in SPL. All sorts of things -- FOPEN foptions and aoptions, WHO capability masks, system table fields, etc. -- contain various bit fields, most of which do not cross word boundaries (hence it's sufficient to have bit fields of single words only) and have constant offsets and lengths (hence it's only necessary to have constant bit field parameters). Other operations are less frequent. Checking my 13,000-line RL file -- which, of course, is completely representative of any SPL program that anybody's ever written -- I find that I use shift operations in a very distinct set of cases: * Converting byte to word addresses and vice versa. * Extracting variable bit fields (e.g. if I have a WHO-format capability mask and a variable capability bit number). * Extracting bit fields of doubles (from WHO masks, disc addresses, and other double-word entities). * Rarely, quick multiplies and divides by powers of two. * Constructing integers and doubles out of bytes and integers. Note that the majority of these cases are actually "work-arounds" caused by compiler problems. If SPL had a C-like "cast" facility with which I could convert a byte address to a word address and vice versa, I wouldn't need to do an ugly and unreadable shift; if SPL's bit extracts were more powerful, I could use them for variable fields and doubles; if SPL's optimizer were better, I could always do multiplies and divides and count on SPL to do the work. Shifts, then, in my opinion are a classic case of something that might not be needed in a "perfect" system; however, as we see, they can come in handy to avoid the imperfections that are bound to exist in any language. Similarly, I find that I almost don't use bitwise operations at all; the only cases I do use them are those where I need to implement double-word bit fields and variable bit fields. C C's approach to bit operations is rather like that of SPL. Since most computers have bit manipulation instructions, C has bit operators built in to the language. Actually, C does not have bit extracts and deposits, but it does have shifts and bitwise operators, which (as I mentioned) can be used to emulate bit extractions. However, C has another mechanism to handle bit fields, which is both more and less usable than SPL's. In C, structures can have fields that are explicitly declared to be a certain number of bits long. For instance, consider the following definition: typedef struct {unsigned:2; /* Bits .(0:2); unused */ unsigned file_type:2; /* .(2:2) */ unsigned ksam:1; /* .(4:1) */ unsigned disallow_fileeq:1; /* .(5:1) */ unsigned labelled_tape:1; /* .(6:1) */ unsigned cctl:1; /* .(7:1) */ unsigned record_format:2; /* .(8:2) */ unsigned default_desig:3; /* .(10:3) */ unsigned ascii:1; /* .(13:1) */ unsigned domain:2; /* .(14:2) */ } foptions_type; This defines a data type called "foptions_type" as a structure with several subfields. However, each subfield occupies a certain number of bits, and because of certain guarantees made by the C compiler (which differ, by the way, among different Cs on different machines) we know which bits they are. Thus, if we say: foptions_type foptions; ... fgetinfo (fnum,,foptions); if (foptions.cctl=1) ... A lot clearer, you'll agree, than saying "FOPTIONS.(7:1)". And the effect, of course, is exactly the same. On the other hand, there are certain restrictions to this bit field extraction mechanism: * Just like in SPL, you can't extract bit fields whose offset and length are not constant. To do this, you have to use the shift operators or bitwise operations. For instance, instead of writing CAPMASK.(BIT:2) (which you couldn't do in SPL anyway), you'd say: (capmask << bit) >> 14 or, perhaps, (capmask >> (16-bit)) & 3 (3 = a binary 11) You have to do similar (but uglier) stuff to set variable bit fields. * Another difference -- in which C falls short of SPL -- is that the "structure subfield" approach of bit extraction only works for getting bits from variables that were declared to be a certain type. Say you want to extract bits (10:2) of something that was declared as an "int", or, perhaps, an expression (like "foptions | 4", which ORs foptions with 4, thus setting the ASCII bit). You can't say (foptions | 4).rec_format or even ((foptions_type) (foptions | 4)).rec_format (since you can't cast something to a structure type). Granted, it's very rare that you'd want to extract a subfield of an expression, and if you're trying to extract a subfield of a type "int" variable, this might mean that you declared the variable wrong. Still, the point here is that SPL's bit extract mechanism is more flexible (though much less readable). * Finally, a philosophical issue. Yes, you can use C record structures to extract bit fields; but what if you want to use a particular subfield only once? Do you need to declare a special record structure just for extracting the field? Wouldn't it be easier to do like you can in SPL, specifying the bit offset and length directly, without encumbering yourself with a new datatype? This becomes an even more serious problem with PASCAL, in which you can't even do bit shifts to emulate impromptu bit extraction. The key point here is that for QUICK AND DIRTY, one-shot operations, declaring a structure in order to be able to use bit subfields may be more cumbersome that you'd like. Again, the old trade-off of "Good Programming Style" versus ease of writing. Thus, C supports bitwise operations and shifts (although not as many shift operators as SPL does; on the other hand, many of SPL's shift operators are of doubtful utility). It can also emulate bit field manipulations using shifts, but, more importantly, can make bit manipulations a lot easier and more readable using record structure bit subfields. On the other hand, SPL's ".(X:Y)" operator has the advantage of being usable on an ad-hoc basis, without needing to declare a special record structure. PASCAL As I mentioned before, PASCAL, perhaps more for philosophical reasons than anything else, does not explicitly support bits. "PASCAL: An Introduction to Methodical Programming" (W. Findlay & D. A. Watt) -- the book from which I learned PASCAL, and one that describes all the Standard PASCAL features -- doesn't even mention "bits" in its index. Of course, the need for bit fields was recognized quite early, and a fairly common consensus developed. PACKED RECORDs are defined in Standard PASCAL as structures that the compiler may -- at its option -- make use less space but slower to access. Many PASCAL compilers use PACKED RECORDs as vehicles for implementing bit subfields, much like C does. For instance, say that you want to declare an "foptions" type much like the one I showed for C. In PASCAL, it would be: TYPE FOPTIONS_TYPE = PACKED RECORD DUMMY: 0..3; /* .(0:2) */ FILE_TYPE: 0..3; /* .(2:2) */ KSAM: 0..1; /* .(4:1) */ DISALLOW_FILEEQ: 0..1; /* .(5:1) */ LABELLED_TAPE: 0..1; /* .(6:1) */ CCTL: 0..1; /* .(7:1) */ RECORD_FORMAT: 0..3; /* .(8:2) */ DEFAULT_DESIG: 0..7; /* .(10:3) */ ASCII: 0..1; /* .(13:1) */ DOMAIN: 0..3; /* .(14:2) */ END; Note the most obvious feature here (which you may consider either ingenious or utterly laughable, depending on your prejudices). Instead of specifying the number of bits you're using explicitly, you specify a range from 0 to 2 ^ NUMBITS - 1 The compiler then decides that the smallest number of bits it could use to represent this is NUMBITS, and allocates that many. Remember that PASCAL is very reluctant to let the programmer "see" any aspect of the internal representation of its variable; therefore, it prefers that bit fields be thus declared implicitly rather than explicitly. Now, you can access the bit fields of an FOPTIONS_TYPE variable just like you would in C: VAR FOPTIONS: FOPTIONS_TYPE; ... FGETINFO (FNUM, , FOPTIONS); IF FOPTIONS.CCTL=1 THEN ... If you don't like having to declare the bit fields with all those powers of 2 (quick -- how many bits is "0..8191"?), you can just issue the following declarations (usually in an $INCLUDE$ file): TYPE BITS_1 = 0..1; BITS_2 = 0..3; BITS_3 = 0..7; ... BITS_13 = 0..8191; BITS_14 = 0..16383; BITS_15 = 0..32767; and then declare each subfield of, say, FOPTIONS_TYPE, as being of type BITS_1 or BITS_3 or whatever. So, the high-level means of accessing bit subfields exists in PASCAL just as it does in C. What about the low-level means? What if you want to do a shift or a bitwise AND? Or, more concretely, what if you want to extract, say, a bit field that starts at a variable offset? Fortunately, HP PASCAL (unlike SPL and C) provides a nice built-in mechanism for handling variable-offset bit fields. Consider our classic example, a 32-bit "capability mask" (of the type that WHO returns). We want to be able to retrieve, say, the CAPNUMth bit of the capability mask. In PASCAL, we say: TYPE CAPABILITY_MASK_TYPE = PACKED ARRAY [0..31] OF 0..1; VAR CAPABILITY_MASK: CAPABILITY_MASK_TYPE; ... I:=CAPABILITY_MASK[BITNUM]; Simple! Because "PACKED" in PASCAL can apply to any kind of structure, HP PASCAL (and many other PASCAL compilers) allow PACKED ARRAYs of subranges that can fit in less than 1 byte (e.g. 0..1, 0..3, etc.) to become arrays of bit fields. Thus, "CAPABILITY_MASK[BITNUM]" extracts the BITNUMth bit of CAPABILITY_MASK simply because, to HP PASCAL, the BITNUMth bit is the BITNUMth element of an array of 32 bits. We could do the same, incidentally, to an array of 48 bits, 64 bits, etc. On the other hand, if we want to retrieve more than one bit, we can see the limitations of the PASCAL approach: * There is no simple way, for instance, to retrieve a variable number of bits from an integer. Since we have neither a bit extract nor a bit shift operator, we can't use them; our only alternatives are using division and modulo by powers of 2 (quite complicated and very inefficient) or using the PACK and UNPACK built-in procedures in a rather esoteric way (equally complicated and maybe more inefficient). * We can't even, in general, extract, say, 2 bits from a variable bit offset! We can only do this if we know that the bit offset will be a multiple of 2 -- in that case, we can use a PACKED ARRAY OF 0..3. Similar problems, of course, happen with extracting 3-bit fields, 4-bit fields, etc. from variable boundaries (although we can, for instance, extract nybbles from a packed decimal number, because they always start on a 4-bit boundary). In light of all this, it should be obvious that, say, shifts or bitwise operations are well-nigh impossible to do efficiently or conveniently in PASCAL (although in HP PASCAL, bitwise operations can be craftily and kludgily emulated using sets). What we see here is, I believe, a common PASCAL syndrome. Those things that the language supports -- to wit, fixed-offset bit fields and arrays of single-bit fields -- it often supports quite well; you can use these features very readably and efficiently. On the other hand, anything that the language designers didn't think to give you -- like shifts and bitwise operators -- you have NO WAY of accessing. If the compiler is too dumb to implement "X*8" as the much more efficient "X left shifted by 3 bits", too bad; in SPL and C you can do this yourself, but in PASCAL you can't. If the compiler doesn't have variable-offset bit field support, SPL and C let you do it using shifts; in PASCAL, it would be very difficult and very inefficient. Thus, if you find PASCAL's bit handling mechanisms sufficient -- as is quite probable, since the major features are there -- then you'll have no problems. On the other hand, there's a very distinct limit on what you can do in PASCAL, and PASCAL doesn't have the flexibility to let you work around it easily. PASCAL/XL Just a brief comment about PASCAL/XL -- in PASCAL/XL, instead of using PACKED RECORD and PACKED ARRAY for bit field support, you must use CRUNCHED RECORDs and CRUNCHED ARRAYs. Remember this when you try to write code that'll run both in PASCAL/XL and normal HP PASCAL. Remember this and weep. INCREMENT AND DECREMENT IN C Another feature of C worth mentioning is the variable increment/ decrement set of operators. This is what allows C programmers to say char a[80], b[80]; char *pa, *pb; pa = &a[0]; pb = &b[0]; while (pa != '\0') *(pb++) = *(pa++); This, of course, is either one of the most elegant pieces of code ever written, or one of the most unreadable. Or both. There are four operators like this in C: * ++ PREFIX, i.e. "++X". This increments X by 1 (or sometimes 2 or 4, if it's a pointer -- more about this later) and returns X's new value. In other words, if you say int x, y; x = 10; y = 1 + (++x) + 2; then X will be set to 11 and Y will be 14 (1 + 11 + 2). * ++ POSTFIX, i.e. "X++". This increments X by 1 (or 2 or 4) and returns X's OLD value, the one it had before the increment. Thus, int x, y; x = 10; y = 1 + (x++) + 2; then X will be set to 11, but Y will be 13 (1 + 10 + 2). In the calculation of Y, the old value of X (10) was used. * -- PREFIX ("--X"), just like ++, but decrements. * -- POSTFIX ("X--"), just like --, but decrements. Note that ++ and -- don't always increment/decrement by 1. If you pass them a POINTER, the pointer will be incremented or decremented by the SIZE OF THE OBJECT BEING POINTED TO. Thus, if "int" is a 4-byte integer, int a[10]; int *ap; ap = &a[0]; ++ap; will increment AP by 4 bytes (or 2 words, depending on how the pointer is represented internally). In other words, "++" and "--" of pointers actually increment or decrement by one element; if AP used to point to element 0 of A, now it points to element 1. Now, one reason why I mention these operators is that they are a non-trivial difference between C and PASCAL, and I have to say something about them just so you'll think I'm thorough. Another reason, though, is that beyond the seemingly simple (and, in case of the postfix operators, counterintuitive) definition lurks a fairly powerful construct that can be quite useful in many cases. On the other hand, some say -- and not without reason -- that using these kinds of operators makes code much more difficult to read and understand. One of the original reasons that these operators were introduced was that some of the computers that C was first implemented on supported these operations in hardware. Modern computers, like the VAX and the Motorola 68000, for instance, have special "addressing modes" on each instruction that allow you to store something into the location pointed to by a register and then increment the register (like postfix ++) or decrement the register and then store (like prefix --). A reasonable compiler, though, can know enough to translate, say, X:=X-1; into "decrement X"; even the 15-year-old SPL compiler can do this. Today's reason for the increment/decrement operators -- besides saving poor programmers' weary fingers -- is that in many cases they can very directly represent what you're trying to do. A classic case, for instance, is stack processing. Say you want to implement your own stack data structure. The primary operations you need are to PUSH a value onto the stack and to POP a value. In SPL, you might have a pointer PTR that points to the top cell, and define two procedures, PROCEDURE STACK'PUSH (V); VALUE V; INTEGER V; BEGIN @PTR:=@PTR+1; PTR:=V; END; INTEGER PROCEDURE STACK'POP; BEGIN STACK'POP:=PTR; @PTR:=@PTR-1; END; In C, you can have PTR point to one cell AFTER the top cell, and say: *(ptr++) = v; /* to push V onto the stack */ v = *(--ptr); /* to pop a value from the stack */ For stacks (and queues and other data structures), post-increment and pre-decrement are EXACTLY what you need. Of course, a full-function stack package would have to have many more features, but many of them can profitably use post-increment/ pre-decrement and other nice C features. Other applications are, for intstance, the case I showed as an example earlier: char a[80], b[80]; char *pa, *pb; pa = &a[0]; pb = &b[0]; while (pa != '\0') *(pb++) = *(pa++); What this actually does (isn't it obvious?) is copy the string stored in the array A to the array B. PA is a pointer to the current character in A; PB is a pointer to the current character in B. Since all C strings are terminated by a null character ('\0'), the loop goes through A, incrementing the pointers and copying characters at the same time! Similarly, you can say... while (*(pa++) == ' '); which will increment PA until it points to a non-blank; or, while (*(pa++) == *(pb++)); which will increment PA and PB while the characters they point to are equal -- very useful for a string comparison routine. In a way, these features of C are rather like FOR loops that never execute when the starting value is greater than the ending value. In PASCAL, for instance, you can say FOR INDEX:=CURRCOLUMN+1 TO ENDCOLUMN DO ... and know that if CURRCOLUMN = ENDCOLUMN, the loop won't be executed at all (which happens to be exactly what you want). In classic FORTRAN, though, DO 10 INDEX=CURRCOLUMN+1,ENDCOLUMN ... will always execute the loop at least once, even if CURRCOLUMN is equal to ENDCOLUMN; if you don't want this, you have to have an IF ... GOTO around the loop. The point here is that the "post/pre-increment/decrement" features are one of those things that "just happen to come in handy" in a surprising number of cases. Just by looking at them, you wouldn't think that they're so useful, but there are a lot of applications where they are just the thing. Fine, you've heard the "pro". ++ and -- let you write a lot of elegant one-liners for handling stacks, strings, queues and the like. Now, the con: while (pa != '\0') { *pb = *pa; pb = pb + 1; pa = pa + 1; } while (pa == ' ') pa = pa + 1; while (*pa == *pb) { pa = pa + 1; pb = pb + 1; } What are these? Well, these are the C loops that do exactly what the above post/pre-increment examples do, but using conventional operators. Are they more or less readable than the ++ mechanisms we saw? Let us for the moment ignore performance; even if the compiler doesn't optimize all these cases (and, to be fair, many compilers won't), performance isn't everything. DO ++ AND -- CONSTRUCTS MAKE YOUR CODE MORE OR LESS READABLE? Now, I don't have any opinions on this matter; I just tell you the two sides of the issue and let you decide. I am completely objective (if you believe that, I've got some waterfront property in Kansas you could have real cheap...). Readability isn't a black-and-white sort of thing; everybody has his own standards. What do you think? Are these "two-in-one" programming constructs elegant or ugly? ?: AND (,) Let's say that you want to call the READX intrinsic, reading either LEN words or 128 words, whichever is less. You know that your buffer only has room for 128 words, and you don't want to overflow it if LEN is too large; normally, though, if LEN<128, you want to read only LEN words. In PASCAL, you'd have to write this: IF LEN<128 THEN ACTUAL_LEN := READX (BUFFER, LEN) ELSE ACTUAL_LEN := READX (BUFFER, 128); In C, however, you can instead say: actual_len = readx (buffer, (len<128)?len:128); (Think of all the keystrokes you save!) What this actually means is that the second parameter to READX is the expression (LEN<128) ? LEN : 128 This is a "ternary" operator -- an operator with three parameters: * the first parameter (before the "?") is a boolean expression, called the "test" -- in this case "LEN<128"; * the second parameter (between the "?" and ":") is an expression called the "then clause", in this case "LEN"; * the third parameter (after the ":") is the "else clause", in this case "128". The behavior is quite simple -- if the "test" is TRUE, this operator returns the value of the "then clause"; if the test is FALSE, the operator returns the value of the "else" clause. Just like an IF/THEN/ELSE statement except that it returns a value of an expression instead of just executing some statements. The advantage should be clear. There are many cases where you need to do one of two things, almost exactly identical except for one key parameter. If the two tasks need a different statement in one case or another (e.g. a call to READ instead of READX), you'd use an IF/THEN/ELSE; if they need a different expression as a parameter inside a statement, you'd use a ?: construct. The trouble is, again, one of readability. Consider this example (taken as an example of the "right way" of using ?/: from "C, a reference manual", by Harbison & Steele): return (x > 0) ? 1 : (x < 0) ? -1 : 0; OK, quick, what does this do? Why, it determines the "signum" of a number, of course! +1 if the number is positive, -1 if it's negative, 0 if it's zero. Which is more readable -- the above or if (x > 0) return 1; else if (x < 0) return -1; else return 0; Again, up to you to decide -- some would say that ?: is better, others would side with the IF/THEN/ELSE. On the other hand, in this case, I think there is a substantive thing to be said against "? :": * ?: IS MORE THAN JUST AN OPERATOR; IT'S A CONTROL STRUCTURE IN THAT IT INFLUENCES THE FLOW OF THE PROGRAM. ESPECIALLY WHEN ?:S ARE NESTED (e.g. if you're testing two conditions and do one of four things based on the result), THE FACT THAT THIS CONTROL STRUCTURE IS DELIMITED BY TWO SPECIAL CHARACTERS (rather than, say, IF, THEN, or ELSE) CAN MAKE THE PROGRAM DIFFICULT TO READ. In other words, since "?" and ":" are just two special characters, like many of the other special characters that occur in C statements, you can often have a hard time finding out where the test starts and where it ends, where one THEN or ELSE clause ends, and so on. This is especially the case when you write code like a = (x>0) ? ((y>0)?(x*y):(-x*y)) : ((y>0)?(-x+y):(error_trap())); In which ?:s are nested within each other. Of course, you might say that this code is badly written; perhaps it should be: a = (x>0) ? ((y>0) ? (x*y) : (-x*y)) : ((y>0) ? (-x+y) : (error_trap())); But then, why not just write it as if (x>0 && y>0) a = x*y; else if (x>0) a = -x*y; else if (y>0) a = -x+y; else a = error_trap(); In any case, this is mostly a matter of personal preferences. I'm in favor of using ?: in #define's (where it is necessary -- see the chapter on them, and on "(,)" below), such as #define min(a,b) (((a)<(b)) ? (a) : (b)) #define abs(a) (((a)<0) ? -(a) : (a)) On the other hand, I try to avoid ?:s in normal code in almost all cases. I prefer IF/THEN/ELSE statements instead. (,) Just like ?: is equivalent to an IF/THEN/ELSE, (x,y,z) is essentially equivalent to { /* begin */ x; y; z; } /* end */ The difference, of course, is that "(x,y,z)" returns the value of Z. For instance, say that we're looping through a file. We want to read a bunch of records until we get an EOF (which, presumably, is a returned by a call to the "get_ccode" procedure). We might write: while ((len=fread(fnum,data,128), get_ccode()==2)) ... Instead of a simple expression, our loop test consists of two parts, which are both evaluated (in the given order!) to determine the value -- every time we do the loop test, we first call FREAD and then check the result of GET_CCODE. This, of course, is identical to len = fread(fnum,data,128); while (get_ccode()==2) { ... len = fread(fnum,data,128); } but using the "(x,y)" construct, we don't have to repeat the FREAD call twice. A more common use of this is in FOR loops. The three parts of the FOR loop -- the loop counter initialization, the loop termination test, and the loop counter increment -- must all be single expressions. Using the "," operator, you can fit several operations into one expression, to wit: for (pa=&a[0], pb=&b[0]; pa!='\0'; pa++, pb++) *pb=*pa; This, of course, copies the string A t(without the trailing zero) to the string B. The "," operator isn't used here for its value (which would be the value of "&b[0]" in the initialization portion and the new value of "pb" in the increment portion); it's just used for combining several expressions in a context where only one is allowed. The major power of both the ?: operator and the "," operator is manifested in #define's. Statements can only appear at the "top-level", separated by semicolons; expressions, however, can appear either inside statements or in place of statements. Thus, #define min(a,b) if (a<b) a; else b; won't work, because it will make x=min(y,z) expand into x=if (y<z) y; else z; which is quite illegal. On the other hand, #define min(a,b) (((a)<(b)) ? (a) : (b)) will translate x=min(y,z) into x=((y)<(z)) ? (y) : (z); which will do the right thing. Similarly, say you have a record structure typedef struct {real re; real im;} complex; Then, you can have a #define #define cbuild(z,rpart,ipart) (z.re=(rpart), z.im=(ipart), z) When you say "CBUILD(Z,1.0,3.0)", this sets the variable Z's RE subfield to 1.0, its IM subfield to 3.0, and returns the value Z. You can use this in cases like c = csqrt (cbuild(z,1.0,3.0)); If you didn't have the "," operator, you couldn't write a #define like this. You could say: #define cbuild(z,rpart,ipart) {z.re=rpart; z.im=ipart;} but then it wouldn't be usable in an expression because C statement blocks can't be parts of expressions. Again, my personal attitude towards the "," operator is similar to my opinion about ?:. It is necessary for #DEFINEs but best avoided in normal code, the one exception being FOR loops, where it's used more as a separator than to return a result. That's one reporter's opinion. THE "COMPOUND ASSIGNMENTS" (+=, -=, ETC.); ALSO, SOME MORE GENERAL COMMENTS ON EFFICIENCY AND READABILIY Finally, C has one other set of interesting operators. These are the "compound assignments", which perform an operation and do an assignment at the same time. A possible example might be: x[index(7)+3] += inc; which is, of course, identical to x[index(7)+3] = x[index(7)+3] + inc; Similarly, (*foo).frob |= 0x100; means the same thing as: (*foo).frob = (*foo).frob | 0x100; (and happens to set the 8th least-significant bit of "(*foo).frob"). [Note: The above examples aren't actually EXACTLY the same because of considerations pertaining to "double evaluation" of the expression being assigned to; however, this isn't usually very relevant, and I won't discuss it here.] Now if I wanted to, I could stop here. Obviously "a x= b" means the same as "a = a x b" (where "x" is pretty much any dyadic operator) -- now you know it and can make up your own minds about it. But, what the hell -- I'm a naturally garrulous kind of guy. I could run on for pages about these operators, and their psychoscientific motivations! In fact, I think I might do just that, because I think that there's something of deeper significance to them than just a few saved keystrokes. To put it simply, there are several "statements" that the presence of these operators -- "+=", "-=", "++", "--", etc. -- makes. Whether you take the "pro" side or the "con" on them will influence your opinion on the utility of these operators: * EFFICIENCY. Saying "X += Y" or "X++" will let the compiler generate more efficient code than just "X = X+Y" or "X = X+1". - PRO: The compiler "knows" that we're just incrementing a variable (rather than doing an arbitrary add) and can generate the more efficient instructions that most computers have for this special case. - CON: All -- or almost all -- modern compilers can easily deduce this information even from a "X = X+Y" or "X = X+1". Even SPL/3000, which is 15 years old, will generate an "INCREMENT BY ONE" instruction if you say "X := X+1". - MORE PRO (COUNTER-CON?): It's true that most compilers will automatically generate fast code for increments, bit extracts, etc. However, every compiler will have SOME flaw somewhere -- perhaps it won't recognize one particular case and will generate inefficient code. Special operators that the compiler ALWAYS translates efficiently can allow you to write efficient code even if you're stuck with a silly compiler implementation. * READABILITY. Saying "X += Y" is more readable than "X = X+Y". - PRO: Consider one of the examples above: x[index(7)+3] = x[index(7)+3] + inc; Here, we're incrementing "x[index(7)+3]" -- but how does the person reading the program know that? He has to look at the fairly complex expressions on both sides of the assignment, and make sure that they're identical! Similarly, when he's writing the program, it's quite easy for him to make a mistake -- say "x[index(3)+7]" instead of "x[index(7)+3]" on one side of the assignment, and probably never see it because he "knows" that it's just a simple increment. Saying x[index(7)+3] += inc; is actually MORE readable, since you don't have to duplicate any code and thus introduce additional opportunity for error. - CON: "x[index(7)+3] += inc". Can you read that? I can't read that! The more special characters and operators a language has, the harder it is to read. Everybody's USED to simple ":=" assignments, present in ALGOL, FORTRAN, PASCAL, SPL, ADA, etc.; when we introduce a whole new bevy of operators, people are likely to misunderstand them, or at least have to take extra time and effort while reading the program. * FLEXIBILITY. OK, so you don't like these operators -- don't use them! - PRO: Hey, this is a free country! Look at the entire rich operator set and use only those that you find pleasant; at least in C, you have the choice. - CON: I might not be forced to WRITE programs with these operators in them, but I may well be forced to READ them; 70% of a program's lifetime is spent in maintenance, and I don't want my programmers to write in a language that ENCOURAGES them to write unreadably! A language should be restrictive as well as flexible -- it should prevent wayward programmers from writing unreadable constructs like: x += (x++) + f(x,y) + (y++); /* can you understand this? */ - PRO again: Authoritarian fascist! - CON again: Undisciplined hippie! OK, break it up, boys. I think that the above issues are particularly involved in evaluating C's rich (but perhaps "undisciplined") operator set, and, to some extent, the differences between PASCAL and C in general. I won't pretend to tell you which attitude is correct -- I don't know myself. I just want to lay more of the cards out on the table. PASSING VARIABLE NUMBERS OF PARAMETERS TO PROCEDURES -- SPL One feature that SPL has is so-called "OPTION VARIABLE" procedures. This is a procedure that looks like this: PROCEDURE V (A, B, C); VALUE A; INTEGER A; BYTE ARRAY B, C; OPTION VARIABLE; BEGIN ... INTEGER VAR'MASK = Q-4; ... IF VAR'MASK.(14:1)=1 THEN << was parameter B omitted? >> ... END; What does this mean? This means that when we say: V (1); or V (,,BUFF); or even simply V; the SPL compiler will NOT complain that we didn't specify the three parameters that V expects. Rather, it will pass those parameters you've specified, pass GARBAGE in place of the parameters you've omitted, and will set the "Q-4" location in your stack to a bit mask indicating exactly which parameters were specified and which were not. As you see, we've declared the variable VAR'MASK to reside at this Q-4 location, and can now say IF VAR'MASK.(x:1)=1 THEN to check whether or not the parameter indicated by "x" was actually specified. "x" has to be the bit number associated with the parameter, counting from 15 (the last parameter) down. Thus, to check if C (the last parameter) was specified, we'd say IF VAR'MASK.(15:1)=1 THEN To check for B (the second-to-last), we'd say IF VAR'MASK.(14:1)=1 THEN To check for A (the third-to-last), we'd say IF VAR'MASK.(13:1)=1 THEN Note the twin advantages of being able to omit parameters: * It can make the procedure call a lot easier to write or read; the FOPEN intrinsic has 13 parameters, all of them necessary for one thing or another. Do you want to have to say: MOVE DISC'DEV:="DISC "; FNUM:=FOPEN (FILENAME, 1, %420, 128, DISC'DEV, DUMMY, 0, 2, 1, 1023D, 8, 1, 0); or wouldn't you rather just type FNUM:=FOPEN (FILENAME, 1, %420); and let all the other parameters automatically default to the right values? It's easier to write AND gives less opportunity for error (did you notice that I accidentally specified blocking factor 2 and 1 buffer instead of the default, which is the other way around?). * Furthermore, the very act of omitting or specifying a parameter carries INFORMATION. For instance, FOPENing a file with DEV=LP and the forms message parameter OMITTED is quite different than passing any forms message. The very fact that the forms message wasn't specified tells the file system something. Similarly, omitting the blocking factor in an FOPEN makes the file system calculate an "optimal" (actually it isn't) blocking factor for the file. Many examples -- FOPEN, FGETINFO, FCHECK, etc. -- can be given where not having OPTION VARIABLE would make calling the procedure a substantial burden. While we talk about the advantages of OPTION VARIABLE procedures, let's note also some of the problems with the way they're implemented in SPL: * If you declare a procedure to be OPTION VARIABLE, then SPL will let its caller omit ANY parameter. Usually, some parameters are optional, while others (e.g. the file number in an FGETINFO call) are required, and you'd like the compiler to enforce this requirement. Otherwise, you'd either have to rely on the user (always a bad idea), or check the presence of each of the required parameters yourself at run-time (possible but somewhat cumbersome and inefficient). * As you saw, checking to see whether a parameter was actually passed is not an easy job. Instead of saying IF HAVE(FILENAME) THEN you have to say INTEGER VAR'MASK = Q-4; ... IF VAR'MASK.(3:1)=1 THEN knowing (as of course you do) that FILENAME is the 13th-to-last procedure parameter and is thus indicated by VAR'MASK.(3:1). * Often, a user's omission of a parameter simply means that some default value should be assumed. Why not have the compiler take care of this case for you instead of making you do it yourself? For instance, if you were writing FOPEN, wouldn't you rather say: INTEGER PROCEDURE FOPEN (FILE, FOPT, AOPT, RECSZ, DEV, ...); VALUE FOPT, AOPT, RECSZ, ...; BYTE ARRAY FILE (DEFAULT ""), DEV (DEFAULT "DISC "); INTEGER FOPT (DEFAULT 0), AOPT (DEFAULT 0), RECSZ (DEFAULT 128); ... OPTION VARIABLE; instead of INTEGER PROCEDURE FOPEN (FILE, FOPT, AOPT, RECSZ, DEV, ...); VALUE FOPT, AOPT, RECSZ, ...; BYTE ARRAY FILE (DEFAULT ""), DEV (DEFAULT "DISC "); INTEGER FOPT (DEFAULT 0), AOPT (DEFAULT 0), RECSZ (DEFAULT 128); ... OPTION VARIABLE; BEGIN INTEGER VAR'MASK=Q-4; ... IF VAR'MASK.(3:1)=0 THEN @FILE:=@DEFAULT'FILE; IF VAR'MASK.(4:1)=0 THEN FOPT:=0; IF VAR'MASK.(5:1)=0 THEN AOPT:=0; IF VAR'MASK.(6:1)=0 THEN RECSZ:=128; IF VAR'MASK.(7:1)=0 THEN @DEV:=@DISC'DEVICE; ... END; Note only is that easier on the author of FOPEN, but it could also be more efficient at run-time -- instead of having a whole bunch of run-time bit extracts and checks, the code generated by a call such as: FNUM:=FOPEN (TMPFILE); might actually have all the default values built in to it (just as if the user had explicitly specified them), saving a non-trivial amount of time. * Finally, another interesting concern. Say that I want to write a procedure that's "plug-compatible" with the FOPEN intrinsic. In my MPEX/3000, for instance, I have a SUPER'FOPEN procedure that checks a global "debugging" flag, prints all of its parameters if the flag is true, and then calls FOPEN. SUPER'FOPEN also calls the ZSIZE intrinsic to make sure that FOPEN has as much stack space as possible to work with; it might also detect and specially handle certain error conditions, and so on. In other words, what I want to have is an OPTION VARIABLE procedure that does some things and then passes all of its parameters to another OPTION VARIABLE procedure: INTEGER PROCEDURE SUPER'FOPEN (FILE, FOPT, AOPT, RECSZ, ...); ... OPTION VARIABLE; BEGIN ... SUPER'FOPEN := FOPEN (FILE, FOPT, AOPT, RECSZ, ...); ... END; The trouble here is that in my FOPEN call I want to OMIT ALL THE PARAMETERS THAT WERE OMITTED IN THE SUPER'FOPEN CALL and SPECIFY ONLY THOSE PARAMETERS THAT WERE SPECIFIED IN THE SUPER'FOPEN CALL. In other words, in this case I DON'T KNOW WHICH PARAMETERS I WANT TO OMIT UNTIL RUN-TIME. If I just say: SUPER'FOPEN := FOPEN (FILE, FOPT, AOPT, RECSZ, ...); passing all thirteen parameters, FOPEN will think that they're all the ones I want, whereas many of them are garbage. I want to say INTEGER VAR'MASK = Q-4; ... SUPER'FOPEN := VARCALL FOPEN, VAR'MASK (FILE, FOPT, AOPT, RECSZ, ...); somehow telling the compiler: "this isn't an ordinary call, where you should figure out which parameters are specified and which aren't; rather, pass to FOPEN the very same VAR'MASK parameter that I myself was given". PASSING VARIABLE NUMBERS OF PARAMETERS TO PROCEDURES -- STANDARD PASCAL, PASCAL/3000, ISO LEVEL 1 STANDARD PASCAL Standard PASCAL, PASCAL/3000, and ISO Level 1 Standard PASCAL do not allow you to pass variable numbers of parameters to procedures. Enough said? Well, maybe not. As I've mentioned before, the mere fact that language X has a feature that language Y does not doesn't mean that language X is better than language Y. This isn't a basketball game where you get 2 points for each feature, and 3 for each one that's really far out. Maybe PASCAL has a point -- do you really need procedures with variable numbers of parameters? Well, the first thing you notice about, say, the CREATE, FOPEN, and FGETINFO intrinsics -- conspicuous users of the OPTION VARIABLE features -- is that they aren't very extensible. Sure, FGETINFO has 20 parameters, and you can specify any and omit any (except the file number); but what if a new file parameter is introduced? Since there are all these thousands of programs that use the old FGETINFO, we can't just add a 21st parameter, since that would make them all incompatible. This, in fact, is why the FFILEINFO intrinsic was created -- FFILEINFO takes a file number and five pairs of "item numbers" and "item buffers". Each item number is a code indicating what piece of information ought to be returned about a file. Thus, up to five different pieces of information can be returned by a single FFILEINFO call. If you need more than five (which is unlikely), you can call FFILEINFO twice or however many times is necessary. A typical call can thus look like: FFILEINFO (FNUM, 8 << item number for "filecode" >>, CODE, 18 << item number for "creator id" >>, CREATOR); instead of the FGETINFO call, which would be: FGETINFO (FNUM,,,,,,,,CODE,,,,,,,,,,CREATOR); Note another advantage of the FFILEINFO approach -- you no longer have to "count commas" to make sure that your parameter is in the right place; the item number (which you've presumably declared as a symbolic constant) indicates what the item you want to get is. So, instead of the 20-parameter OPTION VARIABLE FGETINFO intrinsic, we have FFILEINFO. But FFILEINFO is still OPTION VARIABLE! Remember, FFILEINFO takes up to five item number/item value pairs; in this case we entered only two. Of course, we could have said: FFILEINFO (FNUM, 8 << item number for "filecode" >>, CODE, 18 << item number for "creator id" >>, CREATOR, 0, DUMMY, 0, DUMMY, 0, DUMMY); but who'd want to? Similarly, FFILEINFO might have been defined to return only one piece of data at a time (and thus always have exactly three parameters), but again that's not very good. Every FFILEINFO call has some fixed overhead to it (for instance, finding the File Control Block from the file number FNUM); why repeat it more often than you have to? Another example arises in the CREATEPROCESS intrinsic. The CREATEPROCESS intrinsic was introduced when some new parameters -- ;STDLIST, ;STDIN, and ;INFO -- had to be added to the CREATE intrinsic. The CREATE intrinsic, although OPTION VARIABLE, was initially defined to have 10 parameters. This means that any compiled program that uses the CREATE intrinsic expects it to have 10 parameters; if you added three parameters to the CREATE intrinsic in the system SL, all the old programs would stop working. An additional problem with CREATEPROCESS is that it wasn't just a "get me some information" intrinsic like FFILEINFO -- it actually starts a new process. We can't just say "pass five process-creation parameters at a time; if you need to pass more, just call it twice" (like we did for FFILEINFO). All the parameters need to be known to the CREATEPROCESS intrinsic at once. The CREATEPROCESS intrinsic, although OPTION VARIABLE, doesn't really need to be. You can just view it as a five-parameter procedure: CREATEPROCESS (error, pin, program, itemnumbers, items); The itemnumbers array contains the item numbers of all the process-creation parameters; the items array contains the parameters themselves (either the values or the addresses). Thus, to do the equivalent of an old CREATE (PROGRAM, ENTRY'NAME, PIN, PARM, 1, , , MAXDATA); (which would create a process with entry ENTRY'NAME, ;PARM= PARM, ;MAXDATA= MAXDATA, and "load flags" 1), we'd say INTEGER ARRAY ITEM'NUMS(0:4); INTEGER ARRAY ITEMS(0:4); ... << Item 1 = entry name, 2 = parm, 3 = load flags, 6 = maxdata; >> << 0 terminates the list. >> << We probably want to have EQUATEs for these "magic numbers". >> MOVE ITEM'NUMS:=(1, 2, 3, 6, 0); ITEMS(0):=@ENTRY'NAME; << the address of the entry name >> ITEMS(1):=PARM; << ;PARM= >> ITEMS(2):=1; << load flags >> ITEMS(3):=MAXDATA; << ;MAXDATA= >> CREATEPROCESS (ERR, PIN, PROGRAM, ITEM'NUMS, ITEMS); As you see, the non-OPTION VARIABLE approach may be more extensible, but it certainly isn't easier to write or read. Finally, let me point out that OPTION VARIABLE procedures, though not easily extensible when you have COMPILED CODE that calls them, are quite easy to extend when you have SOURCE CODE. If you have your own OPTION VARIABLE procedure MYPROC, then adding a new parameter to it is a piece of cake; in fact, it's easier than adding a new parameter to a non-OPTION VARIABLE procedure (for which you'd have to change all the calls to pass an extra dummy parameter). All you need to do to extend an OPTION VARIABLE procedure is to recompile both it and all its callers, so that the newly-compiled code will appropriately reflect the new parameters of the called procedure. PASSING VARIABLE NUMBERS OF PARAMETERS TO PROCEDURES -- KERNIGHAN & RITCHIE C One thing you may have noticed about C is that two of the most important functions in C -- "printf", which outputs data in a formatted manner, and "scanf", which inputs data -- have variable numbers of parameters. An example of a call to "printf" might be: printf ("Max = %d, min = %d, avg = %d\n", max, min, average); This call happens to take 4 parameters -- the format string (in which the "%d"s indicate where the rest of the parameters are to be put in) and three integers (max, min, and average). Other calls might take only one parameter (a constant formatted string, e.g. 'printf ("Hi there!\n")') or two or twenty. Now, PASCAL's input/output "procedures" (READ, READLN, WRITE, and WRITELN) also take a variable number of parameters. Unfortunately, PASCAL isn't being quite honest when it just calls them "procedures"; they can get away with things that ordinary procedures can't, such as taking parameters of varying types, taking a variable number of parameters, and even having special parameters prefixed with ":"s (e.g. "WRITELN(X:10, Y:7:4)"). C, however, is serious when it calls "printf" and "scanf" procedures. Their source is kept somewhere in some C library source files; if you don't like them, you can rewrite them yourself, or write your own procedures that take variable numbers of parameters. Let's say that we want to do just that. Let's say that on our way to work, we fall down and knock ourselves on the head. When we wake up, we find that we've inexplicably fallen in love with FORTRAN and want to make our "printf" format strings look exactly like FORTRAN FORMAT statements. (For a slightly more plausible example, say that we want to add some new directives, such as "%m" for outputting data in monetary format, with ","s between each thousand -- standard C "printf" doesn't allow this.) Well, what we really want to do is write a "writef" procedure: writef (fmtstring, ???) char fmtstring[]; ???; { ??? } Now the good news is that -- unlike PASCAL -- C won't get upset when we call this procedure as: writef ("I5,X,F7.2", inum, fnum); on one line, and as writef ("I4,X,S,X,I3", i1, s, i2); on the next; C never checks the number or types of parameters anyway. (Note that C allows us to omit parameters at the END of the parameter list; unlike SPL, it doesn't let us omit them from the MIDDLE of the parameter list.) The question is -- how do we write the "writef" procedure? The caller might be able to specify a variable number of parameters, but how will "writef" itself be able to access these parameters? This is where the trouble with having "OPTION VARIABLE"-type C routines comes in. There's no universal, system-independent way for "writef" and any such procedure to find out how many parameters were actually passed, or access those parameters that were passed. Different compilers have different conventions for this sort of thing. Many compilers assure you that if the user passes, say, 3 parameters to a procedure that expects 10 parameters, then the first 3 procedure parameters will have the right values -- it's just that the remaining 7 will be set to garbage. In this case, we could write "writef" as: writef (fmtstring, p1, p2, p3, p4, p5, p6, p7, p8, p9, p10) char fmtstring[]; int p1, p2, p3, p4, p5, p6, p7, p8, p9, p10; Then if we call "writef" using: writef ("I5,X,F7.2", inum, fnum); the procedure can look at the format string -- which it knows will be passed as "fmtstring" -- determine the number of parameters that the format string expects (in this case, 2, one for the I5 and one for the F7.2), and then look only at "p1" and "p2", not at "p3" through "p10", which are known to be garbage. Some other compiler might always assure you that the number of parameters passed to a procedure would be kept in some register, which could then be accessed using an assembly routine. On the other hand, it might say that if 10 parameters were expected but only 3 were passed, the actually passed parameters would be accessible as P8, P9, and P10 instead of P1, P2, and P3. Then, WRITEF would have to call the assembly routine to determine the number of passed parameters and would then have to realize that since only 3 parameters were passed, their data is stored in P8, P9, and P10. As you see, there are two components here: * Knowing the number of parameters passed (here determined by looking at FMTSTRING). * Being able to determine the value of each parameter that was passed (here assured by knowing that any parameters that are passed will become the first, second, etc. parameters of the procedure). Somehow -- by some compiler guarantee, or by a calling convention (e.g. the number of parameters is indicated in FMTSTRING, or the parameter list is terminated by -1), or by some assembly routine -- we need to be able to do both of the above things. Finally, let me point out one other factor. When we declare a procedure as writef (fmtstring, p1, p2, p3, p4, p5, p6, p7, p8, p9, p10) we really don't want to access the last 10 parameters as P1 through P10; we want to be able to view them all as elements of one big array, so we could say something like: for (i = 0; i < 10; i = i + 1) process_parm (p[i]); instead of having to say process_parm (p1); process_parm (p2); process_parm (p3); process_parm (p4); process_parm (p5); process_parm (p6); process_parm (p7); process_parm (p8); process_parm (p9); process_parm (p10); The same, incidentally, arises with SPL -- we'd like to be able to access SPL OPTION VARIABLE parameters as elements of one big "parameters array", too. In SPL, it turns out, we can do that by saying: PROCEDURE FOO (P0, P1, P2, P3, P4, P5, P6, P7, P8, P9); VALUE P0, P1, P2, P3, P4, P5, P6, P7, P8, P9; INTEGER P0, P1, P2, P3, P4, P5, P6, P7, P8, P9; OPTION VARIABLE; BEGIN INTEGER ARRAY PARMS(*)=P0; ... END; The PARMS array here is defined to start at the location occupied by the by-value parameter P0; it so happens that the way parameters are allocated on the HP3000, PARMS(3) would be equal to P3, and PARMS(7) would be equal to P7. Similarly, in C you can say: foo (p0, p1, p2, p3, p4, p5, p6, p7, p8, p9) int p0, p1, p2, p3, p4, p5, p6, p7, p8, p9; { int *parms; parms = &p0; ... x = parms[i]; /* meaning parameter #I */ ... } and this will work on those C compilers which allocate the parameters the appropriate way on the stack. As you see, the conclusion here -- just like in the general question of writing OPTION VARIABLE-like parameters -- is: * It's probably doable on any particular C implementation, but it's certainly not portable. Thus, to summarize, C's support for procedure with variable numbers of parameters is: * Unlike PASCAL, C syntax allows you to specify a different number of parameters in a call than the procedure actually has: you can call a 10-parameter procedure using "P (1, 2, 3)". * Unlike SPL, you can't omit any parameters in the middle of a call -- "P (1,, 3,,, 6)" is illegal. * Although the CALL is legal and portable, there's no way to portably write a procedure that EXPECTS a variable number of parameters. * On the other hand, on most C compilers, there will be SOME way of writing an OPTION VARIABLE-type procedure, although as I said it's likely to be rather different from compiler to compiler. * Finally -- something that I haven't mentioned yet but that is of much relevance -- parameters of different types may occupy a different amount of space on the call stack. If you pass a "long float" to a procedure that's expecting "int" parameters, the "long float" will end up occupying two parameters. This means that the procedure must know when its parameters are "long float"s (like "writef" can know by looking at the FMTSTRING parameter), and kludge accordingly. PASSING VARIABLE NUMBERS OF PARAMETERS TO PROCEDURES -- DRAFT ANSI STANDARD C Draft ANSI Standard C has a provision for passing variable numbers of parameters. Like many good things, it's at the same time useful and confusing. Let's have a look at it. Calling an OPTION VARIABLE-type procedure in Draft Standard C is quite similar to the way you'd do it in K&R C: writef ("I5,X,F7.2", inum, fnum); The one difference is that the compiler might (or might not) DEMAND that you establish a function prototype (see "DATA STRUCTURES -- TYPE CHECKING") to declare that this function takes a variable number of parameters. The prototype for WRITEF would probably be: extern int writef (char *, ...); The "char *" says that there is one REQUIRED parameter, a character array; the "..." -- literally, three "."s, one after the other -- means that there is a variable number of parameters after this. Defining a procedure that can take a variable number of parameters is trickier. Here's an example: writef (char *fmtstring, ...) { va_list arg_ptr; va_start (arg_ptr, fmtstring); ... while (!done) { ... if (current_fmtstring_descriptor_is_I) output_integer (va_arg (arg_ptr, int)); else if (current_fmtstring_descriptor_is_F) output_integer (va_arg (arg_ptr, float)); else if (current_fmtstring_descriptor_is_S) output_integer (va_arg (arg_ptr, char *); ... } ... va_end (arg_ptr); } Consider the components of this declaration one at a time: * The "..." in the header indicates that besides the one required parameter, this procedure takes an unknown number of optional parameters. * The "va_list arg_ptr" declares a variable called "arg_ptr", of type "va_list" (which is defined in a special #INCLUDE file that comes with the C compiler). * The "va_start (arg_ptr, fmtstring)" call initializes "arg_ptr" to point to the first variable parameter -- the one immediately after "fmtstring". "va_start" must be passed the last required parameter (in this case, "fmtstring"); among other things, this means that every procedure must take AT LEAST ONE fixed parameter -- you can't have all the parameters be optional. * The procedure then (presumably) goes through FMTSTRING and finds out what the types of the parameters is expected to be. As it determines that the current format descriptor is, say, "I", or "F", or "S", it "picks up" the next parameter. It does this by saying va_arg (arg_ptr, <type>) The "arg_ptr" is the same variable that was declared using "va_list arg_ptr"; the <type> indicates which type of object we want to get (in our case, this may be an "int", a "float", or a "char *", depending on which format descriptor we're on). * Finally, at the end, we call va_end (arg_ptr); to do whatever stack cleanup needs to be done. This method is guaranteed (heh, heh) to be portable across all implementations of the Draft ANSI Standard. (Again, note that since the Standard is only Draft, many existing implementations might have no such facility or a slightly different one.) Note its advantages and disadvantages: * You can now portably access the optional parameters, and even easily access them as elements of an array by making a WHILE loop that calls VA_ARG and sticks the results into a local array. * You can specify that some of the procedure parameters are required, thus letting the compiler check that every call to the procedure contains at least those parameters. * On the other hand, there's still no way of figuring out exactly how many parameters were passed to you -- you have to rely on the user's telling you this, either as an explicit parameter or implicitly (such as using a format string from which the number of parameters can be deduced). * You can't have a procedure where all of the parameters are optional. * You still can't have a procedure where a parameter in the MIDDLE of a parameter list can be omitted (e.g. "P (1,,3,,,6)"). * Accessing parameters that are simply optional is somewhat harder than in SPL, since you have to get them using VA_ARG rather than referring to them by name, to wit: create (char *progname, ...) { va_list ap; char *entryname; int *pin, param, flags, stack, dl, maxdata, pri, rank; va_start (ap, progname); entryname = va_arg (ap, char *); pin = va_arg (ap, int *); param = va_arg (ap, int); flags = va_arg (ap, int); stack = va_arg (ap, int); dl = va_arg (ap, int); maxdata = va_arg (ap, int); pri = va_arg (ap, int); rank = va_arg (ap, int); ... } As you see, you have to specially extract each optional parameter, rather than just being able to access it directly like you can in SPL. PASSING VARIABLE NUMBERS OF PARAMETERS TO PROCEDURES -- PASCAL/XL PASCAL/XL's support for procedures with optional parameters seems to be really nice. One mechanism that PASCAL/XL provides is "OPTION DEFAULT_PARMS". PROCEDURE P (A, B, C, D, E, F: INTEGER) OPTION DEFAULT_PARMS (A:=11, C:=22, E:=55, F:=66); BEGIN ... END; This tells PASCAL/XL several things: * In any call to P, the first (A), third (C), fifth (E), or sixth (F) parameters may (or may not) be omitted. Thus, we can say: P (, 22, ,44); { omitting A, C, E, and F } P (11, 22, 33, 44); { omitting only E and F } P (, 22, 33, 44, , 66); { omitting A and E } or any such combination. Only the parameters without a DEFAULT_PARMS declaration -- B and D -- must be specified. * If any of A, C, E, and F are omitted, then when P tries to access them, it will get their default values instead. To the procedure, the parameter will look exactly as if it was specified as the default value. * HOWEVER, the procedure can (if it wants to) determine if a parameter was ACTUALLY passed by saying something like IF HAVEOPTVARPARM(C) THEN { C was actually passed } ELSE { we're using C's default value }; "HAVEOPTVARPARM(X)" simply returns TRUE if parameter X was actually passed to the procedure, and FALSE if parameter X was not passed and X's value was simply defaulted. Thus, we get the best of both worlds: * You can specify a default value, so if the procedure wants to, it can just see the parameter value as the default. * If you need to, you can still find out if the parameter was REALLY specified. * Since the compiler knows which parameters are optional and which are required, it can make sure that the required ones are really specified (unlike SPL, in which any parameters of an OPTION VARIABLE procedure may be omitted without an error). Now, interestingly enough, PASCAL/XL also has a different mechanism to achieve a similar goal. You can also say PROCEDURE P (A, B, C, D, E) OPTION EXTENSIBLE 3; What this means is that all parameters after the first 3 -- in this case, D and E -- are optional. You can actually combine this with DEFAULT_PARMS to set default values for these "extension" parameters, or even set default values for the "non-extension" parameters, thus making them optional, too. Practically speaking, saying PROCEDURE P (A, B, C, D, E: INTEGER) OPTION DEFAULT_PARMS (D:=NIL, E:=NIL); will achieve pretty much the same goal (making both D and E extensible). The advantage of EXTENSIBLE parameters is that their implementation allows you to add new parameters to an OPTION EXTENSIBLE procedure WITHOUT HAVING TO RE-COMPILE ANY OF ITS CALLERS! Thus, if HP had written the CREATE intrinsic in PASCAL/XL, it could have said: PROCEDURE CREATE (VAR PROGRAM: STRING; VAR ENTRY: STRING; VAR PIN: INTEGER; PARM, FLAGS, STACK, DL, MAXDATA, PRI, RANK: INTEGER) OPTION EXTENSIBLE 3 DEFAULT_PARMS (ENTRY:=""); This would have made PROGRAM and PIN required and all the other parameters optional -- ENTRY because of the DEFAULT_PARMS and the rest because of the OPTION EXTENSIBLE. Then, if HP wanted to add STDIN, STDLIST, and INFO parameters, it could have just changed the definition of CREATE to: PROCEDURE CREATE (VAR PROGRAM: STRING; VAR ENTRY: STRING; VAR PIN: INTEGER; PARM, FLAGS, STACK, DL, MAXDATA, PRI, RANK: INTEGER; VAR STDIN, STDLIST, INFO: STRING) OPTION EXTENSIBLE 3 DEFAULT_PARMS (ENTRY:=""); Then, ALL OF THE PROGRAMS THAT WERE COMPILED REFERRING TO THE OLD CREATE WOULD STILL WORK! You can add new parameters to an OPTION EXTENSIBLE procedure without causing any incompatibility with previously compiled callers. SUMMARY Thus, to summarize the differences in the ways the various compilers handle optional parameters and procedures with variable numbers of parameters: ["STD PAS" includes Standard PASCAL, ISO Level 1 Standard, and PASCAL/3000] STD PAS K&R STD SPL PAS /XL C C CAN YOU HAVE A PROCEDURE YES NO YES YES YES WITH OPTIONAL PARAMETERS? IS SUCH A PROCEDURE N/A N/A NO YES DEFINITION PORTABLE? CAN YOU OMIT PARAMETERS IN YES YES NO NO THE MIDDLE OF A CALL? CAN THE FIRST PARAMETER OF A YES YES YES NO FUNCTION BE OPTIONAL? CAN YOU DETECT IF A PARAMETER YES YES NO NO WAS REALLY PASSED OR NOT? CAN YOU SPECIFY SOME NO YES NO YES PARAMETERS AS REQUIRED? CAN YOU SPECIFY DEFAULT VALUES NO YES NO NO FOR OPTIONAL PARAMETERS? CAN YOU ADD NEW PARAMETERS NO YES NO NO WITHOUT RE-COMPILING ALL CALLERS? CAN YOU ACCESS PARAMETERS "BY YES NO YES YES NUMBER" AS WELL AS "BY NAME"? [This refers to the example we showed were we wanted to reference, say, the last 10 parameters as elements of an array rather as P1, P2, P3, ..., P10] PROCEDURE AND FUNCTION VARIABLES Say that you write a B-Tree handling package. B-Trees, as you know, is the kind of data structure that KSAM is built on; they allow you to easily find records either by key or in sequential order. Thus, if you store your data in a B-Tree, you can, for instance, find a record whose key starts with "JON", even though you don't know the exact key value. Now, you're a sophisticated programmer, and you know how to deal with this sort of thing. You define a record structure type called, say, BTREE_HEADER_TYPE, that contains the various pointers that your B-Tree handling procedures need, and then write the following procedures: PROCEDURE BTREE_CREATE (VAR B: BTREE_HEADER_TYPE); ... PROCEDURE BTREE_ADD (VAR B: BTREE_HEADER_TYPE; K: KEY_TYPE; REC: RECORD_TYPE); ... FUNCTION BTREE_FIND (VAR B: BTREE_HEADER_TYPE; K: KEY_TYPE): RECORD_TYPE; ... You get the drift -- you have all these routines, to which you pass the appropriate data, and between them, they process the data structure. No problem. Now, we said that the B-tree allows you to retrieve records in "sorted order". Sorted how? If the key is a string, you'd want it sorted alphabetically; however, what if the key is an integer? Or a floating point number? Comparing two strings is a different operation from comparing integers or floating point numbers. Now, you can write a different set of routines for B-Trees with string keys, B-Trees with integer keys, B-Trees with packed decimal keys, etc. You can, but of course you wouldn't want to duplicate the code. Assume for a moment that you can get around PASCAL's type checking so that you can pass an arbitrary-type key to the BTREE_ADD and BTREE_FIND routines; how do you make sure that the procedures do the appropriate comparisons for the various types? Well, one possibility is this: * Have a field in the BTREE_HEADER_TYPE data structure called "COMPARISON_TYPE". * Have BTREE_CREATE take a parameter indicating the comparison type (string, integer, float, packed, etc.) needed; it can then put this type into the COMPARISON_TYPE field. * Each BTREE_ADD and BTREE_FIND call will interrogate this COMPARISON_TYPE field, and do the appropriate comparison; for instance, PROCEDURE BTREE_ADD (VAR B: BTREE_HEADER_TYPE; K: KEY_TYPE; REC: RECORD_TYPE); BEGIN ... IF B.COMPARISON_TYPE=STRING_COMPARE THEN COMP_RESULT:=STRCOMPARE (K, CURRENT_KEY) ELSE IF B.COMPARISON_TYPE=INT_COMPARE THEN COMP_RESULT:=INTCOMPARE (K, CURRENT_KEY) ELSE IF B.COMPARISON_TYPE=FLOAT_COMPARE THEN COMP_RESULT:=FLOATCOMPARE (K, CURRENT_KEY) ELSE IF B.COMPARISON_TYPE=PACKED_COMPARE THEN COMP_RESULT:=PACKEDCOMPARE (K, CURRENT_KEY); ... END; Depending on the COMPARISON_TYPE field value, BTREE_ADD can do the right thing. The trouble with this approach though, is quite obvious. What if we (like KSAM) support more than just these four types? What if we need to add zoned decimal support -- will we have to change BTREE_ADD (and BTREE_FIND and whatever other procedures do this)? What if we need to add an EBCDIC collating sequence? We want to allow the B-Tree package's USER to define his own comparison types without having to change the source code of the package itself. In other words, we don't just want to let the user pass us a value, like a record structure or an integer. We want to let a user pass a PIECE OF CODE, in this case the code that would do the comparison. Then, instead of having a big IF (or CASE) statement, our BTREE_ADD and BTREE_FIND procedures can simply call the code that was passed to them to do the comparison. PASCAL, of course, has a facility for doing this (as do SPL and C). PASCAL lets you declare a parameter to be of type PROCEDURE (or FUNCTION), and then call that parameter. A good example of this might be the following procedure: FUNCTION NUMERICAL_INTEGRATION (FUNCTION F(PARM:REAL): REAL; START, FINISH, INCREMENT: REAL): REAL; VAR X, TOTAL: REAL; BEGIN X:=START; TOTAL:=0.0; WHILE X<FINISH DO BEGIN TOTAL:=TOTAL+F(X)/((FINISH-START)/INCREMENT); X:=X+INCREMENT; END; NUMERICAL_INTEGRATION:=TOTAL; END; (And you thought you'd never have to see this sort of thing again once you finished college!) This procedure takes a function as a parameter (a function that itself takes one parameter), and then calls that function several times. The NUMERICAL_INTEGRATION procedure itself might be called thus: X:=NUMERICAL_INTEGRATION (SQRT, 0.0, 10.0, 0.01); This, as you see, passes it the procedure "SQRT" as a parameter. The same sort of thing, incidentally, can easily be done in SPL: REAL PROCEDURE NUMERICAL'INTEGRATION (F, START, FINISH, INC); VALUE START, FINISH, INC; REAL PROCEDURE F; REAL START, FINISH, INC; ... or in C: float numerical_integration (f, start, finish, inc) real *f(); real start, finish, inc; ... And, of course, this is the very sort of thing that we'd do to implement our BTREE_ADD and BTREE_FIND: PROCEDURE BTREE_ADD (VAR B: BTREE_HEADER_TYPE; K: KEY_TYPE; REC: RECORD_TYPE; FUNCTION COMP_ROUTINE (K1, K2: KEY_TYPE): BOOLEAN); ... FUNCTION BTREE_FIND (VAR B: BTREE_HEADER_TYPE; K: KEY_TYPE; FUNCTION COMP_ROUTINE (K1, K2: KEY_TYPE): BOOLEAN): RECORD_TYPE; ... These declarations, as you see, indicate that both of these procedures expect a parameter that is itself a function (which takes two keys and returns a boolean). A typical call might thus be: BTREE_ADD (BHEADER, K1, R1, STRCOMPARE); or R:=BTREE_FIND (BHEADER, K, MY_OWN_EBCDIC_COMPARE_ROUTINE); or whatever else the user wants to do. Now, this paper purports to be a comparison between PASCAL, C, and SPL, but so far we've only discussed a feature that exists -- and is virtually identical -- in all three languages. What's the difference? Well, note that we demanded that the user pass the comparison routine in every call to one of the BTREE_ADD or BTREE_FIND procedures. In turn, if a procedure called by BTREE_ADD or BTREE_FIND needs to call the comparison routine, then BTREE_ADD or BTREE_FIND must pass that procedure the comparison routine, too. This is cumbersome and also error-prone -- what if the user passes one procedure for BTREE_ADD and another for BTREE_FIND? The logical solution seems to be to pass the procedure once to BTREE_CREATE, i.e. BTREE_CREATE (BHEADER, STRCOMPARE); and then have the address of the procedure stored somewhere in the BHEADER record structure. Then, when BTREE_ADD needed to do a comparison, it would say something like: COMP_RESULT:=BHEADER.COMPARE_ROUTINE (K, CURRENT_KEY); This makes more sense. After all, even KSAM only requires you to specify the key comparison sequence at file creation time, not on every intrinsic call. Examples of this sort of thing are plentiful: * Trap routines (just like in MPE). MPE's XCONTRAP intrinsic, for instance, expects you to pass it a procedure (actually, a procedure's address); it saves this address in a special location, and then when control-Y is hit, calls this procedure. Similarly, let's say you're writing a package for packed decimal arithmetic. You want to have a procedure called PROCEDURE SET_PACKED_TRAP (PROCEDURE TRAP_ROUTINE); which will set some global variable to the value passed as TRAP_ROUTINE. Then, whenever your procedure detects some kind of packed decimal arithmetic error, it'll call whatever routine was set up as the trap routine. That way, the user will be able to do what he pleases; he can set the trap routine to abort the program, to print an error message, to return a default result -- whatever. * Say that you have various procedures that do certain things -- build temporary files, set up locks, buffer I/O, etc. -- that require special processing when the program is terminated. A classic example of this is buffering your output to a file in order to do fewer physical I/Os. If the program somehow dies, you want to be able to flush all the buffered data to the file rather than having it get lost. What you'd like to do is have a procedure called ONEXIT, which would take a single procedure parameter. Then, if your process dies, the system would know enough to call this procedure, letting it do whatever cleanup you want. For instance, you might say ONEXIT (FLUSH_BUFFERS); to tell the system to call FLUSH_BUFFERS if the program terminates for any reason; you might also want to say ONEXIT (RELEASE_SIRS); so that the system will release any SIRs (System Internal Resources) that you may have acquired. In fact, you want ONEXIT to keep what is essentially an array of procedures: VAR ONEXIT_PROCS: ARRAY [1..100] OF PROCEDURE; Then, the system terminate routine will say: FOR I:=1 UNTIL NUM_ONEXIT_PROCS DO ONEXIT_PROCS[I](); { call the Ith procedure } * The file system, for instance, has to keep track of a number of different file types -- standard files, message files, KSAM files, remote files, and so on. Although they all look like files to the user, the various routines that read them, write them, close them, etc. are quite different. A possible design for, say, the file control block might be: RECORD FILENAME: PACKED ARRAY [1..36] OF CHAR; FILE_READ_ROUTINE: PROCEDURE (...); FILE_WRITE_ROUTINE: PROCEDURE (...); FILE_CLOSE_ROUTINE: PROCEDURE (...); ... END; Then, the FREAD intrinsic might simply say FCB.FILE_READ_ROUTINE (FNUM, BUFFER, LENGTH, ...); thus calling the file read routine pointed to by the file control block (this might be one of FREAD_STANDARD, FREAD_MSG, FREAD_KSAM, etc.) -- this field would have been set by the FOPEN call. These are just examples to convince you that it makes sense not just to be able to pass procedures and functions as parameters, but also have variables that "contain" procedures and functions, or rather pointers to them. This is where the three languages differ. Standard PASCAL and PASCAL/3000 have no such feature. There simply is no way of either * declaring a variable to be of type "pointer to a function or procedure"; * setting a variable to point to a particular function/procedure; * or calling a procedure/function pointed to by a variable; None of the three above examples can be implemented in Standard or HP PASCAL. C, on the other hand, does support this feature. In C we might say: typedef struct {... int (*comp_proc)(); ... } btree_header_type; thus declaring comp_proc to be a field pointing to a procedure that returns an integer. Then, BTREE_OPEN might read like: btree_open (b, proc) btree_header_type b; int (*proc)(); { ... b.comp_proc = proc; ... } and BTREE_ADD might say btree_add (b, k, rec) btree_header_type b; int k[], rec[]; { ... comp_result = (*b.comp_proc) (k, current_key); ... } As you see, we simply use "(*b.comp_proc)" -- "the thing pointed to by the COMP_PROC subfield of record B" in place of a procedure name; C will then call this procedure, passing it the parameters K and CURRENT_KEY (of course, doing no parameter checking). Similarly, our ONEXIT routine (which, by the way, I think is a singularly useful sort of procedure, and one that Draft Standard C has defined in its Standard Library) might look like this: int (*exit_routines[100])(); int num_exit_routines = 0; onexit (proc) int (*proc)(); { exit_routines[num_exit_routines] = proc; num_exit_routines = num_exit_routines + 1; } terminate () int i; { ... for (i=0; i<num_exit_routines; i++) (*exit_routines[i]) (); /* call the Ith exit routine */ ... } Clean and simple. SPL's solution to this problem is somewhat dirtier. SPL can do it, because SPL -- with its TOSes and ASSEMBLEs -- can do anything; but, it can't do it very cleanly. In SPL, what you'd do is save the procedure's address (actually, its plabel, but for our purposes that's the same thing) in an integer variable. Then, you'd use an ASSEMBLE statement to call the procedure. INTEGER ARRAY EXIT'ROUTINES(0:99); INTEGER NUM'EXIT'ROUTINES:=0; PROCEDURE ONEXIT (PROC); PROCEDURE PROC; BEGIN EXIT'ROUTINES(NUM'EXIT'ROUTINES):=@PROC; NUM'EXIT'ROUTINE:=NUM'EXIT'ROUTINES+1; END; PROCEDURE TERMINATE; BEGIN INTEGER I; ... FOR I:=0 UNTIL NUM'EXIT'ROUTINES-1 DO BEGIN TOS:=EXIT'ROUTINES(I); ASSEMBLE (PCAL 0); << call the routine whose addr is on TOS >> END; ... END; If you had to pass and/or receive parameters from this procedure, the code would be even uglier. To pass, for instance, the integer arrays K and CURRENT'KEY and to receive a result to be put into COMP'RESULT, you'd have to say: TOS:=0; << room for the result >> TOS:=@K; TOS:=@CURRENT'KEY; TOS:=COMP'ROUTINE'PLABEL; << the plabel of the routine to call >> ASSEMBLE (PCAL 0); COMP'RESULT:=TOS; << get the return value >> Ugly, but possible -- more than can be said for Standard PASCAL or PASCAL/3000. PASCAL/XL, on the other hand, has a solution rather comparable to that of C's -- better, if you generally prefer PASCAL to C. In PASCAL/XL, you could define a "procedural" or "functional" data type, to wit: TYPE EXIT_PROC = PROCEDURE; << no parms, no result >> COMPARE_PROC = FUNCTION (K1, K2: KEY_TYPE): BOOLEAN; The declaration is much like what you'd put into a procedure header if you want a parameter to be a procedure or function; however, this kind of type allows your variables to be procedure/function pointers, too. Thus, you'd write ONEXIT as: VAR EXIT_ROUTINES: ARRAY [1..100] OF EXIT_PROC; NUM_EXIT_ROUTINES: 0..100; PROCEDURE ONEXIT (P: EXIT_PROC); BEGIN EXIT_ROUTINES[NUM_EXIT_ROUTINES]:=P; NUM_EXIT_ROUTINES:=NUM_EXIT_ROUTINES+1; END; PROCEDURE TERMINATE; VAR I: 0..100; BEGIN ... FOR I:=1 UNTIL NUM_EXIT_ROUTINES DO CALL (EXIT_ROUTINES[I]); { CALL is a special construct } ... END; Similarly, our BTREE_ADD procedure would be: PROCEDURE BTREE_ADD (VAR B: BTREE_HEADER_TYPE; K: KEY_TYPE; REC: RECORD_TYPE); ... BEGIN ... COMP_RESULT:=FCALL (B.COMP_ROUTINE, K, CURRENT_KEY); ... END; As you see, "CALL (proc, parm1, parm2, ...)" calls the procedure pointed to by the variable "proc", passing to it the given parameters. Similarly, "FCALL (proc, parm1, parm2, ...)" calls a function. To summarize, then: * Procedure and function variables -- though apparently rather obscure -- can actually be very useful. * Standard PASCAL and PASCAL/3000 supports procedures and functions as parameters, but not as variables; this is rather inadequate. * SPL supports procedure/function variables, but in a rather "dirty" fashion, which is clumsy and uses many TOSes and ASSEMBLEs. * C and PASCAL/XL have very clean support for this nifty feature. C #DEFINEs One thing in which C may be quite a bit superior to PASCAL is the #define construct. It can have serious performance advantages, and also avoid duplication of code in cases where ordinary procedures just don't do the job. The #define is a simple macro facility. References to it get expanded into C code that is compiled in place of its invocation. In other words, saying #define square(x) ((x)*(x)) ... printf ("%d %d\n", a, square(a)); is identical to printf ("%d %d\n", a, ((a)*(a))); Other useful defines may include #define min(x,y) (((x)<(y)) ? (x) : (y)) #define max(x,y) (((x)>(y)) ? (x) : (y)) #define push(val,stackptr) *(stackptr++) = (val) and so on. Note that this is in the same spirit as SPL DEFINEs -- in fact, if you don't have any parameters, C #define's and SPL DEFINEs are one and the same -- but allows parameterization, which increases the power immeasurably. One question that instantly comes to mind is: how are #define's better than FUNCTIONs? * First of all, on any computer, there is an overhead in PROCEDURE and FUNCTION calls. For instance, an HP PASCAL program running on a Mighty Mouse took about 140 milliseconds to execute 10,000 calls to a parameter-less PROCEDURE, and about 250 milliseconds to do 10,000 calls to a PROCEDURE with 3 parameters. Of course, this isn't a very large amount of time, and certainly isn't bad enough to convince me to stop writing procedures and repeat portions of code several times in my program. Still, it is enough to give one pause in cases where performance is critical; procedure calls are frequent enough that the total overhead piles up. #DEFINEs allow us to avoid code repetition without any of the overhead of procedure calls. For small, very frequently used procedures, they can be a very good solution. * A #DEFINE can replace anything, including declarations, control structures, etc. For instance, say that you think the C "for" loop is too complicated, and you'd like to be able to do a PASCAL-like "FOR". You could say #define loop(var,start,limit) \ for ((var) = (start); (var) <= (limit); (var)=(var)+1) Then, loop (x, 1, 100) printf ("%d %d %d\n", x, x*x, x*x*x); would mean the same thing as for (x = 1; x <= 100; x=x+1) printf ("%d %d %d\n", x, x*x, x*x*x); This use of #DEFINE, however, is more than just a sop to people who are dissatisfied with C terminology and want to make it look like PASCAL. The fact that #DEFINEs directly expand into source code rather than just calling functions can be used to: - Have operations that work with arbitrary datatypes. For instance, our "min" define will work equally well for "int"s, for "float"s, for "long"s, etc., since it expands into a "<" comparison, which works for all those types. - Define objects that can be stored into as well as fetched from. For instance, if for some reason you keep all your arrays in 1-dimensional format, you can say #define element(array,rowsize,row,col) \ array[rowsize*row+col] ... x = element (a, numcols, rnum, cnum); ... element (a, numcols, rnum, cnum) = x; Since you can't assign anything to a function call, you couldn't do this if ELEMENT were a function; but, since it's a macro, this ends up being a simple assignment to an array element. - Have #define's that define procedures. Consider the following mysterious creature: #define defvectorop(funcname,type,op) \ funcname(vect1,vect2,rvect,len) \ type vect1[], vect2[], rvect[]; \ int len; \ { \ int i; \ for (i = 0; i<len; i++) \ rvect[i] = vect1[i] op vect2[i]; \ } This #define allows you to easily define procedures of a certain format -- to wit, those that operate element-wise on two arrays (of a given type) to generate a third array. For instance, defvectorop(intmult,int,*) defvectorop(intadd,int,+) defvectorop(floatadd,float,+) will define three functions that, respectively, multiply vectors of integer, add vectors of integers, and add vectors of floats. A call to intmult(x1,x2,y,10); will set elements y[0] through y[9] to x1[0]*x2[0] through x1[9]*x2[9]. - Finally, using some even weirder constructs, you can deal with "families" of variables which are identified by their similar names. For instance, say your convention is that if your "queue" data structure is stored in an array called "x", then the header pointer is stored in a variable called "x_head" and the tail pointer is stored in a variable called "x_tail". A typical macro might look like #define queueprint (queuename) \ printf ("Head = %d, Tail = %d\n", \ queuename ## _head, queuename ## _tail); \ print_array_data (queuename); Note that with the special "##" macro operator, we can have queueprint (myqueue) expand into printf ("Head = %d, Tail = %d\n", myqueue_head, myqueue_tail); print_array_date (queuename); thus deriving the head and tail variable names from the queue variable name -- clearly something we can't do with a procedure. To summarize, the primary advantages of #define's are performance and the additional flexibility that comes with direct text substitution. PASCAL/XL INLINE PROCEDURES SPL, of course, has DEFINEs, but they don't support parameters and thus are severely limited. Standard PASCAL and HP PASCAL have nothing like DEFINEs or #define's, either for performance's or functionality's sake. PASCAL/XL, however, has a rather interesting feature called "INLINE". In HP PASCAL, you can write a procedure like this: FUNCTION MIN (X, Y: INTEGER): INTEGER OPTION INLINE; BEGIN IF X<Y THEN MIN:=X ELSE MIN:=Y; END; What does the "OPTION INLINE" keyword do? It commands the compiler: whenever a MIN is seen, physically INCLUDE the code of the procedure instead of simply compiling a procedure call instruction. This is, of course, done for performance's sake -- to save the time it would take to do that procedure call. This is thus somewhat comparable to C's #define's. It isn't as flexible -- MIN is still a procedure, with fixed parameter types, and so on -- but can be as fast (or almost as fast). Actually, I wouldn't be surprised if INLINE procedure calls were still somewhat slower than #define's (although they don't have to be, if the compiler is really smart). On the other hand, INLINE procedures have some advantages: * Since the compiler treats them like true procedures, it'll make sure that MIN(A,F(B)) won't evaluate F(B) twice (like our C #define would). * Since these are real procedures, we're no longer restricted by the rules about what can and can't go into an expression. For instance, the procedure FUNCTION FIND_NON_SPACE (X: STRING): INTEGER OPTION INLINE; VAR I: INTEGER; BEGIN I:=0; WHILE I<STRLEN(X) AND X[I]=' ' DO I:=I+1; FIND_NON_SPACE:=I; END; can't be written as a C #define, since C does not allow "while" loops inside expressions. Similarly, we can declare local variables and so forth. In short, while PASCAL/XL INLINE procedures are in some respects not quite as flexible as C #define's, they might be as efficient or almost as efficient, and even more flexible in their own way. Whether or not they really work depends on how good a job PASCAL/XL does of optimization. If, for instance, it expands A:=MIN(B,C) into TEMPPARM1:=B; TEMPPARM2:=C; IF TEMPPARM1<TEMPPARM2 THEN RESULT:=TEMPPARM1 ELSE RESULT:=TEMPPARM2; A:=RESULT; then this won't be a big savings. If, on the other hand, it's smart enough to generate IF B<C THEN A:=B ELSE A:=C; then, of course, it'll be every bit as efficient as the C #define. Remember, though: * INLINE procedures can only be used in the same kind of context in which an ordinary procedure is used; i.e., you can't define a new type of looping construct, declaration, etc. * INLINE procedures are available only in PASCAL/XL, not in Standard PASCAL or even PASCAL/3000. * You rely (like you always do) on the compiler's intelligence in generating efficient code. When you have MIN defined as a #define, you KNOW that the compiler will generate EXACTLY the same code for min(x,y) and (x<y) ? x : y This will probably be one test, two branches, and some stack pushes. On the other hand, a call to an INLINE MIN procedure might do exactly the same thing -- or, it might also build a stack frame, allocate local variables, etc., taking almost as much time as an ordinary non-INLINE call. POINTERS: WHAT AND WHY One major feature that C emphasizes more than PASCAL is support for POINTERS. These creatures -- available in SPL as well as C -- are often very powerful mechanisms, but they have been also accused of making programs very difficult to read. I can't really objectively comment on the readability aspect, but some discussions of pointers, their advantages and disadvantages, is in order. APPLICATION #1: DYNAMICALLY ALLOCATED STORAGE If you declare some variable in SPL, C, or PASCAL, what you're really declaring is a chunk of storage. If you declare a global variable, the storage is allocated when you run the program and deallocated when the program is done; if you declare a local variable, the storage is allocated when you enter a procedure and deallocated when the procedure is exited. What if you want to declare storage that is allocated and deallocated in some other way? For instance, MPEX (or, for that matter, MPE) needs to read all your UDC files and keep a table indicating which UDC commands are defined in which file and at which record number. This table can be any size from 0 bytes (no UDCs) to thousands of bytes. How do we allocate it? The trouble is that we don't know how large the UDC dictionary will be, so we can't really declare it either as a local or global variable (in SPL, local arrays can be of variable size, but the UDC dictionary has to be global anyway). What we need to be able to do is DYNAMICALLY ALLOCATE IT in the READ_UDC_FILES procedure -- somehow tell the computer "I need X (a variable) bytes of storage now, and I want to view it as an object of type so-and-so (say, an array of records)". Now, there are two issues involved here: * WE NEED A MECHANISM FOR DYNAMICALLY ALLOCATING STORAGE. * WE NEED A WAY OF REFERRING TO THIS STORAGE ONCE IT'S ALLOCATED. The need for a dynamic allocation procedure (e.g. PASCAL's NEW, SPL's DLSIZE, or C's CALLOC) is obvious; but, the need for a reference mechanism is equally important! After all, we can't very well declare our UDC dictionary as VAR UDC_DICT: ARRAY [1..x] OF RECORD ...; Our very point is that we don't know the size of the array, and we DON'T WANT THE COMPILER TO ALLOCATE IT FOR US, which is what the compiler has to do if it sees an array declaration. What we have to do is to declare UDC_DICT as a POINTER. A pointer is an object that can be accessed in one of two modes: * In one mode, it looks EXACTLY like an object of a given type (say, an array, a record, a string, etc.). It can be assigned to, it can have its fields extracted, etc. If UDC_DICT is a pointer to an ARRAY of RECORDs, we could say UDC_DICT^[UDCNUM].NAME:=UDC_NAME_STRING; and assign something to the NAME subfield of the UDCNUMth element of this ARRAY of RECORDs. * In another mode, it is essentially an ADDRESS, which can be changed to make the pointer point to (theoretically) an arbitrary location in memory. When we say NEW (UDC_DICT); we don't really pass an ARRAY of RECORDs to NEW (remember, no such array has been allocated yet); rather, we pass a variable that will be set by NEW to a MEMORY ADDRESS, the address of a newly-allocated array of records that can later be accessed using "UDC_DICT^". This two-fold nature is the key aspect of pointers -- they can be viewed as ordinary pieces of data, OR they can be viewed as the addresses of data, and thus changed to "point" to arbitrary locations in memory. The reason why I gave this definition in the context of a discussion on "dynamic memory allocation" is that with dynamic memory allocation, pointers are NECESSARY. If all your data is kept in global and local variables, you might never need to use pointers, since all the data can be accessed by directly referring to the variable name. On the other hand, if you use things like NEW or CALLOC, you must refer to the dynamically allocated data using pointers. Let's take another example: We're building a Multi-Processing Executive system. This'll be an eXtra Large variety, by the way, so we might call it MPE/XL for short. This system will have lots of PROCESSES; each process has to have a lot of data kept about it. It makes a lot of sense for us to declare a special type of record: TYPE PROCESS_INFO_REC = RECORD PROGRAM_NAME: PACKED ARRAY [1..80] OF CHAR; CURRENT_PRIORITY: INTEGER; TOTAL_MEMORY_USED: INTEGER; FATHER_PROCESS: ???; SON_PROCESSES: ARRAY [1..100] OF ???; END; { Now, declare a pointer to this type } PROCESS_PTR = ^PROCESS_INFO_REC; [Let's not talk about whether or not this is a good design.] Now, we can have a procedure to create a new process: FUNCTION CREATE_PROCESS (PROGNAME: PROGNAME_TYPE): PROCESS_PTR; VAR NEW_PROC: PROCESS_PTR; BEGIN NEW (NEW_PROC); NEW_PROC^.PROGRAM_NAME:=PROGNAME; NEW_PROC^.TOTAL_MEMORY_USED:=1234; ... CREATE_PROCESS:=NEW_PROC; END; Note what this procedure returns -- it returns a POINTER to the newly allocated PROCESS INFORMATION RECORD (remember, NEW_PROC is the pointer, NEW_PROC^ is the record). Why does it return a pointer instead of the record itself? Remember that each process information record has to indicate who the process's father pointer is and who the process's sons are. To do this, we have to have some kind of "unique process identifier" -- well, what better identifier than the POINTER TO THE PROCESS INFORMATION RECORD? Thus, our record really looks like this: TYPE PROCESS_INFO_REC = RECORD PROGRAM_NAME: PACKED ARRAY [1..80] OF CHAR; CURRENT_PRIORITY: INTEGER; TOTAL_MEMORY_USED: INTEGER; FATHER_PROCESS: ^PROCESS_INFO_REC; SON_PROCESSES: ARRAY [1..100] OF ^PROCESS_INFO_REC; END; When we create a new process, we can just say: NEW_PROC_INFO_PTR^.FATHER_PROCESS:=CURR_PROC_INFO_PTR; All of our dynamically allocated process information records are thus DIRECTLY LINKED to each other using pointers. To find out a process's grandfather, for instance, we can just say FUNCTION GRANDFATHER (PROC_INFO_PTR: PROCESS_PTR): PROCESS_PTR; BEGIN GRANDFATHER:=PROC_INFO_PTR^.FATHER_PROCESS^.FATHER_PROCESS; END; Now of course, pointers aren't the only way to "point" to data. If, for instance, all our Process Information Records were not allocated dynamically, but rather taken out of some global array: VAR PROCESS_INFO_REC_POOL: ARRAY [1..256] OF PROCESS_INFO_REC; then we could just use indices into this pool as unique process identifiers (in fact, that's what PINs in MPE/V are -- indices into the PCB, an array of records that's kept in a system data segment). But for true dynamically allocated data (allocated using PASCAL's NEW, C's CALLOC, or SPL's DLSIZE), pointers are the way to go. POINTERS BEYOND DYNAMIC ALLOCATION Another reason why I first introduced pointers in the context of dynamic allocation and NEW is that in PASCAL, that's all you can really use pointers for. In other words, NEW makes a pointer point to a newly-allocated chunk of data; but there's NO WAY TO MAKE A POINTER POINT TO AN EXISTING GLOBAL OR LOCAL VARIABLE (OR ARRAY ELEMENT). PASCAL's theory was that any global or local variables can and should be accessed without pointers (since presumably we know where they are at compile-time). Following with the UDC dictionary example I talked about earlier, let me explain a bit about the workings of the UDC parser and executor that I have in MPEX and SECURITY. Both MPEX/3000 and SECURITY/3000's VEMENU need to be able to execute UDCs. Unfortunately, MPE's COMMAND intrinsic can't execute UDCs on my behalf (it can't even execute PREP or RUN!). Thus, I had to do my own UDC handling. * The first step in handling UDCs is finding out what UDC files the user has set up. To do this I look in the directory and COMMAND.PUB.SYS, which indicate all of the user's UDC files, and then I FOPEN each one of these files. * Next, I have to find out what UDCs these files contain and where they contain them. I read the files and generate a record for each UDC I find; this record contains the UDC's name, the file number of the file where I found it, and the record number at which I found it. (This is where the dynamic allocation I discussed earlier fits in -- I'd like to be able to allocate the UDC dictionary dynamically rather than just keep it around as a fixed-size global variable.) * Finally, when the time actually comes to execute a UDC, - I parse the UDC invocation (e.g. "COBOL85 AP010S,AP010U"); - After finding the UDC name (COBOL85), I look it up in my UDC dictionary to find out where and in which UDC file it's defined; - I read the header of the UDC from the UDC file -- it looks something like "COBOL85 SRC,OBJ=$NEWPASS,LIST=$STDLIST"; - I determine the values of all the UDC parameters from the header and the invocation -- SRC is AP010S, OBJ is AP010U, and LIST is $STDLIST (the default); - I then read the UDC, substituting all the UDC parameters in each line and then executing it. Not a trivial process, but a necessary one. The reason I bring it up is that one aspect of the processing -- determining the UDC parameter values and then substituting them into the UDC commands -- is quite well-tailored to the use of POINTERS. In order to determine the values of all the UDC parameters, I have to parse the UDC invocation ("COBOL85 AP010S,AP010U") and the UDC header ("COBOL85 SRC,OBJ=$NEWPASS,LIST=$STDLIST"). From the UDC invocation I determine the values of the specified parameters (AP010S and AP010U); from the UDC header I get the parameter names (SRC, OBJ, and LIST), and the default values ($NEWPASS and $STDLIST). (For the purposes of this discussion, I'm ignoring keyworded UDC invocation -- if you don't know what I mean by this, that's good; it's not relevant here.) Thus, my parsing has essentially generated three tables: Parameter Names Values Given Default Values SRC AP010S none OBJ AP010U $NEWPASS LIST none $STDLIST Two of these tables -- the Values Given and Default Values -- I have to merge into one table, the Actual Values table. If a value was given, it becomes the Actual Value; if it wasn't, the default value is used. Let's look at the kind of data structure we'd use to do this sort of thing: TYPE STRING_TYPE = PACKED ARRAY [1..80] OF CHAR; VAR PARM_NAMES: ARRAY [1..MAX_PARMS] OF STRING_TYPE; VALUES_GIVEN: ARRAY [1..MAX_PARMS] OF STRING_TYPE; DEFAULT_VALUES: ARRAY [1..MAX_PARMS] OF STRING_TYPE; ACTUAL_VALUES: ARRAY [1..MAX_PARMS] OF STRING_TYPE; As you see, we've declared four arrays of strings. Each one contains up to MAX_PARMS (say, 16) strings, one for each UDC parameter. Of course, if we do it this way, we'll be using 3*16*80 = 3840 bytes. Since actually each parameter could be up to 256 bytes long, we'd actually need more like 12,000 bytes to fit all our data! What a waste, especially, since all of these values were derived from two strings -- the UDC invocation and UDC header -- each of which was at most 256 bytes. In other words, * All the elements of VALUES_GIVEN are simply substrings of the UDC_INVOCATION array; * All the elements of PARM_NAMES and DEFAULT_VALUES are substrings of the UDC_HEADER array. Why actually copy these substrings out? It takes a lot of space and more than a little time -- instead, let's just keep the INDICES and LENGTHS of the substrings in their original arrays: VAR PARM_NAME_INDICES: ARRAY [1..MAX_PARMS] OF INTEGER; PARM_NAME_LENGTHS: ARRAY [1..MAX_PARMS] OF INTEGER; GIVEN_VALUE_INDICES: ARRAY [1..MAX_PARMS] OF INTEGER; GIVEN_VALUE_LENGTHS: ARRAY [1..MAX_PARMS] OF INTEGER; DEFAULT_VALUE_INDICES: ARRAY [1..MAX_PARMS] OF INTEGER; DEFAULT_VALUE_LENGTHS: ARRAY [1..MAX_PARMS] OF INTEGER; Note that, so far, this is a classical PASCAL solution; if you want to "point" to data that's in your program's variables (rather than dynamically allocated using NEW), you just keep indices instead of the actual data. Then, you can say SUBSTR(UDC_HEADER,PARM_NAME_INDICES[PNUM],PARM_NAME_LENGTHS[PNUM]) and thus refer to the PNUMth PARM_NAME (assuming your PASCAL has a SUBSTR function); similarly, you can use SUBSTR(UDC_HEADER,DEFAULT_VALUE_INDICES[PNUM], DEFAULT_VALUE_LENGTHS[PNUM]) and SUBSTR(UDC_INVOCATION,GIVEN_VALUE_INDICES[PNUM], GIVEN_VALUE_LENGTHS[PNUM]) Remember, you KNOW where the substrings came from anyway, so with the indices and the lengths you can always "reconstitute" them whenever you like instead of having to keep them around in separate arrays. However, think about the ACTUAL_VALUE table. This contains the ACTUAL VALUES of the UDC parameters, which might have come either from the DEFAULT VALUES on the UDC header or the GIVEN VALUES on the UDC invocation. How can you represent the actual values without having to copy each one out into a separate string? You see, you can't just keep the index of the actual value around, since in this case, you're not sure WHAT string this is an index into. You'd have to have a special array of flags: VAR ACTUAL_VALUE_FROM: ARRAY [1..MAX_PARMS] OF (HEADER,INVOCATION); VAR ACTUAL_VALUE_INDICES: ARRAY [1..MAX_PARMS] OF INTEGER; VAR ACTUAL_VALUE_LENGTHS: ARRAY [1..MAX_PARMS] OF INTEGER; and then use it like this: PROCEDURE PRINT_ACTUAL_VALUES (NUM_PARMS: INTEGER); VAR PNUM: 1..MAX_PARMS; BEGIN FOR PNUM:=1 TO NUM_PARMS DO BEGIN WRITE ('PARAMETER NUMBER ', PNUM:3, ' IS: '); IF ACTUAL_VALUE_FROM(PNUM)=HEADER THEN WRITELN (SUBSTR(UDC_HEADER,ACTUAL_VALUE_INDICES[PNUM], ACTUAL_VALUE_LENGTHS[PNUM])) ELSE WRITELN (SUBSTR(UDC_INVOCATION,ACTUAL_VALUE_INDICES[PNUM], ACTUAL_VALUE_LENGTHS[PNUM])); END; END; The point I'm trying to make here is that: * THERE ARE CASES IN WHICH YOU WANT TO HAVE A VARIABLE "POINTING" INTO ONE OF SEVERAL ARRAYS -- OR, IN GENERAL, ONE OF A NUMBER OF POSSIBLE LOCATIONS. TO DO THIS IN PASCAL, YOU HAVE TO KEEP TRACK OF *BOTH* WHICH LOCATION IT'S POINTING TO AND WHAT ITS INDEX INTO THAT LOCATION IS. THEN, TO REFERENCE IT, YOU'LL NEED AN "IF" OR A "CASE". Imagine, though, that in PASCAL you were able to set a pointer to the address of a global or procedure local variable (and, presumably, have some way of using that pointer as a string). Then, you could have VAR PARM_NAMES: ARRAY [1..MAX_PARMS] OF ^STRING; PARM_NAME_LENGTHS: ARRAY [1..MAX_PARMS] OF INTEGER; GIVEN_VALUES: ARRAY [1..MAX_PARMS] OF ^STRING; GIVEN_VALUE_LENGTHS: ARRAY [1..MAX_PARMS] OF INTEGER; DEFAULT_VALUES: ARRAY [1..MAX_PARMS] OF ^STRING; DEFAULT_VALUE_LENGTHS: ARRAY [1..MAX_PARMS] OF INTEGER; ACTUAL_VALUES: ARRAY [1..MAX_PARMS] OF ^STRING; ACTUAL_VALUE_LENGTHS: ARRAY [1..MAX_PARMS] OF INTEGER; Not only will you be able to say SUBSTR(PARM_NAMES[PNUM]^,1,PARM_NAME_LENGTHS[PNUM]) instead of SUBSTR(UDC_HEADER,PARM_NAME_INDICES[PNUM],PARM_NAME_LENGTHS[PNUM]) thus being able to forget where the PARM_NAMES, GIVEN_VALUES, and DEFAULT_VALUES arrays happened to be derived from, but you could also have ACTUAL_VALUES point into EITHER the header or the invocation, thus reducing our "PRINT ACTUAL VALUES" procedure to: PROCEDURE PRINT_ACTUAL_VALUES (NUM_PARMS: INTEGER); VAR PNUM: 1..MAX_PARMS; BEGIN FOR PNUM:=1 TO NUM_PARMS DO BEGIN WRITE ('PARAMETER NUMBER ', PNUM:3, ' IS: '); WRITELN (SUBSTR(ACTUAL_VALUES[PNUM]^,1, ACTUAL_VALUE_LENGTHS[PNUM])); END; END; Unfortunately, in PASCAL you can't do this because there was no way to fill the various arrays of pointers (ACTUAL_VALUES et al.) with data -- there's no way of determining the pointer to, say, UDC_INVOCATION[33] or UDC_HEADER[2]. What I've been trying to convince you is that this is a non-trivial lack and there are cases in which it's desirable to be able to set a pointer to point to any object in your data space, and then work from that pointer rather than, say, indexing into an array. OTHER POINTER APPLICATIONS There are other cases in which many C and SPL users use pointers. In these cases, a PASCAL user can quite as readily use an index into a string or an array; it's hard to tell which solution is better. For instance, consider these three procedures: [Note: I've intentionally avoided using certain language features like PASCAL sets, SPL three-way <=, some automatic type conversions, etc. to make the examples as similar as possible] TYPE PAC256 = PACKED ARRAY [0..255] OF CHAR; PROCEDURE UPSHIFT_WORD (S: PAC256); { Upshifts all the letters in S until a non-alphabetic } { character is reached; expects there to be at least one } { special character somewhere in S to act as a terminator. } VAR I: INTEGER; BEGIN I:=0; WHILE 'A'<=S[I] AND S[I]<='Z' OR 'a'<=S[I] AND S[I]<='z' DO BEGIN IF 'a'<=S[I] AND S[I]<='z' THEN S[I]:=CHR(ORD(S[I])-32); { upshift character } I:=I+1; END; END; PROCEDURE UPSHIFT'WORD (S); BYTE ARRAY S; << Upshifts all the letters in S until a non-alphabetic >> << characters is reached; expects there to be at least one >> << special character somewhere in S to act as a terminator. >> BEGIN BYTE POINTER P; @P:=@S; WHILE "A"<=P AND P<="Z" OR "a"<=P AND P<="z" DO BEGIN IF "a"<=P AND P<="z" THEN P:=BYTE(INTEGER(P)-32); @P:=@P(1); END; END; upshift_word (s) char s[]; { char *p; p = &s[0]; while ('A'<=*p && *p<='Z' || 'a'<=*p && *p<='z') { if ('a'<=*p && *p<='z') p = (char) (*p - 32); p = &p[1]; } } The first example is PASCAL, using indices into an array of characters; the next two are SPL and C, using character pointers. Which is better? * In PASCAL, since you're using indices into an array whose size is known, the compiler can do run-time bounds checking to make sure that the index is valid; also, every "S[I]" reference makes it clear where you're getting the data from. * In SPL and C, instead of "S[I]" you just say "P" (in SPL) or "*P" (in C). This is, incidentally, probably faster than PASCAL, since it would typically require just an indirect memory reference instead of an indirect reference with indexing (unless you have a very smart compiler). * Some people think that "S[I]" is more readable; others don't like to always repeat the index (especially if its something like "MY_STRING[MY_INDEX]") and prefer "P" or "*P". This is where it gets quite subjective; you've got to decide for yourself. MORE ON DYNAMICALLY ALLOCATING MEMORY As you recall, we started our discussion of pointers with an example involving dynamic memory allocation. This is really good for me, since I have things to say about dynamic memory allocation, and I'd have a hard time sneaking them into any other chapter. Thus, with this tenuous connection established, let's talk some more about dynamic memory allocation. I use "dynamic memory allocation" to mean allocating memory other than what's automatically allocated for you in the form of GLOBAL or PROCEDURE LOCAL variables. You typically use this mechanism when you don't know at compile-time how much memory you'll need. I've already given some examples of uses of dynamic memory allocation: * Allocating a "UDC command dictionary" that could be 0 bytes or 10,000 bytes. * Allocating "process information records", which might themselves be rather small, but of which there might be either none or a thousand. * Implementing commands like MPE's :SETJCW, with which you can define any number of objects at the user's command. Let's consider the latter example -- you're writing a command- driven program, in which the user might use a "SETVARIABLE" command to define a new variable and give it a value (say, an integer). Naturally, you have some top-level prompt-and-input routine, which then sends the user-supplied command to the parser (called, say, PARSE'COMMAND): The parser identifies this as a SETVARIABLE command, and calls this procedure: PROCEDURE SETVARIABLE (VAR'NAME, VAR'LEN, INIT'VALUE); VALUE VAR'LEN; VALUE INIT'VALUE; BYTE ARRAY VAR'NAME; INTEGER VAR'LEN; INTEGER INIT'VALUE; BEGIN ... END; Now SETVARIABLE has all the data already at its disposal in its parameters; however, it has to SAVE all this information somewhere, so that it'll stay around long after the SETVARIABLE procedure and even the PARSE'COMMAND procedure is exited. Presumably there is a FINDVARIABLE procedure somewhere that, given the variable name will extract the value that was put into it using SETVARIABLE. Where should SETVARIABLE put the data -- the variable name and initial value? Well, we could have a global array: BYTE ARRAY VARIABLES(0:4095); This gives us up to 4096 bytes of room for our "user-defined variables", names, data, and all. Clearly, though, this solution is both inefficient and inflexible. What if the user doesn't define any variables? We've wasted 4K bytes. What if the user defines too many variables? He won't be able to. What we want to do, thus, is to have SETVARIABLE request from the system a chunk of memory containing VAR'LEN+3 bytes -- VAR'LEN for the name, 1 for the name length, and 2 for the variable value. Then, we can keep the addresses of all these chunks somewhere (perhaps in a linked list), and FINDVARIABLE can then just go through these chunks to find the variable it's looking for. * In SPL, the only easy way of dynamically allocating memory is by calling DLSIZE. DLSIZE will get space from the "DL-DB" area; unfortunately, it'll only get it in 128-word chunks (if you ask for a 2-word chunk it'll give you a 128 words). Also, if we ever see a DELETEVAR command, there's no easy way of "returning space" to the system (always a difficult task). * In PASCAL, we'd use the NEW procedure. You pass to NEW a pointer to a given data type, and it will set that pointer to point to a newly allocated variable of that data type. Thus, you'd say: TYPE NAME_TYPE = PACKED ARRAY [1..80] OF CHAR; VAR_REC = RECORD CURRENT_VALUE: INTEGER; NAME_LEN: INTEGER; NAME: NAME_TYPE; END; VAR_REC_PTR = ^VAR_REC; ... FUNCTION SETVARIABLE (VAR NAME: NAME_TYPE; NAME_LEN: INTEGER; INIT_VALUE: INTEGER): VAR_REC_PTR; VAR RESULT: VAR_REC_PTR; BEGIN NEW (RESULT); RESULT^.CURRENT_VALUE:=INIT_VALUE; RESULT^.NAME_LEN:=NAME_LEN; RESULT^.NAME:=NAME; SETVARIABLE:=RESULT; END; Simple, eh? * In C, you'd do almost the same thing. I say almost because the only real difference is that C's equivalent of NEW, called CALLOC, takes the number of elements to allocate (in our case 1, since this is just a record and not an array of records) and the size of each element. typedef struct {int current_value; int name_len; char name[0];} var_rec; typedef *var_rec var_rec_ptr; ... var_rec_ptr setvariable (name, name_len, init_value) char name[]; int name_len, init_value; { var_rec_ptr result; /* Allocate an object, cast it to type "VAR_REC". */ result = (*var_rec) calloc (1, sizeof(var_rec) + name_len); *result.curr_value = init_value; strcpy (*result.name, name); } Compare PASCAL and C; ignore the small differences like the STRCPY call (it just copies once string into another). The important difference is in the CALLOC and NEW calls, and it's an important one indeed! * IN PASCAL, A CALL TO "NEW" ALLOCATES A NEW OBJECT OF A GIVEN DATATYPE. THE SIZE OF THE OBJECT IS UNIQUELY DEFINED BY THE DATATYPE. WHAT ABOUT STRINGS??? NEW is just great for allocating fixed-size objects, like the Process Information Records we talked about earlier. But what about variable-length things? When we defined the VAR_REC data type, we defined NAME to be a PACKED ARRAY [1..80] OF CHAR. This, however, isn't quite precise. What this means to us is that NAME can be UP TO 80 characters long. But when NEW allocates new objects of type VAR_REC, it will ALWAYS allocate them with room for 80 characters in NAME! Never mind that we know how long NAME should REALLY be -- we have no way of telling this to NEW. C's CALLOC, on the other hand, allows us to specify the number of bytes we need to allocate. The disadvantage of this is that we need to figure out this number; this, however, isn't hard -- we just say sizeof (<datatype>) e.g. sizeof (var_rec) [Note that VAR_REC was cunningly defined to have the NAME field be 0 characters long -- since C never does bounds checking anyway, this won't cause any problems, but will make SIZEOF return the size of only the fixed-length portion of VAR_REC.] The great advantage of CALLOC is that for variable-length objects we can EXACTLY indicate how much space is to be allocated. Since space savings is one of the major reasons we do dynamic memory allocation, this advantage of CALLOC -- or, more properly, disadvantage of NEW -- becomes very serious indeed. Not only does it waste space, but it also impairs flexibility, since in trying to save space we restrict the maximum variable name length to 80 bytes, when we should really make it virtually unlimited. Thus, to summarize, C and PASCAL both have relatively easy-to-use dynamic memory allocation mechanisms (as well as deallocation mechanisms, called DISPOSE in PASCAL and CFREE in C). They both work very well for allocating fixed-length objects (or, parenthetically, so-called "variant records" which are really variable-length in that they can have one of several distinct formats). However, if you want to allocate variable-length objects -- e.g. strings or records containing strings -- PASCAL CAN'T DO IT WITHOUT WASTING INORDINATE AMOUNTS OF MEMORY! DYNAMIC MEMORY ALLOCATION -- PASCAL/3000 AND PASCAL/XL Naturally, I'm not the first one to notice this kind of problem. PASCAL/XL has a rather nice solution to it (I only wish the Standard PASCAL authors thought of it): P_GETHEAP (PTR, NUM_BYTES, ALIGNMENT, OK); Using this built-in procedure, you can set PTR (which can be a pointer of any type) to point to a newly-allocated chunk of memory NUM_BYTES long. ALIGNMENT indicates how to physically align this chunk (on a byte, half-word, word, double-word, or page boundary), and OK indicates whether or not this request succeeded (another thing that Standard PASCAL NEW doesn't give you). The counterpart to DISPOSE here is P_RTNHEAP (PTR, NUM_BYTES, ALIGNMENT, OK); These procedures seem every bit as good as CALLOC and CFREE -- they let you allocate EXACTLY as much space as you need. PASCAL/3000 does not have the P_GETHEAP and P_RTNHEAP procedures; however, hidden away in Appendix F of the PASCAL/3000 manual there is a subsection called "PASCAL Support Library" (did you see this section when you read the manual?) Here, with the strong implication that these procedures are to be used from OTHER languages rather than PASCAL, are documented two procedures: GETHEAP (PTR, NUM_BYTES, OK); and RTNHEAP (PTR, NUM_BYTES, OK); It appears that these two procedures do pretty much the same thing as P_GETHEAP and P_RTNHEAP -- they allocate an arbitrary amount of space, and return a pointer to it in the variable PTR, which can be of any type. Again, I'm not sure whether they were even INTENDED to be called from PASCAL or only from other languages; however, it appears that they ought to work from PASCAL, too. PASCAL/XL AND POINTERS As I mentioned earlier, Standard PASCAL allows pointers in one case and one case alone -- pointers to dynamically allocated (NEWed) data. There's no way to make a pointer to point to a global or procedure-local variable. PASCAL/3000 (and PASCAL/3000) share this lack; PASCAL/3000 lets you get the address of an arbitrary variable (by calling WADDRESS, e.g. "WADDRESS(X)"), but it doesn't allow you to do the inverse -- go from the address (which is an integer) back to the value. SPL and C, of course, allow you to do both. In SPL, you can say INTEGER POINTER IP; INTEGER I, J; @IP:=@I; << set IP to the address of I >> J:=IP+1; << get the value pointed to by IP >> In C, you'd write int *ip; int i, j; ip = &i; << Set IP to the address of I >> j = *ip + 1; Note that C provides you two operators -- "&" to get the address of a variable, and "*" to get the value stored at a particular address. PASCAL/XL, in essence, allows you to do much the same thing. Its ADDR operator determines the address of an arbitrary variable and returns it as a pointer. Thus, you can say: VAR IP: ^INTEGER; I, J: INTEGER; IP:=ADDR(I); J:=IP^+1; A very simple addition, but it allows you to do virtually all of the pointer manipulation described earlier in the section -- you can have pointers that point to one of several local arrays, pointers that step through a string, etc. To revive an example from before, PASCAL/XL lets you write: TYPE PAC256 = PACKED ARRAY [0..255] OF CHAR; PROCEDURE UPSHIFT_WORD (S: PAC256); { Upshifts all the letters in S until a non-alphabetic } { character is reached; expects there to be at least one } { special character somewhere in S to act as a terminator. } VAR P: ^CHAR; BEGIN P:=ADDR(S); WHILE 'A'<=P^ AND P^<='Z' OR 'a'<=P^ AND P^<='z' DO BEGIN IF 'a'<=P^ AND P^<='z' THEN P^:=CHR(ORD(P^)-32); { upshift character } P:=ADDTOPOINTER(P,SIZEOF(CHAR)); END; END; Compare this with the corresponding C code: upshift_word (s) char s[]; { char *p; p = &s; while ('A'<=*p && *p<='Z' || 'a'<=*p && *p<='z') { if ('a'<=*p && *p<='z') p = (char) ((int)*p - 32); p = &p[1]; } } As I mentioned before, one can legitimately say that you should be indexing into the string (e.g. S[I]) rather than using a pointer -- in fact, since you can't use pointers to local variables in Standard PASCAL, you'd have to use indexing. On the other hand, many people prefer using pointers and, as you see, PASCAL/XL allows you to do this as easily as in C. SPL AND ITS LOW-LEVEL ACCESS MECHANISMS SPL, being a language designed explicitly for the HP3000 and for nitty-gritty systems programming, has a lot of "low-level" access mechanisms. These include: * The ability to execute any arbitrary machine instruction (using ASSEMBLE). * The ability to push things onto and pop things off the stack (using TOS). * The ability to examine (PUSH), set (SET), and reference data relative to various system registers (Q, DL, DB, S, X, etc.). Standard C and PASCAL, naturally, do not have such mechanisms; neither does PASCAL/XL. CCS, Inc.'s C/3000 does have "ASSEMBLE"- and "TOS"-like constructs (although their ASSEMBLE is more difficult to use than SPL's); however, it's by no means certain that C/XL will have them. Now, if we were simply counting features, things would be simple. We'd credit SPL with 3 new statements (ASSEMBLE, PUSH, and SET) and 5 new addressing modes (TOS, X register, DB-relative, Q-relative, and S-relative), and that'd be that. Score: SPL 37, PASCAL 22, C 31. Of course, not every feature is worth as much as any other feature. Many people complain that it's BAD for SPL to have these features; that whatever performance advantages you can get aren't worth the additional complexity and opacity; that, in general, PASCAL and C are better off without them. Now this may end up being a moot point, especially if C/XL ends up not having ASSEMBLE and similar constructs. On the other hand, it might be nice to consider any cases there may be where such constructs are really necessary -- if only for old times' sake. THE ARGUMENTS AGAINST ASSEMBLE, TOS, AND FRIENDS Before we go further, let me outline the arguments -- most of them perfectly valid -- that have been made against SPL's (and other languages') low-level constructs: * If you're using low-level constructs for performance's sake, you're wasting your time. Most compilers these days are good enough that they generate very efficient code, and you can't do much better using assembly. On the other hand, assembly is much, much harder to write, read, and maintain than high-level code. It's just not worth it. * If you're using low-level constructs for functionality, all that means is that the system isn't providing you with enough fundamental primitives that you could use in place of assembly. For instance, the old trick of going through the stack markers, Q register, etc. to get your ;PARM= and ;INFO= -- there should have been an intrinsic to do that in the first place. * If the language has low-level programming constructs, people will use them out of thoughtlessness or a misguided sense of efficiency, and thus produce awful, impossible to maintain, code. Languages with ASSEMBLE, TOS, and the like, are like a loaded gun, an open invitation for anybody to shoot himself in the foot (or worse). * Finally, the more sophisticated (and efficient) the compiler, the more likely that it CAN'T let you do low-level stuff. How can you use register-manipulation code if you don't know what registers the compiler uses for itself (and it can use different ones in different cases)? How can you get things off the stack if you don't know whether the compiler is keeping them on the stack or in registers? How can you trace back the stack markers if the compiler may do special call instructions or place code inline at its own discretion? The first of these arguments, in my opinion, is on the whole very sound. Very rarely do I find it desirable to use low-level constructs for efficiency's sake. Compared to programmer and maintainer time, computer time is cheap. On the other hand, when efficiency is really very important -- and it's often the case that 5% of the code uses 95% of the CPU time -- using low-level constructs for performance sake may be quite necessary. The fourth argument -- that a smart modern compiler can't assure you about the state of the world and thus can't let you muck around with it -- is very potent as well. In SPL, you can always say TOS:=123; << now, execute an instruction that expects something on TOS >> What if on Spectrum you have an instruction that expects a value in register 10? You can't just say R10:=123; << execute the instruction >> What if the compiler stored one of your local variables in R10? How does it know that it has to save it before your operation? Will saving it damage the machine state (e.g. condition codes) enough that the operation won't work anyway? The classic example of this in SPL is condition codes -- saying FNUMS'ARRAY(I):=FOPEN (...); IF <> THEN ... The very act of saving the result of FOPEN in the Ith element of FNUMS'ARRAY resets the condition code, thus making the IF <> THEN do something entirely unexpected! Now, an SPL user can know which operations change the condition code, and which don't (maybe), and thus avoid this sort of error -- but what about a PASCAL/XL user? Will HP be obligated to tell all the users which registers and condition codes each operation modifies? The third of the arguments has some merit, too. In fact, all you need to do is to look at a certain operating system provided by the manufacturer of a certain large business minicomputer, to see how dangerous TOSes and ASSEMBLEs are. I've seen pieces of code that are utterly impossible to understand, where something is pushed onto the stack only to be popped 60 lines and 10 GOTOs later -- ugh! There are some language constructs that are just plain DANGEROUS. It's the second argument -- that all the cases where you need to use low-level code shouldn't have existed in the first place -- that I don't quite buy. What SHOULD be and what IS are two different things. I personally wish that every case where I needed to use low-level code was already implemented for me by a nice, readable, easy-to-use HP intrinsic. Unfortunately, that's not always the case, and I shudder to think what would have happened if I DIDN'T have a way of doing all the dirty assembler stuff myself. Thus, the point here is: every system SHOULD provide you all you want, and it SHOULD optimize your code well (perhaps even better than you could do it yourself using assembler). Unfortunately, it often DOESN'T, and you need some way of getting around it to do things yourself. EXAMPLES OF THINGS YOU NEED LOW-LEVEL OPERATIONS FOR SETTING CONDITION CODES For better or worse, HP decided that its intrinsics indicate some part of their return result as the "condition code". This value, actually 2 bits in the STATUS register, can be set to the so-called "less than", "greater than", and "equal" values; to see what the current condition code value is, you can say in SPL: IF <> THEN << or <, >, <=, >=, or = >> or, in FORTRAN: IF (.CC.) 10, 20, 30 << go to 10 on <, 20 on =, 30 on > >> (A similar mechanism exists in COBOL II.) Now, have you ever wondered how HP's intrinsics actually SET this condition code? You can't just say: CONDITION'CODE:=<; to set it to "less than". What can you do? Now, one can say -- and quite correctly -- that it's not a good thing for a procedure to return condition codes. Condition codes are volatile things; they're changed by almost every machine instruction; for instance, as I mentioned before, FNUMS(I):=FOPEN (...); IF <> THEN ... won't do what you expect, since the instructions that index into the FNUMS array reset the condition code. Thus, if you have the choice, you ought to return data as, say, the procedure's return value, or a by-reference parameter. Still, sometimes it's necessary to return a condition code. Say, for instance, that you have a program that's been written to use FREAD calls, and you decide to change it to call your own procedure called, say, MYREAD. MYREAD might, for instance, do MR NOBUF I/O to speed things up, or whatever -- the important thing is that you want it to be "plug compatible" with FREAD. You just want to be able to say /CHANGE "FREAD","MYREAD",ALL and not worry about changing anything else. Well, in C or PASCAL, you'd be USC (that's Up Some Creek). In SPL, though, you can do it. You have to know that the condition code is kept in the STATUS register, a copy of which is in turn kept in your procedure's STACK MARKER at location Q-1. When you do a return from your procedure to the caller, the EXIT instruction sets the status register to the value stored in the stack marker. Thus, you just need to set the condition code bits in Q-1 (something you can't do in any language besides SPL) before returning from the procedure: INTEGER PROCEDURE MYREAD (FNUM, BUFFER, LEN); VALUE FNUM, LEN; INTEGER FNUM, LEN; ARRAY BUFFER; BEGIN INTEGER STATUS'WORD = Q-1; DEFINE CONDITION'CODE = STATUS'WORD.(6:2) #; EQUATE CCG=0, CCL=1, CCE=2; << possible CONDITION'CODE values >> ... CONDITION'CODE:=CCE; END; Relatively clean as you see, but not doable without "low-level" access (in this case, Q-relative addressing). Incidentally, don't think this is just a speculative example that doesn't happen in real life. I usually avoid using condition codes in my RL (most of my procedures return a logical value indicating success or failure), but I have several just like this -- FREAD plug-compatible replacements. Also, I've sometimes had to write SL procedures that exactly mimic intrinsics like READX, FREAD, FOPEN, etc. so that I can patch a program file to call this procedure instead of the HP intrinsic. (VESOFT's hook mechanism, which implements RUN, UDCs, super REDO, and MPEX commands from within programs like EDITOR, QUERY, etc. works like this.) One can say that HP should have provided a SET'CCODE intrinsic in the first place to do this; my only response is that it didn't, and I have to somehow get my job done in spite of it. SYSTEM TABLE ACCESS System table access, another thing that I like to do in a high-level fashion, with as few ASSEMBLEs, TOSes, EXCHANGEDBs, etc. as possible, sometimes needs low-level access. The classic example is accessing system data segments, e.g. TOS:=@BUFFER; TOS:=DSEG'NUMBER; TOS:=OFFSET; TOS:=COUNT; ASSEMBLE (MFDS 4); << Move From Data Segment >> Originally, in SPL, this was the ONLY way to access a system data segment (like the PCB, JMAT, your JIT, etc.). I didn't have an OPTION -- do it this way or some other, somewhat slower way; it was either use TOS and ASSEMBLE (which I was scared to death of) or not do it at all. Now, of course, SPL has the MOVEX statement, with which I can say MOVEX (BUFFER):=(DSEG'NUMBER,OFFSET),(COUNT); to do exactly the same thing without any unsightly ASSEMBLEs. Remember, though, that this construct is a recent addition to SPL; when I started hacking on the HP3000 in 1979, it wasn't there, so I had to do without it. Another example of system tables access is access to the PXGLOB, a table that lives in the DL- negative area of your stack. Most languages can't access this area, and even SPL doesn't give you a direct way of getting to it; but, with the PUSH statement, it can be done: INTEGER ARRAY PXGLOB(0:11); INTEGER POINTER DL'PTR; ... PUSH (DL); @DL'PTR:=TOS; MOVE PXGLOB:=DL'PTR(-DL'PTR(-1)),(12); We use SPL's ability to set a pointer to point to any location in the stack (in this case, the location pointed to by the DL register), and then index from there. Again, there's no way of doing this without using PUSH and TOS. OTHER APPLICATIONS OF LOW-LEVEL CONSTRUCTS Some other cases where ASSEMBLE, TOS, etc. are necessary to do things: * If you look in the "CONTROL STRUCTURES" section of this paper, you'll see a PASCAL/XL construct called TRY .. RECOVER. For reasons that I explain in that section, I think that it's a very useful construct, and I've implemented it in SPL for the benefit of my SPL programs. Note that in any other language, I couldn't do this; only in SPL, with its register access and especially the ability to do Q-relative addressing to access stack markers, could I implement this entirely new control structure. * Whenever you write an SPL OPTION VARIABLE procedure, you need to be able to access its "option variable bit mask", which indicates which parameters were passed and which were omitted. This information is stored at Q-4 (and also sometimes at Q-5); with SPL's Q-relative addressing, you can access it. Again, maybe SPL should have a built-in construct that lets you find out the presence/absence of an OPTION VARIABLE parameter; however, it does not. * To determine your run-time ;PARM= or ;INFO= value, you need to look at your "Qinitial"-relative locations -4, -5, and -6. Qinitial refers to the initial value of the Q register; to get to it, you have to go through all your stack markers, which requires Q-relative addressing. HP's new GETINFO intrinsic does this for you; it was released in 1987, whereas the HP3000 was first put out in 1972. * HP's LOADPROC intrinsic dynamically loads a procedure from the system SL and returns you the procedure's plabel. How do you call the loaded procedure? You have to push all the parameters onto the stack and then do an ASSEMBLE, to wit: TOS:=0; << room for the return value >> TOS:=@BUFF; << parameter #1 >> TOS:=I+7; << parameter #2 >> TOS:=PLABEL; << the plabel returned by LOADPROC >> ASSEMBLE (PCAL 0); RESULT:=TOS; << collect the return value >> * Say that, in the middle of executing a procedure, you need to allocate X words of space. You can try allocating it in your DL- area, but then you'll have a hard time deallocating it (unless you want to write your own free space management package). If you need this space only until the end of the procedure, you can simply say: INTEGER S0 = S-0; << S0 now refers to the top of stack >> INTEGER POINTER NEWLY'ALLOCATED; ... @NEWLY'ALLOCATED:=@S0+1; TOS:=X; << the amount of space to allocate >> ASSEMBLE (ADDS 0); << allocate the space on the stack >> NEWLY'ALLOCATED now points to the X words of newly allocated stack space. Exiting the procedure will deallocate the space, as will saying TOS:=@NEWLY'ALLOCATED-1; SET (S); * XCONTRAP sets up a procedure as a control-Y trap procedure; when the user hits control-Y, the procedure is called. However, if the control-Y is hit at certain times (say, in the middle of an intrinsic call), there'll be some junk left on the stack that the trap procedure will have to pop. The way the SPL manual suggests you do this is by saying: PROCEDURE TRAP'PROC; BEGIN INTEGER SDEC=Q+1; << indicates the amount of junk to pop >> ... TOS:=%31400+SDEC; << build an EXIT instruction! >> ASSEMBLE (XEQ 0); << execute the value in TOS! >> END; This is, of course, incredibly ugly code -- you build an instruction on top of the stack and then execute it! -- and HP should certainly have designed its control-Y trap mechanism some other way. On the other hand, I can't do anything about it; I'm stuck with it, and I have to have some way of dealing with it. CONCLUSION: HOW BAD IS LOW-LEVEL? As you see, for all the bad things that have been said about ASSEMBLEs and TOSes, sometimes they are necessary to get things done. Almost by definition, every time when they are necessary indicates something wrong with the operating system. In every case shown above, I SHOULDN'T have to stoop to ASSEMBLEs et al.; there SHOULD be HP intrinsics to set the condition code, get the ;PARM= and ;INFO=, move things to/from data segments, access DL-negative area, allocate things on the stack, do TRY .. RECOVER, etc. And, as you see, many of the problems I discussed above HAVE been fixed -- in new version of MPE, of SPL, of PASCAL/XL. The root of the problem, though, remains the same: * HP WILL NEVER THINK OF EVERYTHING. The users' needs will always outstrip HP's clairvoyance. The big advantage of SPL was that it gave you the tools to satisfy almost any need you had (at a substantial cost in blood, sweat, toil, and tears). I only hope that on Spectrum, HP has some mechanism -- an ASSEMBLE construct in PASCAL/XL or C/XL, or, perhaps, a separate assembler that is accessible to the users -- with which Spectrum users can attack problems that HP hasn't thought of. THE STANDARD LIBRARIES IN DRAFT STANDARD C One of C's features that its fans are justifiably proud of is the tendency of many C compilers to provide lots of useful built-in "library functions", which do things like I/O, string handling, mathematical functions (exp, log, etc.), and more. In addition to getting "the language" itself, C proponents say, you also get a lot of nice functionality that you COULD have implemented yourself, but would rather not have to. Now, Standard PASCAL has some such functions (mostly in the field of I/O and mathematics); PASCAL/3000 and PASCAL/XL add more (mostly strings and more I/O); SPL, being fixed to the HP3000, relies on the HP3000 System Intrinsics. Kernighan & Ritchie C, to be honest, is actually INFERIOR to Standard PASCAL insofar as built-in functions go -- although it defines a standard set of I/O functions, it doesn't define any standard mathematical functions, nor does it define standard string handling functions (which Standard PASCAL doesn't, either). However, many C compilers quickly evolved their own sets of supported library functions, and the Draft Proposed C Standard enumerates and standardizes them all. Remember, though, the considerations involved in relying on the Draft Proposed C Standard: * On the one hand, being new and not even finalized yet, most existing compilers are likely to differ with it in quite a few respects. In fact, it'll probably be years before most C compilers fully conform to the new Standard. * On the other hand, the Draft Standard is not created from scratch. All or most of the functionality that it sets down has already been implemented in one or more of the existing C compilers. In particular, at least string handling packages and mathematical functions are available in virtually all C compilers (although not necessarily entirely standardized). The question of these sorts of built-in support functions is not an earth-shaking one; almost by definition of C, any function described here can be implemented by any C programmer, and most of these functions are probably ordinary C-written functions that just happened to have been provided by the compiler writer. However, I think that it is somewhat important to mention these functions simply because although you CAN write them, you'd rather not write anything you don't have to. Any time the standard is thoughtful enough to provide date and time support (how many thousands of various personal implementations of date handlers are there? and how many of them actually work?) or built-in binary search or sort mechanisms, that's something to be thankful for. INTERESTING FUNCTIONS PROVIDED BY DRAFT STANDARD C * RAND, a random number generator. Quite simple to implement yourself. * ATEXIT, which allows you to specify one or more functions that are to be called when the program terminates normally. These may release various resources, flush buffers to disc, etc. Actually, this is a very useful construct, one that I think should be available in any language on any operating system. The major problem here is that you'd like the ATEXIT functions to be called whenever the program terminates, whether normally or abnormally. * BSEARCH, which does a binary search of a sorted in-memory array. A nice thing, especially since it often isn't provided by the underlying operating system (for instance, there's no intrinsic to do this on the HP3000). Note, however, that this is quite limited in application, since you usually want to search files or databases, not just simple arrays. * QSORT, which sorts an in-memory array. Again, rather nice, but limited because it only works on in-memory arrays and not on files or databases. * MEMCPY and MEMMOVE, which can copy arrays very fast (presumably faster than a normal FOR loop). This is comparable with, but different from, PASCAL/XL's MOVE_FAST, MOVE_L_TO_R, and MOVE_R_TO_L. Note that this is somewhat different in spirit from the string handling functions like STRCPY and STRNCPY -- this is intended for arbitrary arrays, and doesn't care about, say, '\0' string terminators. * MEMCHR finds a character in an array; MEMSET sets all elements of an array to a given character. Again, note the emphasis here on speed (if the computer supports special fast search/fast set instructions (like the HP3000 does), these functions ought to use them) and on working with ARRAYS rather than STRINGS (neither function cares about '\0' string terminators). * Built-in DATE and TIME handling functions: - Return the current date and time. - Convert the internal date/time representation to a structure containing year, month, day, hour, minute, and second; convert backwards, too. - Compute the difference between two days and times. - Convert an internal time into a text string of an arbitrary user-specified format. You can, for instance, say, strftime (s, 80, "%A, %d %B %Y; %I:%M:%S %p", &time); and the string S (whose maximum length was given as 80) would contain a representation of "time" as, for instance: Friday, 29 February 1968; 04:50:33 PM The third parameter to STRFTIME is a format string; "%A" stands for the full weekday name, "%d" for the day of the month, "%B" for the full month name, etc. As you can see, this is a non-trivial feature, one that many operating systems (e.g. MPE) don't provide, and one that you'd rather not have to implement yourself. * Character handling functions, such as "isalpha" (is a character alphabetic or not?), "isdigit" (is it a digit"), "toupper" (convert a character to uppercase), etc. * If you care about these sorts of things, Draft Standard C provides for "native language" support (called "localization" in the C standard). This means that "isalpha", string comparisons, the time-handling functions, etc. are defined to return whatever is appropriate for the local language and character set, be it English, Dutch, Czech, or Swahili (well, maybe not Swahili). SUMMARY If you have not yet guessed, I am by nature a loquacious man. For every issue I've raised, I've spent pages providing examples, giving arguments, discussing various points of view. This was all intentional; rather than just presenting my own opinions, I wanted to give as many of the facts as possible and let you come to your own conclusions. However, this resulted in a paper that was 200-odd pages long -- not, I would conjecture, the most exciting and tittilating 200 pages that were ever written. In this section I want to present a summary of what I think the various merits and demerits of SPL, PASCAL, and C are. All of the things I mention are discussed in more detail elsewhere in the paper, so if you want clarification or evidence, you'll be able to find it. I hope, though, that these lists themselves might put all the various arguments and alternatives in better perspective. Remember, however, as you read this -- if this all sounds opinionated and subjective, all the evidence is elsewhere in the paper, if you want to read it! THE TROUBLE WITH SPL [This section includes all those things that make SPL hard to work in. This isn't just "features that exist in other languages but not in SPL" -- these are what might be considered drawbacks (serious or not), things that you're likely to run into and regret while programming in SPL.] * SPL IS COMPLETELY NON-PORTABLE. There is no HP-supplied Native Mode SPL on Spectrum, and certainly not on any other machine. (Note: Software Research Northwest intends to have a Native Mode SPL compiler released by MAY 1987.) * SPL's I/O FACILITY FRANKLY STINKS. Outputting and inputting either strings or numbers is a very difficult proposition -- I think this is the major reason why more HP3000 programmers haven't learned SPL. * SPL HAS NO RECORD STRUCTURES. This is a severe problem, but not fatal -- there are workarounds, though none of them is very clean. See the "DATA STRUCTURES" section for more details. THE TROUBLE WITH STANDARD PASCAL * STANDARD PASCAL's PROCEDURE PARAMETER TYPE CHECKING IS MURDEROUS: - YOU CAN'T WRITE A GENERIC STRING PACKAGE OR A GENERIC MATRIX-HANDLING PACKAGE because the same procedure can't take parameters of varying sizes! That's right -- you either have to have all your strings be 256-byte arrays (or some such fixed size), or have a different procedure for each size of string! Try writing a general matrix multiplication routine; it's even more fun. - YOU CAN'T WRITE A GENERIC ROUTINE THAT HANDLES DIFFERENT TYPES OF RECORD STRUCTURES OR ARRAYS FOR ARGUMENTS. Say you want to write a procedure that, say, does a DBPUT and aborts nicely if you get an error; or does a DBGET; or does anything that might cause it to want to take a parameter that's "AN ARRAY OR A RECORD OF ANY TYPE". You can't do it! You must have a different procedure for each type! - YOU CAN'T WRITE A WRITELN-LIKE PROCEDURE THAT TAKES INTEGERS, STRINGS, OR FLOATS (perhaps to format them all in some interesting way). * IN STANDARD PASCAL, YOUR PROGRAM AND ALL THE PROCEDURES IT CALLS MUST BE IN THE SAME FILE! That's right -- if your program is 20,000 lines, it must all be in one file, and all of it must be compiled together. * STANDARD PASCAL HAS NO BUILT-IN STRING HANDLING FACILITIES, AND NO MECHANISM FOR YOU TO IMPLEMENT THEM. Not only are simple things like string comparison, copying, etc. (which are built into SPL) missing; you can't write generic string handling routines of your own unless all your strings have the same length and occupy the same amount of space (see above)! * STANDARD PASCAL's I/O FACILITY IS ABYSMAL. - YOU CAN'T WRITE A STRING WITHOUT CAUSING THE OUTPUT TO DEVICE TO GO TO A NEW LINE (i.e. you can't just "prompt the user for input" and have the cursor remain on the same line). - YOU CAN'T OPEN A FILE FOR APPEND OR INPUT/OUTPUT ACCESS. - YOU CAN'T OPEN A FILE WITH A GIVEN NAME. So you want to prompt the user for a filename and open that file? Tough cookies -- Standard PASCAL has no way of letting you do that. - IF YOU PROMPT THE USER FOR NUMERIC INPUT, THERE'S NO WAY FOR YOUR PROGRAM TO CHECK IF HE TYPED CORRECT DATA. Say you ask him for a number and he types "FOO"; what happens? The program aborts! It doesn't return an error condition to let you print an error and recover gracefully -- it juts aborts! - SIMILARLY, IF YOU TRY TO OPEN A FILE AND IT DOESN'T EXIST (or some such file system error occurs on any file system operation), YOU DON'T GET AN ERROR CODE BACK -- YOU GET ABORTED! What a loss! -- YOU CAN'T DO "DIRECT ACCESS" -- READ OR WRITE A PARTICULAR RECORD GIVEN ITS RECORD NUMBER. Think about it -- how can you write any kind of disc-based data structure (like a KSAM- or IMAGE-like file) without some direct access facility? Even FORTRAN IV has it! * OTHER, LESS PAINFUL, BUT STILL SIGNIFICANT LIMITATIONS INCLUDE: (These are things which you can certainly live without, unlike some of the problems above, which can be extremely grave. However, although you can live without them, they are still desirable, and in Standard PASCAL -- partly because of its restrictive type checking -- you CAN'T emulate them with any degree of ease. Their lack, incidentally, is felt particularly in writing large system programming applications.) - STANDARD PASCAL DOESN'T ALLOW YOU TO DYNAMICALLY ALLOCATE A STRING OF A GIVEN SIZE (where the size is not known at compile-time). PASCAL talks much about its NEW and DISPOSE functions, which dynamically allocate space at run-time. These functions are certainly very useful, and are in fact essential to many systems programming applications. However, say you want to allocate an array of X elements, where X is not known at compile-time -- YOU CAN'T! You can allocate an array of, say, 1024 elements, forbidding X to be greater than 1024 and wasting space if X is less than 1024; you CAN'T simply say "give me X bytes (or words) of memory". - STANDARD PASCAL HAS NO REASONABLE MECHANISM FOR DIRECTLY EXITING OUT OF SEVERAL LAYERS OF PROCEDURE CALLS. Say that your lowest-level parsing routine detects a syntax error in the user's input and wants to return control directly to the command input loop (the larger and more complicated your application, the more common it is that you want to do something like this). "Un-structured" as this may seem, it can be quite essential (see the "CONTROL STRUCTURES -- LONG JUMPS" chapter), and Standard PASCAL provides only very shabby facilities of doing this. - STANDARD PASCAL HAS NO WAY FOR HAVING VARIABLES THAT POINT TO FUNCTIONS AND PROCEDURES. Strange as this may seem, variables that point to procedures/ functions can be VERY useful -- see the chapter on "PROCEDURE AND FUNCTION VARIABLES" for full details. Interestingly, even Standard PASCAL recognizes their utility by implementing PARAMETERS that point to procedures and functions, but it doesn't go all the way and let arbitrary VARIABLES do it. If you respond that many PASCALs fix many of these drawbacks, I'll agree -- BUT WHAT HAPPENS TO PORTABILITY? If you use PASCAL/3000's string handling package (a pretty nice one, too), how are you going to port your program to, say, a PC implementation that has a different string handling package? What's more, some implementations -- like PASCAL/3000 itself -- don't solve many of the most important problems listed above! THE TROUBLE WITH KERNIGHAN & RITCHIE C * WHERE STANDARD PASCAL's PROCEDURE PARAMETER TYPE CHECKING IS TOO RESTRICTIVE, K&R C's IS NON-EXISTENT! If you write a procedure p (x1, x2, x3) int x1; char *x2; int *x3; and call it by saying p (13.0, i+j, 77, "foo") the compiler won't utter a peep. Not only won't it automatically convert 13.0 to an integer -- it won't print an error message about that, OR that "I+J" is not a character array, OR that 77 is not an integer pointer (which probably means that P expects X3 to be a by-reference parameter and you passed a by-value parameter), OR EVEN THAT YOU PASSED THE WRONG NUMBER OF PARAMETERS! * WHILE NOT AS BAD AS PASCAL's, K&R C's I/O FACILITIES STILL HAVE SOME MAJOR LACKS. Most serious are: - NO DIRECT ACCESS (read record #X). - NO INPUT/OUTPUT ACCESS. * K&R C, THOUGH FAIRLY STANDARD, HAS NO STANDARD STRING PACKAGE. Unlike in Standard PASCAL, though, it's fairly easy to write, since you CAN write a C procedure that takes, say, a string of arbitrary length. * C IS UGLY AS SIN. At least that's what some PASCAL programmers say; C programmers obviously disagree. People complain that C is just plain UGLY and thus (subjectively) difficult to read; they talk about everything from the "{" and "}" that C uses instead of "BEGIN" and "END" to C's somewhat arcane operators, like "+=" and "--". I'm not saying that this is either TRUE or FALSE; unfortunately, it's much too subjective to discuss in this paper. However, don't be surprised if you decide that on all the merits, C is superior but you can't stand writing with all these funny special characters; or, that PASCAL is the best, but it's much too verbose for you! IS ISO LEVEL 1 STANDARD PASCAL ANY BETTER THAN THE ANSI STANDARD? * THE ONLY DIFFERENCE BETWEEN ISO LEVEL 1 STANDARD PASCAL AND THE ANSI STANDARD (what I call simply "Standard PASCAL") IS THAT IT ALLOWS YOU TO WRITE PROCEDURES THAT TAKE ARRAYS OF VARIABLE SIZES. This eliminates one of the worst problems in Standard PASCAL -- that you can't write a generic string handling package, or a matrix multiplication routine, etc.; however, all the other problems (lack of separate compilation, bad I/O, etc.) still remain. * NOTE THAT IT'S NOT CLEAR HOW MANY NEW PASCAL COMPILERS WILL FOLLOW THE ISO LEVEL 1 STANDARD. The ISO Standard document makes it clear that implementing this feature is optional (without them, a compiler will conform only to the "ISO Level 0 Standard"); PASCAL/3000 doesn't implement it, but PASCAL/XL does. IS PASCAL/3000 ANY BETTER THAN ANSI STANDARD PASCAL? * PASCAL/3000 supports: - A PRETTY GOOD STRING PACKAGE. - IMPROVED, though still somewhat difficult, I/O. - THE ABILITY TO COMPILE A PROGRAM IN SEVERAL PIECES. - THE ABILITY TO WRITE A PROCEDURE OR FUNCTION THAT TAKES A STRING (BUT NOT ANY OTHER KIND OF ARRAY) OF VARIABLE SIZE. * The remaining problems still include: - PARAMETER TYPE CHECKING STILL WAY TOO TIGHT. You still can't write a procedure that takes an integer array of arbitrary size or an arbitrary array/record (say, to do DBGETs or DBPUTs with); you still can't write, say, a matrix multiplication routine (just as an example). - I/O STILL HAS PROBLEMS: * IT'S STILL VERY DIFFICULT (not impossible, but still very painful) TO TRAP ERROR CONDITIONS, SUCH AS FILE SYSTEM ERRORS OR INCORRECT NUMERIC INPUT. * YOU CAN'T OPEN A FILE USING "FOPEN" AND THEN USE THE PASCAL I/O SYSTEM WITH IT. This means that any time you need to specify a feature that PASCAL's OPEN doesn't have (such as "open temporary file", "build a new file with given parameters", "open a file on the line printer", etc.), you can't just call FOPEN and then use PASCAL's I/O facilities. You either have to do all FOPEN/FWRITE/FCLOSEs, or you have to issue a :FILE equation, which is difficult and still doesn't give you all the features you want. - THE "LESS IMPORTANT BUT STILL SUBSTANTIAL" LIMITATIONS STILL EXIST -- it's hard to allocate variable-size strings, you can't immediately exit several levels of nesting, and you can't have variables that point to functions or procedures. * REMEMBER -- YOU CAN'T COUNT ON "A PARTICULAR IMPLEMENTATION" TO SAVE YOU HERE! If you could live with Standard PASCAL's restrictions by knowing that, say, string handling or a good I/O facility would surely be implemented by any particular implementation, remember: PASCAL/3000 is a particular implementation! If you run into a restriction with PASCAL/3000, that's it; you either have to work around it or use a different language. IS PASCAL/XL ANY BETTER THAN THE ANSI STANDARD? Surprisingly, yes. ALL OF THE MAJOR PROBLEMS I POINTED OUT IN STANDARD PASCAL SEEM TO HAVE BEEN FIXED IN PASCAL/XL! The only words of caution are: * IT MAY BE GREAT, BUT IT'S NOT PORTABLE -- NOT EVEN TO PRE-SPECTRUMS! HP still hasn't announced when (if ever) it'll implement all of PASCAL/XL's great features on the pre-Spectrum machines. As long as it doesn't, you'll have to either avoid using of all PASCAL/XL's wonderful improvements, or be stuck with code that won't run on pre-Spectrum 3000's! * BE SKEPTICAL. "New implementations" always look great, precisely because we haven't had the chance to really use them. For all we know, the compiler may be ridden with bugs, or it might be excruciatingly slow in compiling your program, or it might generate awfully slow code! Even more likely, there may be serious design flaws that make programming difficult -- it's just that we won't notice them until we've programmed in it for several months! As I said, BE SKEPTICAL. IS DRAFT ANSI STANDARD C BETTER THAN KERNIGHAN & RITCHIE C? Again, it seems it might be! It's standardized the I/O and string handling facilities (and they're pretty good ones at that), AND it's implemented some nice-looking parameter checking. Still, beware: * BEING A "DRAFT STANDARD", IT MIGHT BE YEARS (OR DECADES) BEFORE ALL OR MOST C COMPILERS HAVE ALL OF ITS FEATURES. Note, however, that most modern C compilers already include some of the Draft Standard's new features, except for the strengthened parameter checking, which is still relatively rare. * IF YOU THOUGHT KERNIGHAN & RITCHIE C WAS UGLY, YOU'LL STILL THINK THIS ABOUT DRAFT STANDARD C. I don't want to imply that K&R C IS ugly -- it's just that many old SPL, PASCAL, and ALGOL programmers think so. It may not be objectively demonstrable, or even objectively discussible; however, that's the reaction I've seen in some (more than a few!) people. All I can say is this -- if you suffer from it, the Draft Standard still won't help you. NICE FEATURES THAT SOME LANGUAGES DON'T HAVE AND OTHERS DO The "PROBLEMS WITH" sections discussed things that could make programming in SPL, PASCAL, or C a miserable experience. It emphasized some things that were show-stoppers and others that simply frayed on the nerves; one thing it conspicuously EXCLUDED were the good features that you could live without, but would rather have. The following is a summary of all these, plus some of the things we've already mentioned above. [Legend: "STD PAS" = Standard PASCAL or ISO Level 1 Standard; "STD C" = Draft Proposed ANSI Standard; "YES" = good implementation of this feature; "YES+" = excellent or particularly nice implementation; "YES-" = OK, so they've got it, but it's rather ugly; "NO" = no; "HNO" = Hell, no!; "---" = Major loss! No support of REALLY IMPORTANT feature] STD PAS/ PAS/ K&R STD PAS 3000 XL C C SPL RECORD STRUCTURES YES YES YES YES YES NO STRINGS --- YES+ YES+ YES- YES+ YES ENUMERATED DATA TYPES YES YES YES NO YES- NO (see "DATA STRUCTURES") SUBRANGE TYPES YES YES YES NO NO NO (see "DATA STRUCTURES"; may not be all that useful) OPTIONAL PARAMETER/VARIABLE NUMBER NO NO YES+ YES- YES YES OF PARAMETERS SUPPORT (like SPL "OPTION VARIABLE") NUMERIC FORMATTING/INPUT YES- YES- YES YES+ YES+ YES- FILE I/O YES- YES YES+ YES- YES YES (see "FILE I/O" chapter for more) BIT ACCESS NO YES YES YES YES YES+ (see "OPERATORS") POINTER SUPPORT NO NO YES YES YES YES THE ABILITY TO WRITE PROCEDURE-LIKE NO NO YES YES YES NO CONSTRUCTS THAT ARE COMPILED "IN-LINE", FOR MAXIMUM EFFICIENCY PLUS MAXIMUM MAINTAINABILITY LOW-LEVEL ACCESS HNO HNO HNO NO NO YES (ASSEMBLEs, TOS, registers -- often useless, sometimes vital!) REALLY NICE FEATURES TO PAY ATTENTION TO Just some interesting things, mostly implemented in only one of the three languages. I just wanted to draw your attention to them, because they can be quite nice: * PASCAL/XL'S TRY/RECOVER CONSTRUCT. A really nifty contraption -- see the "CONTROL STRUCTURES" chapter for more info. * C's "FOR" LOOP. You might think it's ugly, but it's quite a bit more powerful -- in some very useful ways -- than SPL's or PASCAL's looping constructs. * C's "#define" MACRO FACILITY. I wish that PASCAL and SPL had it too; it lets you do procedure-like things without the overhead of a procedure call AND without the maintainability problems of writing the code in-line. ALSO, it lets you add interesting new constructs to the language (like defining your own looping constructs, etc.). * SPL's LOW-LEVEL SYSTEM ACCESS. Although you'd rather not have to worry about registers, TOSs, ASSEMBLEs, etc., sometimes you need to be able to manipulate them -- SPL lets you to do it.