WHAT TO DO WHEN YOUR PROGRAM ABORTS by Eugene Volokh, VESOFT 1135 S. Beverly Dr. Los Angeles, CA 90035 USA (213) 282-0420 INTRODUCTION Your program aborts. ABORT :FOOBAR.TEST.PROD.%1.%7331 PROGRAM ERROR #24 :BOUNDS VIOLATION What happened? Where did the abort take place? Your program is three thousand lines long -- which one of them has the bug? You might put in DISPLAY statements to figure out exactly where the abort happened, but that could take a whole lot of DISPLAYs, and requires at least one recompilation. (Sometimes, though not often, the very act of putting in a debugging display might make the problem go away -- it will then reappear as soon as you take the debugging display out!) How do you figure out where your program is aborting WITHOUT putting in debugging displays, WITHOUT recompiling, and WITHOUT having to learn DEBUG, breakpoints, Q registers, indirect addressing, and all that nonsense? WHAT THIS PAPER WILL TELL YOU Debugging a program usually takes up to an order of magnitude more time than actually writing it. As any programmer can tell you, it is often a harrowingly frustrating experience. You'd like to have lots of tools (such as symbolic debuggers or program analyzers) that will help you debug your programs, but unfortunately few such tools are available on the HP 3000. In particular, standard DEBUG does NOT allow you to look at and modify variables symbolically (by name), easily put breakpoints at given procedures or statements, do symbolic stack tracebacks, or anything like that. This paper will NOT tell you how to use DEBUG to watch what your program is doing or examine the state of your variables or anything like that. Experience has shown that DEBUG is often too complicated even for highly skilled programmers -- not because it somehow requires a lot of smarts to understand, but just because it is so cumbersome (to look at your variables, you have to know their machine addresses; to set a breakpoint at a given statement, you have to know both the starting address of a procedure and the code address of the statement inside the procedure). What this paper will explain is how to solve a specific problem: * WHEN A PROGRAM ABORTS (with a bounds violation, stack overflow, integer overflow, QUIT, library procedure error, or whatever), HOW CAN ONE FIGURE OUT WHERE -- in what procedure, and perhaps at what statement -- THE ABORT OCCURRED? This turns out to be a not so very complicated problem after all, and once you master a few key concepts, you'll be able to unerringly pinpoint the locations of your aborts with very little difficulty. THE FPMAP CAN BE YOUR FRIEND Let's look at that abort message again: ABORT :FOOBAR.TEST.PROD.%1.%7331 PROGRAM ERROR #24 :BOUNDS VIOLATION It includes, of course, the name of the program and the type of error that occurred (BOUNDS VIOLATION). It also includes two numbers -- %1 and %7331 -- which are the keys to solving our problem. The two numbers are simply the segment (%1) and location (%7331) at which the error happened. They are, however, quite unlikely to be very informative to you. What you really want is the NAME of the procedure in which this location (%1.%7331) resides (actually, what you'd really like is the actual source code line number, but that's much harder to get). Once upon a time (about T MIT, if my memory serves me right), a certain new keyword was added to the MPE :PREP command. This keyword is called ;FPMAP, and what it really means is that the program file created by the :PREP command will have inside it -- stored in a specially-formatted table -- the NAMES and STARTING ADDRESSES of all the procedures in the file. Therefore, if your procedure MUNGARRAY is stored in location %7105 through %7472 of segment %1 in your FOOBAR program, the program file will contain this information. The purpose of this ;FPMAP parameter was, in fact, ease of debugging -- given a segment number and segment-relative address of an instruction in a program file, you could now get the name of the procedure which contains this instruction (assuming the program file was :PREPped with ;FPMAP). Of course, it would make sense for the MPE code that prints the BOUNDS VIOLATION abort message to look at the FPMAP and automatically decode the segment and location into something sensible; but, unfortunately, this was not to be. However, if you want to, you can do this decoding yourself. Remember, the first thing we must do is :PREP the program with ;FPMAP: :PREP FOOUSL,FOOBAR;FPMAP;RL=MYRL;MAXDATA=27000 Now, we run the program and get the abort message, which tells us that the abort occurred at segment %1, location %7331. If we look at our MPE System Intrinsics Manual (provided, of course, that we have a recent enough edition), we find a procedure called FINDPMAPADDR: IV IV IV IA IV IR FINDPMAPADDR (FNUM, SEG, LOC, RECORD, SIZE, STATUS); FNUM = the file number of the program file (which must have been :PREPped with ;FPMAP and FOPENed with MR NOBUF) SEG = the segment number of the instruction to find LOC = the instruction's segment-relative location RECORD = the specially-formatted FPMAP record that FINDPMAPADDR returns -- it describes the segment and procedure that contain the location given by SEG and LOC SIZE = the amount of room you've allocated in your stack for RECORD; should be 36 words STATUS = the value returned by FINDPMAPADDR that indicates how things went; 0 if all OK, non-zero in case some kind of error occurred Thus, if you FOPEN your program file MR NOBUF and then call FINDPMAPADDR, you'll be able to translate the segment and location into the appropriate segment name and procedure name. You can easily write a program that uses FINDPMAPADDR to decode a segment and location; I already have -- MPEX has a command called PROGINST: %PROGINST FOOBAR, %1.%7331 (this command may, of course, be executed either from MPEX directly or from EDITOR, QUERY, etc. using the MPEX hook) Segment UTILSEG, procedure MUNGARRAY, offset from start of code %224 Now we know that the bounds violation occurred in the MUNGARRAY procedure. Of course, we'd really like to know the location in more detail (for instance, down to the line number in the procedure), but this should give us a pretty good idea of what's happening. TRACING BACK THE ENTIRE CALLING SEQUENCE So far, we've taken the segment and location number that are printed in the abort message and decoded them into a segment name and a procedure name. Let's say, though, that MUNGARRAY is a very commonly used utility procedure. It might have been called from any one of a dozen other procedures, which in turn might have been called from one of a number of places. What we'd really like to do is to determine not just the name of the procedure in which the abort occurred, but also the procedure that called it, the procedure that called the procedure that called it, and so on, all the way up to the main body of the program. Now, although the abort message normally only displays the segment and location at which the abort actually took place, there is a way to make aborts to trace back the entire calling sequence of the program. This is done using the little-known :SETDUMP command: :SETDUMP That's all there is to it -- no options, no parameters. (There are some parameters that you can specify if you'd like; however, they display information that can only be interpreted using the variable map and DEBUG.) Now watch what happens when we run FOOBAR and get it to abort: :SETDUMP :RUN FOOBAR.TEST.PROD ABORT :FOOBAR.TEST.PROD.%1.%7331 PROGRAM ERROR #24 :BOUNDS VIOLATION *** ABORT STACK ANALYSIS *** S=005522 DL=176650 Z=012262 Q=005526 P=007331 LCST= 001 STAT=U,1,1,R,0,0,CCE X=176652 Q=005515 P=004672 LCST= 003 STAT=U,1,1,L,0,0,CCG X=176652 Q=005012 P=004366 LCST= 000 STAT=U,1,1,L,0,1,CCG X=176652 Q=004010 P=002640 LCST= 001 STAT=U,1,1,L,0,1,CCG X=176706 Q=003517 P=001670 LCST= 001 STAT=U,1,1,L,0,0,CCL X=176662 Q=003265 P=000307 LCST= 002 STAT=U,1,1,L,0,0,CCL X=000001 Q=003222 P=000004 LCST= 002 STAT=U,1,1,L,0,0,CCG X=000000 Q=003210 P=177777 LCST= S024 STAT=P,1,0,L,0,0,CCG X=000000 *DEBUG* 1.7332 ?E@ PROGRAM TERMINATED IN AN ERROR STATE. (CIERR 976) Immediately after the abort message, MPE prints an "ABORT STACK ANALYSIS", which indicates (in its own rather cryptic fashion) the state of all the "ancestors" of the currently-executing procedure. Each of the line that starts with "Q=" represents one calling procedure. The important numbers in each line are the LCST value (which stands for Logical Code Segment Table index), which is really a segment number, and the P value, which is a segment-relative address (both values are in octal). As you see, the first line is LCST=001 and P=007731, which is of course our %1.%7331. The next line -- which is the procedure that called the procedure that caused the abort -- is LCST=003, P=004672 (%3.%4672); the line after that (the procedure that called the %3.%4672 procedure) is %0.%4366, and so on. The bottom-most line, you'll find, has LCST=S024, which means "System SL segment %24". This is the address of the system procedure that first called your program's outer block and thus began the execution of your program. Now that we have the raw data, we can use MPEX's %PROGINST (which in turn uses the FINDPMAPADDR intrinsic) to interpret it: %PROGINST FOOBAR.TEST.PROD, %1.%7331 Segment UTILSEG, procedure MUNGARRAY, offset from start of code %224 %PROGINST FOOBAR.TEST.PROD, %3.%4672 Segment PROCESSDATA, procedure SEARCHFILE, offset from start of code %32 %PROGINST FOOBAR.TEST.PROD, %1.%2640 Segment UTILSEG, procedure NEXTPARM, offset from start of code %251 %PROGINST FOOBAR.TEST.PROD, %1.%1670 Segment UTILSEG, procedure PARSECOM, offset from start of code %1023 %PROGINST FOOBAR.TEST.PROD, %2.%307 Segment MAINSEG, procedure MAINLOOP, offset from start of code %102 %PROGINST FOOBAR.TEST.PROD, %2.%4 Segment MAINSEG, procedure OB', offset from start of code %4 As you see, :SETDUMP is a rather powerful tool for this sort of thing. I always do a :SETDUMP in my OPTION LOGON UDC -- when everything goes well, the :SETDUMP doesn't interfere, but when a program aborts I automatically get the ABORT STACK ANALYSIS, which I want. Watch out for one thing, though: after the procedure call traceback is printed, you may be (if you have the right capabilities) dropped into DEBUG. This is useful if you know DEBUG and want to use it (perhaps to display the parameters of one of the called procedures); if you don't want to do anything in DEBUG, just type ?E@ and the program will terminate as usual. Do NOT just type "?E", since that will resume executing the program (despite the abort), and will most probably result in another abort shortly afterwards. FINDING OUT AT WHICH STATEMENT IN THE PROCEDURE AN ABORT OCCURRED So far, we've managed to figure out the name of the procedure in which the abort occurred and the names of all its ancestors (which might actually be more useful for finding out exactly where the bug is). However, a procedure can be many lines long -- how can we find out exactly where in the procedure the abort happened? The answer is: "with great difficulty". Finding out the procedure names was simple since MPE keeps track of which procedure starts where, and even makes it easy for us to get this information (using the FINDPMAPADDR procedure). Unfortunately, MPE does NOT keep track of which location each statement starts at. If you're willing to keep all your source code listings (either on paper or online), you can refer to them to find out exactly where each statement in a procedure starts. SPL outputs this information as a matter of course: 3 00000 1 INTEGER PROCEDURE MIN (I, J); 4 00000 1 VALUE I, J; 5 00000 1 INTEGER I, J; 6 00000 1 BEGIN 7 00000 2 IF I<J THEN 8 00003 2 MIN:=I 9 00003 2 ELSE 10 00006 2 MIN:=J; 11 00010 2 END; The number in the second column is the starting address of this line's code, relative to the start of the procedure. Thus, the code for line 7 (IF I<J THEN) starts at location 0 in the procedure; the code for line 8 (MIN:=I) starts at location 3, the code for line 10 (MIN:=J) starts at location 6, and the code for the END statement (procedure return) starts at location 10 (all numbers are in octal). If our abort message indicates that, say, a bounds violation occurred at location 5 in procedure MIN (not very likely considering the code involved), we'd know that the error is in the statement "MIN:=I". Similar outputs are provided by FORTRAN's $CONTROL LOCATION control card, COBOL's $CONTROL VERBS, and PASCAL's $CODE_OFFSETS ON$. Well, this is all well and good -- IF you keep all your source code listings! Remember, you need to either print out the entire listing every time you recompile the program (a listing that's even a little bit out of date is likely to have incompatible statement starting addresses) or keep the entire listing online for every program. You may consider one of the above alternatives worthwhile (keeping the entire listing online seems preferable); if you do, then all you need to do to exactly locate an abort location is: * Use PROGINST to convert the segment number and location into a procedure name and a procedure-relative address. * Find the procedure in your source listing. * Find the line in your source listing whose code offset is LESS than or equal to the procedure-relative address returned by PROGINST but the code offset of the next line is GREATER than PROGINST's procedure-relative address. * The instruction at which the abort occurred is contained in the line you've just found. I personally do not keep all my source listings, either online or on paper. One reason is disc space; another is that all my procedures are quite small in any case, a programming style that I find has many advantages. When the location of the abort in a procedure is not immediately obvious, I use the following trick: * I use a decompiler to disassemble the code around the abort location. Although this requires some knowledge of HP assembly code, it's not as hard as it seems -- often, there are some procedure calls (for which a good disassembler will display the names of the procedures being called) right near the abort location; by referring back to the source code, we can find the lines on which these procedures are being called and thus, by inference, find the approximate location on the abort. Similarly, if I see an ADD 15 instruction right next to the abort location and there's a "I+15" in my procedure source code, I can assume that the source line in which the abort occurred is right next to the "I+15".