WHAT TO DO WHEN YOUR PROGRAM ABORTS
                       by Eugene Volokh, VESOFT
                         1135 S. Beverly Dr.
                      Los Angeles, CA 90035 USA
                            (213) 282-0420



INTRODUCTION

   Your program aborts.

   ABORT :FOOBAR.TEST.PROD.%1.%7331
   PROGRAM ERROR #24 :BOUNDS VIOLATION

What  happened? Where did the abort  take place? Your program is three
thousand lines long -- which one of them has the bug? You might put in
DISPLAY statements to figure out exactly where the abort happened, but
that  could  take  a whole lot of DISPLAYs,  and requires at least one
recompilation.  (Sometimes, though not often,  the very act of putting
in  a debugging display might make the problem go away -- it will then
reappear as soon as you take the debugging display out!)

   How  do  you  figure  out  where  your program  is aborting WITHOUT
putting in debugging displays, WITHOUT recompiling, and WITHOUT having
to learn DEBUG, breakpoints, Q registers, indirect addressing, and all
that nonsense?


WHAT THIS PAPER WILL TELL YOU

   Debugging  a program usually takes up to an order of magnitude more
time  than actually writing it. As any  programmer can tell you, it is
often a harrowingly frustrating experience. You'd like to have lots of
tools (such as symbolic debuggers or program analyzers) that will help
you  debug  your  programs,  but  unfortunately  few  such  tools  are
available on the HP 3000. In particular, standard DEBUG does NOT allow
you to look at and modify variables symbolically (by name), easily put
breakpoints  at  given  procedures  or  statements, do  symbolic stack
tracebacks, or anything like that.

   This  paper  will NOT tell you how to  use DEBUG to watch what your
program  is  doing or examine the state  of your variables or anything
like  that.  Experience has shown that  DEBUG is often too complicated
even for highly skilled programmers -- not because it somehow requires
a  lot  of smarts to understand, but  just because it is so cumbersome
(to  look at your variables, you have to know their machine addresses;
to  set  a breakpoint at a given statement,  you have to know both the
starting  address of a procedure and the code address of the statement
inside the procedure).

   What this paper will explain is how to solve a specific problem:

   * WHEN A PROGRAM ABORTS  (with  a bounds violation, stack overflow,
     integer  overflow,  QUIT, library procedure  error, or whatever),
     HOW CAN ONE FIGURE OUT WHERE -- in what procedure, and perhaps at
     what statement -- THE ABORT OCCURRED?

This  turns out to be a not so very complicated problem after all, and
once  you  master  a  few  key concepts, you'll  be able to unerringly
pinpoint the locations of your aborts with very little difficulty.


THE FPMAP CAN BE YOUR FRIEND

   Let's look at that abort message again:

   ABORT :FOOBAR.TEST.PROD.%1.%7331
   PROGRAM ERROR #24 :BOUNDS VIOLATION

It  includes, of course, the name of the program and the type of error
that  occurred (BOUNDS VIOLATION). It also  includes two numbers -- %1
and %7331 -- which are the keys to solving our problem.

   The two numbers are simply the segment (%1) and location (%7331) at
which the error happened. They are, however, quite unlikely to be very
informative  to you. What you really want is the NAME of the procedure
in which this location (%1.%7331) resides (actually, what you'd really
like  is the actual source code line number, but that's much harder to
get).

   Once  upon  a  time (about T MIT, if  my memory serves me right), a
certain  new keyword was added to  the MPE :PREP command. This keyword
is  called  ;FPMAP, and what it really  means is that the program file
created  by  the  :PREP  command  will  have inside it  -- stored in a
specially-formatted  table -- the NAMES  and STARTING ADDRESSES of all
the  procedures in the file. Therefore, if your procedure MUNGARRAY is
stored  in  location %7105 through %7472 of  segment %1 in your FOOBAR
program, the program file will contain this information.

   The  purpose  of  this  ;FPMAP  parameter  was,  in  fact,  ease of
debugging -- given a segment number and segment-relative address of an
instruction  in  a  program  file,  you could now get  the name of the
procedure  which contains this instruction  (assuming the program file
was  :PREPped with ;FPMAP). Of course, it would make sense for the MPE
code  that  prints  the BOUNDS VIOLATION abort  message to look at the
FPMAP and automatically decode the segment and location into something
sensible; but, unfortunately, this was not to be. However, if you want
to, you can do this decoding yourself.


   Remember,  the  first  thing  we must do is  :PREP the program with
;FPMAP:

   :PREP FOOUSL,FOOBAR;FPMAP;RL=MYRL;MAXDATA=27000

Now, we run the program and get the abort message, which tells us that
the abort occurred at segment %1, location %7331.

   If  we  look  at  our  MPE  System Intrinsics  Manual (provided, of
course,  that  we  have a recent enough  edition), we find a procedure
called FINDPMAPADDR:


                 IV    IV   IV   IA      IV    IR
   FINDPMAPADDR (FNUM, SEG, LOC, RECORD, SIZE, STATUS);

     FNUM    = the  file number  of the  program file (which must have
               been :PREPped with ;FPMAP and FOPENed with MR NOBUF)

     SEG     = the segment number of the instruction to find

     LOC     = the instruction's segment-relative location

     RECORD  = the specially-formatted  FPMAP record that FINDPMAPADDR
               returns  -- it describes the segment and procedure that
               contain the location given by SEG and LOC

     SIZE    = the amount  of room you've allocated  in your stack for
               RECORD; should be 36 words

     STATUS  = the value returned by  FINDPMAPADDR  that indicates how
               things went; 0 if all OK, non-zero in case some kind of
               error occurred


Thus,  if  you  FOPEN  your  program  file  MR  NOBUF  and  then  call
FINDPMAPADDR,  you'll  be  able to translate  the segment and location
into the appropriate segment name and procedure name.

   You  can easily write a program  that uses FINDPMAPADDR to decode a
segment  and location;  I already have  --  MPEX has a  command called
PROGINST:


   %PROGINST FOOBAR, %1.%7331
     (this command may, of course, be executed either from MPEX
      directly or from EDITOR, QUERY, etc. using the MPEX hook)
   Segment UTILSEG, procedure MUNGARRAY, offset from start of code %224


   Now  we  know  that the bounds violation  occurred in the MUNGARRAY
procedure.  Of  course, we'd really like to  know the location in more
detail  (for instance, down to the  line number in the procedure), but
this should give us a pretty good idea of what's happening.


TRACING BACK THE ENTIRE CALLING SEQUENCE

   So  far,  we've  taken  the  segment  and location  number that are
printed  in the abort message and decoded them into a segment name and
a procedure name.

   Let's  say, though, that MUNGARRAY is  a very commonly used utility
procedure.  It  might  have been called from any  one of a dozen other
procedures,  which in turn might have been called from one of a number
of  places.  What we'd really like to do  is to determine not just the
name  of  the  procedure  in  which  the abort occurred,  but also the
procedure that called it, the procedure that called the procedure that
called it, and so on, all the way up to the main body of the program.

   Now,  although the abort message normally only displays the segment
and location at which the abort actually took place, there is a way to
make  aborts to trace back the entire calling sequence of the program.
This is done using the little-known :SETDUMP command:

   :SETDUMP

That's  all  there  is to it -- no  options, no parameters. (There are
some  parameters  that  you  can specify if  you'd like; however, they
display  information  that can only be  interpreted using the variable
map  and DEBUG.) Now watch what happens  when we run FOOBAR and get it
to abort:

   :SETDUMP
   :RUN FOOBAR.TEST.PROD
   ABORT :FOOBAR.TEST.PROD.%1.%7331
   PROGRAM ERROR #24 :BOUNDS VIOLATION
   *** ABORT STACK ANALYSIS ***
   S=005522    DL=176650    Z=012262
   Q=005526 P=007331  LCST=  001  STAT=U,1,1,R,0,0,CCE   X=176652
   Q=005515 P=004672  LCST=  003  STAT=U,1,1,L,0,0,CCG   X=176652
   Q=005012 P=004366  LCST=  000  STAT=U,1,1,L,0,1,CCG   X=176652
   Q=004010 P=002640  LCST=  001  STAT=U,1,1,L,0,1,CCG   X=176706
   Q=003517 P=001670  LCST=  001  STAT=U,1,1,L,0,0,CCL   X=176662
   Q=003265 P=000307  LCST=  002  STAT=U,1,1,L,0,0,CCL   X=000001
   Q=003222 P=000004  LCST=  002  STAT=U,1,1,L,0,0,CCG   X=000000
   Q=003210 P=177777  LCST= S024  STAT=P,1,0,L,0,0,CCG   X=000000
   *DEBUG* 1.7332
   ?E@
   PROGRAM TERMINATED IN AN ERROR STATE.  (CIERR 976)

Immediately  after  the  abort  message,  MPE  prints an  "ABORT STACK
ANALYSIS",  which  indicates  (in its own  rather cryptic fashion) the
state  of  all  the "ancestors" of  the currently-executing procedure.
Each  of  the  line  that  starts  with  "Q="  represents  one calling
procedure.

   The important numbers in each line are the LCST value (which stands
for  Logical  Code  Segment  Table  index), which is  really a segment
number,  and  the  P value, which is  a segment-relative address (both
values  are  in  octal).  As  you see, the first  line is LCST=001 and
P=007731,  which is of course our %1.%7331.  The next line -- which is
the  procedure  that called the procedure that  caused the abort -- is
LCST=003, P=004672 (%3.%4672); the line after that (the procedure that
called the %3.%4672 procedure) is %0.%4366, and so on. The bottom-most
line, you'll find, has LCST=S024, which means "System SL segment %24".
This  is  the  address of the system  procedure that first called your
program's outer block and thus began the execution of your program.

   Now that we have the raw data,  we can use  MPEX's %PROGINST (which
in turn uses the FINDPMAPADDR intrinsic) to interpret it:

   %PROGINST FOOBAR.TEST.PROD, %1.%7331
   Segment UTILSEG, procedure MUNGARRAY, offset from start of code %224

   %PROGINST FOOBAR.TEST.PROD, %3.%4672
   Segment PROCESSDATA, procedure SEARCHFILE, offset from start of code %32

   %PROGINST FOOBAR.TEST.PROD, %1.%2640
   Segment UTILSEG, procedure NEXTPARM, offset from start of code %251

   %PROGINST FOOBAR.TEST.PROD, %1.%1670
   Segment UTILSEG, procedure PARSECOM, offset from start of code %1023

   %PROGINST FOOBAR.TEST.PROD, %2.%307
   Segment MAINSEG, procedure MAINLOOP, offset from start of code %102

   %PROGINST FOOBAR.TEST.PROD, %2.%4
   Segment MAINSEG, procedure OB', offset from start of code %4

   As  you  see,  :SETDUMP is a rather powerful  tool for this sort of
thing.  I  always  do  a  :SETDUMP  in  my  OPTION  LOGON UDC  -- when
everything  goes  well,  the  :SETDUMP  doesn't interfere,  but when a
program  aborts I automatically get the  ABORT STACK ANALYSIS, which I
want.  Watch  out  for  one  thing,  though: after  the procedure call
traceback  is printed, you may be (if you have the right capabilities)
dropped  into DEBUG. This is useful if  you know DEBUG and want to use
it   (perhaps   to  display  the  parameters  of  one  of  the  called
procedures); if you don't want to do anything in DEBUG, just type

   ?E@

and  the program will terminate as usual. Do NOT just type "?E", since
that  will resume executing the program  (despite the abort), and will
most probably result in another abort shortly afterwards.


FINDING OUT AT WHICH STATEMENT IN THE PROCEDURE AN ABORT OCCURRED

   So  far,  we've managed to figure out  the name of the procedure in
which  the  abort  occurred and the names  of all its ancestors (which
might  actually  be more useful for finding  out exactly where the bug
is).  However,  a procedure can be many lines  long -- how can we find
out exactly where in the procedure the abort happened?

   The  answer is: "with great  difficulty". Finding out the procedure
names  was  simple  since  MPE  keeps track of  which procedure starts
where,  and  even makes it easy for  us to get this information (using
the FINDPMAPADDR procedure). Unfortunately, MPE does NOT keep track of
which location each statement starts at.

   If  you're willing to keep all your source code listings (either on
paper or online), you can refer to them to find out exactly where each
statement  in  a  procedure starts. SPL outputs  this information as a
matter of course:

    3     00000 1   INTEGER PROCEDURE MIN (I, J);
    4     00000 1   VALUE I, J;
    5     00000 1   INTEGER I, J;
    6     00000 1   BEGIN
    7     00000 2   IF I<J THEN
    8     00003 2     MIN:=I
    9     00003 2   ELSE
   10     00006 2     MIN:=J;
   11     00010 2   END;

The number in the second column is the starting address of this line's
code,  relative to the start of the procedure. Thus, the code for line
7  (IF  I<J THEN) starts at location 0  in the procedure; the code for
line  8  (MIN:=I) starts at location 3,  the code for line 10 (MIN:=J)
starts  at  location 6, and the code  for the END statement (procedure
return) starts at location 10 (all numbers are in octal).

   If  our  abort  message  indicates  that,  say, a  bounds violation
occurred  at location 5 in procedure  MIN (not very likely considering
the  code  involved),  we'd  know  that the error  is in the statement
"MIN:=I".

   Similar outputs are provided by FORTRAN's $CONTROL LOCATION control
card, COBOL's $CONTROL VERBS, and PASCAL's $CODE_OFFSETS ON$.

   Well, this is all well and good -- IF you keep all your source code
listings!  Remember,  you need to either  print out the entire listing
every  time you recompile the program  (a listing that's even a little
bit  out  of  date  is likely to  have incompatible statement starting
addresses) or keep the entire listing online for every program.

   You  may consider one of the above alternatives worthwhile (keeping
the  entire listing online seems preferable);  if you do, then all you
need to do to exactly locate an abort location is:

   * Use  PROGINST  to convert the segment number and  location into a
     procedure name and a procedure-relative address.

   * Find the procedure in your source listing.

   * Find  the line in your source  listing whose code offset  is LESS
     than  or  equal  to  the  procedure-relative address  returned by
     PROGINST  but  the  code  offset of the next line is GREATER than
     PROGINST's procedure-relative address.

   * The instruction at which the  abort occurred  is contained in the
     line you've just found.

   I  personally do not keep all  my source listings, either online or
on paper.  One reason is disc space; another is that all my procedures
are  quite small in any case, a programming style that I find has many
advantages.  When  the  location  of  the abort in  a procedure is not
immediately obvious, I use the following trick:

   * I use a  decompiler to  disassemble  the  code  around  the abort
     location.  Although this  requires some  knowledge of HP assembly
     code,  it's not  as hard as  it seems  --  often,  there are some
     procedure calls  (for which a  good disassembler will display the
     names  of the  procedures  being  called)  right  near  the abort
     location;  by referring back  to the source code, we can find the
     lines  on which  these procedures  are being called  and thus, by
     inference, find the approximate location on the abort.

     Similarly, if I see an ADD 15 instruction right next to the abort
     location  and there's a "I+15" in my procedure source code, I can
     assume  that the source line in which the abort occurred is right
     next to the "I+15".

Go to Adager's index of technical papers