RELATIONAL DATABASES VS. IMAGE: WHAT THE FUSS IS ALL ABOUT
                       by Eugene Volokh, VESOFT
       Presented at 1986 INTEREX Conference, Detroit, MI, USA
   Published in "Thoughts & Discourses on HP3000 Software", 3rd ed.


ABSTRACT

    What  are  "relational  databases" anyway? Are  they more powerful
than  IMAGE? Less powerful? Faster?  Slower? Slogans abound, but facts
are  hard  to  come  by.  It  seems like HP will  finally have its own
relational  system  out  for Spectrum (or whatever  they call it these
days). I hope that this paper will clear up some of the confusion that
surrounds  relational  databases,  and will point  out the substantive
advantages  and  disadvantages  that  relational  databases  have over
network systems like IMAGE.


WHAT IS A RELATIONAL DATABASE?


   Let's think for a while about a database design problem.

   We  want to build a parts requisition system. We have many possible
suppliers, and many different parts. Each supplier can sell us several
kinds  of  parts,  and  each  part  can be bought  from one of several
suppliers.

   Easy,  right? We just have a supplier master, a parts master, and a
supplier/parts cross-reference detail:

   ---------------                      ---------------
    \ SUPPLIERS /                        \   PARTS   /
     \         /                          \         /
      \  (M)  /                            \  (M)  /
       \     /                              \     /
        \   /                                \   /
         \_/                                  \_/
          .                                   .
           ...                             ...
              ...                       ...
                 .                     .
                  ---------------------
                   \                 /
                    \ SUPPLIER-XREF /
                     \             /
                      \    (D)    /
                       \_________/

   Every supplier has a record in the SUPPLIERS master, every part has
a  record in the PARTS master, and each (supplier, part-supplied) pair
has a record in the SUPPLIER-XREF dataset.

   Now,  why  did  we  set  things  up  this  way? We  could have, for
instance,  made  the  SUPPLIER-XREF  dataset  a master, with  a key of
SUPPLIERS#+PART#.   Or,   we   could  have  made  all  three  datasets
stand-alone  details,  with  no masters at all.  The point is that the
proof  of  a  database  is  in the using. The  design we showed -- two
masters and a detail -- allows us to very efficiently do the following
things:

    * Look up supplier information by the unique supplier #.

    * Look up parts information by the unique part #.

    *  For  each part, look up all  its suppliers (by using the cross-
      reference detail dataset).

    *  For each supplier, look up all the parts it sells (by using the
      cross-reference detail dataset).

    This  is what IMAGE is good at  -- allowing quick retrieval from a
master using the master's unique key and allowing quick retrieval from
a detail chain using one of the detail's search items.


   However,  lets take a closer look at the parts dataset. It actually
looks kind of like this:

   PART#        <-- unique key item
   DESCRIPTION
   SHAPE
   COLOR
   ...

What  if  we  want  to  find  all  the  suppliers  that can  sell us a
"framastat"?  A  "framastat", you see, is not  a part number -- it's a
part  description.  We  want  to be able to look  up parts not only by
their  part  number,  but  also  by their  descriptions. The functions
supported by our design are:

   * Look up PART by PART#.

   * Look up SUPPLIERS by SUPPLIERS#.

   * Look up PARTs by SUPPLIERS#.

   * Look up SUPPLIERs by PART#.

What we want is the ability to

   * Look up PART by DESCRIPTION.


   The  sad thing is that the PARTS  dataset is a master, and a master
dataset  supports  lookup  by ONLY ONE FIELD  (the key). We can't make
DESCRIPTION  the key item, since we want  PART# to be the key item; we
can't  make DESCRIPTION a search item,  since PARTS isn't a detail. By
making  PARTS a master, we got fast lookup by PART# (on the order of 1
or  2 I/Os to do the DBGET), but we forfeited any power to look things
up quickly by any other item.

   And  so, dispirited and dejected, we get  drunk and go to bed. And,
deep  in  the  night,  a  dream  comes. "Make it  a detail!" the voice
shouts.  "Make it a detail, and then you can have as many paths as you
want to."

   We awaken elated! This is it! Make PARTS a detail dataset, and then
have  two  search  items, PART# and DESCRIPTION.  Each search item can
have an automatic master dataset hanging off of it, to wit:

   ---------------            --------------      --------------
    \ SUPPLIERS /              \  PART#S  /        \ DESCRIP- /
     \         /                \        /          \  TIONS /
      \  (M)  /                  \ (A)  /            \  (A) /
       \     /                    \    /              \    /
        \   /                      \  /                \  /
         \_/                        \/                  \/
          .                         .                   .
            ..                    ..  ..              ..
              ..                ..      ..           ..
                ..           ..           ..       ..
             ---------------------      --------------------
              \                 /        \                /
               \ SUPPLIER-XREF /          \    PARTS     /
                \     (D)     /            \    (D)     /
                 \           /              \          /
                  -----------                ----------


   What's  more,  if  we  ever,  say, want to find  all the parts of a
certain  color  or  shape, we can easily add  a new search item to the
PARTS  dataset. Sure, it may be a bit slower (to get a part we need to
first  find  it  in PART#S and then follow  the chain to PARTS, 2 I/Os
instead of 1), and also the uniqueness of part numbers isn't enforced;
still, the flexibility advantages are pretty nice.

   So,  now we can put any number of search items in PARTS. What about
SUPPLIERS? What if we want to find a supplier by his name, or city, or
any  other field? Again, if we  use master datasets, we're locked into
having only one key item per dataset. Just like we restructured PARTS,
we can restructure SUPPLIES, and come up with:

            ----------------       --------------          -------------
             \ SUPPLIER#S /         \  PART#S  /            \ DESCRIP- /
              \   (A)    /           \  (A)   /              \ TIONS  /
               \        /             \      /                \ (A)  /
                \      /               \    /                  \    /
                 \    /                 \  /                    \  /
                  \--/                   \/                      \/
                   ..                    ..                      .
                 ..  ..                ..  ...                 ..
               ..      ..             ..      ...             ..
             ..          ..          ..          ...         ..
           ..              ..       ..              ..      ..
 -------------------      ---------------------     -------------------
  \               /        \                 /       \               /
   \  SUPPLIERS  /          \ SUPPLIER-XREF /         \    PARTS    /
    \    (D)    /            \    (D)      /           \    (D)    /
     \         /              \           /             \         /
      ---------                -----------               ---------




   Note  what we have done in our  quest for flexibility. All the real
data  has  been put into detail datasets;  every data item which we're
likely to retrieve on has an automatic master attached to it.

   Believe it or not, this is a relational database.





IF THIS IS A RELATIONAL DATABASE, I'M A HOTTENTOT


   Surely,  you say, there is more  to a relational database than just
an IMAGE database without any master datasets. Isn't there? Of course,
there  is.  But  all  the  wonderful things you've  been hearing about
relational  databases  may  have  more  to  do with the  features of a
specific system that happens to be relational than with the virtues of
relational as a whole.

   Consider  for a moment NETWORK databases.  IMAGE is one example, in
fact  an  example  of  a  rather  restricted kind  of network database
(having only two levels, master and detail). Let's look at some of the
major features of IMAGE:

   * IMAGE supports unique-key MASTERS and non-unique-key DETAILS.

   * IMAGE does HASHING on master dataset records.

   * IMAGE has QUERY, an interactive query language.

   Which  of these features are actually NETWORK DATABASE features? In
other  words, which features would be present in any network database,
and  which  are  specific  to  the IMAGE implementation?  Of the three
listed  above, only the first --  masters and details -- must actually
be present in all databases that want to call themselves "network". On
the other hand, a network database might very well use B-trees or ISAM
as  its access method instead of hashing;  or, it might not provide an
interactive query language. It would still be a network database -- it
just wouldn't be IMAGE.

   Why  is  all  this  relevant?  Well,  let's say  that somebody said
"Network  databases  are  bad  because  they  use  hashing  instead of
B-trees."  This statement is wrong  because the network database model
is  silent on the question of B-trees  vs. hashing. It is incorrect to
generalize  from  the  fact  that IMAGE happens to  use hashing to the
theory  that  all  network  databases use hashing. If  we get into the
habit  of  making  such  generalizations,  we  are liable  to get very
inaccurate  ideas about network databases  in general or other network
implementations in particular.

   The  same  goes  for relational databases. The  reason that so many
people are so keen on relational databases isn't because they have any
particularly  novel  form of data  representation (actually, it's much
like   a   bunch   of  old-fashioned  KSAM/ISAM-like  files  with  the
possibility  of  multiple  keys); nor is it  because of some fancy new
access  methods  (hashing,  B-trees, and ISAM  are all that relational
databases  support). Rather, it's because the designers of many of the
modern  relational  databases did a good  job in providing people with
lots  of  useful features (ones that might  have been just as handy in
network databases).

WHAT ARE RELATIONAL DATABASES -- FUNCTIONALITY


   The  major  reason  for many of  the differences between relational
databases  and network databases is simple: age. Remember the good old
days  when  people  hacked  FORTRAN  code,  spending days  or weeks on
optimizing  out an instruction or two,  or saving 1000 bytes of memory
(they  had only 8K back then) ? Well, those are the days in which many
of  today's network databases were  first designed; maximum effort was
placed on making slow hardware run as fast as possible and getting the
most out of every byte of disk.

   Relational  databases, children of the late '70s and early '80s had
the  benefit  of perspective. Their designers  saw that much desirable
functionality  and  flexibility was missing in  the older systems, and
they  were  willing  to include it in  relational databases even if it
meant  some wasted storage and performance  slow-down. The bad part of
this  is  that, to some extent,  modern relational databases are still
hurting from slightly decreased performance; however, this seems to be
at  most  a  temporary problem, and  the functionality and flexibility
advantages are quite great.



THE USER INTERFACE -- THE RELATIONAL QUERY LANGUAGE


   If  you look at the  theoretical definition of relational databases
--  the one given by Codd in  his original paper that first introduced
the  subject -- you'll find that nowhere does it talk about B-trees or
hashing  or  internal  dataset format or anything  like that. In fact,
what  defines  a relational database is  the format of each "relation"
(or  dataset)  and  the  Relational  Query  Language that  lets a user
retrieve  from and update these  datasets. Since IMAGE detail datasets
very  closely  fit  the requirements of  relational dataset format, we
won't  talk  much  about  this;  the  concept  of  a  Relational Query
Language, however is another story.

   Everyone understands the "Query Language" part -- it's an interface
that  allows interactive access to  the database, just like QUERY/3000
(note that "query" means any database retrieval, insertion, update, or
deletion). What makes the query language RELATIONAL?

   Imagine  for  a moment the following  QUERY/3000 statement -- let's
call it "SELECT":

   *  It has the ability of a  >LIST command to select certain records
     based  on  selection  conditions  (like PRICE <  1000 AND COLOR =
     "RED").

   *  Just like a >LIST, it can output only those fields you want (not
     just the entire record).

   * Instead of just saying "list the fields PRICE and COLOR", you can
     also  specify expressions such as PRICE  * NUM-ITEMS, so that the
     statement might look like:

       SELECT PRICE, COLOR, PRICE * NUM-ITEMS, NUM-ITEMS + 100

   * Finally -- the most important difference of all, you can RETRIEVE
     ITEMS  FROM SEVERAL DATASETS AT ONCE!  This is what QUERY's >JOIN
     and >MULTIFIND commands let you do, but it's a lot less clumsy.

   The  capability  of retrieving stuff  from several datasets (called
"joining", from which QUERY's >JOIN command gets its name) is the most
important  difference between Relational  Query Languages and orthodox
query  languages  (such as QUERY-A, which didn't  have even the > JOIN
command).  If there's one thing  the advocates of relational databases
can  be proud of, it's their  efforts for widespread implementation of
joins in various query languages.

   With the >SELECT statement, we could say something like:

   SELECT SUPPLIER.NAME, SUPPLIER.NUMBER,
          PART.NAME, PART.NUMBER
   FROM   SUPPLIER, PART, SUPPLIER-XREF
   WHERE  SUPPLIER.NUMBER = SUPPLIER-XREF.SUPPLIER# AND
          PART.NUMBER = SUPPLIER-XREF.PART# AND
          PART.COLOR = "RED" AND
          SUPPLIER.STATE = "CA"

This  finds every SUPPLIER in the state of CAlifornia who supplies red
parts,  and prints the name and  number of that supplier together with
the name and number of every red part that he supplies.

   The first thing you should notice about the >SELECT command is that
THERE'S  NOTHING  IN  IT  THAT  QUERY/3000 DOESN'T  HAVE. Functionally
speaking,  it's  merely  a  combination  of QUERY's  >LIST, >JOIN, and
>MULTIFIND.  The  big  thing  is  that  you  don't have  to specify an
explicit  >JOIN command that tells QUERY  how to navigate the pathways
between  SUPPLIER,  SUPPLIER-XREF,  and PART. It  figures all this out
automatically.

   For  instance, the >SELECT command shown  above figures out that it
should find all the CAlifornian SUPPLIERs (maybe there's even a search
item on STATE that'll make the search faster), for each of those finds
the  appropriate  items  in SUPPLIER-XREF, and  for each SUPPLIER-XREF
item  it  checks to see if the  cross-reference record refers to a red
part.  Alternatively,  it  might look up all  the red parts first, and
then for each of those parts go through SUPPLIER-XREF and find all the
CAlifornian SUPPLIERs.

   It's  a great convenience to have the query language figure out the
way  the  join should be done, but  functionally this doesn't give you
any  capability  you  don't  already  have with QUERY/3000.  In a way,
therefore   (with   a  few  technical  exceptions),  QUERY/3000  IS  a
relational  query  language,  although  not  easy to use  as many such
languages that relational database systems support.



A BRIEF, BUT IMPORTANT DIGRESSION


   RELATION = TABLE = DATASET.

   ATTRIBUTE = COLUMN = FIELD (or ITEM -- it's close enough).

   TUPLE = ROW = RECORD.

The   relational   database   community  is  fond  of  calling  things
"relations",   "attributes",  and  "tuples".  This  is  just  needless
confusion. A "relation" is a dataset, and "attribute" is a field/item,
and  a "tuple" is a record. Table,  column, and row are more names for
the same thing.

   I try to always use the words dataset, item, and record. If anybody
throws  any of the relational buzzwords at you, just do a quick mental
translation.



FLEXIBILITY -- BUILDING AUTOMATIC MASTERS ON THE FLY


   If  you look carefully at  the theoretical definition of relational
databases,  you'll find that it doesn't actually mention any analog of
an  automatic master dataset. It describes "relations", which are just
detail  datasets,  and  some broad parameters  of the Relational Query
Language,  but  it  says  nothing about "indexes",  which is what most
relational  systems  use  in  the  same way that  IMAGE uses automatic
masters.

   Thus,  a database system may well have all its selections and joins
implemented  using  serial  reads and still  be relational! Of course,
such a system wouldn't sell very well, so all relational systems allow
you  to  build  "indexes"  just like IMAGE lets  you set up automation
masters.  The  thing  that the architects of  these systems did right,
however,  was that they let you  build indexes whenever you liked, not
just  when  you  were initially building  the database. Similarly, you
could  delete them whenever you found  that the overhead of constantly
updating  the  index exceeded the benefit that  the index gave you for
searching.

   In  some  ways, this feature --  although, as I said, theoretically
not  a "relational" feature -- is one of the most important advantages
of relational systems. It's this that gave rise to the often-heard but
quite  misleading statement that "in  relational systems, you can find
things  by  any  item,  not  just  a  search item". In  both IMAGE and
relational databases, you can find records using non-search items, but
it'll require a serial search; in both IMAGE and relational databases,
you  can set up any item as a search  item, but then you have to pay a
penalty  every  time you add or delete  a record. The big advantage of
relational  is that you can make ordinary items into search items (and
vice versa) very easily -- even easier than with ADAGER.



MORE FLEXIBILITY -- CHANGING DATABASE STRUCTURE ON THE FLY


   The ability that a relational database user has to easily build and
destroy  indexes  is  actually just a special  case of the ability the
user  has to easily create and destroy any dataset in the database. He
can add datasets, delete datasets, restructure them, and so on.

   This can be done not just after the system is built, but even while
other  parts  of the database are in  use. Again, any ADAGER user will
understand  how  useful  this  kind of ability  is (although ADAGER is
actually   rather   more  powerful  than  most  relational  databases'
restructuring tools).

   Again,  this  is  not  part  of  the  "theoretical"  definition  of
databases,  since  the  theoretical  definition  doesn't  worry  about
"unimportant"  things like performance or database maintenance; still,
it's pretty universal among database systems.



MORE USEFUL FUNCTIONALITY


   Relational Query Languages and Ease of Restructuring are two of the
three most major features that relational databases provide (the third
one  I'll  get  to later). What I've taken  pains to point out is that
these  features aren't unique with relational  systems, and to a large
extent  are present in IMAGE, especially  if you also throw in ADAGER.
However,  relational systems did pioneer these features, and should be
given  much credit for the spread  of these features to non-relational
systems like IMAGE.

   Some  other  useful  features  -- also not  part of the theoretical
definition of relational databases, but implemented in most relational
systems -- are:

   *  B-tree  and  ISAM  indexes to the addition  (or sometimes to the
     exclusion)  of hashing indexes. With  these, you can do KSAM-like
     accesses  by  which you can quickly  find all the customers whose
     names  start with "SMI" or retrieve all the parts in sorted order
     by part number.

   *  Indexes  on multiple items, just as if  you were able to build a
     single  IMAGE  automatic master on not  just the VENDOR# field of
     the  INVOICES  dataset,  but  both  the VENDOR#  and the INVOICE#
     fields.

   *  Views, which look to the user just like a dataset, but which are
     actually  the  results of FINDs (or  MULTIFINDs) on other dataset
     (s).  In  other  words, a user (or  a database administrator) can
     define  a view called RED-CALIFORNIAN-PARTS, which corresponds to
     the

       SELECT SUPPLIER.SNAME, SUPPLIER.SNUMBER,
              PART.PNAME, PART.PNUMBER, PART.PSHAPE
       FROM   SUPPLIER, PART, SUPPLIER-XREF
       WHERE  SUPPLIER.NUMBER = SUPPLIER-XREF.SUPPLIER# AND
              PART.NUMBER = SUPPLIER-XREF.PART# AND
              PART.COLOR = "RED" AND
              SUPPLIER.STATE = "CA"

     command  that we discussed above. This view will look just like a
     dataset  which contains the data retrieved by this >SELECT -- the
     only  difference  is that the dataset  won't actually take up any
     disc  space,  but  rather  any  command  that  refers to  it will
     automatically  be  translated  into  one  that  extracts  all the
     requested  data  from  the  SUPPLIER,  SUPPLIER-XREF,  and  PARTS
     datasets.

     Thus,   if   we  define  the  RED-CALIFORNIAN-PARTS  view  to  be
     equivalent to a SELECT like the one above, then

       SELECT RED-CALIFORNIAN-PARTS.SNAME,
              RED-CALIFORNIAN-PARTS.PNAME
       WHERE  RED-CALIFORNIAN-PARTS.PSHAPE="CUBE"

     will get all the red parts that are made by CAlifornian SUPPLIERs
     AND are cubes (the parts, not the suppliers).

   *  Integrity constraints, by which you  can say something like "all
     employee  salaries must be positive"  or "the sum of HOURS-WORKED
     and  HOURS-VACATION may not be more than  48". Any time a user --
     through  a program or the  interactive query environment -- tries
     to  add  a  record  to  the database that  violates the integrity
     constraints,  he'll get an error. On the other hand, the built-in
     integrity  constraints  of IMAGE, namely  that all detail records
     must  have  corresponding appropriate manual  master records, are
     not supported.

   *   Virtually   all   relational  database  systems  have  built-in
     "transaction-level  recovery". Without going  to deeply into this
     -- users of IMAGE roll-forward or roll-back recovery already know
     what  this  is  --  this  is  a good way of  making sure that the
     database  is kept safe and sound (even  in case of disk errors or
     such). It also allows you to automatically prevent half-completed
     transactions -- for instance, the addition of half the line items
     of an invoice - from being left in the database in case of system
     crash.



FUNCTIONAL ADVANTAGES OF IMAGE OVER RELATIONAL


   As  a  rule, most relational systems  are more powerful than IMAGE.
Still,  the  very  fact  that IMAGE is a  network system gives it some
advantages over relational systems.

   *  The most major one is the fact that in IMAGE, all detail records
     must   have  corresponding  records  in  any  appropriate  manual
     masters.  This  is  a kind of  "integrity constraint" (see above)
     that  most  relational  systems  don't support.  It's often quite
     useful  for  the  database itself to make  sure that a user can't
     enter an invoice, say, belonging to a non-existent vendor.

   *  Another advantage of IMAGE (or, to be precise, of QUERY) is that
     unlike  many (but not all) relational database systems, QUERY can
     handle several databases open at once.



BUT WHAT ABOUT RELATIONAL DATABASE PERFORMANCE?


   One thing that's often been said about relational databases is that
they're  slow. The reason why this rumor  is so prevalent is that it's
often  been true. Relational databases are young creatures, not yet as
well-optimized  as older, more mature systems. Still, the recent (past
3 years) 'coup' of relational systems has shown great improvement, and
there's  no reason why they can't - now  or in the near future - match
and exceed the performance of older systems like IMAGE/3000.


   What  exactly  was  it that has made  many older relational systems
slow? It wasn't the data structures - they can use hashing and B-trees
just  as well as IMAGE or KSAM (the  fact of the index and the dataset
being  stored  in  two  different  places isn't really  a problem). It
wasn't the recovery logging - a smart relational system might actually
do  less  I/Os  to support database reliability  than IMAGE does (with
each  one of its writes having to  be posted immediately to disk). The
greatest  speed  problem of relational systems  actually happens to be
their greatest performance advantage - embedded query languages.


   As  I  mentioned before, relational  databases were invented almost
backwards  (some  of  their  proponents claim  that the non-relational
databases were the ones invented backwards). Instead of describing the
physical structure first - hashed masters, details, search items, sort
items, double-word pointers, etc. - and then adding the query language
as  an afterthought, relational query capabilities were designed first
and  then  some  reasonable physical structure  was built around them.
That's why relational query languages are so nifty.

   The problem the relational architects got was this:

   In  a  relational  query system, to print out  the names of all the
employees  who  made  more  than $50,000 per year,  the names of their
departments,   and   the  addresses  of  their  buildings,  sorted  by
department name, we could say

   SELECT EMPLOYEE.NAME, DEPARTMENT.NAME, BUILDING.ADDRESS,
          BUILDING.CITY, BUILDING.STATE
   WHERE  EMPLOYEE.SALARY > 50000   AND
          EMPLOYEE.DEPT# = DEPARTMENT.DEPT#   AND
          EMPLOYEE.BLDG# = BUILDING.BLDG#
   SORT BY DEPARTMENT.NAME

But, this is all in QUERY - how can we do this from our program? Do we
now  have  to  go  back  to the old way of  doing all those DBGETs and
DBFINDs?

   What the designers of several relational systems decided was that

   * YOU CAN EMBED RELATIONAL QUERY COMMANDS INTO YOUR OWN PROGRAMS!

   Think about it for a moment. Your program might look like this:

   ACCEPT "WHAT SALARY THRESHOLD DO YOU WANT?", SALARY
#     SELECT  :ENAME = EMPLOYEE.NAME, :DNAME = DEPARTMENT.NAME,
#             :ADDRESS = BUILDING.ADDRESS, :CITY = BUILDING.CITY,
#             :STATE = BUILDING.STATE
#     WHERE   EMPLOYEE.SALARY > :SALARY   AND
#             EMPLOYEE.DEPT# = DEPARTMENT.DEPT#   AND
#             EMPLOYEE.BLDG# = BUILDING.BLDG#
#     SORT BY DEPARTMENT.NAME
#     DO
C   This code will be executed once for each employee-department-
C   building triplet found, with the ENAME, DNAME, ADDRESS, CITY,
C   and STATE variables set to the proper values.
#     DOEND

   The  query  you  want  to  execute  is just embedded  right in your
program  (with each line preceded by, say, a "#" to indicate that it's
part  of a query). Any variables - like the threshold SALARY or ENAME,
DNAME,  ADDRESS, CITY, or STATE - can be passed to and from the query.
Finally,  in the case of SELECT  queries (which only retrieve data), a
bunch  of  code  will  be  executed  for  each thing  retrieved; other
queries,  like  APPEND,  DELETE,  or  UPDATE can  also be conveniently
embeded.

   What  an  idea!  This  is  almost  like having  a fourth-generation
language - all the power of a procedural language with all the ease of
use  of the relational query  facility. However, truly heroic measures
have  to be undertaken to be sure that this isn't the slowest database
system known to man.

   The  simplest  way  of  allowing  embedded  queries is  to have the
compiler  (or, more commonly, a preprocessor program that converts the
embedded  code into something readable  to the compiler) compile calls
to the relational query language. Think of it like QUERY/3000 wasn't a
standalone program, but rather a procedure, and you'd say

   CALL QUERY ("FIND EMPLOYEE.SALARY > 10000")

This'll  work, but imagine the overhead  QUERY will have to go through
to  parse  and  interpretively  execute  this  operation!  It  has  to
recognize  the  FIND,  find  the dataset EMPLOYEE,  find the data item
SALARY,  figure  out what indexes there are  on the EMPLOYEE dataset -
you'll  be  lucky  if it's done by Christmas.  And, believe it or not,
this is how some early relational database systems worked.


   Now,  there  is  a  more  intelligent approach, but  it's harder to
implement. What it involves is that the preprocessor should do much of
the query parsing when it preprocesses the user's program. It sees our
SELECT,  for instance, and looks up all the datasets and items that it
refers  to. Then, the pointers to all these things are kept around, so
at  run  time, no parsing or dataset/item  lookup needs to be done. In
the  best  possible case, the preprocessor  might actually compile the
SELECT  into  the appropriate DBGETs, much  like the interactive QUERY
facility would translate the SELECT into a bunch of DBGETs that it has
to do.

   It  is  on  the success of  this "preprocessor-time precompilation"
that the performance of a commercial relational system - including, in
particular, HP's new relational offering - rests.

   HP  might  decide to forbid embedded  queries altogether and have a
DBGET/DBPUT/etc.-like  interface.  This won't be all  that bad, but it
won't be very nice (believe me, embedded queries are VERY useful). Or,
it  could  give  us  slow  embedded  queries  and  a  fast  DBxxx-like
interface,  telling us to "make our choice" - believe me, this will be
no  choice at all, considering  just how slow non-precompiled embedded
queries are.

The  worst  thing  HP  can  do  is  to give  us slow (non-precompiled)
embedded  queries  WITHOUT a procedure-level  interface to the system.
This  means  that  their  entire  relational product will  be a total,
unmitigated disaster.

  HP  might  do  some  simple  precompilation  -  say, parsing  of the
embedded  query  into individual tokens  ("SELECT", "EMPLOYEE", etc.).
WRONG. All it saves is just a little text scanning at run-time.

   HP  might have the preprocessor parse out the entire query and also
look  up  all  the  various  datasets and items  mentioned in it, thus
saving  the  extra  overhead at run-time. This  may make for a viable,
tolerably fast system.

   Finally,  if  HP  is feeling really audacious -  and smart - it can
compile  the  embedded  query into some kind  of pseudo-code that's as
close to the actual DBGET/DBPUT/etc. call level as possible. That way,
all that will be needed at run-time is for the program to step through
this  pseudo-code and do a DBxxx call  or something like that for each
pseudo-instruction.  If this is done, there's  no reason I can see why
the relational database can't be as fast or faster than IMAGE.


CONCLUSION

   My conclusions are simple:

   *  Relational databases are nothing revolutionarily new. Their main
     advantage lies not in any radically different data representation
     structure,     but     rather     in     a    lot    of    useful
     functionality/flexibility features.


   * All those functionality/flexibility features are really GREAT. If
     HP  doesn't degrade performance too badly,  you ought to like the
     relational system a lot better than you do IMAGE.

   * The performance of the system depends primarily on how good a job
     HP does of optimizing the preprocessing of embedded queries.

   *  You'll love embedded queries, if  they're fast. If they're slow,
     you'll hate them.





WHAT I HAVEN'T SAID AND WHY I HAVEN'T SAID IT


   The  purpose  of this paper is to  clarify for the average user the
differences  between IMAGE/3000 and common relational database systems
-  like  HP's  imminent relational offering. To  do this as tersely as
possible,  I've omitted many details  (some of them pretty substantial
ones)  that I think are insignificant  to the broad picture I'm trying
to present.

   Still,  I  think  that  it's appropriate for me  to point out these
details and explain why I didn't think it necessary to discuss them in
greater detail.

   *  THE  MATHEMATICAL  THEORY  OF RELATIONAL  DATABASES - RELATIONAL
     CALCULUS  AND  RELATIONAL  ALGEBRA.  I  believe  that  these  are
     completely  irrelevant  to practical uses  of relational database
     systems.  I  get  particularly  irritated  when  people  say that
     "relational  databases are Mathematically Sound" - something I've
     heard  many  a  time  -  this implies that  network databases are
     Mathematically  Unsound.  I've worked on them  for six years, and
     believe me, they're sound enough.

   * OPTIMIZING LARGE, COMPLICATED JOINS AND QUERIES. In my discussion
     of performance, I emphasized the importance of good preprocessing
     and  precompilation. This is a big deal for small, simple queries
     that  take  only a small amount of  processing time, and thus any
     parsing  overhead  really slows them down.  However, a query that
     does,  say,  a  join  of  three 100,000-record  datasets won't be
     slowed  down  much  by parsing - what will  speed it up most is a
     good  Join  Optimization  algorithm, one which  knows, say, which
     dataset to read through first, and so on. The reason why I didn't
     talk  much about it is that the big three-way 100,000-tuple joins
     aren't  all that frequent, and we  can live with them being slow.
     The small queries have to run like greased lightning.

   *  FIRST,  SECOND,  THIRD,  BOYCE-CODD,  AND  OTHER  NORMAL  FORMS.
     Interesting  for a paper on how  to design a database to minimize
     programming  problems, but irrelevant to  this discussion. In any
     case, anything you can represent in IMAGE is normalized enough to
     be  directly  translated  to  relational, although  some database
     designs may be prettified by further normalization.

Go to Adager's index of technical papers