RELATIONAL DATABASES VS. IMAGE: WHAT THE FUSS IS ALL ABOUT
by Eugene Volokh, VESOFT
Presented at 1986 INTEREX Conference, Detroit, MI, USA
Published in "Thoughts & Discourses on HP3000 Software", 3rd ed.
ABSTRACT
What are "relational databases" anyway? Are they more powerful
than IMAGE? Less powerful? Faster? Slower? Slogans abound, but facts
are hard to come by. It seems like HP will finally have its own
relational system out for Spectrum (or whatever they call it these
days). I hope that this paper will clear up some of the confusion that
surrounds relational databases, and will point out the substantive
advantages and disadvantages that relational databases have over
network systems like IMAGE.
WHAT IS A RELATIONAL DATABASE?
Let's think for a while about a database design problem.
We want to build a parts requisition system. We have many possible
suppliers, and many different parts. Each supplier can sell us several
kinds of parts, and each part can be bought from one of several
suppliers.
Easy, right? We just have a supplier master, a parts master, and a
supplier/parts cross-reference detail:
--------------- ---------------
\ SUPPLIERS / \ PARTS /
\ / \ /
\ (M) / \ (M) /
\ / \ /
\ / \ /
\_/ \_/
. .
... ...
... ...
. .
---------------------
\ /
\ SUPPLIER-XREF /
\ /
\ (D) /
\_________/
Every supplier has a record in the SUPPLIERS master, every part has
a record in the PARTS master, and each (supplier, part-supplied) pair
has a record in the SUPPLIER-XREF dataset.
Now, why did we set things up this way? We could have, for
instance, made the SUPPLIER-XREF dataset a master, with a key of
SUPPLIERS#+PART#. Or, we could have made all three datasets
stand-alone details, with no masters at all. The point is that the
proof of a database is in the using. The design we showed -- two
masters and a detail -- allows us to very efficiently do the following
things:
* Look up supplier information by the unique supplier #.
* Look up parts information by the unique part #.
* For each part, look up all its suppliers (by using the cross-
reference detail dataset).
* For each supplier, look up all the parts it sells (by using the
cross-reference detail dataset).
This is what IMAGE is good at -- allowing quick retrieval from a
master using the master's unique key and allowing quick retrieval from
a detail chain using one of the detail's search items.
However, lets take a closer look at the parts dataset. It actually
looks kind of like this:
PART# <-- unique key item
DESCRIPTION
SHAPE
COLOR
...
What if we want to find all the suppliers that can sell us a
"framastat"? A "framastat", you see, is not a part number -- it's a
part description. We want to be able to look up parts not only by
their part number, but also by their descriptions. The functions
supported by our design are:
* Look up PART by PART#.
* Look up SUPPLIERS by SUPPLIERS#.
* Look up PARTs by SUPPLIERS#.
* Look up SUPPLIERs by PART#.
What we want is the ability to
* Look up PART by DESCRIPTION.
The sad thing is that the PARTS dataset is a master, and a master
dataset supports lookup by ONLY ONE FIELD (the key). We can't make
DESCRIPTION the key item, since we want PART# to be the key item; we
can't make DESCRIPTION a search item, since PARTS isn't a detail. By
making PARTS a master, we got fast lookup by PART# (on the order of 1
or 2 I/Os to do the DBGET), but we forfeited any power to look things
up quickly by any other item.
And so, dispirited and dejected, we get drunk and go to bed. And,
deep in the night, a dream comes. "Make it a detail!" the voice
shouts. "Make it a detail, and then you can have as many paths as you
want to."
We awaken elated! This is it! Make PARTS a detail dataset, and then
have two search items, PART# and DESCRIPTION. Each search item can
have an automatic master dataset hanging off of it, to wit:
--------------- -------------- --------------
\ SUPPLIERS / \ PART#S / \ DESCRIP- /
\ / \ / \ TIONS /
\ (M) / \ (A) / \ (A) /
\ / \ / \ /
\ / \ / \ /
\_/ \/ \/
. . .
.. .. .. ..
.. .. .. ..
.. .. .. ..
--------------------- --------------------
\ / \ /
\ SUPPLIER-XREF / \ PARTS /
\ (D) / \ (D) /
\ / \ /
----------- ----------
What's more, if we ever, say, want to find all the parts of a
certain color or shape, we can easily add a new search item to the
PARTS dataset. Sure, it may be a bit slower (to get a part we need to
first find it in PART#S and then follow the chain to PARTS, 2 I/Os
instead of 1), and also the uniqueness of part numbers isn't enforced;
still, the flexibility advantages are pretty nice.
So, now we can put any number of search items in PARTS. What about
SUPPLIERS? What if we want to find a supplier by his name, or city, or
any other field? Again, if we use master datasets, we're locked into
having only one key item per dataset. Just like we restructured PARTS,
we can restructure SUPPLIES, and come up with:
---------------- -------------- -------------
\ SUPPLIER#S / \ PART#S / \ DESCRIP- /
\ (A) / \ (A) / \ TIONS /
\ / \ / \ (A) /
\ / \ / \ /
\ / \ / \ /
\--/ \/ \/
.. .. .
.. .. .. ... ..
.. .. .. ... ..
.. .. .. ... ..
.. .. .. .. ..
------------------- --------------------- -------------------
\ / \ / \ /
\ SUPPLIERS / \ SUPPLIER-XREF / \ PARTS /
\ (D) / \ (D) / \ (D) /
\ / \ / \ /
--------- ----------- ---------
Note what we have done in our quest for flexibility. All the real
data has been put into detail datasets; every data item which we're
likely to retrieve on has an automatic master attached to it.
Believe it or not, this is a relational database.
IF THIS IS A RELATIONAL DATABASE, I'M A HOTTENTOT
Surely, you say, there is more to a relational database than just
an IMAGE database without any master datasets. Isn't there? Of course,
there is. But all the wonderful things you've been hearing about
relational databases may have more to do with the features of a
specific system that happens to be relational than with the virtues of
relational as a whole.
Consider for a moment NETWORK databases. IMAGE is one example, in
fact an example of a rather restricted kind of network database
(having only two levels, master and detail). Let's look at some of the
major features of IMAGE:
* IMAGE supports unique-key MASTERS and non-unique-key DETAILS.
* IMAGE does HASHING on master dataset records.
* IMAGE has QUERY, an interactive query language.
Which of these features are actually NETWORK DATABASE features? In
other words, which features would be present in any network database,
and which are specific to the IMAGE implementation? Of the three
listed above, only the first -- masters and details -- must actually
be present in all databases that want to call themselves "network". On
the other hand, a network database might very well use B-trees or ISAM
as its access method instead of hashing; or, it might not provide an
interactive query language. It would still be a network database -- it
just wouldn't be IMAGE.
Why is all this relevant? Well, let's say that somebody said
"Network databases are bad because they use hashing instead of
B-trees." This statement is wrong because the network database model
is silent on the question of B-trees vs. hashing. It is incorrect to
generalize from the fact that IMAGE happens to use hashing to the
theory that all network databases use hashing. If we get into the
habit of making such generalizations, we are liable to get very
inaccurate ideas about network databases in general or other network
implementations in particular.
The same goes for relational databases. The reason that so many
people are so keen on relational databases isn't because they have any
particularly novel form of data representation (actually, it's much
like a bunch of old-fashioned KSAM/ISAM-like files with the
possibility of multiple keys); nor is it because of some fancy new
access methods (hashing, B-trees, and ISAM are all that relational
databases support). Rather, it's because the designers of many of the
modern relational databases did a good job in providing people with
lots of useful features (ones that might have been just as handy in
network databases).
WHAT ARE RELATIONAL DATABASES -- FUNCTIONALITY
The major reason for many of the differences between relational
databases and network databases is simple: age. Remember the good old
days when people hacked FORTRAN code, spending days or weeks on
optimizing out an instruction or two, or saving 1000 bytes of memory
(they had only 8K back then) ? Well, those are the days in which many
of today's network databases were first designed; maximum effort was
placed on making slow hardware run as fast as possible and getting the
most out of every byte of disk.
Relational databases, children of the late '70s and early '80s had
the benefit of perspective. Their designers saw that much desirable
functionality and flexibility was missing in the older systems, and
they were willing to include it in relational databases even if it
meant some wasted storage and performance slow-down. The bad part of
this is that, to some extent, modern relational databases are still
hurting from slightly decreased performance; however, this seems to be
at most a temporary problem, and the functionality and flexibility
advantages are quite great.
THE USER INTERFACE -- THE RELATIONAL QUERY LANGUAGE
If you look at the theoretical definition of relational databases
-- the one given by Codd in his original paper that first introduced
the subject -- you'll find that nowhere does it talk about B-trees or
hashing or internal dataset format or anything like that. In fact,
what defines a relational database is the format of each "relation"
(or dataset) and the Relational Query Language that lets a user
retrieve from and update these datasets. Since IMAGE detail datasets
very closely fit the requirements of relational dataset format, we
won't talk much about this; the concept of a Relational Query
Language, however is another story.
Everyone understands the "Query Language" part -- it's an interface
that allows interactive access to the database, just like QUERY/3000
(note that "query" means any database retrieval, insertion, update, or
deletion). What makes the query language RELATIONAL?
Imagine for a moment the following QUERY/3000 statement -- let's
call it "SELECT":
* It has the ability of a >LIST command to select certain records
based on selection conditions (like PRICE < 1000 AND COLOR =
"RED").
* Just like a >LIST, it can output only those fields you want (not
just the entire record).
* Instead of just saying "list the fields PRICE and COLOR", you can
also specify expressions such as PRICE * NUM-ITEMS, so that the
statement might look like:
SELECT PRICE, COLOR, PRICE * NUM-ITEMS, NUM-ITEMS + 100
* Finally -- the most important difference of all, you can RETRIEVE
ITEMS FROM SEVERAL DATASETS AT ONCE! This is what QUERY's >JOIN
and >MULTIFIND commands let you do, but it's a lot less clumsy.
The capability of retrieving stuff from several datasets (called
"joining", from which QUERY's >JOIN command gets its name) is the most
important difference between Relational Query Languages and orthodox
query languages (such as QUERY-A, which didn't have even the > JOIN
command). If there's one thing the advocates of relational databases
can be proud of, it's their efforts for widespread implementation of
joins in various query languages.
With the >SELECT statement, we could say something like:
SELECT SUPPLIER.NAME, SUPPLIER.NUMBER,
PART.NAME, PART.NUMBER
FROM SUPPLIER, PART, SUPPLIER-XREF
WHERE SUPPLIER.NUMBER = SUPPLIER-XREF.SUPPLIER# AND
PART.NUMBER = SUPPLIER-XREF.PART# AND
PART.COLOR = "RED" AND
SUPPLIER.STATE = "CA"
This finds every SUPPLIER in the state of CAlifornia who supplies red
parts, and prints the name and number of that supplier together with
the name and number of every red part that he supplies.
The first thing you should notice about the >SELECT command is that
THERE'S NOTHING IN IT THAT QUERY/3000 DOESN'T HAVE. Functionally
speaking, it's merely a combination of QUERY's >LIST, >JOIN, and
>MULTIFIND. The big thing is that you don't have to specify an
explicit >JOIN command that tells QUERY how to navigate the pathways
between SUPPLIER, SUPPLIER-XREF, and PART. It figures all this out
automatically.
For instance, the >SELECT command shown above figures out that it
should find all the CAlifornian SUPPLIERs (maybe there's even a search
item on STATE that'll make the search faster), for each of those finds
the appropriate items in SUPPLIER-XREF, and for each SUPPLIER-XREF
item it checks to see if the cross-reference record refers to a red
part. Alternatively, it might look up all the red parts first, and
then for each of those parts go through SUPPLIER-XREF and find all the
CAlifornian SUPPLIERs.
It's a great convenience to have the query language figure out the
way the join should be done, but functionally this doesn't give you
any capability you don't already have with QUERY/3000. In a way,
therefore (with a few technical exceptions), QUERY/3000 IS a
relational query language, although not easy to use as many such
languages that relational database systems support.
A BRIEF, BUT IMPORTANT DIGRESSION
RELATION = TABLE = DATASET.
ATTRIBUTE = COLUMN = FIELD (or ITEM -- it's close enough).
TUPLE = ROW = RECORD.
The relational database community is fond of calling things
"relations", "attributes", and "tuples". This is just needless
confusion. A "relation" is a dataset, and "attribute" is a field/item,
and a "tuple" is a record. Table, column, and row are more names for
the same thing.
I try to always use the words dataset, item, and record. If anybody
throws any of the relational buzzwords at you, just do a quick mental
translation.
FLEXIBILITY -- BUILDING AUTOMATIC MASTERS ON THE FLY
If you look carefully at the theoretical definition of relational
databases, you'll find that it doesn't actually mention any analog of
an automatic master dataset. It describes "relations", which are just
detail datasets, and some broad parameters of the Relational Query
Language, but it says nothing about "indexes", which is what most
relational systems use in the same way that IMAGE uses automatic
masters.
Thus, a database system may well have all its selections and joins
implemented using serial reads and still be relational! Of course,
such a system wouldn't sell very well, so all relational systems allow
you to build "indexes" just like IMAGE lets you set up automation
masters. The thing that the architects of these systems did right,
however, was that they let you build indexes whenever you liked, not
just when you were initially building the database. Similarly, you
could delete them whenever you found that the overhead of constantly
updating the index exceeded the benefit that the index gave you for
searching.
In some ways, this feature -- although, as I said, theoretically
not a "relational" feature -- is one of the most important advantages
of relational systems. It's this that gave rise to the often-heard but
quite misleading statement that "in relational systems, you can find
things by any item, not just a search item". In both IMAGE and
relational databases, you can find records using non-search items, but
it'll require a serial search; in both IMAGE and relational databases,
you can set up any item as a search item, but then you have to pay a
penalty every time you add or delete a record. The big advantage of
relational is that you can make ordinary items into search items (and
vice versa) very easily -- even easier than with ADAGER.
MORE FLEXIBILITY -- CHANGING DATABASE STRUCTURE ON THE FLY
The ability that a relational database user has to easily build and
destroy indexes is actually just a special case of the ability the
user has to easily create and destroy any dataset in the database. He
can add datasets, delete datasets, restructure them, and so on.
This can be done not just after the system is built, but even while
other parts of the database are in use. Again, any ADAGER user will
understand how useful this kind of ability is (although ADAGER is
actually rather more powerful than most relational databases'
restructuring tools).
Again, this is not part of the "theoretical" definition of
databases, since the theoretical definition doesn't worry about
"unimportant" things like performance or database maintenance; still,
it's pretty universal among database systems.
MORE USEFUL FUNCTIONALITY
Relational Query Languages and Ease of Restructuring are two of the
three most major features that relational databases provide (the third
one I'll get to later). What I've taken pains to point out is that
these features aren't unique with relational systems, and to a large
extent are present in IMAGE, especially if you also throw in ADAGER.
However, relational systems did pioneer these features, and should be
given much credit for the spread of these features to non-relational
systems like IMAGE.
Some other useful features -- also not part of the theoretical
definition of relational databases, but implemented in most relational
systems -- are:
* B-tree and ISAM indexes to the addition (or sometimes to the
exclusion) of hashing indexes. With these, you can do KSAM-like
accesses by which you can quickly find all the customers whose
names start with "SMI" or retrieve all the parts in sorted order
by part number.
* Indexes on multiple items, just as if you were able to build a
single IMAGE automatic master on not just the VENDOR# field of
the INVOICES dataset, but both the VENDOR# and the INVOICE#
fields.
* Views, which look to the user just like a dataset, but which are
actually the results of FINDs (or MULTIFINDs) on other dataset
(s). In other words, a user (or a database administrator) can
define a view called RED-CALIFORNIAN-PARTS, which corresponds to
the
SELECT SUPPLIER.SNAME, SUPPLIER.SNUMBER,
PART.PNAME, PART.PNUMBER, PART.PSHAPE
FROM SUPPLIER, PART, SUPPLIER-XREF
WHERE SUPPLIER.NUMBER = SUPPLIER-XREF.SUPPLIER# AND
PART.NUMBER = SUPPLIER-XREF.PART# AND
PART.COLOR = "RED" AND
SUPPLIER.STATE = "CA"
command that we discussed above. This view will look just like a
dataset which contains the data retrieved by this >SELECT -- the
only difference is that the dataset won't actually take up any
disc space, but rather any command that refers to it will
automatically be translated into one that extracts all the
requested data from the SUPPLIER, SUPPLIER-XREF, and PARTS
datasets.
Thus, if we define the RED-CALIFORNIAN-PARTS view to be
equivalent to a SELECT like the one above, then
SELECT RED-CALIFORNIAN-PARTS.SNAME,
RED-CALIFORNIAN-PARTS.PNAME
WHERE RED-CALIFORNIAN-PARTS.PSHAPE="CUBE"
will get all the red parts that are made by CAlifornian SUPPLIERs
AND are cubes (the parts, not the suppliers).
* Integrity constraints, by which you can say something like "all
employee salaries must be positive" or "the sum of HOURS-WORKED
and HOURS-VACATION may not be more than 48". Any time a user --
through a program or the interactive query environment -- tries
to add a record to the database that violates the integrity
constraints, he'll get an error. On the other hand, the built-in
integrity constraints of IMAGE, namely that all detail records
must have corresponding appropriate manual master records, are
not supported.
* Virtually all relational database systems have built-in
"transaction-level recovery". Without going to deeply into this
-- users of IMAGE roll-forward or roll-back recovery already know
what this is -- this is a good way of making sure that the
database is kept safe and sound (even in case of disk errors or
such). It also allows you to automatically prevent half-completed
transactions -- for instance, the addition of half the line items
of an invoice - from being left in the database in case of system
crash.
FUNCTIONAL ADVANTAGES OF IMAGE OVER RELATIONAL
As a rule, most relational systems are more powerful than IMAGE.
Still, the very fact that IMAGE is a network system gives it some
advantages over relational systems.
* The most major one is the fact that in IMAGE, all detail records
must have corresponding records in any appropriate manual
masters. This is a kind of "integrity constraint" (see above)
that most relational systems don't support. It's often quite
useful for the database itself to make sure that a user can't
enter an invoice, say, belonging to a non-existent vendor.
* Another advantage of IMAGE (or, to be precise, of QUERY) is that
unlike many (but not all) relational database systems, QUERY can
handle several databases open at once.
BUT WHAT ABOUT RELATIONAL DATABASE PERFORMANCE?
One thing that's often been said about relational databases is that
they're slow. The reason why this rumor is so prevalent is that it's
often been true. Relational databases are young creatures, not yet as
well-optimized as older, more mature systems. Still, the recent (past
3 years) 'coup' of relational systems has shown great improvement, and
there's no reason why they can't - now or in the near future - match
and exceed the performance of older systems like IMAGE/3000.
What exactly was it that has made many older relational systems
slow? It wasn't the data structures - they can use hashing and B-trees
just as well as IMAGE or KSAM (the fact of the index and the dataset
being stored in two different places isn't really a problem). It
wasn't the recovery logging - a smart relational system might actually
do less I/Os to support database reliability than IMAGE does (with
each one of its writes having to be posted immediately to disk). The
greatest speed problem of relational systems actually happens to be
their greatest performance advantage - embedded query languages.
As I mentioned before, relational databases were invented almost
backwards (some of their proponents claim that the non-relational
databases were the ones invented backwards). Instead of describing the
physical structure first - hashed masters, details, search items, sort
items, double-word pointers, etc. - and then adding the query language
as an afterthought, relational query capabilities were designed first
and then some reasonable physical structure was built around them.
That's why relational query languages are so nifty.
The problem the relational architects got was this:
In a relational query system, to print out the names of all the
employees who made more than $50,000 per year, the names of their
departments, and the addresses of their buildings, sorted by
department name, we could say
SELECT EMPLOYEE.NAME, DEPARTMENT.NAME, BUILDING.ADDRESS,
BUILDING.CITY, BUILDING.STATE
WHERE EMPLOYEE.SALARY > 50000 AND
EMPLOYEE.DEPT# = DEPARTMENT.DEPT# AND
EMPLOYEE.BLDG# = BUILDING.BLDG#
SORT BY DEPARTMENT.NAME
But, this is all in QUERY - how can we do this from our program? Do we
now have to go back to the old way of doing all those DBGETs and
DBFINDs?
What the designers of several relational systems decided was that
* YOU CAN EMBED RELATIONAL QUERY COMMANDS INTO YOUR OWN PROGRAMS!
Think about it for a moment. Your program might look like this:
ACCEPT "WHAT SALARY THRESHOLD DO YOU WANT?", SALARY
# SELECT :ENAME = EMPLOYEE.NAME, :DNAME = DEPARTMENT.NAME,
# :ADDRESS = BUILDING.ADDRESS, :CITY = BUILDING.CITY,
# :STATE = BUILDING.STATE
# WHERE EMPLOYEE.SALARY > :SALARY AND
# EMPLOYEE.DEPT# = DEPARTMENT.DEPT# AND
# EMPLOYEE.BLDG# = BUILDING.BLDG#
# SORT BY DEPARTMENT.NAME
# DO
C This code will be executed once for each employee-department-
C building triplet found, with the ENAME, DNAME, ADDRESS, CITY,
C and STATE variables set to the proper values.
# DOEND
The query you want to execute is just embedded right in your
program (with each line preceded by, say, a "#" to indicate that it's
part of a query). Any variables - like the threshold SALARY or ENAME,
DNAME, ADDRESS, CITY, or STATE - can be passed to and from the query.
Finally, in the case of SELECT queries (which only retrieve data), a
bunch of code will be executed for each thing retrieved; other
queries, like APPEND, DELETE, or UPDATE can also be conveniently
embeded.
What an idea! This is almost like having a fourth-generation
language - all the power of a procedural language with all the ease of
use of the relational query facility. However, truly heroic measures
have to be undertaken to be sure that this isn't the slowest database
system known to man.
The simplest way of allowing embedded queries is to have the
compiler (or, more commonly, a preprocessor program that converts the
embedded code into something readable to the compiler) compile calls
to the relational query language. Think of it like QUERY/3000 wasn't a
standalone program, but rather a procedure, and you'd say
CALL QUERY ("FIND EMPLOYEE.SALARY > 10000")
This'll work, but imagine the overhead QUERY will have to go through
to parse and interpretively execute this operation! It has to
recognize the FIND, find the dataset EMPLOYEE, find the data item
SALARY, figure out what indexes there are on the EMPLOYEE dataset -
you'll be lucky if it's done by Christmas. And, believe it or not,
this is how some early relational database systems worked.
Now, there is a more intelligent approach, but it's harder to
implement. What it involves is that the preprocessor should do much of
the query parsing when it preprocesses the user's program. It sees our
SELECT, for instance, and looks up all the datasets and items that it
refers to. Then, the pointers to all these things are kept around, so
at run time, no parsing or dataset/item lookup needs to be done. In
the best possible case, the preprocessor might actually compile the
SELECT into the appropriate DBGETs, much like the interactive QUERY
facility would translate the SELECT into a bunch of DBGETs that it has
to do.
It is on the success of this "preprocessor-time precompilation"
that the performance of a commercial relational system - including, in
particular, HP's new relational offering - rests.
HP might decide to forbid embedded queries altogether and have a
DBGET/DBPUT/etc.-like interface. This won't be all that bad, but it
won't be very nice (believe me, embedded queries are VERY useful). Or,
it could give us slow embedded queries and a fast DBxxx-like
interface, telling us to "make our choice" - believe me, this will be
no choice at all, considering just how slow non-precompiled embedded
queries are.
The worst thing HP can do is to give us slow (non-precompiled)
embedded queries WITHOUT a procedure-level interface to the system.
This means that their entire relational product will be a total,
unmitigated disaster.
HP might do some simple precompilation - say, parsing of the
embedded query into individual tokens ("SELECT", "EMPLOYEE", etc.).
WRONG. All it saves is just a little text scanning at run-time.
HP might have the preprocessor parse out the entire query and also
look up all the various datasets and items mentioned in it, thus
saving the extra overhead at run-time. This may make for a viable,
tolerably fast system.
Finally, if HP is feeling really audacious - and smart - it can
compile the embedded query into some kind of pseudo-code that's as
close to the actual DBGET/DBPUT/etc. call level as possible. That way,
all that will be needed at run-time is for the program to step through
this pseudo-code and do a DBxxx call or something like that for each
pseudo-instruction. If this is done, there's no reason I can see why
the relational database can't be as fast or faster than IMAGE.
CONCLUSION
My conclusions are simple:
* Relational databases are nothing revolutionarily new. Their main
advantage lies not in any radically different data representation
structure, but rather in a lot of useful
functionality/flexibility features.
* All those functionality/flexibility features are really GREAT. If
HP doesn't degrade performance too badly, you ought to like the
relational system a lot better than you do IMAGE.
* The performance of the system depends primarily on how good a job
HP does of optimizing the preprocessing of embedded queries.
* You'll love embedded queries, if they're fast. If they're slow,
you'll hate them.
WHAT I HAVEN'T SAID AND WHY I HAVEN'T SAID IT
The purpose of this paper is to clarify for the average user the
differences between IMAGE/3000 and common relational database systems
- like HP's imminent relational offering. To do this as tersely as
possible, I've omitted many details (some of them pretty substantial
ones) that I think are insignificant to the broad picture I'm trying
to present.
Still, I think that it's appropriate for me to point out these
details and explain why I didn't think it necessary to discuss them in
greater detail.
* THE MATHEMATICAL THEORY OF RELATIONAL DATABASES - RELATIONAL
CALCULUS AND RELATIONAL ALGEBRA. I believe that these are
completely irrelevant to practical uses of relational database
systems. I get particularly irritated when people say that
"relational databases are Mathematically Sound" - something I've
heard many a time - this implies that network databases are
Mathematically Unsound. I've worked on them for six years, and
believe me, they're sound enough.
* OPTIMIZING LARGE, COMPLICATED JOINS AND QUERIES. In my discussion
of performance, I emphasized the importance of good preprocessing
and precompilation. This is a big deal for small, simple queries
that take only a small amount of processing time, and thus any
parsing overhead really slows them down. However, a query that
does, say, a join of three 100,000-record datasets won't be
slowed down much by parsing - what will speed it up most is a
good Join Optimization algorithm, one which knows, say, which
dataset to read through first, and so on. The reason why I didn't
talk much about it is that the big three-way 100,000-tuple joins
aren't all that frequent, and we can live with them being slow.
The small queries have to run like greased lightning.
* FIRST, SECOND, THIRD, BOYCE-CODD, AND OTHER NORMAL FORMS.
Interesting for a paper on how to design a database to minimize
programming problems, but irrelevant to this discussion. In any
case, anything you can represent in IMAGE is normalized enough to
be directly translated to relational, although some database
designs may be prettified by further normalization.