RELATIONAL DATABASES VS. IMAGE: WHAT THE FUSS IS ALL ABOUT by Eugene Volokh, VESOFT Presented at 1986 INTEREX Conference, Detroit, MI, USA Published in "Thoughts & Discourses on HP3000 Software", 3rd ed. ABSTRACT What are "relational databases" anyway? Are they more powerful than IMAGE? Less powerful? Faster? Slower? Slogans abound, but facts are hard to come by. It seems like HP will finally have its own relational system out for Spectrum (or whatever they call it these days). I hope that this paper will clear up some of the confusion that surrounds relational databases, and will point out the substantive advantages and disadvantages that relational databases have over network systems like IMAGE. WHAT IS A RELATIONAL DATABASE? Let's think for a while about a database design problem. We want to build a parts requisition system. We have many possible suppliers, and many different parts. Each supplier can sell us several kinds of parts, and each part can be bought from one of several suppliers. Easy, right? We just have a supplier master, a parts master, and a supplier/parts cross-reference detail: --------------- --------------- \ SUPPLIERS / \ PARTS / \ / \ / \ (M) / \ (M) / \ / \ / \ / \ / \_/ \_/ . . ... ... ... ... . . --------------------- \ / \ SUPPLIER-XREF / \ / \ (D) / \_________/ Every supplier has a record in the SUPPLIERS master, every part has a record in the PARTS master, and each (supplier, part-supplied) pair has a record in the SUPPLIER-XREF dataset. Now, why did we set things up this way? We could have, for instance, made the SUPPLIER-XREF dataset a master, with a key of SUPPLIERS#+PART#. Or, we could have made all three datasets stand-alone details, with no masters at all. The point is that the proof of a database is in the using. The design we showed -- two masters and a detail -- allows us to very efficiently do the following things: * Look up supplier information by the unique supplier #. * Look up parts information by the unique part #. * For each part, look up all its suppliers (by using the cross- reference detail dataset). * For each supplier, look up all the parts it sells (by using the cross-reference detail dataset). This is what IMAGE is good at -- allowing quick retrieval from a master using the master's unique key and allowing quick retrieval from a detail chain using one of the detail's search items. However, lets take a closer look at the parts dataset. It actually looks kind of like this: PART# <-- unique key item DESCRIPTION SHAPE COLOR ... What if we want to find all the suppliers that can sell us a "framastat"? A "framastat", you see, is not a part number -- it's a part description. We want to be able to look up parts not only by their part number, but also by their descriptions. The functions supported by our design are: * Look up PART by PART#. * Look up SUPPLIERS by SUPPLIERS#. * Look up PARTs by SUPPLIERS#. * Look up SUPPLIERs by PART#. What we want is the ability to * Look up PART by DESCRIPTION. The sad thing is that the PARTS dataset is a master, and a master dataset supports lookup by ONLY ONE FIELD (the key). We can't make DESCRIPTION the key item, since we want PART# to be the key item; we can't make DESCRIPTION a search item, since PARTS isn't a detail. By making PARTS a master, we got fast lookup by PART# (on the order of 1 or 2 I/Os to do the DBGET), but we forfeited any power to look things up quickly by any other item. And so, dispirited and dejected, we get drunk and go to bed. And, deep in the night, a dream comes. "Make it a detail!" the voice shouts. "Make it a detail, and then you can have as many paths as you want to." We awaken elated! This is it! Make PARTS a detail dataset, and then have two search items, PART# and DESCRIPTION. Each search item can have an automatic master dataset hanging off of it, to wit: --------------- -------------- -------------- \ SUPPLIERS / \ PART#S / \ DESCRIP- / \ / \ / \ TIONS / \ (M) / \ (A) / \ (A) / \ / \ / \ / \ / \ / \ / \_/ \/ \/ . . . .. .. .. .. .. .. .. .. .. .. .. .. --------------------- -------------------- \ / \ / \ SUPPLIER-XREF / \ PARTS / \ (D) / \ (D) / \ / \ / ----------- ---------- What's more, if we ever, say, want to find all the parts of a certain color or shape, we can easily add a new search item to the PARTS dataset. Sure, it may be a bit slower (to get a part we need to first find it in PART#S and then follow the chain to PARTS, 2 I/Os instead of 1), and also the uniqueness of part numbers isn't enforced; still, the flexibility advantages are pretty nice. So, now we can put any number of search items in PARTS. What about SUPPLIERS? What if we want to find a supplier by his name, or city, or any other field? Again, if we use master datasets, we're locked into having only one key item per dataset. Just like we restructured PARTS, we can restructure SUPPLIES, and come up with: ---------------- -------------- ------------- \ SUPPLIER#S / \ PART#S / \ DESCRIP- / \ (A) / \ (A) / \ TIONS / \ / \ / \ (A) / \ / \ / \ / \ / \ / \ / \--/ \/ \/ .. .. . .. .. .. ... .. .. .. .. ... .. .. .. .. ... .. .. .. .. .. .. ------------------- --------------------- ------------------- \ / \ / \ / \ SUPPLIERS / \ SUPPLIER-XREF / \ PARTS / \ (D) / \ (D) / \ (D) / \ / \ / \ / --------- ----------- --------- Note what we have done in our quest for flexibility. All the real data has been put into detail datasets; every data item which we're likely to retrieve on has an automatic master attached to it. Believe it or not, this is a relational database. IF THIS IS A RELATIONAL DATABASE, I'M A HOTTENTOT Surely, you say, there is more to a relational database than just an IMAGE database without any master datasets. Isn't there? Of course, there is. But all the wonderful things you've been hearing about relational databases may have more to do with the features of a specific system that happens to be relational than with the virtues of relational as a whole. Consider for a moment NETWORK databases. IMAGE is one example, in fact an example of a rather restricted kind of network database (having only two levels, master and detail). Let's look at some of the major features of IMAGE: * IMAGE supports unique-key MASTERS and non-unique-key DETAILS. * IMAGE does HASHING on master dataset records. * IMAGE has QUERY, an interactive query language. Which of these features are actually NETWORK DATABASE features? In other words, which features would be present in any network database, and which are specific to the IMAGE implementation? Of the three listed above, only the first -- masters and details -- must actually be present in all databases that want to call themselves "network". On the other hand, a network database might very well use B-trees or ISAM as its access method instead of hashing; or, it might not provide an interactive query language. It would still be a network database -- it just wouldn't be IMAGE. Why is all this relevant? Well, let's say that somebody said "Network databases are bad because they use hashing instead of B-trees." This statement is wrong because the network database model is silent on the question of B-trees vs. hashing. It is incorrect to generalize from the fact that IMAGE happens to use hashing to the theory that all network databases use hashing. If we get into the habit of making such generalizations, we are liable to get very inaccurate ideas about network databases in general or other network implementations in particular. The same goes for relational databases. The reason that so many people are so keen on relational databases isn't because they have any particularly novel form of data representation (actually, it's much like a bunch of old-fashioned KSAM/ISAM-like files with the possibility of multiple keys); nor is it because of some fancy new access methods (hashing, B-trees, and ISAM are all that relational databases support). Rather, it's because the designers of many of the modern relational databases did a good job in providing people with lots of useful features (ones that might have been just as handy in network databases). WHAT ARE RELATIONAL DATABASES -- FUNCTIONALITY The major reason for many of the differences between relational databases and network databases is simple: age. Remember the good old days when people hacked FORTRAN code, spending days or weeks on optimizing out an instruction or two, or saving 1000 bytes of memory (they had only 8K back then) ? Well, those are the days in which many of today's network databases were first designed; maximum effort was placed on making slow hardware run as fast as possible and getting the most out of every byte of disk. Relational databases, children of the late '70s and early '80s had the benefit of perspective. Their designers saw that much desirable functionality and flexibility was missing in the older systems, and they were willing to include it in relational databases even if it meant some wasted storage and performance slow-down. The bad part of this is that, to some extent, modern relational databases are still hurting from slightly decreased performance; however, this seems to be at most a temporary problem, and the functionality and flexibility advantages are quite great. THE USER INTERFACE -- THE RELATIONAL QUERY LANGUAGE If you look at the theoretical definition of relational databases -- the one given by Codd in his original paper that first introduced the subject -- you'll find that nowhere does it talk about B-trees or hashing or internal dataset format or anything like that. In fact, what defines a relational database is the format of each "relation" (or dataset) and the Relational Query Language that lets a user retrieve from and update these datasets. Since IMAGE detail datasets very closely fit the requirements of relational dataset format, we won't talk much about this; the concept of a Relational Query Language, however is another story. Everyone understands the "Query Language" part -- it's an interface that allows interactive access to the database, just like QUERY/3000 (note that "query" means any database retrieval, insertion, update, or deletion). What makes the query language RELATIONAL? Imagine for a moment the following QUERY/3000 statement -- let's call it "SELECT": * It has the ability of a >LIST command to select certain records based on selection conditions (like PRICE < 1000 AND COLOR = "RED"). * Just like a >LIST, it can output only those fields you want (not just the entire record). * Instead of just saying "list the fields PRICE and COLOR", you can also specify expressions such as PRICE * NUM-ITEMS, so that the statement might look like: SELECT PRICE, COLOR, PRICE * NUM-ITEMS, NUM-ITEMS + 100 * Finally -- the most important difference of all, you can RETRIEVE ITEMS FROM SEVERAL DATASETS AT ONCE! This is what QUERY's >JOIN and >MULTIFIND commands let you do, but it's a lot less clumsy. The capability of retrieving stuff from several datasets (called "joining", from which QUERY's >JOIN command gets its name) is the most important difference between Relational Query Languages and orthodox query languages (such as QUERY-A, which didn't have even the > JOIN command). If there's one thing the advocates of relational databases can be proud of, it's their efforts for widespread implementation of joins in various query languages. With the >SELECT statement, we could say something like: SELECT SUPPLIER.NAME, SUPPLIER.NUMBER, PART.NAME, PART.NUMBER FROM SUPPLIER, PART, SUPPLIER-XREF WHERE SUPPLIER.NUMBER = SUPPLIER-XREF.SUPPLIER# AND PART.NUMBER = SUPPLIER-XREF.PART# AND PART.COLOR = "RED" AND SUPPLIER.STATE = "CA" This finds every SUPPLIER in the state of CAlifornia who supplies red parts, and prints the name and number of that supplier together with the name and number of every red part that he supplies. The first thing you should notice about the >SELECT command is that THERE'S NOTHING IN IT THAT QUERY/3000 DOESN'T HAVE. Functionally speaking, it's merely a combination of QUERY's >LIST, >JOIN, and >MULTIFIND. The big thing is that you don't have to specify an explicit >JOIN command that tells QUERY how to navigate the pathways between SUPPLIER, SUPPLIER-XREF, and PART. It figures all this out automatically. For instance, the >SELECT command shown above figures out that it should find all the CAlifornian SUPPLIERs (maybe there's even a search item on STATE that'll make the search faster), for each of those finds the appropriate items in SUPPLIER-XREF, and for each SUPPLIER-XREF item it checks to see if the cross-reference record refers to a red part. Alternatively, it might look up all the red parts first, and then for each of those parts go through SUPPLIER-XREF and find all the CAlifornian SUPPLIERs. It's a great convenience to have the query language figure out the way the join should be done, but functionally this doesn't give you any capability you don't already have with QUERY/3000. In a way, therefore (with a few technical exceptions), QUERY/3000 IS a relational query language, although not easy to use as many such languages that relational database systems support. A BRIEF, BUT IMPORTANT DIGRESSION RELATION = TABLE = DATASET. ATTRIBUTE = COLUMN = FIELD (or ITEM -- it's close enough). TUPLE = ROW = RECORD. The relational database community is fond of calling things "relations", "attributes", and "tuples". This is just needless confusion. A "relation" is a dataset, and "attribute" is a field/item, and a "tuple" is a record. Table, column, and row are more names for the same thing. I try to always use the words dataset, item, and record. If anybody throws any of the relational buzzwords at you, just do a quick mental translation. FLEXIBILITY -- BUILDING AUTOMATIC MASTERS ON THE FLY If you look carefully at the theoretical definition of relational databases, you'll find that it doesn't actually mention any analog of an automatic master dataset. It describes "relations", which are just detail datasets, and some broad parameters of the Relational Query Language, but it says nothing about "indexes", which is what most relational systems use in the same way that IMAGE uses automatic masters. Thus, a database system may well have all its selections and joins implemented using serial reads and still be relational! Of course, such a system wouldn't sell very well, so all relational systems allow you to build "indexes" just like IMAGE lets you set up automation masters. The thing that the architects of these systems did right, however, was that they let you build indexes whenever you liked, not just when you were initially building the database. Similarly, you could delete them whenever you found that the overhead of constantly updating the index exceeded the benefit that the index gave you for searching. In some ways, this feature -- although, as I said, theoretically not a "relational" feature -- is one of the most important advantages of relational systems. It's this that gave rise to the often-heard but quite misleading statement that "in relational systems, you can find things by any item, not just a search item". In both IMAGE and relational databases, you can find records using non-search items, but it'll require a serial search; in both IMAGE and relational databases, you can set up any item as a search item, but then you have to pay a penalty every time you add or delete a record. The big advantage of relational is that you can make ordinary items into search items (and vice versa) very easily -- even easier than with ADAGER. MORE FLEXIBILITY -- CHANGING DATABASE STRUCTURE ON THE FLY The ability that a relational database user has to easily build and destroy indexes is actually just a special case of the ability the user has to easily create and destroy any dataset in the database. He can add datasets, delete datasets, restructure them, and so on. This can be done not just after the system is built, but even while other parts of the database are in use. Again, any ADAGER user will understand how useful this kind of ability is (although ADAGER is actually rather more powerful than most relational databases' restructuring tools). Again, this is not part of the "theoretical" definition of databases, since the theoretical definition doesn't worry about "unimportant" things like performance or database maintenance; still, it's pretty universal among database systems. MORE USEFUL FUNCTIONALITY Relational Query Languages and Ease of Restructuring are two of the three most major features that relational databases provide (the third one I'll get to later). What I've taken pains to point out is that these features aren't unique with relational systems, and to a large extent are present in IMAGE, especially if you also throw in ADAGER. However, relational systems did pioneer these features, and should be given much credit for the spread of these features to non-relational systems like IMAGE. Some other useful features -- also not part of the theoretical definition of relational databases, but implemented in most relational systems -- are: * B-tree and ISAM indexes to the addition (or sometimes to the exclusion) of hashing indexes. With these, you can do KSAM-like accesses by which you can quickly find all the customers whose names start with "SMI" or retrieve all the parts in sorted order by part number. * Indexes on multiple items, just as if you were able to build a single IMAGE automatic master on not just the VENDOR# field of the INVOICES dataset, but both the VENDOR# and the INVOICE# fields. * Views, which look to the user just like a dataset, but which are actually the results of FINDs (or MULTIFINDs) on other dataset (s). In other words, a user (or a database administrator) can define a view called RED-CALIFORNIAN-PARTS, which corresponds to the SELECT SUPPLIER.SNAME, SUPPLIER.SNUMBER, PART.PNAME, PART.PNUMBER, PART.PSHAPE FROM SUPPLIER, PART, SUPPLIER-XREF WHERE SUPPLIER.NUMBER = SUPPLIER-XREF.SUPPLIER# AND PART.NUMBER = SUPPLIER-XREF.PART# AND PART.COLOR = "RED" AND SUPPLIER.STATE = "CA" command that we discussed above. This view will look just like a dataset which contains the data retrieved by this >SELECT -- the only difference is that the dataset won't actually take up any disc space, but rather any command that refers to it will automatically be translated into one that extracts all the requested data from the SUPPLIER, SUPPLIER-XREF, and PARTS datasets. Thus, if we define the RED-CALIFORNIAN-PARTS view to be equivalent to a SELECT like the one above, then SELECT RED-CALIFORNIAN-PARTS.SNAME, RED-CALIFORNIAN-PARTS.PNAME WHERE RED-CALIFORNIAN-PARTS.PSHAPE="CUBE" will get all the red parts that are made by CAlifornian SUPPLIERs AND are cubes (the parts, not the suppliers). * Integrity constraints, by which you can say something like "all employee salaries must be positive" or "the sum of HOURS-WORKED and HOURS-VACATION may not be more than 48". Any time a user -- through a program or the interactive query environment -- tries to add a record to the database that violates the integrity constraints, he'll get an error. On the other hand, the built-in integrity constraints of IMAGE, namely that all detail records must have corresponding appropriate manual master records, are not supported. * Virtually all relational database systems have built-in "transaction-level recovery". Without going to deeply into this -- users of IMAGE roll-forward or roll-back recovery already know what this is -- this is a good way of making sure that the database is kept safe and sound (even in case of disk errors or such). It also allows you to automatically prevent half-completed transactions -- for instance, the addition of half the line items of an invoice - from being left in the database in case of system crash. FUNCTIONAL ADVANTAGES OF IMAGE OVER RELATIONAL As a rule, most relational systems are more powerful than IMAGE. Still, the very fact that IMAGE is a network system gives it some advantages over relational systems. * The most major one is the fact that in IMAGE, all detail records must have corresponding records in any appropriate manual masters. This is a kind of "integrity constraint" (see above) that most relational systems don't support. It's often quite useful for the database itself to make sure that a user can't enter an invoice, say, belonging to a non-existent vendor. * Another advantage of IMAGE (or, to be precise, of QUERY) is that unlike many (but not all) relational database systems, QUERY can handle several databases open at once. BUT WHAT ABOUT RELATIONAL DATABASE PERFORMANCE? One thing that's often been said about relational databases is that they're slow. The reason why this rumor is so prevalent is that it's often been true. Relational databases are young creatures, not yet as well-optimized as older, more mature systems. Still, the recent (past 3 years) 'coup' of relational systems has shown great improvement, and there's no reason why they can't - now or in the near future - match and exceed the performance of older systems like IMAGE/3000. What exactly was it that has made many older relational systems slow? It wasn't the data structures - they can use hashing and B-trees just as well as IMAGE or KSAM (the fact of the index and the dataset being stored in two different places isn't really a problem). It wasn't the recovery logging - a smart relational system might actually do less I/Os to support database reliability than IMAGE does (with each one of its writes having to be posted immediately to disk). The greatest speed problem of relational systems actually happens to be their greatest performance advantage - embedded query languages. As I mentioned before, relational databases were invented almost backwards (some of their proponents claim that the non-relational databases were the ones invented backwards). Instead of describing the physical structure first - hashed masters, details, search items, sort items, double-word pointers, etc. - and then adding the query language as an afterthought, relational query capabilities were designed first and then some reasonable physical structure was built around them. That's why relational query languages are so nifty. The problem the relational architects got was this: In a relational query system, to print out the names of all the employees who made more than $50,000 per year, the names of their departments, and the addresses of their buildings, sorted by department name, we could say SELECT EMPLOYEE.NAME, DEPARTMENT.NAME, BUILDING.ADDRESS, BUILDING.CITY, BUILDING.STATE WHERE EMPLOYEE.SALARY > 50000 AND EMPLOYEE.DEPT# = DEPARTMENT.DEPT# AND EMPLOYEE.BLDG# = BUILDING.BLDG# SORT BY DEPARTMENT.NAME But, this is all in QUERY - how can we do this from our program? Do we now have to go back to the old way of doing all those DBGETs and DBFINDs? What the designers of several relational systems decided was that * YOU CAN EMBED RELATIONAL QUERY COMMANDS INTO YOUR OWN PROGRAMS! Think about it for a moment. Your program might look like this: ACCEPT "WHAT SALARY THRESHOLD DO YOU WANT?", SALARY # SELECT :ENAME = EMPLOYEE.NAME, :DNAME = DEPARTMENT.NAME, # :ADDRESS = BUILDING.ADDRESS, :CITY = BUILDING.CITY, # :STATE = BUILDING.STATE # WHERE EMPLOYEE.SALARY > :SALARY AND # EMPLOYEE.DEPT# = DEPARTMENT.DEPT# AND # EMPLOYEE.BLDG# = BUILDING.BLDG# # SORT BY DEPARTMENT.NAME # DO C This code will be executed once for each employee-department- C building triplet found, with the ENAME, DNAME, ADDRESS, CITY, C and STATE variables set to the proper values. # DOEND The query you want to execute is just embedded right in your program (with each line preceded by, say, a "#" to indicate that it's part of a query). Any variables - like the threshold SALARY or ENAME, DNAME, ADDRESS, CITY, or STATE - can be passed to and from the query. Finally, in the case of SELECT queries (which only retrieve data), a bunch of code will be executed for each thing retrieved; other queries, like APPEND, DELETE, or UPDATE can also be conveniently embeded. What an idea! This is almost like having a fourth-generation language - all the power of a procedural language with all the ease of use of the relational query facility. However, truly heroic measures have to be undertaken to be sure that this isn't the slowest database system known to man. The simplest way of allowing embedded queries is to have the compiler (or, more commonly, a preprocessor program that converts the embedded code into something readable to the compiler) compile calls to the relational query language. Think of it like QUERY/3000 wasn't a standalone program, but rather a procedure, and you'd say CALL QUERY ("FIND EMPLOYEE.SALARY > 10000") This'll work, but imagine the overhead QUERY will have to go through to parse and interpretively execute this operation! It has to recognize the FIND, find the dataset EMPLOYEE, find the data item SALARY, figure out what indexes there are on the EMPLOYEE dataset - you'll be lucky if it's done by Christmas. And, believe it or not, this is how some early relational database systems worked. Now, there is a more intelligent approach, but it's harder to implement. What it involves is that the preprocessor should do much of the query parsing when it preprocesses the user's program. It sees our SELECT, for instance, and looks up all the datasets and items that it refers to. Then, the pointers to all these things are kept around, so at run time, no parsing or dataset/item lookup needs to be done. In the best possible case, the preprocessor might actually compile the SELECT into the appropriate DBGETs, much like the interactive QUERY facility would translate the SELECT into a bunch of DBGETs that it has to do. It is on the success of this "preprocessor-time precompilation" that the performance of a commercial relational system - including, in particular, HP's new relational offering - rests. HP might decide to forbid embedded queries altogether and have a DBGET/DBPUT/etc.-like interface. This won't be all that bad, but it won't be very nice (believe me, embedded queries are VERY useful). Or, it could give us slow embedded queries and a fast DBxxx-like interface, telling us to "make our choice" - believe me, this will be no choice at all, considering just how slow non-precompiled embedded queries are. The worst thing HP can do is to give us slow (non-precompiled) embedded queries WITHOUT a procedure-level interface to the system. This means that their entire relational product will be a total, unmitigated disaster. HP might do some simple precompilation - say, parsing of the embedded query into individual tokens ("SELECT", "EMPLOYEE", etc.). WRONG. All it saves is just a little text scanning at run-time. HP might have the preprocessor parse out the entire query and also look up all the various datasets and items mentioned in it, thus saving the extra overhead at run-time. This may make for a viable, tolerably fast system. Finally, if HP is feeling really audacious - and smart - it can compile the embedded query into some kind of pseudo-code that's as close to the actual DBGET/DBPUT/etc. call level as possible. That way, all that will be needed at run-time is for the program to step through this pseudo-code and do a DBxxx call or something like that for each pseudo-instruction. If this is done, there's no reason I can see why the relational database can't be as fast or faster than IMAGE. CONCLUSION My conclusions are simple: * Relational databases are nothing revolutionarily new. Their main advantage lies not in any radically different data representation structure, but rather in a lot of useful functionality/flexibility features. * All those functionality/flexibility features are really GREAT. If HP doesn't degrade performance too badly, you ought to like the relational system a lot better than you do IMAGE. * The performance of the system depends primarily on how good a job HP does of optimizing the preprocessing of embedded queries. * You'll love embedded queries, if they're fast. If they're slow, you'll hate them. WHAT I HAVEN'T SAID AND WHY I HAVEN'T SAID IT The purpose of this paper is to clarify for the average user the differences between IMAGE/3000 and common relational database systems - like HP's imminent relational offering. To do this as tersely as possible, I've omitted many details (some of them pretty substantial ones) that I think are insignificant to the broad picture I'm trying to present. Still, I think that it's appropriate for me to point out these details and explain why I didn't think it necessary to discuss them in greater detail. * THE MATHEMATICAL THEORY OF RELATIONAL DATABASES - RELATIONAL CALCULUS AND RELATIONAL ALGEBRA. I believe that these are completely irrelevant to practical uses of relational database systems. I get particularly irritated when people say that "relational databases are Mathematically Sound" - something I've heard many a time - this implies that network databases are Mathematically Unsound. I've worked on them for six years, and believe me, they're sound enough. * OPTIMIZING LARGE, COMPLICATED JOINS AND QUERIES. In my discussion of performance, I emphasized the importance of good preprocessing and precompilation. This is a big deal for small, simple queries that take only a small amount of processing time, and thus any parsing overhead really slows them down. However, a query that does, say, a join of three 100,000-record datasets won't be slowed down much by parsing - what will speed it up most is a good Join Optimization algorithm, one which knows, say, which dataset to read through first, and so on. The reason why I didn't talk much about it is that the big three-way 100,000-tuple joins aren't all that frequent, and we can live with them being slow. The small queries have to run like greased lightning. * FIRST, SECOND, THIRD, BOYCE-CODD, AND OTHER NORMAL FORMS. Interesting for a paper on how to design a database to minimize programming problems, but irrelevant to this discussion. In any case, anything you can represent in IMAGE is normalized enough to be directly translated to relational, although some database designs may be prettified by further normalization.