Integer Keys: The Final Chapter
Fred White Adager Corporation Sun Valley, Idaho 83353-2358 · USA
adager.com
Introduction
History
After considering various options, we decided on the following:
IMAGE keys of types I, J, K and R would not be hashed.
These keys could be of any length acceptable to IMAGE.
Only the rightmost 31 bits (the "determinant") would be used to calculate the primary address. (For one-word keys, their 16 bits are padded on the left with zeroes.)
The determinant is then divided by the dataset capacity yielding a remainder which becomes the primary address unless it is zero, in which case the capacity is assigned as the primary address.
Non-hashing Key Performance Pitfalls
Example 1: The Synonym Pitfall
The bottom line is: if your R2 or R4 values don't fit this pattern, avoid using them as keys.
Example 2: The Primary Clustering Pitfall
Claim numbers Record Numbers 7100001 through 7130000 100001 through 130000 7200001 through 7230000 200001 through 230000 7300001 through 7330000 300001 through 330000 7400001 through 7430000 50001 through 80000 7500001 through 7530000 150001 through 180000 7600001 through 7630000 250001 through 280000 7700001 through 7730000 1 through 30000
Look Mom, No Synonyms
To any mathematician worth his salt, the answer to this question is, "Hell, yes".
We refer to this algorithm as "Method 3".
Claim numbers Record Numbers 7100001 through 7130000 210002 through 240001 7200001 through 7230000 45002 through 75001 7300001 through 7330000 145002 through 175001 7400001 through 7430000 245002 through 10001 7500001 through 7530000 80002 through 110001 7600001 through 7630000 180002 through 210001 7700001 through 7730000 15002 through 45001 7800001 through 7830000 115002 through 145001
Claim numbers Record Numbers 7100001 through 7130000 320002 through 350001 7200001 through 7230000 420002 through 450001 7300001 through 7330000 520002 through 550001 7400001 through 7430000 55002 through 85001 7500001 through 7530000 155002 through 185001 7600001 through 7630000 255002 through 285001 7700001 through 7730000 355002 through 385001 7800001 through 7830000 455002 through 485001 7900001 through 7930000 555002 through 20001 8000001 through 8030000 90002 through 120001 8100001 through 8130000 190002 through 220001 8200001 through 8230000 290002 through 320001 8300001 through 8330000 390002 through 420001 8400001 through 8430000 490002 through 520001 8500001 through 8530000 25002 through 55001
Let's look at some other examples.
Applying Method 2, we achieve a capacity of 1231-101+1 = 1131 and a master which is 32.36% full.
Applying Method 3 yields a capacity of 431 and a master which is 84.91% full with no synonyms:
Key Values Record Numbers 0101 through 0131 101 through 131 0201 through 0231 201 through 231 0301 through 0331 301 through 331 0401 through 0431 401 through 431 0501 through 0531 70 through 100 0601 through 0631 170 through 200 0701 through 0731 270 through 300 0801 through 0831 370 through 400 0901 through 0931 39 through 69 1001 through 1031 139 through 169 1101 through 1131 239 through 269 1201 through 1231 339 through 369
Method 1 leads to a capacity of 96366 and a dataset 3.79% full.
Method 2 yields a capacity of 96366-87001+1 = 9366 and a dataset 39.07% full.
Key Values Record Numbers 87001 through 87366 1145 through 1510 88001 through 88366 2145 through 2510 89001 through 89366 3145 through 3510 90001 through 90366 4145 through 4510 91001 through 91366 5145 through 144 92001 through 92366 779 through 1144 93001 through 93366 1779 through 2144 94001 through 94366 2779 through 3144 95001 through 95366 3779 through 4144 96001 through 96366 4779 through 5144
Method 1 results in a capacity of 2005366 and a dataset 0.36% full. Wow!
Method 2 yields a capacity of 2005366-1986001+1 = 20000 and a dataset 36.6% full.
Method 3 yields a capacity of 10366 and a dataset 70.61% full with no synonyms:
Key Values Record Numbers 1986001 through 1986366 6095 through 6460 1987001 through 1987366 7095 through 7460 1988001 through 1988366 8095 through 8460 1989001 through 1989366 9095 through 9460 1990001 through 1990366 10095 through 94 1991001 through 1991366 729 through 1094 1992001 through 1992366 729 through 2094 1993001 through 1993366 2729 through 3094 1994001 through 1994366 3729 through 4094 1995001 through 1995366 4729 through 5094 1996001 through 1996366 5729 through 6094 1997001 through 1997366 6729 through 7094 1998001 through 1998366 7729 through 8094 1999001 through 1999366 8729 through 9094 2000001 through 2000366 9729 through 10094 2001001 through 2001366 363 through 728 2002001 through 2002366 1363 through 1728 2003001 through 2003366 2363 through 2728 2004001 through 2004366 3363 through 3728 2005001 through 2005366 4363 through 4728
Method 1 requires a capacity of 931231 for a dataset 0.19% full.
Method 2 requires a capacity of 931231-890101+1 = 41131 for a dataset 4.52% full.
Method 3 yields a capacity of 2241 for a dataset 82.99% full with no synonyms.
Key Values Record Numbers 7100001 through 7125000 350002 through 1 7200001 through 7225000 75002 through 100001 7300001 through 7325000 175002 through 200001 7400001 through 7425000 275002 through 300001 7500001 through 7525000 2 through 25001 7600001 through 7625000 100002 through 125001 7700001 through 7725000 200002 through 225001 7800001 through 7825000 300002 through 325001 7900001 through 7925000 25002 through 50001 8000001 through 8025000 125002 through 150001 8100001 through 8125000 225002 through 250001 8200001 through 8225000 325002 through 350001 8300001 through 8325000 50002 through 75001 8400001 through 8425000 150002 through 175001 8500001 through 8525000 250002 through 275001
Summary
"If you use integer keys, choose a capacity at least as large as the maximum key value."
"If you use integer keys, IMAGE hashes them to determine the primary address."
False. Keys of types I, J, K and R are NOT hashed.
"Clustering is bad."
False. When used properly, clustering is the virtue of integer keys.
"If you use integer keys, always choose a prime number for the capacity."
False. None of our three capacity selection methods care about the "primeness" of numbers.
"If you use keys of type R4, IMAGE uses the leftmost 32 bits to calculate primary addresses."
False. IMAGE uses the rightmost 31 bits.
"If DBPUT has to search for a free spot in which to add a new entry (or to move an existing secondary), it inspects the next higher block and then the next lower and then the second higher and then the second lower and continues this ping-pong style of searching until it succeeds."
"Don't let master datasets become more than 80 to 85% full."
What do your worldwide HP e3000 colleagues think of Adager? See a sample of comments from real people who use Adager in the real world, where performance and reliability really count.
Back to Adager