Ricochet is the best place on the internet to discuss the issues of the day, either through commenting on posts or writing your own for our active and dynamic community in a fully moderated environment. In addition, the Ricochet Audio Network offers over 50 original podcasts with new episodes released every day.
The Texas Two-Step
In a recent installment of @hankrhody ‘s excellent (and delicious) series on building a computer, he wrote about how to do a binary search. In the comments, I made an oblique reference to a better way to do that kind of matching, referring to something I called the Texas Two-Step.
Now don’t get me wrong; in many situations Hank’s solution is an excellent choice, particularly if you want to do a single, real-time lookup. For a single or only a few lookups, Hank’s way is hard to beat. But what if you want to shuffle the whole deck?
First let’s review what Hank described:
Still going to take a while, even for a computer. That half second while it’s thinking is enough for me to refresh my Ricochet alerts and get distracte. But hey, phone books are alphabetical by last name. Let’s try again, using that information:
- Open the book to the middle. Check any name on that page.
- If it’s Hank, you’re done (lucky you!)
- If it isn’t, check whether it’s alphabetically before or after it. Probably before; R is towards the end of the alphabet.
- Stick your thumb in the book at that page. Open the phone book halfway between your thumb and the correct end of the book.
- Repeat.
Each time you’re halving the amount of names you’d have to look through. Even if you’re doing it stupidly you’re going to get to my name after something like thirty checks. It’s a little more complicated than check each name in order but it’s much quicker.
In Hank’s example, the list of phone numbers was already sorted by last name. The first component of using the Texas Two-Step is that both datasets must be sorted, by the same value, that value being the one used for matching. If you are matching by Social Security Number, then sort them both by Social Security Number. Obviously, the value must appear in both datasets in order to match in the first place. There is just the additional requirement that both sets be sorted, instead of just the one being searched.
Once the data is sorted, the processing is ridiculously simple, so simple that it barely deserves to be called an algorithm. Just step through each dataset sequentially, moving through them in a synchronized manner, capturing the matches as you go. The steps in each table are done such that you are continuously trying to bring the key values together. That means if the key values for one fall behind, you step forward in that set to try to catch up to the values from the other.
To illustrate this, I’ll show a pseudocode example. This won’t be valid syntax for any particular language but will hopefully be understandable even if you won’t know any languages. For simplicity’s sake, this example assumes that the two datasets are loaded entirely into memory as arrays of records, using a counter (an iterator) as a zero-based index for the specific record within each dataset. Thus, dataset[0], with an index of zero, would refer to the first record in that dataset. Obviously, for large datasets this wouldn’t be practical because of memory limitation, but the same technique can be applied to data being streamed in a little at a time. As long as you can step one record at a time, in the preferred sort order, it will work just as well
I’m also assuming in this example that the keys can be compared with simple operators. If the match values happen to be strings, then standard string comparison functions can be used.
To find all the one-to-one matches between two datasets, called dataset_a and dataset_b, the following processing can be used.
—————————————————————————————————-
// define two indices (iterators)
Int a_iter = 0, b_iter = 0;
// define working variables for the key of each dataset
Value a_key, b_key;
// keep processing until you run out of records in one of the sets
Do while a_iter < dataset_a.count AND b_iter < dataset_b.count
— a_key = dataset_a[a_iter].key; // get the key for the current dataset_a record
— b_key = dataset_b[b_iter].key; // get the key for the current dataset_b record
— if a_key < b_key then // if the dataset_a key is less than the dataset_b key
— — a_iter = a_iter + 1; // increment the iterator for dataset_a
— else if a_key > b_key then // if the dataset_a key is greater than the dataset_b key
— — b_iter = b_iter + 1; // increment the iterator for dataset_b
— else // if it’s not less than and not greater than, then it must be equal
— — // do match processing
— — a_iter = a_iter + 1; // increment the iterator for dataset_a
— — b_iter = b_iter + 1; // increment the iterator for dataset_b
— End if
End do
——————————————————————————————-
As mentioned, this version handles one-to-one relationships. One-to-many, and many-to-many relationships can also be processed, by adding additional handling at the point where the match processing is performed. A single internal loop there will bring it up to one-to-many handling, while an internal loop within a loop will handle many-to-many. All of this code is quite simple to implement.
As mentioned before, this is not always the best technique to use, but when it is, it’s hard to beat. It all depends on the ratio between the two datasets. In Hank’s example, that ratio is 1:10,000,000, a lousy time to try this out. With thirty reads required to find a match in a dataset that has 10 million elements, then for anything less than 333k in the lookup set, Hank’s way will be faster. But if you’re ConEd, and you have 6 million customers, then this style of processing will leave Hank’s method gasping in the dust.
And in many cases, particularly when it comes to merging data for reporting, the ratios will quite naturally be much closer together. This can also be used to match in a wide variety of situations. I’ve used this to do collation and merging, processing for Boolean logical operators like And, OR and NOT, parameter processing to apply filters, and a wide range of other handling.
In the cases where the ratios make this technique advantageous, it’s ridiculously fast. If you’re doing database accesses for each read in Hank’s method, it’s even faster, because this method will be limited to two read operations, bringing back all the necessary records in one database operation for each dataset.
NB: Forgive the dash marks at the start of some lines of code. It was the only way I could find to indent for readability.
Published in General
Okay, a genuinely fascinating post comes in. It’s the middle of the night. I’ll re-read this after I wake up later today.
This is the kind of post that’s like the brain tester and trainer created by the incredibly advanced (and, alas, extinct) Krell beings in “Forbidden Planet” (1956). If a brave human uses it successfully, it’ll raise your IQ by 200 points. But if it overloads your circuitry in a vain and foolish attempt to match wits with the creator, in this case Judge Mental, it could leave you in a stupefied coma, unable to combat deadly invincible monsters from the Id. Nobody wants that.
I finished my glass. Too bad that marvelous Kel gal tapped out early. I like her drinking spirit.
Thanks for a clear comparison of sorting methods I’d never have thought of, and wished I had. Your ability to explain this to sci/cyber/eng laymen is, like Hank’s, a gift to the membership that by even the coldest Scrooge-y measure worth a nice chunk of the monthly Rico fee.
We can’t even classify it as most people would, as technology in the strictest sense, if by that you mean transistors and chips and voltages; the Judge is talking about levels of abstract logic that’s theoretically independent of the particular technology used to implement it. In theory, you could build a gigantic processor of this sort protocol that would operate through water pressure and gravity, running over water wheels and through sluice gates–but it would take a lot of damn wheels, ramps and gates.
It wasn’t water powered, but the Babbage Engine did it all with little clicky metal pieces.
During WWII, presidential scientific advisor Dr. Vannevar Bush, one of the men who convinced FDR to build the A-bomb, conceived of something called the “Memex”, what we’d call a hyperlink database able to sort microfilmed text, drawings and statistics via a series of souped-up phone relays.
Harvard and IBM would indeed lavish effort on this type of electro-mechanical computer system, the Mark I. For rapid calculation of things like artillery firing angles, Bush recommended the interim use of analog mechanical systems somewhat similar to Babbage’s Difference Engine as listed above, roughly 90 years before.
When I was taught this technique it was called “match-merge.” Key requirement, duly noted, is that both data sets are sorted on the key being matched.
You were taught this? I had to ‘invent’ it myself.
BTW, I like my name better. ;-)
Yes, if I remember correctly, it was the fifth program we wrote during training at Price Waterhouse.
Yes, your name is better, if less descriptive.
I do get the feeling I’ll be referring back to this one. One of the joys of making things up as you’re going along; the realization that there are better things out there.
Yeah, but when you think you’ve gotten something new and great, and then find out it’s been around for twenty years it’s a downer.
Thanks for a thought-provoking post that doesn’t mention anyone currently in the news. ;)
Sorry about that. For what it’s worth, “twenty years” is probably an understatement. I’ve long since misplaced my collection of Knuth, but I suspect that his Sorting and Searching, volume three of his famous The Art of Computer Programming series, probably touched on this in the section on minimum-comparison selection . . . in 1973.
As for the formatting of your post, let’s agitate Ricochet for a pseudo-code formatting button. I’d love to see more geek stuff on here (though Hank is doing a fantastic job all by himself).
I don’t see the issue.
It would have been nice to have some more flexibility. Text color would be nice; I would have done the comments in green. And those tab dashes in white text on a white background, so that it just appeared to be tabbed. I would have astonished some folks around here who’ve tried to do formatting.
Those of us who have spent most of our lives in C, C++, and C# find that double-dash evocative of the prefixed decrement operator; throw in a lot of SQL experience and they look like comments.
I think we should get Max right on this. And yes, syntax color highlighting would be a terrific addition.
I’m sure at least six of us here on Ricochet would enjoy it.
It’s no accident that when I briefly dropped into code in my post that it was done in a screenshot.
I thought about that, but I really don’t have anything better than Wordpad to use. I bought this machine after I retired and I haven’t set up a single thing to deal with code.
Which is why G-d invented vi.
You’re one of those?
I thought that was Satan.
I fought against it for a decade or more. I was an emacs guy, then commercial products (big MultiEdit fan for years). But I do a lot of remote support of automation systems, and vi is such a good least-common-denominator that I can drop on whichever system I’m supporting, whether Windows or Linux, that I finally gave up and bit the bullet.
I still don’t know how to edit the local configuration file. I live in Visual Studio and Eclipse most of the time, but twenty years of vi have made it second nature for minor editing, viewing, even multi-file stuff.
I once thought as you do. Now I am of the body.
Bram Molenaar (sp?) ported it to Windows, and called it Vim.
At some point maybe I’ll write another post on how this technique can be used as the basis for some far more complicated merging. I call that one the Time Warp. I call it the Time Warp because… it’s just a jump to the left. And then a step to the right.
Whether you choose to do the pelvic thrust, on the other hand, is entirely up to you.
Again, I don’t see the problem.
Most BBS or BBS-like platforms have a [code]lots-of-your-favorite-gibberish[/code] markup that lets you place literal mono-spacing texts. Would be surprised if there isn’t one lurking somewhere underneath Rico, just not exposed.
If so, it would be nice to know what escapement magic might create a portal to that lower, closer-to-the-HTML layer.
And if it gave us access to blinking fonts, whoa! Stand back!
How about the system beep?
^G ^G ^G
;)
Shift-Alt-h gives the keyboard shortcuts.
Shift-Alt-X goes to Code.
<html>
<p>Let's see what this does.</p>
<blink>Farm out, Henry!</blink>
</html>
It still removes spaces, though.