Linux luminary Eric S. Raymond is taking the fight with The SCO Group right back to the basics: he has developed a utility known as a comparator that looks for common code segments in large source trees and which, on an Athlon 1.8 GHz box, has an effective comparison rate of over 55,000 lines per second.
Raymond, the president of the Open Source Initiative, declined for legal reasons to say whether he had developed the comparator specifically to compare older Unix code to Linux so as to be able to refute SCOs claims that the 2.4 kernel and beyond contain proprietary Unix code.
But he did admit that “I am grinning a grin that should frighten the thieves and liars at SCO out of a weeks sleep.”
His comparator, the code for which can be downloaded here, uses a variant of an algorithm called “shred,” which bears a resemblance to some techniques used for DNA sequencing.
The source trees get sliced into overlapping three-line shreds. The shreds then get turned into a list of 32-byte signatures by a process called MD5 hashing; each signature keeps information about its file and line number range.
“If the MD5 signatures are different, then the shreds that they were made from are different. When they match, it is almost certain than the two shreds they were made from are the same, to within odds of eighteen quadrillion to one. MD5 is normally used for making unforgeable digital signatures, but the side effect Im exploiting is that it gives you a fast way to compare texts for equality,” Raymond told eWEEK on Monday.
So, once all the signatures from all the code trees have been included in the comparator, all the “unique” signatures are then thrown out, leaving a list of shreds with duplicate signatures or common code segments. From there it is just report generation, he said.
“The shred technique has two advantages: one, its amazingly fast; and two, if you have the hash list for a given source code tree, you can do overlap reports with other trees without having the original code.
“My version fixes a flaw in the original shred algorithm and uses an implementation trick that gives enormous speedups on machines with enough RAM to hold the signature lists without swapping. I didnt invent the shred technique, but I may have perfected it,” Raymond said.
When asked what the next step was for the comparator, Raymond said that “various persons will apply it in useful ways. Yes, Im being deliberately vague and tantalizing.”
While there are two similar tools in the Linux community at the moment, one developed by Eric Kidd and another by Marius Giuglea, Raymond has been in contact with both authors and be-lieved his is faster than either of their programs and has significant functional improvements.
Raymond also says that while the comparator is a very small, elegant program, barely 1500 lines of code and with a simple interface, “I expect to have no difficulty maintaining it solo, but I also expect the odd patch and minor bug fix will come in.”
Discuss this in the eWEEK forum.