Will Linux Luminary Shred SCOs Unix Claims?

By Peter Galli  |  Posted 2003-09-08 Print this article Print

Linux luminary Eric S. Raymond, the president of the Open Source Initiative, has developed a tool that looks for common code segments in large source trees and which, on an Athlon 1.8 GHz box, has an effective comparison rate of over 55,000 lines per seco

Linux luminary Eric S. Raymond is taking the fight with The SCO Group right back to the basics: he has developed a utility known as a comparator that looks for common code segments in large source trees and which, on an Athlon 1.8 GHz box, has an effective comparison rate of over 55,000 lines per second. Raymond, the president of the Open Source Initiative, declined for legal reasons to say whether he had developed the comparator specifically to compare older Unix code to Linux so as to be able to refute SCOs claims that the 2.4 kernel and beyond contain proprietary Unix code. But he did admit that "I am grinning a grin that should frighten the thieves and liars at SCO out of a weeks sleep."
His comparator, the code for which can be downloaded here, uses a variant of an algorithm called "shred," which bears a resemblance to some techniques used for DNA sequencing. The source trees get sliced into overlapping three-line shreds. The shreds then get turned into a list of 32-byte signatures by a process called MD5 hashing; each signature keeps information about its file and line number range. "If the MD5 signatures are different, then the shreds that they were made from are different. When they match, it is almost certain than the two shreds they were made from are the same, to within odds of eighteen quadrillion to one. MD5 is normally used for making unforgeable digital signatures, but the side effect Im exploiting is that it gives you a fast way to compare texts for equality," Raymond told eWEEK on Monday. So, once all the signatures from all the code trees have been included in the comparator, all the "unique" signatures are then thrown out, leaving a list of shreds with duplicate signatures or common code segments. From there it is just report generation, he said. "The shred technique has two advantages: one, its amazingly fast; and two, if you have the hash list for a given source code tree, you can do overlap reports with other trees without having the original code. "My version fixes a flaw in the original shred algorithm and uses an implementation trick that gives enormous speedups on machines with enough RAM to hold the signature lists without swapping. I didnt invent the shred technique, but I may have perfected it," Raymond said. When asked what the next step was for the comparator, Raymond said that "various persons will apply it in useful ways. Yes, Im being deliberately vague and tantalizing." While there are two similar tools in the Linux community at the moment, one developed by Eric Kidd and another by Marius Giuglea, Raymond has been in contact with both authors and be-lieved his is faster than either of their programs and has significant functional improvements. Raymond also says that while the comparator is a very small, elegant program, barely 1500 lines of code and with a simple interface, "I expect to have no difficulty maintaining it solo, but I also expect the odd patch and minor bug fix will come in." Discuss this in the eWEEK forum.
Peter Galli has been a financial/technology reporter for 12 years at leading publications in South Africa, the UK and the US. He has been Investment Editor of South Africa's Business Day Newspaper, the sister publication of the Financial Times of London.

He was also Group Financial Communications Manager for First National Bank, the second largest banking group in South Africa before moving on to become Executive News Editor of Business Report, the largest daily financial newspaper in South Africa, owned by the global Independent Newspapers group.

He was responsible for a national reporting team of 20 based in four bureaus. He also edited and contributed to its weekly technology page, and launched a financial and technology radio service supplying daily news bulletins to the national broadcaster, the South African Broadcasting Corporation, which were then distributed to some 50 radio stations across the country.

He was then transferred to San Francisco as Business Report's U.S. Correspondent to cover Silicon Valley, trade and finance between the US, Europe and emerging markets like South Africa. After serving that role for more than two years, he joined eWeek as a Senior Editor, covering software platforms in August 2000.

He has comprehensively covered Microsoft and its Windows and .Net platforms, as well as the many legal challenges it has faced. He has also focused on Sun Microsystems and its Solaris operating environment, Java and Unix offerings. He covers developments in the open source community, particularly around the Linux kernel and the effects it will have on the enterprise.

He has written extensively about new products for the Linux and Unix platforms, the development of open standards and critically looked at the potential Linux has to offer an alternative operating system and platform to Windows, .Net and Unix-based solutions like Solaris.

His interviews with senior industry executives include Microsoft CEO Steve Ballmer, Linus Torvalds, the original developer of the Linux operating system, Sun CEO Scot McNealy, and Bill Zeitler, a senior vice president at IBM.

For numerous examples of his writing you can search under his name at the eWEEK Website at www.eweek.com.


Submit a Comment

Loading Comments...
Manage your Newsletters: Login   Register My Newsletters

Rocket Fuel