Continued performance improvement on mass-market microprocessors depends on software developers efficient exploitation of multiple threads of execution. Those concurrent threads may be implemented using Intels hyper-threading technology on a single processor core or using a multicore CPU or a multi-CPU machine. In any case, the programming model is complex and prone to types of error that many programmers may not have learned to detect in their code.
On Aug. 28, Intel, of Santa Clara, Calif., introduced new development tools aimed at assisting application developers in making the most of multithreaded computing resources. Technology Editor Peter Coffee spoke with Intel Software Products Director James Reinders about the aspects of multithreaded programming that are addressed by Intels Thread Checker 3.0 and Thread Profiler 3.0, available for free evaluation download. They also discussed Intels new C++ threading library, Threading Building Blocks, which likewise debuted Aug. 28.
Thread interaction is a difficult phenomenon. What have you found to be the human factors in helping developers deal with it, and how does Intels Thread Profiler help?
It has a similar philosophy to VTune; before this release, it used VTune so thoroughly that you had to have VTune on your system. Thats one of the changes weve made.
It visualizes what is actually happening on the system for you. That turns out to be the key: getting into the hardware, which you may think you understand or you may not want to understand, but having a tool that can visualize that for you.
Thread Profiler does it more on a lock basis. Threads, at a very simplistic level, are either doing useful work or waiting for another thread—waiting for a lock. The visualization that Thread Profiler gives you is along those lines: which threads are activities that are going on [and] how much of their time is being spent serially. It can show you a summary over time so you can look for the threads that seem to be waiting the most. You can also get a visualization over time itself, with areas where the thread is busy or waiting. It tries to recognize spin locks, where the code is actually executing, but its checking the lock over and over again. That usually is not good; its usually wasted time.
In the new world of multiple cores and multiple threads, is it better to use a messaging kind of architecture rather than checking again and again to see if a resource is available?
If youre going to be waiting very long, its useful to yield to other threads and let the operating system go do something else. It depends: If you have a tightly coupled application where multiple threads of execution are running on different processors, it may not be as efficient to yield and let another application come in and kick your data out of cache.
Long waits should definitely be yielded, but, with multiple cores, generally you have a shared cache among at least some of the cores.
How should people understand the difference between Intels Thread Checker and Thread Profiler tools?
The distinction I make is that Thread Profiler helps you tune the code. Its not looking for anything that is per se an error, but its helping show you opportunities to be more efficient. Thread Checker is looking for errors. They may be errors that are causing the program to fail, but even more valuable, perhaps, is that it can find potential errors.
When I say “potential,” theyre actually real errors in the code, but they arent causing it to fail at that moment. Parallel programming with locks offers opportunities to have deadlocks or race conditions. Those can be intermittent; they can rear their ugly head just by running the program multiple times even with the same data or running it on different machines. That can be a nightmare.
If youre putting together a threaded application and you dont know that youve got every lock done perfectly, when you ship your application, it may just freeze up occasionally or get wrong results because of these types of errors. Thread Checker can find those directly and point out where the program seems to be missing synchronization.
Is Linux support substantially broadened in the release of these tools?
A few years ago, we added Linux support on our compilers; more recently, about a year ago, we got VTune native on Linux, and this is the first time that Thread Checker is on Linux. We have gotten more of our technology moved there, our analysis software.
Linux is fairly different than Windows when you get down in the guts of the hardware: the interface with the OS. That did take us some extra effort. Our libraries and compilers have been on Linux for a long time now.
Are there some big bullet points in how Linux and Windows differ in that regard?
There are two things. We access some hardware registers that arent normally managed by the operating system—the event registers, for instance, that the processor has that VTune leans on. We write some device drivers to interact and manage those registers, and device drivers are different enough between Windows and Linux that it isnt just a port.
The other thing is that we try very hard with these tools to interact with the whole system. DLLs get loaded dynamically, applications come in and out of memory, and we need to interact with the operating system to understand whats currently in core.
When things are taken in and out of core, we want to know that, [and] we want to be able to understand the virtual memory addressing that the operating systems using—because, when we get an event or a trap, we get handed a real address or a virtual address and we need to understand both. We need to understand which space were in.
We try to track individual threads, Windows threads or POSIX threads on Linux. Were understanding the operating systems memory map and interacting with that.
From the viewpoint of these tools, is there any difference between hyper-threading, multicore CPUs and multiple physical CPUs on a machine?
From the tools perspective, the simple answer is “no.” There are differences in cache sharing and resource sharing, but the programming model is the same. Youre using a model thats thread-oriented.
The hardware, we try to abstract as just hardware that can run a lot of threads. In the future, you might be running a quad-core processor, each core with hyper-threading, and there might be two of them in the machine. What we would end up showing you is a 16-threaded machine.
Are there any hard-coded limits on the number of threads you might be using in these tools?
Thread Checker uses algorithms that take more time to analyze the more threads you have, but Thread Profiler and VTune dont have that issue. Theyre all designed to go to as many threads as you throw at them. VTune can handle 4,000; tools like Thread Profiler and Thread Checker use some of the same technology, so we havent hard-coded any small limits into them, thats for sure.
Threading Building Blocks is a brand-new product, is that right?
Yes, thats new. It extends the C++ language using templates, a standard feature of C++. Weve added the common features that someone would need. The most important thing about using a package like Threading Building Blocks is that you dont spend any time doing explicit thread management. You dont create a loop to create one thread for each processor, [and] you dont go computing bounds on the problem you want to solve to put those into arrays. You can spend a lot of code doing that sort of thing.
If we tell programmers that the best way to program threads is using explicit threads and writing all that code themselves, one of two things will happen: Theyll either say “No,” or theyll go do it, and then theyll be frustrated over time. There are a lot of things you want to consider about scalability and so forth; youll be back revisiting that code over and over again as you learn more and more about the effects of going to larger and larger numbers of cores.
Using a package like Threading Building Blocks, or using OpenMP and our compilers, or using our libraries, weve thought those things through—weve got them right now. If there are enhancements to take advantage of future hardware, well revise the products in future years, and youll still just use them the way you do today. Youll get the benefits automatically.
So with the Threading Building Blocks, do I essentially just instantiate the class “Parallel Task” and tell it what I want that instance to do? And all the things that involve starting up and monitoring threads are taken care of for me as a result?
Yes, exactly. You explain the parallel task, [and] you may use a construct like a “For” or a “Reduce” or a “While” operation. Tell it to execute that many times, and, behind the scenes, we do all the management of thread creation [and] all the management of how many threads should be instantiated.
So, run-time issues, such as looking at available resources and deciding how many threads to spin up, are abstracted in those classes as well?
Yes, absolutely. You write your program and just say, “Use as many threads as you can.” On a single-threaded processor, the code will work, [and] on a 16-threaded system, it will figure out how to map those. Weve learned from past work to offer abstractions that we can lean on in the run-time.
Check out eWEEK.coms for the latest news, reviews and analysis in programming environments and developer tools.