Every organization creates data in the course of its operations, often in such large quantity and variety that it can be challenging to store it all, let alone make it available for analysis.
Data warehouses offer organizations a means of storing and exposing their data for mining, but setup and maintenance costs, along with the planning and user training tasks required to make data warehouse projects successful, can dissuade companies from deploying these systems.
Enter iLuminate, which sets out to address these data warehousing limitations with iLuminate 4.0. Where most data warehouses are built on relational DBMSes-be they row- or column-oriented-iLuminate describes its product as a correlation database.
Rather than store data in tables, iLuminate 4.0 organizes information in value pools based on data type, with an auto-generated indexing system that keeps track of the values’ context. This fully indexed, value-based storage approach can yield significant performance benefits, but I was most impressed by iLuminate’s knack for making data available for analysis with very few planning or design requirements.
iLuminate Version 4.0, released June 3, boasts a new 64-bit architecture, which, along with new multithreading enhancements, enables the product to accommodate more data and more concurrent connections. In addition, this version includes Java, .NET and C++ APIs to allow custom analytic applications to access the iLuminate engine.
In my tests of the product, I was able to pour a data set that spanned about 30 million records into the engine and, without any other organization or optimization, begin drilling arbitrarily through the data using Illuminate’s analysis tool, iCorrelate. What’s more, I could access any of my tables or stored queries through an external application (in my case, Microsoft Excel) via ODBC.
I found the user interfaces for iLuminate and its associated tools rough in places, but I was impressed by their capabilities nonetheless. According to the company, an overhaul of these tool interfaces is a focus for an upcoming version.
iLuminate 4.0 is priced starting at $41,900 for databases of up to 35 million records, with concurrent connection fees that start at $2,750 for up to four connections. The iCorrelate exploration and analysis tool is priced at $8,000 per seat, and the company’s quick data profiling and loading tool, iLook&Load, is priced at $10,000.
For organizations in search of better ways to store and analyze their data, iLuminate 4.0 is well worth further investigation, such as through one of the free proof-of-concept pilots that iLuminate details here.
iLuminate in the Lab
I tested iLuminate on a Sun Microsystems Sun Fire x4200 server with a pair of dual-core 2.39GHz AMD processors and 8GB of RAM, running the 64-bit version of Windows Server 2008 Enterprise. (iLuminate has a Linux version of its engine in the works for later in 2009.)
For my test data, I used campaign finance data from the Center for Responsive Politics’ OpenSecrets.org site. The data, which was stored in a series of CSV (comma-separated values)-formatted text files, detailed campaign contributions, candidates, political action committees and expenditures spanning the 1990 to 2008 election cycles.
The data set, which added up to about 30 million records, encompassed more than 545 million values, about 49 million of which were unique. Because of iLuminate’s data pool plus index approach, in which each value is stored only once, there’s a sort of deduplication at work here, which saved on database size. My test database occupied around 10GB of disk space.
iLuminate stresses the speedy “time to analytics” that its product enables, as well as its support for ad hoc queries-the freedom to begin digging through data before you necessarily know exactly what you’re looking for. I could appreciate this early along in the testing process, as I began digging arbitrarily through the campaign data.
For instance, one of the first tables I imported contained individual campaign contributions for the 1990 to 2008 election cycles. Once I’d loaded the data, I opened the Metadata Explorer within iCorrelate (the primary front-end tool for iLuminate), expanded the list of fields in the table and looked at the statistics item under the Amount field. There, I could see that I had 15.6 million contribution entries with an average amount of $943.62.
While iLuminate stores data in indexed pools, rather than in table structures, the table and column organization from my raw data persisted as logical structures, and I could use the Metadata Explorer to drill down into a particular column and view the data within, as well as check out auto-generated statistics for that column. For example, I found in a few clicks that 12 percent of the individual contribution records in my data set were from California contributors.
To dig a bit deeper into my initial table of data, I turned to iCorrelate’s query tool, where I could perform text searches for particular names or words, either in individual columns or across the whole table. For example, I searched for instances of the word “songwriter” in the occupation field of the individual contributions table and turned up 410 records out of about 15 million in 0.8 seconds.
I could drill down into this set of records in the same way that I could for whole columns and tables, quickly pulling out statistical information from iCorrelate. I found that while my data set covered 1990 to 2008, my search turned up results only from 2002 to 2008, which led me to wonder whether the field I’d searched on had been used in the same way over the entire span of my data set.
To find out whether this was the case, I searched for “songwriter” across the whole table, and found 2,229 instances. This broader search took 4.5 minutes to complete. I then narrowed my search to records in the cycles between 1990 and 2000 and, by glancing at the 715 records my follow-up search revealed, learned that the occupational information field I initially chose was less consistently used than another field available in the table, a fact about the data that I noted for my future analysis.
By loading more tables into the repository, and by doing a bit of setup work, I was able to make my campaign data exploration easier and more fruitful. Specifically, I returned to the tool’s Metadata Explorer and set out to create some relationships among the individual contributions, candidates and committees tables. The individual contribution table includes a field that identifies the committee or candidate that received each contribution.
I used iCorrelate’s expressions feature to create two copies of the recipient ID columns-one to link to the candidates table and the other to link to the committees table. With these links in place, I returned to my songwriter query and used the tool’s Get Relations option, which returned all the committee and candidate records that my campaign-contributing songwriter entries pointed to.
Again, I could drill down into the results and view the list of candidates and committees that received contributions from songwriters, as well as view statistics on those results. For example, 86 percent of the candidates that received the songwriter contributions were Democrats.
The Get Relations option has a sibling, Find Relations, which I tacked onto my list of songwriter-supported candidates to find all the records that pointed to those candidate entries. This leg of my search returned 3.9 million records in 5.3 seconds, and, as before, I could pull up assorted statistics about my new set of records.
For instance, the gender breakdown among all contributions to the candidates who were popular among contributing songwriters was 59 percent male to 33 percent female, compared with a 67 percent male to 26 percent female breakdown across the entire set of contribution records.
At each step at which I could view data and statistical information, I could generate typical chart types within iCorrelate, as well as export my data sets to CSV files, create reports or make the data available over ODBC as queries. I could also interact with the data from external applications using SQL, although I did not test this option.
Loading and Administration
Of course, before I could begin exploring my campaign data, I had to load it into the iLuminate engine, a task for which Illuminate offers two different tool options: the iLuminate Importer and iLook&Load.
With each tool, I identified my data source, the sort of delimiter that marked off my records, the data types of each column, the presence or absence of a header row, and the table into which my data should be loaded. I could also apply rules to modify the data as it was loaded into the repository.
The Importer tool worked more quickly, and iLook&Load tool provided more information about my data as it entered the system, such as the totals and averages of integer-typed columns, as well as the maximum and minimum values of individual columns. iLook&Load also accepts a broader range of input types, including ODBC sources, text and XML files, Access databases, and Excel spreadsheets. The stock importer accepts text files and ODBC sources.
I loaded my largest test table-the 15-million-record, CSV-formatted individual contributions table-with each tool. The Importer took about 1.5 hours to complete the load, and iLook&Load required about 5.5 hours to do the job.
Before I was able to load the data into my repository at all, I had to do some cleanup, such as dealing with nested quotation marks within some of my fields. Contributors who took pains to include their nicknames (“Tex”) within their name entries gave me fits, for instance.
As with any other data warehouse project, organizations that select iLuminate will need to use an ETL (extraction, transformation and loading) tool to first clean up their data. For the purposes of my tests, I focused primarily on getting all of my records to load, but actual deployments will demand more attention to data cleansing.
iLuminate runs as a pair of Windows services-one for the core engine and another for its ODBC support. I could start, stop, create and switch among database files using a service monitor tool that would minimize to my system tray. The monitor also provided performance details, as well as information about open queries, active connections and storage characteristics.
Also from the service monitor, I was able to enable remote access to the engine for other iCorrelate clients or ODBC consumers on my network.
Executive Editor Jason Brooks can be reached at [email protected]