When I reviewed the iLuminate
4.0 data warehousing product in June, I spent a healthy share of my testing
time getting the data I'd chosen for testing—a set of campaign finance records
from the OpenSecrets project—cleaned up and ready for loading into the
iLuminate product. This involved writing shell scripts to target small import
snags in the data set, as well as divvying up some of my comma-separated values
files to duck Excel's million-row-per-spreadsheet limit (a limit that, like the
apocryphal 640K of PC memory, I never thought I'd hit).
The next time I have to deal with large data sets during testing, I may well
turn to Talend Open Studio, an open-source ETL (extraction, transformation and
loading) product that makes it easy to round up data, tweak it en masse, and
load it into target systems such as databases and enterprise applications.
For eWEEK Labs' images of Talend Open Studio in action, click here.
Talend Open Studio is built on top of the Eclipse platform, which made the
product's interface somewhat familiar to me right from the start.
Licensed under the GPL and available for free download at www.talend.com,
Talend Open Studio is a powerful tool in its own right. But for larger data
integration projects, organizations can tap Talend Integration Suite, a
subscription-based version of the product that adds technical support and
additional capabilities intended to support large teams and deployments.
Pricing for Talend Integration Suite starts at $4,000 per developer seat per
year. Talend offers other editions that target improved parallel processing and
real-time performance. (For the full lineup of Talend data integration
products, go to here.)
Scratching the Surface
In my tests of Talend Open Studio, I only scratched the surface of what the
tool can do. The product ships with an impressive range of components for
accessing data sources and targets, as well as components for manipulating and
integrating data in a variety of ways. Also, the product, which hews to a
code-generation strategy involving Java or Perl code, is quite extensible.
I tested Talend Open Studio Version 3.1 on a machine running 64-bit Ubuntu
Linux 9.04. TOS is also available in versions for PowerPC and x86 versions of
Linux, for Solaris, for 32- and 64-bit versions of Windows, and for Apple
machines running OS X.
I began my tests by firing up TOS and creating a new project. It was at this
point that I could choose between creating a Java- or Perl-based project. For
my tests, I stuck with Java-based projects. I set out to import the bulk
campaign finance data into a MySQL database by first creating metadata elements
for the comma-separated value text files in which the candidate and individual
contributor data I wanted to work with were stored.
TOS presented me with a very straightforward wizard that stepped me through
singling out my data file and identifying the proper field and row separators
and escape characters required to properly parse my file. As with other import
tools I've used, the Talend tool included a preview window that made it easy to
see that my file would be parsed as expected.
I also used this wizard to populate my column definitions with names and
data types. For the date information in my data file, I specified the correct
month-day-year format. Once I saved the metadata definition that the wizard
helped me create, I could apply that definition to other, similarly formatted
data files. So, for example, while I based my metadata definition on campaign
data from the 2010 election cycle, I could use the same definition for
subsequent imports of previous election cycles.
I created a separate metadata element for my MySQL database. This step
involved filling out a field with the connection information for my database,
just as I would configure a database query tool connection. I could choose from
31 different types of database connections, which included a full range of
database products as well as generic ODBC and JDBC connections.
With my source and target elements ready to go, I dragged each element onto
the design canvas for the data integration job I'd created. I indicated that my
delimited file would be an input element and that my database would accept
output. From here, I modified a few settings for the database output element,
indicating, for instance, that TOS should create a new table to receive the
data.
I ran my new job and watched as Talend's rows, rows-per-second and
elapsed-time indicators marked the flow of my test data into my MySQL database.
With this data in place, I built a new job—this time with my
already-configured database element as a source to provide input, as well as a
new element, for an instance of SugarCRM that I've been testing, to accept the
output. Configuring the SugarCRM element was similar to setting up my database
connection: I provided the SugarCRM Web services URL and my authentication
information, and selected which table I wanted to use from the Sugar system.
I also added a third element to my job design canvas—a tMap element, which
enabled me to map particular columns from my MySQL source to my chosen SugarCRM
table, as well as to transform the column values en route between the two
stores. I used Talend's expression builder, for instance, to extract the last
names from a full name column in my source table using a function provided with
the product.
Executive Editor Jason Brooks can be reached at jbrooks@eweek.com.