REVIEW: Talend Open Studio Makes Quick ETL Work of Large Data Sets
When I reviewed the iLuminate 4.0 data warehousing product in June, I spent a healthy share of my testing time getting the data I'd chosen for testing-a set of campaign finance records from the OpenSecrets project-cleaned up and ready for loading into the iLuminate product. This involved writing shell scripts to target small import snags in the data set, as well as divvying up some of my comma-separated values files to duck Excel's million-row-per-spreadsheet limit (a limit that, like the apocryphal 640K of PC memory, I never thought I'd hit).
The next time I have to deal with large data sets during testing, I may well turn to Talend Open Studio, an open-source ETL (extraction, transformation and loading) product that makes it easy to round up data, tweak it en masse, and load it into target systems such as databases and enterprise applications.
Talend Open Studio is built on top of the Eclipse platform, which made the product's interface somewhat familiar to me right from the start.
Licensed under the GPL and available for free download at www.talend.com, Talend Open Studio is a powerful tool in its own right. But for larger data integration projects, organizations can tap Talend Integration Suite, a subscription-based version of the product that adds technical support and additional capabilities intended to support large teams and deployments.
Pricing for Talend Integration Suite starts at $4,000 per developer seat per year. Talend offers other editions that target improved parallel processing and real-time performance. (For the full lineup of Talend data integration products, go to here.)
Scratching the Surface
In my tests of Talend Open Studio, I only scratched the surface of what the tool can do. The product ships with an impressive range of components for accessing data sources and targets, as well as components for manipulating and integrating data in a variety of ways. Also, the product, which hews to a code-generation strategy involving Java or Perl code, is quite extensible.
I tested Talend Open Studio Version 3.1 on a machine running 64-bit Ubuntu Linux 9.04. TOS is also available in versions for PowerPC and x86 versions of Linux, for Solaris, for 32- and 64-bit versions of Windows, and for Apple machines running OS X.
I began my tests by firing up TOS and creating a new project. It was at this point that I could choose between creating a Java- or Perl-based project. For my tests, I stuck with Java-based projects. I set out to import the bulk campaign finance data into a MySQL database by first creating metadata elements for the comma-separated value text files in which the candidate and individual contributor data I wanted to work with were stored.
TOS presented me with a very straightforward wizard that stepped me through singling out my data file and identifying the proper field and row separators and escape characters required to properly parse my file. As with other import tools I've used, the Talend tool included a preview window that made it easy to see that my file would be parsed as expected.
I also used this wizard to populate my column definitions with names and data types. For the date information in my data file, I specified the correct month-day-year format. Once I saved the metadata definition that the wizard helped me create, I could apply that definition to other, similarly formatted data files. So, for example, while I based my metadata definition on campaign data from the 2010 election cycle, I could use the same definition for subsequent imports of previous election cycles.
I created a separate metadata element for my MySQL database. This step involved filling out a field with the connection information for my database, just as I would configure a database query tool connection. I could choose from 31 different types of database connections, which included a full range of database products as well as generic ODBC and JDBC connections.
With my source and target elements ready to go, I dragged each element onto the design canvas for the data integration job I'd created. I indicated that my delimited file would be an input element and that my database would accept output. From here, I modified a few settings for the database output element, indicating, for instance, that TOS should create a new table to receive the data.
I ran my new job and watched as Talend's rows, rows-per-second and elapsed-time indicators marked the flow of my test data into my MySQL database.
With this data in place, I built a new job-this time with my already-configured database element as a source to provide input, as well as a new element, for an instance of SugarCRM that I've been testing, to accept the output. Configuring the SugarCRM element was similar to setting up my database connection: I provided the SugarCRM Web services URL and my authentication information, and selected which table I wanted to use from the Sugar system.
I also added a third element to my job design canvas-a tMap element, which enabled me to map particular columns from my MySQL source to my chosen SugarCRM table, as well as to transform the column values en route between the two stores. I used Talend's expression builder, for instance, to extract the last names from a full name column in my source table using a function provided with the product.
Executive Editor Jason Brooks can be reached at firstname.lastname@example.org.