Syncsort's Hadoop ETL Solutions Provide Simplified Data Integration

By Darryl K. Taft  |  Posted 2013-05-24

Syncsort's Hadoop ETL Solutions Provide Simplified Data Integration

Syncsort, a provider of big data integration and protection solutions, recently announced the availability of its Spring '13 release, including two brand-new Hadoop products and enhancements to its DMX technology that turn Hadoop into an easy-to-use extract, transform and load (ETL) solution.

Big data is prompting organizations to look at Hadoop to process more data in less time and for less money, but Hadoop is not yet a complete ETL solution. Syncsort's two new offerings for HadoopDMX-h ETL Edition  and DMX-h Sort Edition are designed to strengthen Hadoop by providing the full functionality required to deliver enterprise ETL capabilities. They provide greater ease-of-use and maximize node performance, compared with non-native, code-generating ETL tools. In addition, performance and connectivity enhancements to DMX expand usage by end users and partners.

"Analyzing big data is critical to our customers' ability to sustain competitiveness, but the avalanche of information is breaking traditional data integration architectures—many of the tools are too code- and resource-intensive and ultimately drive costs too high," said Josh Rogers, senior vice president of the data integration business at Syncsort, in a statement. "With our new DMX editions, we are strengthening Hadoop by providing seamless and powerful ETL and sort capabilities and at the same time, reinvigorating the value proposition of ETL by leveraging the power of Hadoop to scale core processing of big data."

"Based on the evidence I have gathered talking with customers and in-the-weeds big data consultants, claims that Hadoop, and some non-Hadoop big data solutions, eliminate the need for ETL are patently false," wrote analyst Evan Quinn in a post on the Enterprise Strategy Group (ESG) blog. "Nothing solves data prep and understanding challenges like ETL. ETL forces the data analyst to dig into the details of all the raw data, and conceptualize what a perfect data set for analytics would look like—and this exercise also helps the data analyst determine the analytical possibilities. … Thus, it should also come as no surprise that ETL has thus far proven to be one of the most popular applications of Hadoop, and, if anything, ESG sees Hadoop-based ETL continuing to grow its fan base."

Moreover, Quinn added, "Syncsort DMX-h ETL Edition will help Hadoopists take a big data step forward in terms of ETL ease of development and performance."

"Cloudera sees ETL as one of the top use cases for Hadoop—it is essential to our mission of maximizing the value of big data," Amr Awadallah, chief technology officer at Cloudera, said in a statement. "We see Syncsort's new DMX-h offerings enabling our mutual customers with critical data integration and ETL capabilities which simplify ETL deployments while efficiently processing data natively on Hadoop. The CDH 4.2 release includes Syncsort's contribution to Apache Hadoop making the sort phase pluggable, enabling DMX-h, and broadening use cases on Hadoop."

The new DMX-h solutions take advantage of Syncsort's recent contribution to Apache Hadoop, which provides a unique level of native integration to deliver best-in-class data integration capabilities and Sort acceleration for Apache Hadoop distributions.

Highlights of the DMX-h ETL include an ETL engine that runs natively within MapReduce, maximizing node performance. It also provides Hadoop ETL without coding. Developers can leverage an easy-to-use Windows GUI and deploy seamlessly into Hadoop. In addition, it provides "use case accelerators," which essentially is a library of pre-built templates, that help developers fast-track Hadoop ETL implementations, and it extends access to and delivery of all data, including from the mainframe.

Recent Syncsort benchmarks show significant Hadoop performance and resource efficiency improvements when using DMX-h. The results show very predictable and sustainable throughput even as data volumes grow. Using the TeraSort benchmark, DMX-h Sort Edition achieved a sustainable throughput of more than 100MB per second per node, delivering upwards of two times higher throughput per node­ than Hadoop's native sort at 45MB per second per node.


Syncsort's Hadoop ETL Solutions Provide Simplified Data Integration

Similarly, DMX-h ETL Edition achieved sustainable throughput in excess of 255MB per second per node for up to 2.5 times faster performance than Pig when aggregating 2TB of Web log data. In both cases, tests were run for data volumes ranging from 500GB to 2TB of data. While alternatives such as Hadoop's native sort and Pig reach a saturation point—where throughput starts to decline—at around 500GB of data, DMX-h delivered sustainable and predictable performance from 500GB to 2TB, Syncsort said. This represents major implications organizations as they can more efficiently size their Hadoop infrastructure, minimize uncertainty and achieve a more predictable cost structure as their big data becomes even bigger.

"Hadoop is lowering the cost structure of processing data at scale, but deploying Hadoop at the enterprise level is not free, and significant hardware and IT productivity costs can damage ROI," ESG's Quinn said in a statement. "Syncsort's Spring '13 release provides unique capabilities in Hadoop to help maximize savings, delivering best-in-class ETL technology at a price point that is highly disruptive for the data integration market, and more consistent with the cost structure of open-source solutions."

Meanwhile, TagMan has a marketing data platform providing a software-as-a-service (SaaS) solution to help e-commerce ventures manage the tracking of their marketing campaigns to help them get the full picture of their marketing effectiveness. They facilitate visibility and reduced maintenance by managing vendor marketing tags using a single container tag to get all the information the advertiser wants to track, and tying together all the collected marketing data with other data to provide insights into the effectiveness of different campaigns in the full path to conversion.

The TagMan data management production environment is currently a hybrid of an in-house developed data collection system and a traditional SQL database reporting system. However, TagMan is looking to Hadoop, which they have implemented in parallel, to leverage its horizontal scalability to be able to add nodes, and give them the necessary flexibility to add new data points they want to analyze.

Ultimately, it allows them to handle more data more easily and efficiently when collecting massive amounts of data in real time. This enables them to create actionable intelligence on increasing big data and enable their clients to make minute-by-minute decisions on marketing decisions, such as real-time bidding and search optimization. TagMan sees Syncsort's DMX-h ETL Edition as a fit with their Hadoop plans because the toolset makes it easier to anticipate and handle the required MapReduce processing—data collection and distribution. Company officials said that when they know how they want to slice the data, Syncsort can make it easy to do.

"In tag management, we facilitate a huge number of interactions between marketers and their vendors, and as a result, we are able to see the complex journey a consumer takes prior to making a purchase," said Ave Wrigley, CTO of TagMan, in a statement. "This involves a huge amount of data processing. To be competitive, we must convert the high volume of 'path-to-purchase' data captured by our platform into actionable intelligence that drives decisions by both marketers and their vendors. What's compelling about Syncsort's latest DMX product deliveries is the unique approach to replacing older code-driven approaches with a streamlined, GUI-driven way to collect, cleanse and distribute information inside and outside Hadoop, saving time and resources and giving us maximum flexibility in preparing big data for business analytics and data visualization."

Users looking to leverage DMX-h ETL can download a free test drive that contains everything they require without the need to set up their own Hadoop cluster. It includes a Linux Virtual Machine with Cloudera CDH 4.2 and DMX-h ETL Edition preinstalled, along with use case accelerators and sample data.


Rocket Fuel