Microsoft Spark Connector for Azure DocumentDB Supports Data Science

Microsoft's introduces the new Apache Spark connector for Azure DocumentDB at the Strata + Hadoop World conference along with several new updates to the company's big data ecosystem.

big data tools

Microsoft is serious about becoming the go-to provider of big data cloud services for its enterprise Azure cloud-computing customers.

The company introduced a new Apache Spark connector for Azure DocumentDB, among several new additions to the company's big data solutions portfolio, during the Strata + Hadoop World big data conference in San Jose, Calif. today.

Taken altogether, the new additions are intended to help businesses piece together flexible, high-performance big data processing and analytics systems in the cloud.  

Apache Spark is the developer-friendly, open-source data processing engine with a knack for making short work of big data workloads and enabling sophisticated analytics. Combined with Azure DocumentDB, Microsoft's NoSQL document service, the technology now enables Azure customers to perform data science and glean insights in real-time, according to Dharma Shukla, distinguished engineer and general manager of open-source software analytics and NoSQL at Microsoft.

"Connecting Apache Spark to Azure DocumentDB accelerates our customer's ability to solve fast-moving data sciences problems where data can be quickly persisted and retrieved using DocumentDB," Shukla said in a March 15 statement.

"The Spark to DocumentDB connector efficiently exploits the native DocumentDB managed indexes and enables updateable columns when performing analytics, push-down predicate filtering, and advanced analytics to data sciences against fast-changing globally-distributed data, ranging from IoT, data science, and analytics scenarios," Shukla's statement said.

The connector, which uses the Azure DocumentDB Java SDK, is available now at GitHub.

Microsoft also announced the general availability of new MongoDB APIs (application programming interfaces) for DocumentDB. Backed by enterprise-grade service-level agreements, the APIs enable applications built on MongoDB NoSQL databases to "seamlessly target" DocumentDB data using their existing client drivers and toolsets, Shukla said.

Several new enhancements to HDInsight, the company's cloud-based distribution of Hadoop, were also unveiled.

In a security-enhancing move, Microsoft has extended HDInsight's native security capabilities for Hadoop workloads, which include authentication and encryption, to other workloads including Apache Spark and Interactive Hive, also known as Live Long and Process.  Interactive Hive is a new HDInsight cluster type that "allows in memory caching that makes Hive queries much more interactive and faster," according to this online support document from Microsoft.

This new feature makes HDInsight one of the world's "flexible, and open Big Data solution on the cloud with in-memory caches (using Hive and Spark) and advanced analytics through deep integration with R Services," claims the software giant.

Now that HDInsight supports Apache Hive 2.1.1, customers can use the solution for data warehouse scenarios that deliver sub-second query performance and don't require time- and resource- consuming data movement tasks, Shukla said.

Finally, SQL Server Community Technology Preview (CTP) 1.4 for both Windows and Linux will be available to download in the coming days, Microsoft announced today. In addition to some Linux-specific tweaks, it will contain index-rebuilding features that add some flexibility to a busy database administrator's (DBA) scheduling index maintenance and recovery to-do list.

Pedro Hernandez

Pedro Hernandez

Pedro Hernandez is a contributor to eWEEK and the IT Business Edge Network, the network for technology professionals. Previously, he served as a managing editor for the network of...