Microsoft today announced a new Azure capability enabled by a Script Action feature that allows customers to tailor their HDInsight implementations.
“With this new feature, you can now experiment and deploy Hadoop projects to Azure HDInsight that were not possible before,” wrote Oliver Chiu, product marketing manager for Microsoft Hadoop/big data and data warehousing, in a Nov, 17 Azure Blog post. “This is enabled through the Script Action feature that can modify Hadoop clusters in arbitrary ways using custom scripts.”
All types of HDInsight clusters can be customized, he added, including HBase, which provides random, real-time read/write access to big data and Storm, a real-time, fault-tolerant big data processing platform. Microsoft announced a preview of HDInsight Storm last month in a bid to prepare customers for next-generation, data-intensive applications.
“Available in preview today, we are supporting Apache Storm in HDInsight, allowing our customers to process millions of items of Hadoop data from their Internet of things devices in near-real time using a fully managed Hadoop service,” said T.K. Rengarajan, corporate vice president of Microsoft’s Data Platform, in an Oct. 15 announcement.
Now, organizations can tweak their HDInsight clusters to better fit their big data processing needs.
To help get Azure customers started, Chiu’s team has documented the process of installing Spark and R using scripts. Spark, as he described, “has been gaining popularity for its ability to handle both batch and stream processing as well as supporting in-memory and conventional disk processing.”
The free R programming language was “developed for statistical computing and machine learning,” he continued. “R’s popularity amongst statisticians and data miners has increased substantially in recent years.”
R’s inclusion allows data scientists “to deploy the powerful MapReduce/YARN programming framework for processing large amounts of data on Hadoop clusters deployed in HDInsight,” Microsoft said in a related support document.” Spark, on the other hand, “improves over traditional MapReduce framework by avoiding writes to disk in the intermediate stages.”
Customizing HDInsight requires the latest version of Azure PowerShell, Microsoft’s cloud-enabled answer to its Windows-based task automation and scripting software, and some ground rules.
“This script needs to be written with requirements of the managed cloud environment where Azure patches nodes with OS updates, does security patches and can replace a misbehaving node at times,” explained Chiu. In addition, it “needs to be able to run and apply the customization at any time after the node was updated.”
Full instructions on installing and using Spark (1.0) on HDInsight are available here. R instructions can be found at this link.
Azure HDInsight is evolving into a foundational cloud technology for the Redmond, Wash.-based software company.
Last week, Microsoft launched Azure Operational Insights, a preview of a software-as-a-service (SaaS) data center monitoring solution based on HDInsight. “Leveraging the power of Azure HDInsight, it gleans machine data across data center and cloud environments, and turns it into real-time operational intelligence to enable better-informed business decisions,” stated the company in a Nov. 13 announcement.