Google Unveils Open-Source Gumbo HTML Parser Tool

Google has released Gumbo as a pure C open-source tool for developers.

Google is adding another open-source tool for developers with the release of its Gumbo HTML parser, which is a C implementation of the HTML5 parsing algorithm.

The open-source code release was announced in an Aug. 14 post by Jonathan Tang, of the search features team, on the Google Open Source Blog.

"One of the big accomplishments of the HTML5 standard was to standardize the HTML-parsing algorithm, so that all browsers see the same HTML document in the same way," wrote Tang. "So far, most implementations of this algorithm have either been tied to specific browsers or rendering engines, or they've been written in specific scripting languages. This makes it hard to write quick one-off tools to manipulate and clean up HTML if you don't happen to be working in a language that already has an HTML5-compatible parsing library."

That's where Gumbo can be helpful, because it gives developers "a simple library that can serve as a basic building block for linters, refactoring tools, templating languages, page analysis and other small programs that need to manipulate HTML," wrote Tang. "It's written in pure C for ease of interfacing with other languages, and has no outside dependencies. Gumbo was built from the start to support source locations and correlating nodes in the parse tree with positions in the original text."

Developers can find additional details about Gumbo and its use, installation and more on the Gumbo project page.

Gumbo conforms fully to the HTML5 specification, and is robust and resilient to bad input, according to the Gumbo project page on GitHub. Gumbo includes support for source locations and references back to the original text and has been tested on more than 2.5 billion pages from Google's index, according to the project page.

Gumbo is just one of many open-source tools and projects that Google has released to software developers in recent months.

In June, Google released its open-source Cloud Playground environment where developers can quickly try out ideas on a whim, without having to commit to setting up a local development environment that's safe for testing coding experiments outside the production infrastructure. The new Cloud Playground is presently limited to supporting Python 2.7 App Engine apps.

Also in June, Google opened its Google Maps Engine API to developers so they can build consumer and business applications that incorporate the features and flexibility of Google Maps. By using the Maps API, developers can now use Google's cloud infrastructure to add their data on top of a Google Map and share that custom mash-up with consumers, employees or other users. The maps can then be shared internally by companies or organizations or be published on the Web.

In May, Google's Go open-source programming language was updated to Version 1.1, bringing developers new capabilities and performance improvements such as a race detector for finding concurrency bugs and new standard library functionality. Go 1.1 arrived 14 months after the release of the original 1.0 version in March 2012.

There had been two minor "point releases" in between, but they fixed only critical issues and didn't amount to a reworking of the application. The new version includes significant performance-related improvements, he wrote, including optimizations in the compiler and linker, garbage collector, goroutine scheduler, map implementation and parts of the standard library.

In April, Google released the open-source Android-based kernel code for its Glass project to encourage software developers to begin much more Google Glass apps development in a big way.

In January, Google announced that it was moving its Google Cloud Platform (GCP) over to the GitHub collaborative development environment to make it easier for software developers to contribute and continue the evolution of GCP. The GCP program has been growing since Google unveiled a new partner program in July 2012 to help business clients discover all of Google's available cloud services. GitHub is a rapidly growing collaborative software development platform for public and private code-sharing and hosting.