Facebook, LinkedIn Reflect on 2015: The Year in Open Source
"We've worked to scale our infrastructure as we reached 400 million LinkedIn members, so it's no surprise many of our open-source projects this year focus on building out our data pipelines and tools to help make sense of our data," Perisic said. "The infrastructure improvements we've made in Kafka have allowed us to handle 1.3 trillion messages per day, and Espresso now serves 2.2 million rows per second." LinkedIn open-sourced Pinot in June. Pinot is LinkedIn's real-time analytics infrastructure. Pinot enables the company to slice, dice and scan through massive amounts of data in real-time across a wide variety of products, said Kishore Gopalakrishna, a senior software engineer on the data infrastructure team at LinkedIn in a blog post. "At LinkedIn, we have a large deployment of Pinot storing hundreds of billions of records and ingesting over a billion records every day," Gopalakrishna said. "Pinot serves as the backend for more than 25 analytics products for our customers and members. This includes products such as Who Viewed My Profile, Who Viewed My Posts and the analytics we offer on job postings and ads to help our customers be as effective as possible and get a better return on their investment. In addition, more than 30 internal products are powered by Pinot…" In October, LinkedIn open sourced PalDB. PalDB is a lightweight companion for storing side data. LinkedIn developed PalDB to assist some of its machine learning efforts and storage needs. At LinkedIn, an issue that often comes up is what to do to improve the usability and memory efficiency of side data, said Matthieu Monsch, an engineer at LinkedIn in a blog post. Side data is the extra read-only data needed by a process to do its job, he said.Explaining LinkedIn's open-source philosophy, Perisic said he believes participating in open-source projects makes engineers better because their work is exposed to the entire community. "It seems paradoxical to think that developers write better software for others than they do for themselves, but it actually makes sense," Perisic said. "When software is written 'internally,' developers have a tendency to cut some corners—and I'm as guilty as anyone—especially around documenting, making code easily readable and reusable and having all the right tests in order. The Open Source community has choices and will simply lose patience in trying to figure what your code does if it is too obscure. 'Internally' you may not have a choice. "With open source, developers' names are attached to the software they create and the entire community can look at it. This puts a human face on code and reputations on the line. Once a developer open sources some software, their names will be forever associated with it. Their design choices and bugs will be visible to all. This is a huge incentive to cross their T's and dot their I's. A developer wants to be associated with good stuff that is well written."
For instance, a list of stop words used by a natural language processing algorithm is side data, Monsch said. Machine learning models used in machine translation, content classification or spam detection are also side data. When this side data becomes large, it can create a bottleneck for applications that depend on them. PalDB does more with less by providing a new read-only embeddable database that makes it much easier to scale side data.