Structure the Unstructured
How to Mine Scientific Business Intelligence in the Cloud
Before the rise of the mammoth database, before e-mail, electronic laboratory notebooks (ELNs), and global Centers of Excellence, R&D innovation was centered on personal communication at the company lunch table. It was here that project team leaders-such as the head chemist, pharmacologist, biologist and other select stakeholders-would gather to share their knowledge. The rich insights that resulted led to a significant number of discoveries in areas ranging from pharmaceuticals and consumer packaged goods, to specialty chemicals and heavy manufacturing.
Thanks to the swift pace of technological change, our ability to generate data has increased exponentially. The problem is that there is too much content and not enough context. Raw information dumped into data warehouses has replaced the knowledge-driven categorization and intelligence capabilities that dominated the lunch table.
Disjointed processes and disparate data silos [a product lifecycle management application (PLM) here, a chemistry system here] have replaced collaborative project ownership and decision making. As a result, the most valuable information is often hidden in a deluge of data, inaccessible to the researchers who need them and disconnected from other relevant sources of knowledge.
To usher in a new era of innovation, R&D organizations need to re-create the open, collaborative atmosphere that existed at the company lunch table, but on a scale that embraces the breadth and complexity of today's global scientific information landscape. Here's how.
Move Collaboration to the Cloud
Move collaboration to the cloud
Cloud computing-whether inside or outside the firewall-offers great possibilities when it comes to enabling richer communication. The Web provides an ideal forum for project stakeholders to interact and share ideas, regardless of their location, their areas of specialization or the format of the information. When a browser is all that's needed to get a seat at the table, collaboration can once again play a key role in the discovery process.
But there are technical considerations that first need to be taken into account. Because the data involved in modern scientific research is so vast and complex, it doesn't make sense, nor is it really possible, to take legacy infrastructure (such as a large chemistry or biology data warehouse) that's cemented to the floor and move it to the cloud. There are just too many transactional systems wrapped around these data hubs to pull out the center of the onion.
At the same time, installing thick-client technologies at every site to transact on one or many data warehouses would introduce too much latency. Instead, organizations should focus on enabling the integration, shared access and reporting of project-centric data via a cloud-based project data mart. They should do this rather than isolating information within disciplinary silos (such as an ELN that only categorizes biological assays, for example). This requires a services-based information management platform capable of extracting the most relevant scientific intelligence from diverse systems and formats, and integrating it in the cloud to enable streamlined collaboration and decision making.
For example, suppose a pharmaceutical company is working with a Contract Research Organization (CRO) on a drug discovery project. Today, many scientific organizations actually install their legacy IT systems at the outsourcer's site in order to exchange and analyze data. Not only is this costly, it's also highly inefficient, as systems now need to be maintained both within the organization's internal IT infrastructure as well as at the CRO site.
And the redundancies multiply the more departments, locations and partners that are involved. With a cloud-based project, data mart and reporting sitting on top of a services-based architecture; critical information, workflows and transactions that need to be accessed by collaborators can be maintained globally, with a much lower seat cost and support burden.
Structure the Unstructured
Structure the unstructured
The insights researchers are able to gain when conversing informally are extremely rich, because human brains are adept at making contextually-relevant associations of which a structured database is incapable. For example, a human would know immediately that the words "auto," "automobile" and "car" mean the same thing, or that a past experiment may be "kind of" similar to the one being conducted in a current project. This is what the lunch table of the past delivered.
But what happens when your organization's head pharmacologist is in Boston, the lead chemist is in Beijing, and the available information base involves an enormous breadth of sources and data formats? Those contextually-relevant associations are not so easy to make.
Until organizations are able to "structure" (that is, categorize) the vast quantities of unstructured content at their disposal, they will miss out on a monumental amount of knowledge. This is where less rigid categorization technologies such as advanced semantic search and text analytics come in. But they have to be sophisticated enough to handle the highly-complex nature of scientific data. For instance, a molecule may be represented by name, by an ID number or as an image, so your search solution must be "scientifically aware" enough to recognize these variations.
Consider a company that needs to search a vast amount of unstructured content, ranging from external patents and journal articles to their own internal documents and research databases. The company needs to identify and extract information relevant to a key project. Using a scientifically-aware text analysis application capable of recognizing chemical structures and biological sequences, researchers would be able to query the content and quickly pinpoint the most relevant information. They would be able to do this without having to know exactly how the data is represented. Without this capability, the time and cost constraints involved in leveraging unstructured content would be too high and, most importantly, critical insights would be missed.
Simplify Data Loading and Reporting
Simplify data loading and reporting
The "virtual" lunch table will only be a success if collaborating in the cloud is as easy as, well, sitting down to lunch. And forcing cumbersome information management processes on researchers is the single fastest way to stifle innovation. Loading and reporting on data needs to be simple for users-either through a forms-based application run on a thin client or through a basic Web-based extraction, transformation and loading (ETL) service that allows collaborators to just click to deploy.
A flexible approach to information delivery is also required-one that empowers collaborators to view data in a format that best suits their research methods. These formats may range from a simple Web portal to sophisticated three-dimensional visualization, but the important thing is that the reporting is capable of integrating both the structured and unstructured content so that it can be easily analyzed via a single view.
A combination of scientifically-aware search and a global, service-oriented architecture (SOA) architecture that brings together the intelligence previously marooned in isolated systems such as ELNs makes this integration possible.
Frank Brown, PhD, has served as Senior Vice President and Chief Science Officer for Accelrys since October 2006. Frank has extensive experience in the areas of computational chemistry and chemoinformatics. He is responsible for both the scientific direction of the company and all collaborative research with academic, government and industrial partners. Prior to joining Accelrys, Frank held positions of increasing responsibility at Johnson & Johnson, most recently as senior research fellow within the Office of the CIO. In this position, Frank oversaw the development of architecture for all R&D in the organization's pharma sector. Before Johnson & Johnson, Frank started the first chemoinformatics group in the industry at Glaxo Research Institute, and launched software products targeted to the pharmaceutical industry as vice president for product and business development at Oxford Molecular Group.
Frank has also served as an adjunct associate professor in the Department of Medicinal Chemistry, School of Pharmacy, at the University of North Carolina at Chapel Hill. He has also served as a chair for the American Chemical Society (ACS), Computers in Chemistry section, and on an NIH Special Study. Frank holds a PhD in physical organic chemistry from the University of Pittsburgh and a post-doctoral studies degree in bio physics from the University of California at San Francisco. He can be reached at firstname.lastname@example.org.