IT Science Case Study: How U.S. Census Bureau Embraced the Digital Age

eWEEK IT SCIENCE: Every 10 years, the Census Bureau must build an accurate address list of every housing unit, motivate people to respond, analyze the data and release the results. This gets harder every time, and its data-gathering and analytics systems needed updating in a big way.


Here is the latest article in an eWEEK feature series called IT Science, in which we look at what actually happens at the intersection of new-gen IT and legacy systems.

Unless it’s brand new and right off various assembly lines, servers, storage and networking inside every IT system can be considered “legacy.” This is because the iteration of both hardware and software products is speeding up all the time. It’s not unusual for an app-maker, for example, to update and/or patch for security purposes an application a few times a month, or even a week. Some apps are updated daily! Hardware moves a little slower, but manufacturing cycles are also speeding up.

These articles describe new-gen industry solutions. The idea is to look at real-world examples of how new-gen IT products and services are making a difference in production each day. Most of them are success stories, but there will also be others about projects that blew up. We’ll have IT integrators, system consultants, analysts and other experts helping us with these as needed.

Today’s Topic: Census Bureau Embraces the Digital IT World

The U.S. Census Bureau is the federal government’s largest statistical agency and the nation’s leading provider of quality data about its people and economy. Its most important initiative is the U.S. Census, conducted every 10 years, which counts every resident in the United States. It requires years of research, planning, and development of methods and infrastructure to ensure an accurate and complete count. The data collected by the census determines the number of seats each state has in the U.S. House of Representatives, and it is used to distribute more than $675 billion in federal funds to local communities. This funding supports education, healthcare, infrastructure improvements, and more.

Name the problem to be solved: Modernizing a paper-based system

The 2020 census requires counting an increasingly diverse and growing population of about 330 million in more than 140 million housing units. The Census Bureau must build an accurate address list of every housing unit, motivate people to respond, analyze the data and release the results. Each stage requires significant data processing to organize data into actionable intelligence.

Describe the strategy that went into finding the solution: 

Prior to the 2020 census, all data was collected by paper survey, and then the data was transported to the U.S. Census Bureau and input manually. For the first time in U.S history, the 2020 census is being conducted primarily online instead of by mail. But this effort will generate an unprecedented amount of data that must be collected, stored, secured and interpreted. 

In order to provide the processing capacity needed, bureau leadership established the Census Enterprise Data Lake (EDL) initiative. The EDL supports the processing capability to fulfill petabyte-scale data management and analytics while satisfying security and privacy requirements -- all while controlling costs. This is transforming how the agency processes demographic and economic data using open-source technology and high-performance cloud infrastructure. 

List the key components in the solution: 

The Census Bureau chose Cloudera as the data platform for the 2020 census to help mine, process and extract insights used to inform important decisions at all levels of government. The platform leverages the entire technology stack and professional service offerings. Cloudera DataFlow is used to ingest data and provide real-time analytics. Hortonworks Data Platform serves as the data lake and repository for the massive amount of data collected. Hadoop Distributed File System, Apache Ranger, Apache Atlas and encryption of data at rest and data in motion are used to enable data sharing, as well as security and data governance policies.

Kevin Smith, chief information officer at the U.S. Census Bureau said: “The EDL will support the processing of big datasets quickly and easily with large, dynamically scalable compute and storage capabilities throughout the enterprise. The data lake also provides a centralized repository to consolidate operational paradata, response data, and cost data from multiple modes of data collection. It provides a single place to analyze all operational data and make informed decisions during operations.” 

Describe how the deployment went, perhaps how long it took, and if it came off as planned:

This hybrid deployment is well under way, with much of the workloads running in AWS GovCloud. Cloudera’s consulting team augmented the government and systems integrator teams on site to ensure operational success. 

Describe the result, new efficiencies gained, and what was learned from the project: 

The Census Bureau’s investment in data analytics, cloud computing and open source technology supports the organization’s long-standing history of innovation. Now, filling out the census questionnaire is easier and quicker than ever before, because the platform enables respondents to automatically reuse their responses. The data is quickly analyzed for quality and ultimately reduces the volume of redundant data.

Personal data is more secure than ever before. The EDL enables security, privacy and policy controls for all types of sensitive data and code at an enterprise level. As a result, the bureau can effectively manage and secure multiple, large datasets via automation and use metadata to monitor, link and aggregate datasets through the survey lifecycle until the final products are disseminated.

Describe ROI, carbon footprint savings, and staff time savings:

Data scientists can now share data and insights more easily within the bureau and across agencies while adhering to policies for security and data governance. Because of this new capability, the Census Bureau is able to help other agencies derive insights from the data to ensure that resources are provided to those who need them and the government can plan for future needs through insight into the patterns of population growth and change.

In addition, because the 2020 census is digital, there has been a significant reduction in costs due to a reduction in paper surveys, and more importantly, the alleviation of U.S. Post Office resources in a crucial election year.

For more information, go here.

If you have a suggestion for an eWEEK IT Science article, email [email protected].