Jeremy Martin | November 30, 2018

A Look at NASA's Cloud Data Warehouse

For a space agency that has sent probes to Pluto and landed autonomous crafts on Mars, one might think that NASA has the most pristine data on earth. But even NASA, too, has had to deal with data silos.

Like many organizations, a few years ago NASA was looking to analyze many isolated datasets. Although instead of trying to gain a full picture of customer behavior, NASA was hoping to unify aviation data for a holistic view of U.S. flight patterns.

Now, thanks to NASA's cloud data warehouse, air traffic researchers can view and analyze archived flight data that has been collected and merged from dozens of air traffic facilities across the U.S., with fast update rates ranging from one second to 12 seconds for every flight’s position.

Here's how they did it.

 

Integrating Data Silos 

Before implementing the data warehouse, NASA only had access to flight, weather, and traffic flow-related data sources. Flight data updated every minute, with no information about flights on the ground at airports. Worse, researchers had access to flight data sets from 77 different Federal Aviation Administration (FAA) air traffic facilities. Of these data sources, the vast majority used different formats and standards.

To alleviate these disparities, NASA created the Sherlock Air Traffic Management (ATM) Data Warehouse, a tool that merges all of the air traffic facility data in order to produce analysis-ready, end-to-end flight information for the entire U.S. airspace.

 

Automated ETL 

To do so, Sherlock collects, archives, and processes many raw data feeds (Ex: FAA System Wide Information Management, NOAA Weather observations and forecasts, operational facility data, traffic advisories, delay statuses, etc.). After this last stage, Sherlock must transform flight data before aggregating and storing it in a consistent format. This is no easy task, mind you, as aircraft often have overlapping and conflicting positions, flight plans, as well as time and airspace references. 

 

In addition to its ETL capabilities, NASA uses Hadoop, an open-source software framework for storing data and running applications on clusters of commodity hardware. Hadoop is ideal for aggregating a complex stream of U.S. flight data, since it provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

 

 

Analytics and Reporting 

The major value a cloud data warehouse provides NASA is the ability to serve as a platform for big data analytics, including data mining and machine learning. With Sherlock, NASA can generate real-time reports for everything from runway usage and taxi time to flight reroutes and optimal flight plans. It allows researchers to answer questions such as, “How much fuel could be saved if all flights into the San Francisco Airport used lower power for their final descent?” Or, “Would more accurate departure schedules reduce delays into busy Northeast airports, and at what rate?”

By answering such questions, NASA hopes to dramatically reduce aviation’s environmental impact and improve efficiency while maintaining safety in increasingly crowded skies. But such answers wouldn’t be possible without successful data integration.

 

 

The Challenges of Unstructured Data 

Even with the sophistication of its cloud data warehouse, NASA must still grapple with scores of unstructured data. While it would be nice to be able to query across all of the data sources in Sherlock to answer questions that cut across multiple kinds of data (not just the flight data), to save time and energy writing custom code to integrate data from multiple datasets, these datasets are still very heterogeneous. In other words, there are a variety of data formats, field names, scientific units, and spatial/temporal values that make Sherlock a patchwork of raw data.

Such lack of structure hampers field standardization across tables, making joins difficult. The result being, analysts can only query the data warehouse within isolated ‘data islands’. They can’t connect data tables or bridge across data sources without great effort.

 

The Value of a Single Schema 

OK, so maybe not all of us are geniuses dealing with aeronautical data. But for those of us trying to reconcile various customer datasets, there’s reason to hope.

If you use a variety of cloud applications (CRMs, marketing automation systems, ERPs, etc.), you can connect these data sources to Bedrock Data Fusion.

Fusion is a data warehouse that unifies your customer data for multiple cloud applications. You simply connect your SaaS applications, then let Fusion match, map, and merge every major object into a “Fused Database” - where formats are standardized and many data models transformed into a single, common schema. With structured data at your disposal, it’s far easier to feed data to a business intelligence (BI) tool and accelerate your time to analytics.

 

Fusion is easy and free to try. Sign up in just 10 seconds to create your account.

 

Subscribe Here!