What is a Virtual Data Pipeline?

A virtual data pipeline is a series of repeatable actions, or jobs, that automate the movement and transformation of data across systems, without any physical movement of data out of the locations from which they originate.

Data are extracted, usually from various SaaS applications; transformed, so their formats are the same, fields de-duplicated, and, once merged; loaded into a single place, like a cloud data warehouse.

And at the end of the pipeline, data are in more robust shape for analytics and insights.

Why You Need a Virtual Data Pipeline

A virtual data pipeline helps IT team shift from being a reactive resource to putting a process in place to help businesses perform more effectively around their data process.

Beyond the pure technical benefits, these are some of the key business wins team can achieve from a virtual data pipeline: 

1. Get a 360-Degree Customer View

Everyone wants a 360-degree customer view. But only 10 percent of businesses have one, as unifying data is time-consuming and complex, even for IT. A virtual data pipeline will extract and align data from multiple source of customer data that otherwise would be sitting in separate silos.

360 Customer View

2. Align Teams

Often, teams debate data rather than let it solve their problems. When sales, marketing, support, and finance frequently lack a master data set, they rely on analyzing their own data in silos. A virtual data pipeline combines information so the entire business can rally around a single, trusted source of information.

Align Teams Icon

3. Improve Customer Experience

When teams are aligned with how to interact with customers, customers enjoy an amazing experience. But if your sales rep is unaware of a customer’s support ticket; if your marketing team can’t tell in which products customers have expressed interest; if chat system asks the customer who they are every time they visit your website, the customer you invested in suffers. By implementing a virtual data pipeline, you can fuse many customer experiences into one view that every department can appreciate. 

4. Turn Investments Into Outcomes

A business tracks investments and business outcomes. If this tracking happens in discrete systems, the factors driving growth remain vague. By integrating data, marketing can know how their investment in leads converted to better sales outcomes, for instance. You can explain which activities and milestones led to success, and how.


5. Ensure Compliance

Centralized customer data helps your business comply with the law. If your data are spread across many systems, you run the risk of violating The General Data Protection Regulation (GDPR), for example. Connecting applications to a centralized data warehouse helps solidify which systems house customer data and automate how you meet protocol requirements.


How To Set Up Your Virtual Data Pipeline

You can set up your virtual data pipeline in four steps.

  1. Connect data sources
  2. Consolidate & Normalize data
  3. Warehouse data
  4. Feed analytics, reports & dashboards   

1. Connect Data Sources

Customer data can reside in many SaaS systems, or connectors, at your organization. The average SME uses 12+ SaaS systems to manage customer data, including Marketing Automation tools like HubSpot and Marketo, CRMs like Salesforce Pardot, Microsoft Dynamics and NetSuite, customer support tools like ZenDesk and HelpScout, in addition to finance, event management, online meetings and webinars, and more.


Connecting data sources lets you view and manage a single representation of a user across all of your applications. Once connected, syncing ensures your data are in real-time, thereby improving the performance of each.


Objects are records, or fields, in a given system. Example objects in your applications could include fields like:

  • Lead
  • Opportunity
  • Tickets
  • Campaigns
  • Contact
  • Account
  • Activity
  • Company
  • Products
  • Orders
  • Users

Related Objects

Connecting sources is more than having a unified Contact record. Each object has a relationship to other objects. And these object relationships must be preserved, as you will want access this type of data in your analysis.


Mappings tell you where and how an object is connected to another object. If a lead in System A converts, the mapping determines where and how the lead will sync in System B.


2. Consolidate & Normalize Data

Now that you have your discrete data sets, you need to get them into a fully integrated, unified data set by consolidating and normalizing the data.

The benefits of normalizing data are plenty. Normalized data makes for simpler, faster queries. You can access multiple sources. And records are organized consistently across all tables.

Normalization should begin before consolidating multiple data sets. It is accomplished by:

  • De-duplicating
  • Standardizing Formats
  • Resolving Conflicts
  • Matching Common Identifiers
  • Data Modeling



Sometimes there are so many duplicates in your applications that bringing data into a central system seems daunting. To avoid bringing duplicates into a consolidated data set, de-duplicate data within the data silo first. Load non-duplicate data and flag duplicates for clean-up later.


Standardizing Formats

Inconsistent formats are the scourge of many systems. Dates, picklist values for fields such as states or countries — they might all denote the same thing, using a different format.

To create consistency, you must look at how data is input and make sure fields are formatted the same across systems. This will reduce the amount of normalization needed.


Resolving Conflicts

Say a customer is recorded in two different systems as having two different addresses. How do you know which is correct?

Although resolving such conflicts “by hand” is impractical, luckily there are two automated approaches. You can use one or both.

System of Record

  • SoR determines which system overrides the other to ensure one record for any given piece of information. Think of SoR as the authoritative “single version of the truth”, ranking systems based on your business priorities. For example, your CRM could be the system of record for any data related to a sales opportunity.

Most Recently Updated

  • Uses the most recently updated data across systems, for a given field. So if a customer’s phone number is different in your CRM vs. your support system, the most recently updated field gets used.

Matching Common Identifiers

Once in the SQL database, fields between systems must match so relational models can feed your analytics. Maintaining the integrity of these object relationships lets you analyze a broader set of data.

Matching like records across systems is accomplished with common identifiers. When matching Contact records, for instance, you could start with an email address, as email provides the highest probability for a unique match across systems. Multi-level de-duplicating keys might incorporate additional supporting data, such as name, company and address.   

Matching sets the stage for a standard data schema. For example, the data below could map to another set where the Industry is software or the Contact value reappears.

Industry: Software

Account: Bedrock

Contact: John Smith 



Modeling occurs between systems. For example: a CRM and Marketing Automation system where Activity is a related field.

Modeling helps find common matches. If an email address appears in two tables, you can map the common identifiers across many systems and model the relationships between objects.


3. Warehouse the Data

Now that you have connected applications and normalized data, the third step in building your virtual data pipeline is warehousing the data.

Warehousing data is the recommended approach because it consolidates the data that are now normalized.

Traditionally, during this third step, each system’s data would enter its own warehouse using Extract, Transform, Load, or ETL:

  • CRM → CRM database.
  • MA → MA database.
  • Support → Support database.
  • ERPs → ERP database.  

With this siloed approach, you could either run SQL queries against each of these separate databases or through many separate business intelligence (BI) dashboards.

Several problems arise from this siloed approach.

1. Data values differ.

    • One system may format dates as DD-MM-YYYY, another as YYYY-MM-DD. One may split customer names into two fields, first and last, another have them in just one field. Such lack of consistency makes data messy.
2. Database schemas use unique relational criteria.
    • If one table calls you people Contacts, and another says Name, records are more prone to error, which hurts your chances of getting a 360o view of customers.
3. Systems remain isolated.
    • You get SQL access, but querying data is laborious and time-consuming.
    • As you run SQL queries, parsing which idiosyncrasies belong to which system makes analyzing data difficult.
    • You get a BI dashboard, but it visualizes data for only one of many SaaS applications.
    • Answering questions or visualizing data that runs across multiple data sets is complex and not easily repeated.

Keep in mind that warehousing is also very time-consuming. Joining tables typically presents unanticipated obstacles. And when data inputs invariably change, connected data sources could get out of sync.


Master Data Set

Warehousing data into one master set — rather than many silos — is the recommended approach. By warehousing data in a master set, you combine data across silos, load each of your data silos into one cloud data warehouse, automatically creating a master set without custom code.


Unlike the traditional analytics stack, a master data set combines data prep, ETL, warehousing, and modeling functions into a single layer. Implementation time drops months to days, and the cost of ownership from seven figures to five.


4. Feed Analytics, Reports & Dashboards

Customer dashboards help you analyze the health of your business. By distilling big data sets into visual insights, executives can measure performance with ease and precision. Meanwhile, marketers, sales reps, and support teams can track customer trends and improve KPIs.

Crafting a great customer data dashboard is difficult. Even after accurate data are prepared in a standard format, designing for simplicity is a challenge. A dashboard may forecast behaviors and inspire actionable decisions, but its interface could be complex.

By automating how data feed your dashboard, a single interface manages data flows across all connected applications, so your customer data are up-to-date, delivering immediate ROI and allowing you to capitalize on opportunities.

Dashboards & Business Intelligence Tools

With data connected, normalized, and warehoused, you can query the dataset either with SQL or your favorite BI tools, such as Tableau or Looker. That is the magic of centralized data: you can automatically feed any analytics or reports dashboards. This is especially helpful for IT directors and data analysts. Each function in your pipeline becomes customizable via a web-based interface.


Set Up Your Data Pipeline — What's Next?

Our Fusion product connects these four steps to help you automate your virtual data pipeline.

Try Fusion for free to see for yourself, or learn more here.