Overview of Data Warehousing
Before we explore what a data warehouse is, let's talk about why you would even want or need one in the first place.
For starters, data warehouses are immensely valuable data sources for analysis. They store large quantities of data efficiently, minimize input-output (I/O) for fast query results, and by acting as a central repository of databases scale with far greater ease when compared to a single database or load-balanced database clusters.
Still, in 2018 you needn't be a bona fide data scientist with a PhD to warehouse data. Technology has evolved to the extent that you may combine multiple data sources through a user interface (UI) that triggers a set of automated processes.
Warehousing data is important for reasons other than backup and storage. Most businesses wish to align their teams around their target audience or customer segments, which requires a thorough inspection of all the steps that occur during the buyer's journey. To craft inbound strategies that tailor content to specific prospects and customers, marketers may want to keep their contacts data in a single place in lieu of looking for contact information in multiple databases. Similarly, to hit or exceed quotas, sales teams want one go-to place for generating pipeline reports, dashboards, and data visualizations which add insight into closed-won deals and bookings revenue.
So whether you're an expert in SQL or a CEO looking for high-level answers, there is tremendous value in unifying your data sources into a single warehouse.
Transactional & Non-Transactional Data Sources
All data must be created somewhere, somehow. When you fill out a form, the email address you enter is data, as is the form that gets submitted. And when you go to the ATM, the dollars in your account must be debited from somewhere before it finds its way elsewhere (hopefully into your wallet).
The first example (submitting a form) we'd call "non-transactional data". Less concerned with time than properties, non-transactional data usually includes information about things like:
- Customers (e.g. names, preferences)
- Products (e.g. versions, hierarchies, activities)
- Sites/Locations (e.g. addresses, cities, buildings)
- Accounts (e.g. contracts details)
In contrast, the second example (ATM) counts as "transactional data", or more moment-to-moment statistics such as:
- Financial information (e.g. orders, invoices, payments).
- Time information (i.e. timestamps once a transaction occurs).
A database may record either category, transactional or non-transactional, of data. But the category of data collected will always determine the relational database you use to store it.
OLTP & OLAP
OLTP (on-line transactional processing) and OLAP (on-line analytical processing) are the two major types of relational databases used for SQL and database development. The former is used for transactions, storing and manipulating the data, the latter analysis. While both support business applications.
Short for On-line Transaction Processing, OLTP systems provide source data to data warehouses. Their transactions occur in large quantities but short transactions (INSERT, UPDATE, DELETE), which simplicity allows for fast query processing, less error, and higher data integrity.
Examples: ATM, supermarket cash registers, text messages.
On-line Analytical Processing, or OLAP, processes larger, more complex datasets. The data is aggregated and historical, oftentimes stored in multi-dimensional schemas and joined tables, which slow processing time.
Examples: CRM contact data
Data Replication & Access
How we access data will also depend on whether it's transactional or non-transactional. Some REST APIs, for example, supply non-transactional data (e.g. companies, leads, opportunities, etc.) to the world online. Other services feature downloadable bulk data, limiting them by size and the number of times you can download them over a certain period.
Copying data is key to data storage, for you have to copy the data from where it's stored in one database before moving it into another. When multiple data sources are replicated and extracted, only at that point may they be stored in a data warehouse, a relational database used for analysis.
If any of this is confusing, don't worry. The important thing to remember is that certain kinds of databases handle transactions and others store objects with properties. A data warehouse is what allows for scalable analytics, typically via the storing and backup of multiple databases.
A data warehouse's job is thus to integrate data from multiple sources, storing them for in-database analytics with parallel processing that allows for faster insights than had the data remained siloed in their original data sources.
Batch Job Processing
How exactly does data flow from its original sources to the relational system?
For many years, data migrated via a batch process. Computers would work through a queue, or batch, of separate jobs (programs) that would tell the job where to look for the data, what to look for in it, and where to put the data if it was going to move to another destination.
Cron Jobs, Bulk Extracts, & REST APIs
But just like thinking about decisions, programs executing a series of jobs take time. They require complex logic in the form of code commands and shell scripts which dictate the schedules of these time-based jobs, or “cron jobs”.
While common, scheduling such batch jobs isn't entirely efficient. Once a job’s been scheduled, you can't terminate the process during execution unless you stop the whole execution altogether.
Bulk extracts are a batch job where a program executes the pulling of specific data from a specific data source. Executing a bulk extract usually requires authenticating with a REST API, or Application Program Interface.
Marketo’s Bulk Extract API, for example, provides interfaces for retrieval of large sets of person and person-related data. These bulk extracts allow users to define the set of data they wish to retrieve, such as Leads (Persons) or Activities. After the job has been created, users then wait, sometimes many hours, for the job to complete writing and retrieving the file over HTTP.
Bulk extracts are less ideal than having all the data already in a warehouse. Most bulk extract APIs execute jobs asynchronously, rather than all at once, which takes a lot of time if you have a lot of data. Source applications, be they CRMs or ERPs, also tend to limit the number of export jobs you can run, as well as the file size.
While bulk extracts will call REST APIs to retrieve data, as mentioned, it’s easy to reach API limits by running too many jobs or asking for updates every time the data change. Older systems therefore tend to ingest data less frequently, once a day or even once a week, using a slightly more elaborate process known as ETL.
Extract, Transform, Load (ETL)
Short for Extract, Transform, Load, ETL performs three main functions to get data from the source to a data warehouse:
An ETL tool will read data from one database, then pull it out as previously described so the data can be placed in another database. Extraction may require reading data from many data sources, which is very time-consuming, as data may be scattered on many shards or clusters, rendering them hard to access via OAuth. Reading data also requires many so-called "handshakes" that account for differences in encoding (.csv, JSON, etc.). Add to this the danger of APIs changing or deprecating over time, and the risk of errors during your extraction rises significantly.
Next, the ETL tool will transform data formats and structures. If, for example, you use a data source that writes out months as the full name (e.g. January), and another data source uses a three-letter abbreviation for months (e.g. Jan.), and a third data source uses a numeral (e.g. 01), while all these values refer to the same thing (the month of January), you’re going to want like data to be in like format. Otherwise, when you go to report on the data, you’ll get incomplete information or write overly long, redundant queries.
Transformation is often the longest, most complex step in ETL, as there are often duplicates and different formats. Transformation thus aims for consistency and readability, which is easier said than done.
The last phase is loading the modified data into a destination database. This can include all your master data — users, products, suppliers, places, sites, business units.
Automated Data Pipeline
ETL has many downsides. Once loaded into a database, data may change in the source application, requiring you update it. Not only that, but new data is constantly being created, organizational structures change, businesses merge, fields and operating rules get modified, and new sources of data are added all the time.
So while an ETL’s loading schedule may occur on any timeframe, you still have to maintain the batch process, oversee its execution, and write for many APIs. Even then, the less often data get loaded into the data warehouse, the smaller your chance the data will be accurate, complete, and up-to-date. When data aren’t accurate, complete, or kept up-to-date, it’s often said that the data has “lost integrity”. And whether you’re a business analyst or marketer measuring the success of a mobile app campaign, the less integrity your data has, the less trustworthy your analysis becomes.
To avoid the data integrity issue, Fusion has removed the manual labor and maintenance of ETL by automating the data pipeline instead. Fusion automatically authenticates with your cloud applications’ APIs and feeds a data warehouse which refreshes every 30 minutes, a cadence frequent enough to ensure whatever analysis you’re looking to conduct represents the latest and greatest information.
Fusion isn’t an ETL tool; it’s a self-serve platform that automates ETL. Furthermore, most ETL tools traditionally don’t warehouse data; Fusion does, all while securely backing up data on a continual basis and staying current with API updates.
Finally, in the world of ETL, data get loaded into separate warehouses -- a sales warehouse for sales data, marketing warehouse for leads and events, and so on. Fusion brings all the data sources into one cloud data warehouse rather than many. Automating the data pipeline in this fashion saves engineers months and companies hundreds of thousands of dollars. Which is no surprise when you consider that preparing data for analysis comprises 80% of the time taken to complete an analytics project.
On-Premise Data Warehouses vs. Cloud Data Warehouses
To further simplify setup, operation, and scaling of a relational database for use in applications, modern data warehouses tend to run as web services “in the cloud”. Fusion’s warehouse, for instance, uses Amazon Relational Database Service (Amazon RDS), which in contrast to most on-premise data warehouses can easily change size.
The advantages of a cloud data warehouse are myriad. With data in the cloud, users needn’t worry about hiring administrators to patch fixes, backup data, encrypt logs for security, optimize memory, or provision hardware on. So for data engineers who want fast performance but lack resources to maintain an on-premise data warehouse, the cloud is a wise investment.
Data Lakes vs. Structured Data
Throughout this exploration, there has been a big elephant in the room: structure. In rare cases it’s fine to just throw all your data into a warehouse and make sense of it later. Other times, however, you’ll want to identify tables, clean formats, store sensitive information, and govern the data from an analytics standpoint.
This process of data governance, sometimes referred to as Master Data Management, or MDM, hints at the urgency of “owning” data -- with a structure that fits your organization’s needs. Larger companies, for the most part, collect a greater variety, velocity, and volume of data. Many ingest hundreds of terabytes of structured and unstructured data every day — log files, images,and so on. For this reason, larger companies often opt to store their data in a “data lake”, where different data silos feed a single database.
Apache Hadoop is one common solution for enterprises which collect enormous quantities of data. With its distributed file system (HDFS) for storing extremely large datasets, Hadoop maps key-value pairs which are then shuffled into categories and reduced in size for transmission across a network. The process of preparing Big Data for analysis is known as MapReduce.
And this is where we return to the notion of batch processing, because MapReduce blends an entire dataset in a batch before recruiting a distributed data warehouse like Hive for SQL-based queries and analysis later. A relational database management system (RDMS), in contrast, allows analysts to query data in near real-time, at low-latency. So unless you work at a Netflix or Facebook, implementing a data lake using Hadoop or a similar tool likely won’t suit your needs.
But RDMS isn’t perfect either. Imagine you work at an 80-person company that collects product data and has ten to twenty cloud applications to manage customer data (Zendesk for support; Salesforce, deals and opportunities; Marketo, leads and campaigns; Quickbooks, billing and accounting, etc.).
Well each of these systems has a different schema, or data model, much the way downtown buildings all have unique blueprints and floor plans. Under the ETL paradigm, engineers would have to extract the data from each source application, transform it, load it into a separate data warehouse, then query the siloed data. For each data source they have to learn new API endpoints, wait for data to load, clean data, triple-check for errors, and join tables across applications for datasets with different data models. So RDMS comes with its own set of challenges when it comes to transforming data from multiple data models.
Beyond accuracy, the dynamic nature of data forbids real-time answers. No sooner than a new customer converts or cancels their subscription is your data obsolete, inaccurate, or incomplete. And when data are fragmented, stored in individual data warehouses or the cloud applications themselves, data controllers and processors cannot know where and how their data is stored and structured at all times, despite that regulations like GDPR demand such an inventory of customer records.
Such data chaos spells danger — not just for engineers, but for IT managers, marketing operations, sales reps, support agents, accounting, system administrators, and CxOs looking for a unified view of their customer. This isn’t hyperbole. If you load marketing data into a marketing data warehouse, sales data into a sales data warehouse, and support data into a support data warehouse, how can marketing tell whether their campaigns have been successful? How does sales know which types of prospects turn into better deals? How can support know which deals are optimal for retention? How well can an organization run when all departments are reporting from different data sources, each with varying degrees of data integrity and accuracy?
The Power of a Single Schema
Take a look at the diagram above. Imagine you were using HubSpot and NetSuite. HubSpot uses Contact for an object; NetSuite, Leads. Fusion can read both the the hubspot_contact_first_name field and the netsuite_lead_first_name field, then fuse the two fields into one table in your data warehouse called fused_contact, under the field first_name.
This is possible because Fusion creates a single schema of the two system's different schemas. In this sense, Fusion is neither a data lake or a set of siloed relational databases. It’s a unified data warehouse whose RDMS accounts for the data models of every data source ingested. It does this by transforming multiple data models into a unified schema.
Each of these objects represents a separate table in the data warehouse. The power here is that you can easily join data from all your data sources. If you use NetSuite, Dynamics, and QuickBooks, Fusion would aggregate all the company data and activity data from both so you can analyze the two without having to extract data from each of the source applications.
Amazon RDS, MySQL, & SQL
As an OLAP system, Fusion uses MySQL for its database, which is hosted on Amazon Web Services (AWS) and runs on RDS. This stack lets users query a SQL data warehouse in the cloud, in near real-time, with low latency.
With its MySQL architecture, Fusion provides total relational database support for structured columnar stores and comprehensive support for the SQL language. And because Fusion is an acceptable data source for any service that accepts MySQL, it’s also great for developers who build modern data applications. With data backed-up and stored in Fusion, you can therefore query many data sources at once, asking one database rather than many, using a syntax the relational database understands, without hitting API limits.
This structured query language, SQL, can ask questions of many database engines. So whether you use PostgreSQL or Microsoft SQL Server, either will understand SQL syntax. And while we won’t delve into the differences between each database engine, which you choose for your data warehouse does matter.
Fusion, for example, uses MySQL as its database engine. We liked MySQL because it was open-sourced, scalable, and could establish relationships between tables using primary and foreign keys. MySQL is also such a popular database engine that you can plug it into virtually any analytics tool and query in SQL.
Given its popularity, a lot of developers also use MySQL for building web applications, partly because MySQL treats transactions as a single unit: if an operation fails, the entire transaction happening within that group fails. And for financial transactions, such data integrity is extremely important when you want a high-speed system designed to process millions of queries and thousands of transactions, all while ensuring unique memory caches, full-text indexes, and point-in-time recovery for performance monitoring and troubleshooting.
Analytics & Business Applications
In a data warehouse, you’re likely not going to be changing data. You’ll be analyzing it or feeding it to applications you want to develop. That’s why Fusion transforms the data before loading into a warehouse: duplicates, conflicts, and inconsistent formats muddy results. So if you're using two different systems and have the same contact in both, Fusion would make one fused contact for that person, filtering out all of the necessary data.
As nearly all analytics tools (Tableau, Looker, etc.) accept MySQL as a data source, having all your data in a MySQL database means you simply connect your warehouse to the business intelligence platform, then either begin writing SQL queries directly against any one of those data sources, or creating visualizations.
You can answer key questions about your business: Which leads create the most profitable customers. Through which channels do you capture them and when? What products do these customers tend to buy and how often do they renew or upgrade their subscriptions?
Fusion to automatically pull CRM & ERP data tables into an on-demand warehouse, then feed accounts, contacts, opportunities, invoices, and billing information into any marketing, finance, support, or customer application that supports MySQL as a data source.
Mapping Custom Fields in a Data Warehouse
Now while Fusion boils many schemas down to one, what if you want more control over how your fields are mapped?
Fusion lets you customize your field mappings from right inside its UI. Such mappings make storage in your Fusion warehouse way more than efficient. They give you a magical blueprint of your fused warehouse so that querying and visualizing customer data is super easy to find within your BI tool of choice. All you would do is connect Fusion to any reports, dashboards, or BI tool that accepts a MySQL database, which in today’s market is about 95% or more.
Want to drive marketing and sales teams with a unified data warehouse? Try Fusion's 14-day free trial.