At first blush, data warehouse architecture may appear simple: source systems feed data to the warehouse, which you can then hook up to any BI tool for analytics, dashboards, and reports.
However, feeding data from source systems to the data warehouse is far more complex. Only certain data needs to be added to the warehouse, and loaded incrementally. Business requirements tend to change, as do the endpoints of operational systems like CRMs, ERPs, and marketing automation applications. Over time, a business may also want to add or remove a data source so their data feeds and business rules stay relevant with their vision. Lastly, whenever the records themselves change, the data warehouse must track and capture only those values, without changing the schema each time or jeopardizing data consistency and standardization.
What is Change Data Capture?
Change data capture (CDC) solves these data integration problems by monitoring, inserting, updating, and deleting changes to records or activity on table data.
The first step in this process is to identify that a change has been made to a data source. To do so, a CDC procedure will inspect the relevant table to see if there is a delta between the old values and new values, based on what’s stored in the source systems, then update the pertinent record accordingly.
There are several good methods for doing this — data modified, diff, and triggers, for example. But in a popular database like MySQL, one of the easiest ways to track data changes is by reading changes to binary logs stored in the transactional database.
While interpreting the changes in the transaction log has certain drawbacks, a log-based history ensures that only the rows that need to change will. And since changes are also captured independent of the source application, log-based CDC is generally considered the best approach for systems with high transaction volumes.
Why Change Data Capture Matters
Change data capture is important for several reasons. Foremost, it is more efficient and scalable. Take our cloud data warehouse, Fusion, which automates your data pipeline by ingesting data from popular SaaS applications like HubSpot, Salesforce, Marketo, and NetSuite.
To ingest raw data, Fusion “tails” the log files of every cloud application’s records, starting with the first created and ending with the last modified. (The term “tail” is a Unix command that prints the last few number of lines of a certain file, then terminates.) In this context, Fusion reads internal logs to capture every change in your data warehouse, so analysis is fast and accurate.
The next step in the Fusion algorithm is to detect changes — as either new records or modifications to existing ones — to the data in your connectors (i.e. cloud applications). During this stage Fusion checks each record ingested and compares it to the cached copy, asking, “do we need the next page?” If yes, Fusion will keep requesting records pages from your connectors. If not, this triggers the next phase in the algorithm: transformation.
While we now take CDC for granted, such an approach is revolutionary in terms of how data warehouses keep data up-to-date. Previously, data warehouses used to just let you report and analyze results from the past — such as last quarter’s sales or how product inventory fluctuated over time. Now, data strategists can get visibility into the past as well as see the present in near real-time, with such granularity that it's even possible to predict future outcomes.
Change data capture has, in this sense, revolutionized ETL. For at first, businesses simply wanted to acquire data more quickly. So to minimize data latency, they sped up batch ETL jobs until the data warehouse ingested data hourly or even more frequently.
But this ETL-on-steroids approach gets clunky fast. For one, ETL is heavily driven by business rules to which many different stakeholders must agree, and with a large ETL engine it's hard to ensure tables stay up-to-date or feel confident that you needn’t change the database schema. As data from the source systems grows, ETL performance can suffer and the data warehouse's expansion become difficult to manage and complicated to scale. More data must be cleansed and more data loaded into the data warehouse to create a read-only copy. Soon, reports are less accurate and the data warehouse's architecture a messy, bloated, sluggish mess.
To avoid such pitfalls, CDC minimizes the resources required for ETL processes by only addressing changes to the data. This design eliminates the need for bulk load updating and inconvenient batch windows. Instead, with CDC you get incremental loading of real-time data changes into your data warehouse with nearly zero downtime. Hence, CDC complements ETL. And the result is scalable, highly available, synchronous data; less impact on the database; and faster performance for populating real-time business intelligence dashboards. In short, with change data capture you can focus on what’s important: analyzing fresh, accurate data from all your source systems.
Want to unify data from all your SaaS systems? Sign up for free and take Fusion's cloud data warehouse for a spin.