A semantic layer lets organizations connect data warehouses, data lakes, and data lakehouses to an existing data ecosystem to ensure their continued relevance now and in the future.
Data warehouses, data lakes, and data lakehouses are arguably the most popular methods of integrating data today. They’re older, current, and newer integration approaches—respectively—for running analytics and applications across all enterprise data. Using a semantic layer to connect them to the rest of the data ecosystem in the cloud, on-premises, and at the edge ensures their continued relevance now and in the future.
Why? Data warehouses rely on transformations to integrate schema across sources. Data lakes enable organizations to store all data together in a single repository, regardless of format or structure variation. Data lakehouses combine the low-cost storage of the latter with the data modeling capabilities of the former.
Although each approach has pros and cons for enabling organizations to leverage analytics to solve business problems and create competitive advantage, they’re all based on the same basic data management method of collocating data at the storage layer. Such physical consolidation requires moving data with predominantly batch replication processes that are rigid and oftentimes brittle.
But there’s now a superior approach to traditional data management called data fabric that enables organizations to avoid endlessly replicating data for integrations. A data fabric allows organizations to integrate data at the computational layer, instead of the physical storage layer, with a semantic graph layer connecting, rather than collocating, all sources. Data never moves unless it is required for business processes, enabling computations to go to data instead of the data to be moved to computation.
Data fabrics reinforce agility, decrease time, effort, and costs of queries, and add a pivotal layer of business meaning to data that tremendously improves the quality of the insights generated from analyzing it. By describing data in business-friendly terms understood by data engineers, data scientists, and other data and analytics users, this semantic layer is essential for maximizing the value of integrating data in data warehouses, data lakes, and data lakehouses.
This contemporary integration style accelerates business comprehension of data for discovering insights, expedites queries, and yields more targeted, meaningful results for analytics use cases across data-intensive industries such as financial services, life sciences, and technology and software. By pairing the best of traditional integration approaches with a unified semantic layer, organizations single-handedly update these tools within a modern data fabric to reduce time and cost for data discovery and integration and improve analytics insight.
Early binding
There’s no disputing the worth of data warehouses and data lakes for integrating data. Transformation techniques like ETL and ELT are great at fundamentally integrating data for a defined schema in warehouses and lake houses, especially for relational technologies. Still, this method requires moving data out of sources to integrate it at the storage layer of these repositories, even in the most successful, modern cloud warehouses like Snowflake. Although data lakes collocate data in their native formats (requiring more movement), they don’t redress basic differences in structure variety, schema, or formats.
There’s a couple of additional challenges to overcome with these early binding integration mechanisms as well. Moving data via batch replications is resource-intensive, time-consuming, and costly. Further moving data in bulk costs more than moving data minimally. It also creates more difficult downstream tasks like, say, consent management for sensitive data. Plus, organizations must also align the numerous differences in schema, business terminology, and definitions across sources. As a result, data modeling is burdensome—especially when business requirements or sources change, and the schema must be manually reconstructed. Moreover, the schema itself is difficult to understand for business users because it’s created by IT teams and data modelers that typically don’t describe data in business terms.
Finally, this approach requires a single version of the truth that’s difficult to rectify across multiple, incompatible schemas that must exist simultaneously in business units. This necessity creates conflict between departments in which the sales team, for example, is pushing for its definitions and meaning of data to be represented, while the marketing or legal team is doing the same thing. Ultimately, someone loses in these contests. The early binding required for these traditional integration approaches means firms must commit themselves to a schema (that’s not in business terms) prior to integrating data. If anything changes, there’s a lot more work, cost, and time required to recalibrate the data model—especially with relational technologies.
Data fabric architecture
The flexible architecture and relaxed coupling of a data fabric overcomes the limitations of the early binding of data warehouses, data lakes, and lake houses, making them much more agile and potent. Data stays wherever it is, whether it’s already in the aforementioned repositories or in their individual source systems. Because there’s no storage layer consolidation required, organizations don’t need elaborate data pipelines—and their costs—for constantly moving data throughout their ecosystems. Instead, data is integrated at the computational layer for business processes at the time they’re required, not beforehand. Further, data is minimally integrated, that is, in response to particular queries or searches that are tied to specific business processes or outcomes; this is in stark contrast to the bulk, “move it all, we’ll sort it out later” approach of the older approaches. For example, if there’s sales data in two S3 buckets, Databricks, and an on-premises relational data warehouse, the data only moves the least amount possible to integrate it for querying when it’s time to do so. This method effectively decouples business logic from the storage layer of different systems and implements it at the point of computation.
There are several desirable outcomes of this approach, including the support of multi-tenant schemas. In the above use case in which the legal and sales departments are each vying with one another to populate the schema for a data warehouse with respective departmental terms and definitions, such conflict would never occur. Those two schemas could peaceably coexist without having to choose one over the other because there’s no early binding commitment to a single version of the truth or its schema. Another advantage is the heightened flexibility of this methodology, which costs less and reduces time to insight because there’s no continual data wrangling for integrating different sources at the storage layer. The biggest benefit is the unified semantic layer connecting the entire ecosystem in terminology business users understand. This creates richer, more nuanced queries—and better results—when loading applications or analytics platforms for insights. This semantic layer directly translates into a greater ROI for existing infrastructure for conventional integration methods while modernizing them to make them future-proof.
A unified semantic layer
The unified semantic layer of a contemporary data fabric is characterized by three capabilities to deliver the foregoing advantages. The first, data virtualization, is an integral part of these benefits and this overall approach. This technology ensures that data only moves the minimal amount required to fulfill whatever specific business task is needed, like determining features for a machine learning model. Data virtualization caches data in their sources (such as a data lakehouse, for example), so data doesn’t have to be copied. This virtualization layer connects all sources throughout the entire data ecosystem and supports the second capability: federated querying.
Because data is cached in their source systems wherever they are—in the cloud, on-premises, or at the cloud’s edge—users can query across them on demand. Query federation ensures there’s no need for data movement and occurs whenever there’s a business objective that necessitates it. Not only is this approach more economical than copying data before it’s actually used, but it also improves performance and reduces network bandwidth usage, too. Plus, organizations have greater control over their data, which is ideal for the surplus of data privacy and regulatory mandates deluging them and their risk mitigation teams.
The third trait of a modern data fabric is the semantic graph data model that helps eliminate many of the traditional hassles with other types of data models—especially relational ones. This underlying semantic knowledge graph model standardizes business concepts, their terms, and their definitions in language business users comprehend. This one fact substantially increases the flexibility of data lakes, data lakehouses, and data warehouses in two chief ways. It’s the key to enabling organizations to accurately represent the complexity of the world through their data with vocabularies and taxonomies business users understand. Secondly, it allows organizations to dynamically use different schemas—at query time—with business logic at the computational layer, so there’s no need to pick one business unit’s schema over another’s.
Semantic graphs: Key to modernizing data integration
The mutable nature of semantic graph models is primed for connecting diverse sources via a data fabric. These data models naturally evolve to expand with new business requirements or additional sources. Therefore, when one wants to query across business units with different schemas, entities, and definitions (like those of the sales and marketing teams, for instance), it’s much easier to merge those two schemas into a third for comprehensive insight. The adaptive nature of semantic graph models is largely attributed to its standards-based approach that supports schema combinations, interoperability, and machine intelligence.
Standard-based models (called ontologies) represent the real world with uniform taxonomies, vocabularies, and definitions of terms that eliminate the siloed approach to managing data. This capability encourages schema reuse across departments and use cases, decreasing the work, time, and data preparation for applications or analytics. Applications naturally become easier to build this way, offering faster time to value. Best of all, end users can focus on the higher-level value of data’s meaning and relationships instead of technical requirements for accessing, preparing, and understanding it.
Data warehouses, data lakes, and data lakehouses are still important to the enterprise. A semantic layer lets organizations connect them to an existing data ecosystem to ensure their continued relevance now and in the future. It’s a must for getting them to produce better analytics results faster and cheaper than they otherwise can, and is key to keeping pace with today’s modern enterprise.