DataVault is undoubtedly one of the best candidates to consider when it comes time to determine which methodology to adopt in your data architecture. At its core, Datavault offers an agile way to design and create efficient data warehouses.
Hubs could be defined by the central point that integrates the concept of corporate business, which is thus identified by the Business Keys.
Regardless of the existing approach, whether Bottom-Up or Top-Down, the first steps you have to take are very important. The analysis of the information systems must contain all the business concepts, with their business keys identified and one’s must have business value. You can only identify a business concept if it can be understood. In any case, it is vital to do that analysis very well from the beginning.
In the scenarios below, we will dive into some of the various design approaches for DataVault models and the various challenges associated with them. Our end goal is to apply a good design with the correct approach to ensure that our DataVault model is efficient, scalable, and adaptable to our data architecture.
Starting from a brief definition of a model as a group of datasets that conform structured information with one or multiple data domains.
On one hand, the Bottom-Up approach for a DataVault model would apply when trying to faithfully replicate existing information sources. Our DataVault Architecture must be very heterogeneous due to it containing multiple models. So you can imagine that each model will access multiple sources of information with a lot of tables, but it doesn’t mean that we must collect everything! We are only interested in tracing what we are really going to add value to the business! Let’s not forget that.
On other hand, Top-Down approach applies when the business analyzes the corporate business concepts from the different systems / tools available: data marts, MDM, Data Governance and even starts from a Logical Architecture of corporate Information (LAI), which is nothing more than a map of functional information domains by layers, thus going down conceptually to a lower level to reach the business concept. This LAI usually is based on industry reference models but studied and landed for the corporate application. It can also apply for ELM standards to analyze your business.
With this in hand, you predefine your datasets that you want to ingest in our earliest layers of your PSA (Persistent Staging Area) based on the logical information you need. Is for this that you made a process for the intake information from the different sources with that data previously. You don’t replicate the original data sources of information.
In terms of flexibility, the Top-Down case is by nature the case in which it has the least impact for our DataVault model, since the extra layer of preparing the data to a coherent form is outside of this approach,which allows us an even higher level of abstraction, being less susceptible to changes that the sources of origin may suffer.
Now, imagine the unthinkable happens: after crafting a solid and detailed analysis of our company, on both HUBs and satellites in your DataVault models, a corporate Merger event falls upon our Company. You should demonstrate that your Data Architecture is the best place to store all the data together. This is due to it being more scalable, flexible, optimal and with less cost to absorb all the corporate information than the other company.
In our scenario cases, we will use a Bottom-Up approach.
Initial Scenario
We are the multinational RETAILER_1 company and this is our process that manage the information of the CUSTOMER business concept in our customer HUB:
In the previous diagram, we try to explain the different information mappings to load data into our RDV (Raw Data Vault) Architecture. It is defined how each field is generated.
On the left side, we have the PSA/NPSA phases where the different data sources are located. It will apply in the next scenarios in advance. There is an information table for each CUSTOMER country, one for Spain and the other for Italy. Both have a similar format schema and number of fields except for the name of the fields. We will delight in our Business Keys mapping (HUB).
On the right side, we describe the format of our HUB (in RDV):
- HUB_ID_CUSTOMER: Hash ID of the concatenation of the BK (Business Key) + tenants
- COD_TENANT: Multitenant, in this case we only apply a single level, but there can be a Multitenant (more than one level).
- The Business Keys defined like this:
-> EDW_BK_COD_CUSTOMER: Business Key 1
-> EDW_BK_CLASS_CUSTOMER: Business Key 2
- EDW_DT_LOAD: System incorporation date
- RECORD_SOURCE: Track information of our data origin
Although we have already said that a BK must have been analyzed, defined and accepted across the company, on certain occasions what would happen if we cannot avoid it? It is clear that the refactoring scenario is ruled out, DataVault allows us to make our architecture more flexible. Now we have to test our model, and see what scenario we should empower our architecture in the face of both predictable and unpredictable changes.
For such a situation, we analyze what possible scenarios bring us the least possible impact on our architecture. We will observe that the BK schema structure of the new company does not match with the existing one in our DV architecture.
Here I make a point again in which I stress the importance of a good analysis of our company’s business concepts, but even so, it is possible that certain circumstances make our HUBs undergo variations. So our goal will be to build a really robust, efficient and scalable architecture.
We will see the three different scenarios (at the end of this article we will briefly evaluate all together). In the last story in that series, we will do a detailed evaluation.
We will Zoom into the HUB mapping, the Satellite mapping will be the same for all three ones.
It is based on the principle of having each BK in one single field.
In this scenario our mission is:
In the following figure we show how our new architecture should behave:
The main challenges for Single BK would be refactoring.
It is based on the principle of having all BK in one single field concatenating them in a specific order.
In this scenario our mission is:
In the following figure we show how our architecture should behave:
The main challenges of Composed BK will be to sophisticate the methods of access, storing and labeling of the BKs.
It is based on the principle of having all BK in one single field in semi-structured format. The most common method used is JSON.
In this scenario our mission is:
The main challenges of Flexible BK will be to enrich our process with the new BK.
Example of querying BK in our FBK field (in Snowflake). Extract from a dummy HUB_CUSTOMER HUB table:
If we want filter the specific EDW_ID_CUSTKEY (on Snoflake SQL):
Author:
César Segura
Subect Matter Expert at SDG Group