Blog

Beginning BI

What does a tester need to know before starting to look at Business Intelligence (BI) and Big Data before they start learning more?

Ron Asher and Humza Rashid
21 May 2019

The aim of this post is to help you get started in understanding more about the BI world.

To make decisions, an organization needs to answer questions about what has happened and what will likely happen in the future. For example, a sales director may want to understand how the sales of a particular brand of laundry detergent changed in the last 24 months in a specific region. They may also want to know which deodorant sold the best this year or if any shampoo went out of stock in the last six months. They may also want to supplement this sales data with information about their customers.

To answer these questions, a huge amount of data likely needs to be gathered, processed and presented in charts or reports. At first it might be tempting to try to gather this data from existing transactional databases, such as a sales or inventory system, but that data is optimized for conserving space, write speed and requires a developer to join it all together to get the desired result. By the time the developer is engaged, determines how to generate the data, verifies the data fitsx` the need and produces a report, the need for the data may have passed.

BI systems were developed to make the process of answering questions faster and more efficient. BI is the process of taking large amounts of historical data and turning it actionable information, with a focus on speed of reporting and putting control of the reports into the hands of the user, not the developer.

What makes up a BI system?

At its most basic structure, a BI system has the components in the diagram below:

Each BI system is custom designed to fit the needs of the organization. From the base architecture level to the types of reports generated, the BI and data warehousing systems vary from implementation to implementation. For example, one system may have data ingested into a Hadoop data lake that is then ingested into multiple data warehouses. Other systems will have no data lake at all and only one data warehouse. Keep this in mind as you review this partial list of components. Each item really needs its own write up to do it justice, but provided below are short descriptions:

Data Source

This is the data coming into the system, but it is typically output from another system.
Data is typically time-based, for example: sales data from January 2018.
Source systems can be any system and you will frequently see:

Sales data
Inventory data
Location data

Data can come from systems internal or external to the enterprise.
Data can be generated or purchased.
Can include ANYTHING the user might find relevant to their possible future decisions. This is a major reason they can get big.

Data Ingestion

This is the act of moving data from the raw source into the system. It is also known as the ETL (extract, transform, load) layer.
For a data warehouse, this data will frequently be moved to a set of staging tables to before being imported. This provides the system a chance to clean up and format the data before being moved to a more permanent storage.
For a data lake, the data will be stored in a raw, or close to raw, manner to allow for more flexibility on reporting.

Core Data

Core data is the actual storage mechanism for the BI system. This can either be a data warehouse, a data lake or both.

A data warehouse stores structured data. This means that when the data is ingested into a data warehouse, it can be changed.
A data lake stores unstructured data.
Both the data warehouse and the data lake have multiple implementation options.

Core data will have to be able to store all the data that will give you an answer to the example question but also any question that can be asked in the enterprise. So, the amount of data will be vast and not readily query able.

Data Mart

Unlike a data warehouse, a data mart is a set of data designed around a specific department or line of business. A data warehouse is enterprise-wide and designed around collecting large amounts of interdisciplinary data.
There different type of data marts that fit the needs of different organizations:

Dependent data mart: This type of data mart relies on getting its data from a corporate data warehouse.
Independent data mart: This type of data mart has no relationship with the enterprise data warehouse or any other data mart. Its data is gathered from multiple operational systems that support a specific business area.
Hybrid data mart: A hybrid data mart combines the two concepts and can pull data from the enterprise data warehouse as well as multiple operational systems.

The tight focus in a data mart allows the user to quickly gather data relevant to their department or functional area.

Reporting

Reporting is the ultimate purpose of a BI system and allows the user to extract meaningful and actionable data. This data is then analysed and turned into information that is used it business decisions.
The data can be presented to the user in multiple formats such as, charts and graphs. The goal is to present the data in a way that is meaningful and understandable by the user.

Big Data or BI?

People use the terms big data and BI loosely, so it is not always clear what they are referring to unless you already understand the differences in those terms. BI is all about organizations making decisions and big data is just a set of processes and technologies used to generate, store and process vast amounts of data.

BI is used to make decisions that can be made with a little bit or a large amount of data. Most frequently, decisions are made with large amounts of data, which is why the term big data is used around BI frequently.

Let’s use a real-world example to help make it clear. A hospital, hotel and a bed warehouse all contain large amounts of beds. Let’s call it ‘big bedding’. The warehouse only stores beds, but both the hospital and the hotel have beds for people to sleep in. Just because you have a lot of beds, doesn’t mean that people are sleeping in them. Just because you have a lot of data, doesn’t mean that you are making decisions on it.

What is the difference between a data lake and a data warehouse?

Some folks use the terms data lake and data warehouse interchangeably, but they are different technologies to store large amounts of data with two different type of users looking at the data. The names help to understand the difference.

A real-life warehouse stores its contents in neat shelves in well-organized aisles with a system for locating items. A data warehouse is similar because it is very structured and has a specific place and format for all its data.

A real-life lake is a big pool of water and whatever falls into it stays where it falls; no one goes through and organizes it. A data lake is the same; the data is stored however it is ingested. It is not massaged or reorganized to certain formats for users to access.

Basic types of tests needed

Report validation
Dashboard performance testing
ETL testing (source, target validations, null field tests, etc)
Reconciliation testing
Connection/Database functionality testing

The first thing you need to remember when reading this post is that this is an oversimplification of what a BI system really is. The concept is so vast and so complicated that entire books could exist for each part of the system.

*Organizations are relying more and more on data to make decisions, but they can only do so if the data is trustworthy. That’s where BI testing comes in. Understanding the BI landscape is key for any successful BI testing initiative. That said, the concepts presented here are only scratching the surface. We encourage you look into the wealth of literature that covers the different parts of the BI process in more detail.

we have a lot to talk about