We’re sure that you have been hearing the term ‘Big Data’ and analytics everywhere the last few years. Some of you may know what it is, while others may just have a vague idea. We’re here to give you a brief introduction to Big Data – what it is, what makes it important and how it impacts you.
Introducing Big Data
The word itself may seem self-explanatory: big data refers to the different ways / methods and technologies that you can use to collect, collate, process and analyze very large sets of data. The issues of having to work with data sets that exceed the computing power of singular computers is not a new one, but the scale of it these days is.
Many different big data professionals use different terms to define what is big data so it’s not exactly easy to nail down the definition. In general though, when we talk about big data, we are referring to huge datasets and the strategies and types of technologies that we can use to work with those aforementioned data sets. Since the term ‘large datasets’ might differ for each person, we would refer to large dataset as one that is too big for your typical single computer to process or store.
The Three V’s of Big Data
The ‘three V’s of big data’ was introduced in 2001 by Gartner’s Doug Laner to help better understand what makes big data different from other types of data processing.
This comes as a no brainer – the scale of the data that you need to process will determine if it is indeed big data. The datasets that big data professionals work with are magnitudes larger than traditional volumes of data. This means that more consideration has to be given at every stage of the processing and storage life cycle.
Because this sheer volume of data far outstrips the processing capability of a single computer, big data processionals have to overcome the challenge of connecting multiple computers or even groups of computers together to make things work. There is both the hardware and software aspects that they need to worry about. Your algorithms, APIs and cluster management skills are key.
For some datasets that the average person works with can be static or may grow very slowly. Big data, however, differs significantly in that they move through the system very rapidly. This data usually comes from multiple sources, and big data systems are expected to be able to process it in real time or near real time and feed it back to the system as a whole.
Some big data professionals have moved away from batch-oriented practices to real time streaming so that they can accommodate this requirement for almost instantaneous feedback. At the speed at which data is constantly being added, you need robust systems to ensure that there are no issues.
The data sets that big data professionals have to gather usually come from disparate sources and have diverse levels of quality. This makes it even more challenging.
For example, digital marketing professionals may need to collect data from social media feeds in real time and collate it with server logs and website usage in order to get the insights they require. Each of these sources provide different types of data, and they may each provide different qualities as well. The formats can also be different. You may get GPS data, rich media data such as images and videos, structured file logs and so on.
If you are a big data professional, then you will be concerned about consolidating all of them into one main system.
Other Big Data Characteristics
While those above are the commonly accepted definitions of big data, there are some people and organizations who want to add more. These include
We may go into each of these at a later post. What do you think of the current definitions of big data and how accurate are they?