The Three Major Challenges of Big Data

Unfortunately, big-data analysis cannot be accomplished using traditional processing methods. Some examples of big data sources are social media interactions, collection of geographical coordinates for Global Positioning Systems, sensor readings, and retail store transactions, to name a few. These examples give us a glimpse into what is known as the three Vs of big data:

Volume: Huge amounts of data that are being generated every moment
Variety:Different forms that data can come in–plain text, pictures, audio and video, geospatial data, and so on
Velocity: The speed at which data is generated, transmitted, stored, and retrieved

The wide variety of sources that a business or company can use to gather data for analysis results in large amounts of information–and it only keeps growing. This requires special technologies for data storage and management that were not available or whose use was not widespread a decade ago.

Since big data is best handled through distributed computing instead of a single machine, we will only cover the fundamental concepts of this topic and its theoretical relationship with machine learning in the hope that it will serve as a solid starting point for a later study of the subject.

You can rest assured that big data is a hot topic in modern technology, but don’t just take our word for it–a simple Google search for big data salaries will reveal that professionals in this area are being offered very lucrative compensation.

While machine learning will help us gain insight from data, big data will allow us to handle massive volumes of data. These disciplines can be used either together or separately.

The first V in big data – volume

When we talk about volume as one of the dimensions of big data, one of the challenges is the required physical space that is required to store it efficiently considering its size and projected growth. Another challenge is that we need to retrieve, move, and analyze this data efficiently to get the results when we need them.

At this point, I am sure you will see that there are yet other challenges associated with the handling of high volumes of information–the availability and maintenance of high speed networks and sufficient bandwidth, along with the related costs, are only two of them.

For example, while a traditional data analysis application can handle a certain amount of clients or retail stores, it may experience serious performance issues when scaled up 100x or 1000x. On the other hand, big data analysis–with proper tools and techniques–can lead to achieving a cost effective solution without damaging the performance.

While the huge volume of information can refer to a single dataset, it can also refer to thousands or millions of smaller sets put together. Think of the millions of e-mails sent, tweets and Facebook posts published, and YouTube videos uploaded every day and you will get a grasp of the vast amount of data that is being generated, transmitted, and analyzed as a result.

As organisations and companies are able to boost the volume of information and utilize it as part of their analysis, their business insight is expanded accordingly: they are able to increase consumer satisfaction, improve travel safety, protect their reputation, and even save lives–wildfire and natural disasters prediction being some of the top examples in this area.

The second V – variety

Variety does not only refer to the many sources where data comes from but also to the way it is represented (structural variety), the medium in which it gets delivered (medium variety), and its availability over time.

As an example of structural variety, we can mention that the satellite image of a forming hurricane is different from tweets sent out by people who are observing it as it makes its way over an area.

Medium variety refers to the medium in which the data gets delivered: an audio speech and the transcript may represent the same information but are delivered via different media. Finally, we must take into consideration that data may be available all the time, in real time (for example, a security camera), or only intermittently (when a satellite is over an area of interest).

Additionally, the study of data can’t be restricted only to the analysis of structured data (traditional databases, tables, spreadsheets, and files), however, valuable these all-time resources can be.

As we already mentioned, in the era of big data, lots of unstructured data (SMSes, images, audio files, and so on) is being generated, transmitted, and analyzed using special methods and tools.

That said, it is no wonder that data scientists agree that variety actually means diversity and complexity.

The third V – velocity

When we consider Velocity as one of the dimensions of big data, we may think it only refers to the way it is transmitted from one point to another. However, as we indicated in the introduction, it means much more than that. It also implies the speed at which it is generated, stored, and retrieved for analysis. Failure to take advantage of data as it is being generated can lead to loss of business opportunities.

Let’s consider the following examples to illustrate the importance of velocity in big data analytics:

If you want to give your son or daughter a present for his or her birthday, would you consider what they wanted a year ago, or would you ask them what they would like today?
If you are considering moving to a new career, would you take into consideration the top careers from a decade ago or the ones that are most relevant today and are expected to experience a remarkable growth in the future?

These examples illustrate the importance of using the latest information available in order to make a better decision. In real life, being able to analyse data as it is being generated is what allows advertising companies to offer advertisements based on your recent past searches or purchases–almost in real time.

An application that illustrates the importance of velocity in big data is called sentiment analysis–the study of the public’s feelings about products or events. In 2013, countries in the European continent suffered what later became known as the horse-meat scandal.

According to Wikipedia, foods advertised as containing beef were found to contain undeclared or improperly declared horse or pork meat–as much as 100% of the meat content in some cases. Although horse meat is not harmful to health and is eaten in many countries, pork is a taboo food in the Muslim and Jewish communities.

Before the scandal hit the streets, Meltwater (a media intelligence firm) helped Danone, one of their customers, manage a potential reputation issue by alerting them about the breaking story that horse DNA had been found in meat products. Although Danone was confident that they didn’t have this issue with their products, having this information a couple of hours in advance allowed them to run another thorough check.

This, in turn, allowed them to reassure their customers that all their products were fine, resulting in an effective reputation-management operation.

Introducing a fourth V – veracity

For introductory purposes, simplifying big data into the three Vs (Volume, Variety, and Velocity) can be considered a good approach, as mentioned in the introduction. However, it may be somewhat overly simplistic in that there is yet (at least) another dimension that we must consider in our analysis–the veracity (or quality) of data.

In this context, quality actually means volatility (for how long will the current data be valid to be used for decision making?) and validity (it can contain noise, imprecisions, and biases). Additionally, it also depends on the reliability of the data source. Consider, for example, the fact that as the Internet of Things takes off, more and more sensors will enter the scene bringing some level of uncertainty as to the quality of the data being generated.