Gone are those days when finite amount of data would be accumulated in a database in a structural format, where everything would have definite pattern of data type and limited storage for each field/column. Today’s information is rapidly increasingly, and is too diverse and massive to be handled by the age-old conventional methods. The technology always needs to be upgraded as per present demands which is obviously good for an organization’s competitive advancements. If that does not happen, the victory in business doesn’t last long.
This is the reason why the concept of Big Data has been introduced lately. It is a superset of everything that covers managing massive amount of data. To understand this concept more deeply, let’s go through the three V’s of Big Data Management. Big Data has three vectors, also known as three Vs or 3Vs, which are as follows:
Volume: The Volume vector implies to substantially large quantities of data that keep on increasing on daily basis in real-time. Imagine the count of photographs that are being uploaded in Facebook. It contains approximately 250 billion images. And everyone knows it does not just hold images, but also comments, likes/reactions, statuses, shares, and so on, which are subject to change/edit or deleted at any given time. It also facilitates the live streaming of videos that are streamed by infinite users at different or common time interval. Back in the year 2016, it had around 2.5 trillion posts which one cannot even imagine. The site contains more profiles than the population in China!
Another example is YouTube where millions of videos get uploaded of varying volume. The Volume is immense and is consistently rising. Think about the IoT (Internet of Things) which will reach its advancement decade after decade where even minute factors would matter, such as setting up your alarm clock, preparing coffee in the morning, or even your grocery shopping if your refrigerator gets empty, everything will be automated. Obviously, Volume is a crucial part of Big Data. It is naturally one of the important three V’s of Big Data management without which it won’t be even called as “big”.
Velocity: The Velocity vector is the rate at which a fragment of data is being transmitted. Every second counts when it comes to uploading or downloading a particular type of file, be it an image, a document file, an audio, or a video. Moreover, the technology has been explored to enter the era of not only online chats, but also face time, video conference, and webinars where one cannot afford to lose the data. So, along with strong internet connectivity, one needs to have high speed servers that support the real-time information.
And this does not stop here. Think about millions of data being transferred, out of which infinite number is encrypted. This does not only include genuine and harmless information but also those that are sent by fraudulent individuals. Yes, it is possible that harmful files pass through the firewall being unnoticed because these too can be in encrypted format. So, the cyber security has to swiftly analyze these kinds of threats before there is a significant (negative) impact. The Velocity vector is a necessity in today’s world and plays a vital role in the three V’s of Big Data management. Obviously, it is not just the bulk we deal with, but also how swiftly we can manage.
Variety: Present is the time of unstructured data where even an email wouldn’t consist the same kinds of information. There would always be variation in the size and the kinds of electronic mail and its attachments, if any. The Variety velocity encloses exactly that. As mentioned in the previous paragraph, the data no longer gets stored in a tabular format because of varying size and type. The reason is obvious, it is practically not possible to keep audio, video, large text, and similar kinds of files in some structural format.
It is indeed unimaginable the amount of information of different kinds being transferred each second around the world. The technology is advancing with time and the Big Data has been growing. The Variety is huge and needs to be managed with care so that it can always be retained whenever needed.
Now the question arises is, how a server administrator is going to accumulate and fetch the unstructured data. It is also possible that an organization has recently updated its technology, and retrieving the old data would be challenging because it no longer supports the age-old method of storing. Moreover, if the ways of storing have changed, the old information also needs to be stored in the latest fashion. Furthermore, an end user deserves the right to access data in its raw format rather than its manipulated version. How would one achieve that!
This is why softwares like Apache Hadoop, NoSQL, Sqoop, Presto, etc. have been introduced. Apache Hadoop is a Java based software framework which is free and can effectively store bulk of data in a cluster. Hadoop Distributed File System (HDFS) is a kind of storage system that splits Big Data and distributes it across many nodes in a cluster. The high availability is there because of data replication in the cluster which is based on demands.
Hive is a distributed data management which is basically for data mining purpose. It supports HiveSQL (HSQL) to access the Big Data. In NoSQL, each row can have its own separate column values of different type and size. If one wants to transfer any kind of structural data to Apache Hadoop or Hive, Sqoop can be used. Presto has been introduced by the Facebook as an open source (SQL-on-Hadoop) query engine which can handle petabytes of data. It is capable to retrieve quickly.
The Big Data management is not easy. Even a fractional second adds a lot to it. Moreover, nobody can afford to lose any kind of data, else the consequences would be huge. All organizations do their best to achieve the most effective solution. Those who upgrade with time face less difficulties than those who are reluctant to stop continuing their traditional methods because of the possible investments that they find unworthy initially. Sooner or later converting the age-old methods need to be updated and that may become even more expensive in the future due to the high demands and more amount of scaling. Yet again, it also depends on the extent to which the data is termed as being “outdated”.