Home Blog Big Data: structured ...

Big Data: structured and unstructured data

Marco Belmondo

written by Marco Belmondo (Chief Marketing Officer at Datrix group)

In the first Big Data years, right after Doug Laney has defined Big Data as data characterized by at least one of these three characteristics: volume, variety, and speed, most of the people focused on the first word, volume. Over the years, both in the academic world and in the industrial one, we are realizing that the true Big Data value lies in variety, as the heterogeneity of sources and formats.

In this context, practitioners talk more and more about unstructured data. The term unstructured data refers to those data that do not have a well-defined structure, in simple words, they are texts, images, videos or audio files. These data – for example documents, photos, or social network posts -, as it is easy to understand, do not have a well-defined or better standardized structure. This means that it is not possible to organize them in tabular form (as with spreadsheets).

Despite this first difficulties, there are now numerous companies that are approaching these issues and more and more startups are offering specific solutions for the analysis of texts or images. Let us try to understand in this article three fundamental points to approach the use of unstructured data: why to use them, with what technologies and with what methodologies.

Applications that can be developed with unstructured data

What are the possible use cases to be approached thanks to the use of texts or images? First of all, it is good to divide the two areas, characterized by some common challenges but also by very different technologies and techniques.

Analysis of the texts and understanding of natural language

In the world of natural language, the best-known application is certainly the chatbot. It is a solution that can reproduce a conversation in natural language. The potential of chatbots is truly numerous: in some companies they are used internally to increase the efficiency of some processes. Let’s think to ticket management, in order to support the field force or in the business relationship with IT. Furthermore, the chatbot can also play an important role in relations with the final consumer, giving the company the opportunity to be available 24 hours a day, reducing costs. In summary, there are some aspects that can be automated with very simple chatbots; however it is very difficult to create truly intelligent and autonomous chatbots in answering the most disparate questions. The tech giants are trying to do so, developing increasingly performing virtual assistants (think of Siri, Google Home, Cortana).

In the use of unstructured data, remaining in the context of text analysis (and the use of Natural Language Processing algorithms), there is another relevant application: sentiment analysis. Those use cases can be developed to understand the company’s reputation on the web. Eventually, we can image credit scoring applications that use both alternative textual data and more traditional data sources to understand the creditworthiness of an individual or a company (Discover the Finscience applications on these issues!)

Image and video analysis

Talking about image and video analysis, first of all we can imaging security applications (let’s think of an algorithm able to recognize if a stranger break into private property!). There are numerous applications already in place both in the manufacturing and in the media field. In the first case, image analysis can be used quality control applications or to optimize warehouse management. Secondly, applications aimed at automatically extracting information from videos (we are talking about Image Captioning) or facial recognition to capture the emotion in front of an advertisement.

Other industries could benefit from the analysis of unstructured data, as well.

Using structured and non-structured data: how to be ready

Learning to extract insights from unstructured data is not easy.

First of all, it is necessary to invest in new technologies that can respond to different needs. The offer of the main technological players is increasingly made up of tools – primarily NoSQL / NewSQL databases or the data lake – that allow you to store even data that does not have a well-defined structure.

Secondly, it will be necessary to acquire specialized skills, capable of developing Natural Language Processing or Computer Vision algorithms. These algorithms, that are living a true “hype” phase, are having an impressive increase in effectiveness thanks to Machine Learning techniques.

Finally, the potential of unstructured data can only be understood in the integration with more traditional data sources (of structured format): to start this path, it is therefore necessary to address Data Integration problems. These problems are not always trivial.

To conclude, there is need for a big cultural change. The main issue is to start thinking of unstructured sources as valuable data sources and thanks to which automating some processes. In the introduction of new “smart” tools, change management actions are crucial to support end users in using these new applications.