Logo en.artbmxmagazine.com

Big data. data analysis and architecture

Table of contents:

Anonim

Big Data is a term that describes the large volume of data, both structured and unstructured, that floods businesses every day. But it is not the amount of data that is important. What matters with Big Data is what organizations do with data. Big Data can be analyzed to obtain ideas that lead to better decisions and strategic business movements. (PowerData, 2015).

big-data-management-information-itai

Big data describes a holistic information management strategy that includes and integrates many new types of data and data management along with traditional data (ORACLE, 2014).

Within a much broader definition, the 4 V must be taken into account in order to better understand the scope of the concept:

Volume

It refers to the amount of data, it should be noted that more volume is not synonymous with more data, so in Big Data it is necessary that the processing of large volumes of information is of low density. ORACLE says the Hadoop dataThey must be unstructured (of unknown values), for example clicks made on web pages, messages on social networks and even mobile applications, existing traffic on the network, among others. Big Data's job is to convert this data into useful information. Referring to the storage size can vary from tens of terabytes hundreds of petabytes, depending on each organization.

Hadoop

The Hadoop system has the function of lightening the work of developers due to the difficulty of parallel programming, providing an ecosystem that helps the user, distributing the file in nodes, allowing to execute several processes in parallel. The Hadoop system has control modules for data monitoring, allows the integration of addons, which serve to facilitate the work, manipulation, monitoring and consultation of the stored information.

Value

It is not uncommon to say that currently the data or information has value, however having them does not generate any utility, but an application must be discovered for them. There are many quantitative and research techniques that allow us to extract value from these data, a clear example of this is the analysis of customer preferences carried out by many companies, which serves to be able to make a relevant offer, which includes data such as location.

Being able to store and process all the information has a cost, however, due to the increasing demand for online behavior analysis, the prices in terms of computing and data storage have decreased, so that the statistical analysis of a large amount of information without the need to segment it or use only a sample.

The fact of being able to process all the information together supposes an innovation for decision making, allowing them to be more exact. The process of discovering valuable information requires the participation of analysts or specialists in the field, users and executives. In this way, Big Data must learn to predict human behavior, recognizing patterns, in order to offer a prediction of behaviors.

Variety

This aspect refers to unstructured data and to those that can be classified as semi-structured, among which are texts, audios and videos. All this data requires additional processing in order to generate some meaning, as well as the use of metadatasupport. In other words, this aspect tries to quantify the complexity of the information and reduce it.

When they have been understood, unstructured data can be processed as structured data, that is, it can be summarized, aligned, and plotted for audits. However, there is a greater complexity when the data that is obtained from a known origin changes without prior notice, this produces a burden for the analysis.

Speed

It is the rate at which data is received and in which some action is applied, such as analyzing or processing it. To obtain a higher speed, a high memory capacity is required, not only in bytes but also in the reading power, so the importance of technologies such as cloud storage and internet speed are essential.

For example, some Internet of Things applications (Internet of Things), have status and security aggregates, these require real-time actions, as well as evaluations.

Another example is smart products that are ready to use the Internet, they work in real time, providing relevant information such as usage statistics, security, location, among others. This is how e-commerce applications try to use these variables, mixing the location of a Smartphone with personal preferences to make offers through advertising. From an operational point of view, applications designed for mobile phones have a huge user base and broader network traffic, so the experience and response expectation must be immediate.

Description

Once all these principles are clear, it should be pointed out that Big Data is then a data set, which in turn are combinations of these data sets whose volume, value, variety and speed make it difficult to capture, record, manage, process and analysis using conventional technologies and tools, such as relationship databases, statistics and visualization packages, within the time necessary for them to be useful.

It is not defined what is the size that a specific data set must have to be considered Big Data, since it continues to change as time goes by, currently most analysts and professionals in the field say that they are data sets that start from 30 Terabytes. Therefore, it is extremely complex in nature, due to the unstructured nature of much of the data generated by the technologies currently used, such as Internet searches for information, social networks and the interactions that occur in them (Facebook, Twitter, Google, among others), page records, device sensors (measurements, GPS location), laptops, smartphones and call center records, including machinery and vehicles.

To be able to use Big Data effectively it must be combined with the structured data (relational database) of a conventional commercial application, such as an ERP or a CRM.

Importance

The fact that Big Data provides answers to many questions that companies sometimes did not know they should answer, is what makes this tool extremely useful at the business level, because it provides a point of reference. The volume of information required allows the data to be molded in any way that companies require. By doing so they are able to identify problems in a more understandable way.

The fact of being able to collect large amounts of data and allow specific trends to be found within them, allows companies to make decisions quickly, efficiently and without problems. Something very important to highlight is that it allows to eliminate problem areas long before the problems affect the reputation of the company or damage its benefits.

Big Data helps organizations take advantage of their information through analysis, using it to identify opportunities for growth or improvement. This enables smart business movements, more efficient operations, higher profits, and customer satisfaction. Benefits should be considered through this tool such as the following:

Costs reduction Accelerated decision making Generate products and

services

Costs reduction

The most powerful and potential data technologies, such as the Hadoop system and cloud-based analytics, must be brought to hand. These generate a cost advantage, since when it comes to storing large amounts of data, there is a large amount of supply that shows exponential growth in the following years, also allowing the identification of more efficient ways of marketing.

Faster decision making

Referring to the Hadoop system, its speed and information analytics, combined with the ability to analyze new data sources, serve companies to have the information available immediately (either as a summary or as specific data that is required) and in this way make decisions based on what they have learned (artificial intelligence).

Generate new products / services

Big Data offers the ability to analyze and measure customer needs, therefore, their satisfaction is given through the analysis of their information, with which it is possible to know with certainty what they want or need. Through analytics, companies create new products and services to meet the demands of their customers. They can even generate new needs that they did not know they had.

Application

As it could be observed previously, the reach power of Big Data is unimaginable, really the limits are set by the same companies, since it is up to them what to do with the information. Below are ways in which this tool can be used in various sectors:

Health

Big Data contains large amounts of information in the healthcare industry. Primarily part of patient records, general and specialized health plans, insurance and scope information, and also difficult to manage information. All this data provides information that is key when applying an analysis. That is why data analytics technology is vitally important to healthcare. By analyzing these large amounts of information, patient diagnoses and treatment options can be provided almost immediately, thereby creating possibilities of attacking diseases before they are irreparable.

Administration

One of the main challenges management faces is ensuring quality and increasing productivity of operations with generally tight budgets. Big Data can allow the streamlining of operations through technology, giving management a much broader view of activities.

Advertising

The increasing use of smartphones, as well as devices with GPS integration, allows advertisers to target consumers when they are near a specific store, such as a restaurant, bookstore, or coffee shop. This creates opportunities for service providers such as earning more income, getting new prospects, positioning and achieving success.

Sales

Customer service has become extremely important to all businesses, and customers have become demanding down to the smallest detail, so sales have evolved as smarter buyers expect retailers. understand exactly what they need and when they need it.

Big Data can enable retailers to meet those demands. Armed with endless amounts of data from customer loyalty programs, shopping habits, and other sources, retailers not only have a deep understanding of their customers, but can also predict trends, recommend new products, and increase profitability.

tourism

It must allow for customer satisfaction, since it is key for the tourism industry, but this characteristic is difficult to measure, especially at the right time. Resorts and casinos, for example, have only a small chance of turning a bad customer experience around. Big data analytics gives these companies the ability to collect customer data, apply analytics, and immediately identify potential issues before it's too late.

Big Data challenges

The particular characteristics of Big Data make its data quality face many challenges:

Volume Value Variety Speed Veracity

Diversity in data sources and types

With so many sources, data types, and complex structures, the difficulty of data integration increases.

Big data data sources are vast:

  • Internet and mobile data. Internet of Things data. Sectoral data collected by specialized companies.  Experimental data.

And the data types are also:

  1. Unstructured data types: documents, videos, audios, etc. Semi-structured data types: software, spreadsheets, reports, structured data types

Only 20% of information is structured and that can cause many errors if we do not undertake a data quality project.

Data volume

As we have already seen, the volume of data is enormous, and that complicates the execution of a data quality process within a reasonable time.

It is difficult to collect, clean, integrate and get high quality data quickly. It takes a long time to transform unstructured types into structured types and process that data.

Volatility

The data changes rapidly and this makes it very short valid. To solve it we need a very high processing power.

If we do not do it well, the processing and analysis based on this data can produce erroneous conclusions, which can lead to errors in decision making.

There are no unified data quality standards

In 1987 the International Organization for Standardization (ISO) published ISO 9000 standards to guarantee the quality of products and services. However, the study of data quality standards did not begin until the 1990s, and it was not until 2011 that ISO published the ISO 8000 data quality standards.

These standards need to mature and refine. Furthermore, research on big data quality data has only recently begun and there are hardly any results.

The quality of big data is key, not only in order to obtain competitive advantages but also to prevent us from making serious strategic and operational errors based on erroneous data with consequences that can be very serious.

Data Governance Plan

Governance refers to ensuring that the data is authorized, organized and with the necessary user permissions in a database, with the least possible number of errors, while maintaining privacy and security. Achieving easy balance within these characteristics is difficult, especially when the reality of where and how data is housed and processed is in constant motion.

Granular Data Access

You cannot have effective data governance without granular controls.

These granular controls can be achieved through the access control expressions. These expressions use grouping and Boolean logic to control flexible data access and authorization, with role-based permissions and visibility settings.

At the lowest level, confidential data is protected by hiding it, and at the top level, there are confidential contracts for data scientists and BI analysts. This can be done with data masking capabilities and different views where raw data is locked as much as possible and gradually more access is provided until, at the top, administrators are given more visibility.

You can have different levels of access, which gives more integrated security.

Data Protection

Governance does not occur without security at the end point of the chain. It is important to build a good perimeter and put a firewall around the data, integrated with existing authentication systems and standards. When it comes to authentication, it's important for companies to sync up with proven systems.

With authentication, it's all about seeing how to integrate with LDAP, Active Directory, and other directory services. Tools such as Kerberos can also be supported for authentication support. But the important thing is not to create a separate infrastructure, but to integrate it into the existing structure.

Encryption

The next step after protecting the perimeter and authenticating all the granular data access that is being granted is to make sure that the files and personally identifiable information (PII) are encrypted and tokenized from end to end of the data pipeline.

Once the perimeter is exceeded and with access to the system, protecting PII data is extremely important. That data needs to be encrypted so that regardless of who has access to it, they can run the scans they need without exposing any of that data.

Audit and Analysis

The strategy does not work without an audit. That level of visibility and accountability at every step of the process is what enables IT to "govern" the data rather than simply setting access policies and controls and hoping for the best. It's also how companies can keep their strategies up to date in an environment where the way we view data and the technologies we use to manage and analyze it are changing every day.

We are in the infancy of Big Data and IoT (Internet of Things), and it is critical to be able to track access and recognize patterns in the data.

Auditing and parsing can be as simple as tracking JavaScript Object Notation (JSON) files.

Unified Data Architecture

Ultimately, the IT manager who oversees the business data management strategy must think about the details of granular access, authentication, security, encryption, and auditing. But it should not stop there. Rather, you should think about how each of these components integrates into your global data architecture. You also need to think about how that infrastructure is going to need to be scalable and secure, from data collection and storage to BI, analytics, and other third-party services. Data governance is as much about rethinking strategy and execution as it is about the technology itself.

It goes beyond a set of security rules. It is a unique architecture in which these roles are created and synchronized across the entire platform and all the tools that are brought to it.

Thesis proposal

Proposal 1

Use Big Data to analyze the information of the society of Veracruz and to be able to prevent crimes, by monitoring activity on networks that facilitates the routing and correction of individuals.

Proposition 2

Generate proposals for the improvement of the social fabric, scaling from the laggards to achieve faster integration.

Reference sources

Data Management Specialists. (October, 2012). Big Data: What is it?

Its importance, challenges and governance. March, 2018, from PowerData Website:

ORACLE. (August, 2014). Business big data. March, 2018, from ORACLE Latin America Website:

Quer, A. (September 05, 2013). How are Big Data and Hadoop related? March, 2018, from PowerData Website:

______________________

Open source system that is used to store, process and analyze large volumes of data.

Terabyte (TB), equivalent to 10 12 bytes, that is, 1,000,000,000,000 (one billion) bytes.

Petabyte (PB) equals 10 15 bytes, that is, 1,000,000,000,000,000 bytes.

A logical set of information or data that is designated by a name and is configured as a complete stand-alone unit for the system or user.

Point of intersection or union of several elements that come together in the same place.

Extension or addition may refer to an installable upgrade for IT projects.

Group of data that describes the informative content of an object called a resource.

The Internet of Things powers objects that were once connected via closed circuit, such as communicators, cameras, sensors, and so on, and enables them to communicate globally through the use of the network of networks.

Download the original file

Big data. data analysis and architecture