Data, text and sentiment mining

Data mining could be defined as a process of discovering new and significant relationships, trends and patterns by exploring large amounts of data.

Having a large amount of information available as well as the use of various computer tools has led the analysis of data to the application of different specialized techniques framed in what is known as Data mining or data mining.

data-mining-from-texts-and-feelings-ana

The techniques used in data mining aim to automatically discover the knowledge that is stored in an orderly manner in the information contained in a large database. The main purpose is to find patterns, profiles and trends by analyzing data using technologies that allow recognition of patterns, neural networks, fuzzy logic, genetic algorithms and many other advanced techniques of data analysis.

Today data mining is used in many fields of science, at the financial and banking level, in the analysis of markets and businesses, in the area of both public and private health, at the educational level, in industrial processes, in medicine, biology and bioengineering as well in telecommunications and different areas. (Perez Lopez & Santín González, 2007)

In this writing you will review the concept of data mining, its application and importance for decision-making in organizations.

What is data mining?

Data mining is understood as a group of techniques used to extract and obtain valid, detailed and highly useful information found in the different databases. It is very useful for decision-making as it helps to predict future trends and behaviors, making it a powerful tool for organizations.

Data mining or Data mining as it is known in English, relates the procedures used in the mining industry to extract minerals from the earth by making explosions that make it rise to the surface. Following this same technique, data mining explodes databases in order to extract hidden information from them.

Using various algorithmic tools and techniques, data mining searches for hidden patterns of interest in databases in order to anticipate futures and forecast situations with a certain degree of probability. In this way, predictable information can be found that cannot be easily found by anyone, no matter how expert they may be. Data mining can be implemented on any hardware and software platform and can be integrated into on-line information systems. (Reinosa & Maldonado, 2012)

History of data mining

Data mining is not something new, it has been happening since the sixties when statisticians used the terms data fishing, data mining or data archeology at that time; Later in the eighties people began to talk about the term KDD, which by its acronym in English means process of extraction of knowledge from data of which data mining is part. From that year on, several companies dedicated to providing services related to data mining were created and little by little many others emerged; So far there are more than one hundred companies dedicated to data mining. (Felix, 2002)

There is a well-known 1992 data mining implementation success story from an employee of the NCR corporation who conducted a study for Osco Drugs of American Stores. As a result of this study, it was observed that during the hours of 5 to 7 at night the purchase of two items that were not related to each other but that were carried at the same time occurred more frequently: diapers and beers. This then concluded that many parents who were sent to buy diapers at that time also ended up carrying a few cans of beers, a situation that was used by the store to place the refrigerator with the beers near the diaper display to enhance said purchase either by choice or on impulse.This is a sample of the unexpected results that can be found when using data mining and the decisions that your organization can make based on these discoveries. It is very important that the organization is agile to use the results obtained, therefore, data mining alone is not useful, unless it is used as it should be. (Reinosa & Maldonado, 2012)

Business intelligence

Data mining has its origin in information systems whose purpose was to collect information on a certain topic to make decisions. With the emergence of new software and hardware, organizations are computerized and information systems began to support the basic processes of the company such as sales, production, human resources and others, which are called Information systems for management. Over time and after the need for companies to have a basis that helps them make decisions; tools emerged that meet these needs called DSS (Decision support system) such as EIS and OLAP as well as the different technical tools of data mining.

EIS (Executive information systems) are a set of tools and information systems that allow company executives to have access to the status of activities and their management. They allow to report immediately any change that occurs in the company, for this it analyzes the daily state of the organization through key indicators. The type of information that is regularly requested are usually weekly sales, partial balances and the level of stocks, and at the same time it is represented by graphs in spreadsheets. (Perez Lopez & Santín González, 2007)

OLAPs (On-line analytical processing) provide ease of handling and transforming data to produce new data. The goal of OLAP is to streamline the query of large amounts of data.

Data mining tools aim to extract patterns and trends in order to predict future behavior. Data mining analyzes the data while OLAP and EIS facilitate access to information so that a more effective analysis can be done, which means that they support data mining.

The use of each tool will depend on the objective of the organization, for this it must start from a basic question, as we can see in the following table: (Braga, Valencia, & Carvajal, 2009)

In order for the aforementioned systems to work, it is necessary that there be a data warehouse or Warehouse which is a collection of internal or external historical data, which describes a context or study area oriented towards a domain that allows applying tools in order to to describe, summarize and analyze data to aid in decision making.

To load or feed the data, a system called ETL (Extraction, transformation, Load) is used that is responsible for reading the data, incorporating new data, creating keys, etc. The following image explains how these systems work

Data mining techniques

Data mining techniques are classified into predictive, descriptive and auxiliary and are organized as they appear in the following image

How do you create a data mining model?

To apply data mining you can follow the following six steps:

Defining the problem Data preparation Data exploration Generation of models Exploration and validation of models Implementation and updating of models

In the following figure you can see these steps

As can be seen, this is a cyclical process, which means that if the data found are not sufficient for the creation of the model or the models are not adequate for the proposed purposes. The same steps must then be repeated to create a new model.

Define the problem

The first thing to do to create a mining model is to define the problem and consider how the data can be used to solve it.

At this point, the business requirements are analyzed, the scope of the problem is defined, the way in which the model will be evaluated, and the specific objectives of the data mining project are established. For this, the following questions can be asked:

What are you looking for? What kinds of relationships are you trying to find? Does the problem reflect solving the problem that the policies are trying to solve? What do you want to do from the data mining model? Predictions, looking for interesting patterns or associations? What result do you want to predict? What data do you have and what type of information is in each column? If there are tables, how are they related? Does the data need to be cleaned, aggregated or processed before it is used? How is the data distributed? Are they seasonal? Do they accurately represent business processes?

Prepare the data

The next step is to consolidate and clean the data identified in the previous step. These data may have inconsistencies or be widely dispersed, such as that a customer bought a product before it went on the market or that she buys in a store located 20,000 km from her home.

This cleanup is not only about removing the data that is not valid, but also looking for correlations that are hidden in the data, identifying the source of the data that is most accurate, and determining which columns are best suited for analysis.

Explore the data

You must know the data in order to make the best decision when creating data mining models, for this you must use exploration techniques such as calculating the minimum and maximum values, calculate the mean and standard deviations and examine the distribution of the data.

Generate models

In the fourth step of data mining, what is done is to generate the model using the knowledge that was acquired in the exploration of the data, for this it is necessary to define which data columns are going to be used in order to create a structure of data mining.

Explore and validate the models

The next step in the data mining process is to explore the previously obtained models and verify that they are effective prior to deployment. By testing the models, you can see which one offers better results for the problem initially posed.

If none of the models that have been created work, go back to the previous steps to either rethink the problem or re-investigate the data from the original set.

Deploy and update the models

Finally, the models that work best in the production environment must be implemented, which can perform different tasks according to the needs of the company.

Among the tasks that the model can perform are:

To make predictions that can later be used to make business decisions Create content queries to retrieve rules, formulas and statistics from the model Embed the functionality of the model in an application (Microsoft, 2014)

Data mining application

Currently data mining can be used in various fields within which are:

Do financial analysis: it is applied in the banking or financial sector and what is sought is to provide data with which it is possible to make reliable systematic analyzes. With this it is possible to predict loan payments, analyze customer credit policies, classify and group customers to create specialized offers and detect possible fraud and financial crimes.In the retail sector: stores dedicated to this type of activity collect daily lots of information coming from sales, purchase history and freight transportation. With this data, predictions can be made that allow stores to offer a better service and facilitate their retention. Data mining in these cases can do:
- Analysis of sales, customers, products, time and region Analyze the effectiveness of sales campaigns Recommend products in a personalized way
In telecommunications: in this sector, data mining can be used to identify telecommunications patterns, it helps to facilitate the detection of fraudulent activities and makes better use of resources, thus improving the quality of service. (Lantares, 2014)

What is text mining?

Text mining is the location, analysis and organization of information in order to create new information that cannot be clearly seen when reviewing documents. The new information obtained can be a pattern, a trend or a correlation that cannot be identified only by reading the documents, which can be internet pages, emails, a field in the databases or a text file without any format.

Text mining or text mining comprises three fundamental activities which are:

Retrieve the information: select the appropriate texts Extract the information contained in those texts: key data, facts and events Use data mining to find associations between these key texts (galeon.com, 2016)

How is text mining done?

The following four stages can be followed to implement data mining:

First stage: The objectives are established in order to clarify what is sought in the investigation and in order to establish the limits and limit to what extent it is desired to deepen.

Second stage: Process the data by selecting, analyzing and reducing the texts or documents from which the information will be extracted. This is the stage that consumes the most time

Third stage: Specify which model or technique is to be used, this will depend on the objectives set and the tasks to be carried out

Fourth stage: The results are analyzed in order to use the information found to make the decisions that best suit the organization. (galeon.com, 2016)

Text mining application

To extract information: it can be used to extract information from large amounts of text found on the web, thus allowing the definition of entities and their relationships, revealing significant information and facilitating the understanding of the data Classify documents: allows you to retrieve and navigate in documents, especially in companies that keep a historical record of their activities and projects in documents. For this, text mining algorithms are applied that grouped the documents and obtains descriptive information from each one of each group in order to better understand them. Preparation of summaries: a general description of a set of documents can be obtained regarding a specific topic.In this sense, these methods can be classified into two categories: extractive summarization and abstract summarization. Knowledge extraction: using text mining it is possible to create knowledge models from the information extracted from the documents.

What is sentiment or opinion mining?

Opinion or sentiment mining can be defined as the application of a series of techniques of natural language processing, computational linguistics and text mining whose objective is to extract subjective information from information published by people either in blogs or reviews of products online. From this analysis, important information can be obtained, whether positive or negative.

When mining opinions or feelings, text mining is applied and can be done in two ways:

Polarity detection: its objective is to establish whether an opinion is negative or positive and at the same time try to obtain a numerical value within an established range to obtain a rating associated with a certain opinion. Analysis of sentiment based on characteristics: its objective is be able to establish what are the characteristics of a product based on the review or opinion of users and with each of these characteristics obtain a polarity. (Brainsins, 2015)

conclusion

Data, text and sentiment mining provide very useful tools for analyzing data and texts that at the same time allow to identify patterns of behavior that help decision-making. There are many uses that can be given to data, text and sentiment mining, but it is up to each organization to establish what type of technique to use based on the initial approach to a problem.

Bibliography

Braga, LP, Valencia, LI, & Carvajal, SS (2009). Introduction to data mining. Sao Pablo: National Union of Publishers Brainsins. (2015). Obtained from: http://www.brainsins.com/es/blog/mineriaopiniones/3555Cesar Perez Lopez, DS (2007). Data mining: techniques and tools. Madrid: International Thompson Ediciones Paraninfo saEnrique Jose Reinosa, CA (2012). Database. Mexico: Allfaomega.galeon.com. (02 of 04 of 2016). Galeon.com. Obtained from: http://textmining.galeon.com/Lantares. (2014). Obtained from http://www.lantares.com/blog/mineria-de-datosaplicaciones-que-ya-son-una-realidadMicrosoft. (2014). Obtained from: https://msdn.microsoft.com/esmx/library/ms174949%28v=sql.120%29.aspxPerez Lopez, C., & Santín González, D. (2007). Data Mining: Techniques and Tools. Madrid:Thomson International Editions Paraninfo.

Thanks

To the Technological Institute of Orizaba for giving me the opportunity to train professionally and to Professor Fernando Aguirre y Hernández for all the knowledge he has shared with us in his subject Fundamentals of Administrative Engineering to learn and perfect my skills to write quality scientific articles.

Image taken from the book Introduction to data mining Luis Pablo Vieira Braga and others. 2009

Image taken from the book Data Mining: Techniques and Tools by César Perez López and Daniel Santín González

Download the original file