Data, text and opinion mining

The development of technology has made it possible to facilitate the work of people in different sectors to a certain degree. For example, when people grew their products, everything was manual, from preparing the land, planting, watering, composting and harvesting. Today, all that work is done by machines that are responsible for replacing and helping people and do a faster job.

In the medical sector, robots have made great strides, even operating in places inaccessible to humans. The same thing happens in large factories, where technology has come to replace labor to some extent, achieving continuous production, without fatigue, without overtime.

In the area of administration, technology appeared to aid in decision-making, through information analysis, predictions can be made, as will be seen in this article.

The explosive growth of databases, the Internet and the use of techniques and tools, which automatically and efficiently generate information from stored data, allow us to discover patterns, relationships and formulate models. In particular, these techniques have become enormously important in areas such as marketing strategies, decision support, financial planning, scientific data analysis, bioinformatics, text analysis, and web data analysis.

Technology is here to stay and day by day, trying to complement and facilitate people's work.

Data Mining - Data Mining

Definition

Data mining is the set of techniques and technologies that allow exploring large databases, automatically or semi-automatically, with the aim of finding repetitive patterns, trends or rules that explain the behavior of data in a given context.

Basically, datamining arises to try to help understand the content of a data repository. To this end, it makes use of statistical practices and, in some cases, search algorithms close to artificial intelligence and neural networks.

In general, the data is the gross raw material. The moment the user attributes some special meaning to them, they become information. When specialists develop or find a model, making the interpretation that arises between the information and that model represent an added value, then we refer to knowledge.

Datamining is presented as an emerging technology, with several advantages: on the one hand, it is a good meeting point between researchers and business people; on the other, it saves a company large amounts of money and opens up new business opportunities. Furthermore, there is no doubt that working with this technology involves taking care of countless details because the final product involves "decision making".

Advantage

It is a good meeting point between researchers and business people.

This point refers to the appearance of new technology which is often acquired by large companies which finance these projects.

It saves a company large amounts of money and opens up new business opportunities.

Practically this supports the previous point since when a project is good it is financed by a company which acquires more money than it invested and thanks to this technology a company can open other opportunities in the market.

Working with this technology implies taking care of a number of details because the final product involves «decision making».

You have the technology and it made your way into the market, it also creates a product which you are offering, but you have to see how effective the implementation was, is the company growing or decreasing?, This is what this point refers to.

It contributes to strategic and tactical decision making by providing an automated sense to identify key information from volumes of data generated by traditional and e-Business processes.It allows users to prioritize decisions and actions, showing factors that have a higher in a objective, it also shows which customer segments are disposable and which business units are bypassed and why?

It refers to the fact that thanks to datamining you only have to worry about decision-making since thanks to this technology it is showing the various advantages and disadvantages as some are indicated at this point.

It provides decision-making powers to business users who better understand the problem and the environment and is able to measure actions and results in the best way.

Thanks to datamining, problems can be divided into different sectors and this will mean that in different sectors there must be different specialized work groups in the field of this problem in order to optimize time and resources.

Generates descriptive models: in a context of defined business objectives, it allows companies, regardless of industry or size, to automatically explore, visualize and understand the data and identify patterns, relationships and dependencies that impact the final results of the income statement (such as increased revenues, increased profits, cost containment and risk management) Generates Predictive models: allows relationships undiscovered and identified through the datamining process to be expressed as business rules or predictive models. These outputs can be communicated in traditional formats (presentations, reports, shared electronic information, embedded in applications, etc.) to guide the strategy and planning of the company.

Techniques

Data mining techniques come from Artificial Intelligence and statistics, these techniques are nothing more than algorithms, more or less sophisticated that are applied to a set of data to obtain results.

Among the most used are:

1. Neural Networks

This artificial intelligence technique, in recent years has become one of the frequently used instruments to detect common categories in data, because they are capable of detecting and learning complex patterns and characteristics of the data.

One of the main characteristics of neural networks is that they are capable of working with incomplete and even paradoxical data, which depending on the problem can be an advantage or a disadvantage. In addition, this technique has two forms of learning: supervised and unsupervised.

2. Decision Trees

This technique is within a supervised learning methodology. Its representation is in the form of a tree where each node is a decision, which in turn generate rules for the classification of a data set.

Decision trees are easy to use, support discrete and continuous attributes, handle non-significant attributes and missing values well. Its main advantage is the ease of interpretation.

3. Generic Algorithms

Genetic algorithms mimic the evolution of species through mutation, reproduction and selection, as well as providing programs and optimizations that can be used in the construction and training of other structures such as neural networks. Furthermore, genetic algorithms are inspired by the principle of survival of the fittest.

4. Clustering

They group data within a number of preset or not classes, based on distance or similarity criteria, so that the classes are similar to each other and different from the other classes. Its use has provided significant results with regard to classifiers or pattern recognizers, such as in system modeling. This method due to its flexible nature can be easily combined with another type of data mining technique, resulting in a hybrid system.

5. Machine Learning

This artificial intelligence technique is used to infer knowledge of the result of the application of any of the other techniques mentioned above.

Data Mining Models

A data mining model is created by applying an algorithm to data, but it is more than just an algorithm or a metadata container: it is a set of data, statistics, and patterns that can be applied to new data to generate predictions and deduce relationships.

Applications of Data Mining Models

Data mining models can be applied in scenarios such as the following:

Forecasting: calculating sales and predicting server loads or server downtime Risk and probability: choosing the best clients for correspondence, determining the probable break-even point for risk scenarios, assigning probabilities to diagnostics or other destination results Recommendations: determination of the products that can be sold together and generation of recommendations Sequence search: analysis of the items that customers have placed in the shopping cart and prediction of possible events Grouping: distribution of customers or events in groups of related elements, and analysis and prediction of affinities.

Generation of Data Mining Models

Generating a data mining model is part of a larger process that ranges from asking questions about the data and creating a model to answer them, to implementing the model in a work environment.

This process can be defined by the following six basic steps:

1. Define the Problem

The first step in the data mining process is to clearly define the problem and consider ways to use the data to provide an answer to the problem.

This step includes analyzing the business requirements, defining the scope of the problem, defining the metrics by which the model will be evaluated, and defining the specific objectives of the data mining project. These tasks translate into questions like the following:

What are you looking for? What types of relationships are you trying to look for? Does it reflect the problem that your business policies or processes are trying to solve? Do you want to make predictions from the data mining model, or just look for interesting patterns and associations? What result or attribute do you want Predict What kind of data do you have and what kind of information is in each column? In case there are multiple tables, how are they related? Do you need to clean, add or process the data before you can use it? How is the data distributed? Are the data seasonal? Does the data accurately represent business processes?

To answer these questions, a data availability study may need to be conducted to investigate the needs of business users for available data. If the data does not meet the needs of the users, the project might have to be redefined.

2. Prepare the Data

The second step in the data mining process is to consolidate and clean the data identified in the previous step.

The data can be dispersed in the company and stored in different formats; They can also contain inconsistencies such as missing or incorrect entries. For example, the data may show that a customer purchased a product even before it was offered on the market or that the customer regularly purchases from a store 2,000 kilometers from home.

Data cleansing involves not only removing invalid data or interpolating missing values, but also looking for hidden correlations in the data, identifying the data sources that are most accurate, and determining which columns are best suited for analysis. For example, should I use the ship date or the order date? What influences sales the most: quantity, total price, or a discounted price? Incomplete data, bad data, and inputs that appear independent, but are in fact closely correlated, can influence model results in ways that are not expected.

Therefore, before you start building your mining models, you must identify these issues and determine how they will be corrected. In data mining, you are typically working with a large data set and you cannot examine the quality of the data for every transaction; therefore, you may need to use data profiling, and automatic data cleaning and filtering tools to explore the data and look for inconsistencies.

3. Explore the Data

The third step in the data mining process is to explore the prepared data. You need to know the data to make the right decisions when building your data mining models. Scanning techniques include calculating minimum and maximum values, calculating the mean and standard deviations, and examining the distribution of the data.

For example, reviewing the maximum, minimum, and mean values could determine that the data is not representative of customers or business processes, and therefore you must obtain more balanced data or review the assumptions that are the basis of your expectations. Standard deviations and other distribution values can provide useful information on the stability and accuracy of the results. A large standard deviation may indicate that adding more data could help you improve your model. Data that deviates greatly from a standard distribution could be skewed or could represent an accurate picture of a real-life problem, but make it difficult to fit a model to the data.

By exploring the data to understand the business problem, you can decide whether the data set contains faulty data, and then you can devise a strategy to correct the problems or get a deeper description of the behaviors that are typical of the business.

4. Generate Models

The fourth step in the data mining process is to build the data mining model or models.

You must define which columns of data you want to use; To do this, a data mining structure will be created. The mining structure is linked to the data source, but does not actually contain any data until it is processed. Processing the data mining structure generates aggregates and other statistical information that can be used for analysis.

Before processing the structure and the model, a mining model is simply a container that specifies the columns to use for input, the attribute that it is predicting, and parameters that tell the algorithm how to process the data. Processing a model is often called training. Training refers to the process of applying a specific mathematical algorithm to data in the structure to extract patterns. The patterns you find in the training process will depend on the selection of the training data, the algorithm you choose, and how the algorithm has been configured.

The parameters can also be used to fine-tune each algorithm, and filters can be applied to the training data to use a subset of the data, creating different results. After passing the data through the model, the mining model object contains the summaries and models that can be queried or used for prediction.

It is important to remember that whenever the data changes, you must update the mining structure and model.

5. Explore and Validate Models

The fifth step in the data mining process is to explore the data mining models that you have generated and verify their effectiveness.

Before implementing a model in a production environment, it is a good idea to test whether it works correctly. Also, when building one model, you typically create several with different configurations and test all of them to see which one provides the best results for your problem and your data.

6. Implement and Update the Models

The last step in the data mining process is to implement the models that work best in a production environment.

Once the data mining models are in the production environment, different tasks can be carried out, depending on the needs. The following are some of the tasks you can perform:

Use the models to create predictions that can then be used to make business decisions Create content queries to retrieve statistics, rules, or formulas from the model Create a report that allows users to query directly against an existing mining model. Updating models after review and analysis Dynamically updating models, when more data enters the organization, and making constant modifications to improve the effectiveness of the solution should be part of the implementation strategy.

Text Mining - Text Mining

It is one of the branches of computational linguistics that tries to obtain information and knowledge from data sets that in principle do not have an order or are not originally arranged to transmit that information It is a key technique in a world like the current one in the one that continuously collects data from different perspectives and from many different aspects of all the activities of human beings.

Text Mining should not be confused with information retrieval, which is the automatic retrieval of relevant documents through text indexing, classification, categorization, etc. The information that would really interest text mining is that contained in those documents but in a general way, that is, it is not contained in a specific text but is the global information that all records, texts, documents… have. the common collection. It is an analysis of the data shared by all the texts in the collection that is offered indirectly, that is, they are information that the collection will give to specialists but that was not specifically included in that collection at the time of its creation for their subsequent dissemination to users.

Text Mining comprises three fundamental activities:

Information retrieval, that is, selecting the relevant texts Extraction of the information included in those texts: facts, events, key data, relationships between them, etc. Finally, what was previously defined as data mining would be carried out to find associations among those key data previously extracted from the texts

Applications

It is very useful for all companies, administrations and organizations in general that, due to the characteristics of their operation, composition and activities, generate a large number of documents and are interested in obtaining information from all that volume of data. It can help you get to know your customers better, what are their habits, preferences, etc.

Stages

It is a relatively new, changing technique that can be adapted to different situations and cases, so there is no strict method to always follow. However, in general terms it could be said that these are the four main stages:

1. Determination of Objectives

Clarify what is being sought with this research, limiting to what extent you want to deepen it and clearly defining the limits.

2. Data pre-processing

It is the selection, analysis and reduction of the texts or documents from which the information will be extracted. This stage is most time consuming.

3. Determination of the Model

Depending on the objectives set and the task to be carried out, some techniques or others can be used.

4. Analysis of Results

From the data extracted, it will try to see its coherence and will look for evidence, similarities, exceptions, etc., that can serve the specialist or the user who has commissioned the study to draw conclusions that can be used to improve some aspect of your company, company, administration or organization in general.

Opinion Mining or Sentiment Analysis

Opinion Mining refers to a series of applications of natural language processing techniques, computational linguistics and text mining, which aim to extract subjective information from user-generated content, such as comments on blogs., or product reviews. With this type of technology, a tangible and direct value can be extracted, such as "positive" / "negative", from a textual comment.

In general, there are two types of tasks related to Opinion Mining:

Detection of polarity: Or what is the same, being able to determine if an opinion is positive or negative. Beyond a basic polarity, you may also want to obtain a numerical value within a certain range, which in a certain way tries to obtain an objective “rating” associated with a certain opinion. Sentiment analysis based on characteristics: Or what is the same, be able to determine the different characteristics of the product treated in the opinion or review written by the user, and for each of those characteristics mentioned in the opinion, be able to extract a polarity. These types of approaches are much more complex and much finer-grained than polarity detection.

conclusion

Data, Text and Opinion Mining are very important tools for analyzing the information of a company or organization and they are used to forecast based on the trends that have been present over a period of time.

The technology applied in the administration, tries to provide means that facilitate the control of an organization, trying to prevent errors that could occur.

These are the tools of the present and of the future, which is why they are being used by more and more companies and this means that every day more specialized people are needed.

Bibliography

Microsoft (2014). Obtained from: https://msdn.microsoft.com/es-es/library/ms174949.aspx Obtained from: http://www.sinnexus.com/business_intelligence/datamining.aspx Data Mining. Obtained from: http://mineria-datos-actualidad.blogspot.mx/2012/06/por-que-usar-datamining.htmlMería de Textos. Retrieved from:

Download the original file