Data and information mining

Simple definition of mining:

The process or business of digging in mines for minerals, metals, jewelry, etc.

Data Mining: What is Data Mining?

Overview

In general, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing into useful information, information that can be used to increase revenue, reduce costs, or both.. Data mining software is one of a series of analytical tools for data analysis. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the identified relationships. Technically, data mining is the process of finding correlations between patterns or fields in large relational databases.

Continuous innovation

Although data mining is a relatively new term, technology is not. Companies have used powerful computers to sift through volumes of supermarket scanner data and analyze market research reports for years. However, continued innovations in computing power, disk storage, and statistical software are dramatically increasing the accuracy of analysis while reducing cost.

Example

For example, a Midwest supermarket chain uses Oracle's software data mining capabilities to analyze local shopping patterns. They found that when men bought diapers on Thursdays and Saturdays, they also tended to buy beer. A more detailed analysis showed that these shoppers typically made their weekly shopping on Saturdays. On Thursdays, however, they have only bought a few items. The retailer concluded that it should purchase the beer to have it available for the next weekend. The supermarket chain could use this newly discovered information in a number of ways to increase revenue. For example, they could move the beer screen closer to the diaper screen. AND,They could make sure the beer and diapers are sold full price on Thursdays.

The basics of data mining

Data mining techniques are the result of a long process of product research and development. This evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, spawned technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond accessing navigational and retrospective data to the delivery of forward-looking and proactive information. Data mining is ready for application in the business community, as it relies on three technologies that are already mature enough:

Massive data collection Computers with multiple powerful processors Data mining algorithms

Business databases are growing at an unprecedented rate. A recent META Group survey of data warehousing projects found that 19% of respondents are beyond the 50 gigabyte level, while 59% expect to be there. In some industries, such as retail, these figures can be much higher. The need for support for the improvement of calculation engines can be met in a cost-effective way with computer technology with several parallel processors. Data mining algorithms incorporate techniques that have been around for at least 10 years, but have only recently been implemented as mature, reliable and understandable tools that far outperform older statistical methods.

In the evolution from business data to business information, each new step has built on the previous one. For example, dynamic data access is critical for drill-through in data navigation applications, and the ability to store large databases is critical for data mining.

The data, information and knowledge

Data

Data is the facts, numbers or text that can be processed by a computer. Today, organizations are accumulating vast amounts of data in different formats and different growing databases. This includes:

Operational or transactional data, such as sales, costs, inventory, payroll, and accounting Non-operational data, such as industry sales, forecast data, and macro-economic data Meta data, data about data in yes, such as logical database layout or data dictionary definitions

information

The patterns, associations or relationships between all this data can provide information. For example, analysis of the retail transaction data point can provide information about what products are selling and when.

Knowledge

Information can be converted into knowledge of historical patterns and future trends. For example, summary information on supermarket sales can be analyzed in light of promotional efforts to provide insight into consumer buying behavior. Therefore, a manufacturer or retailer could determine which items are the most susceptible to promotional efforts.

Data warehouses

Dramatic advances in data capture, processing capacity, data transmission, and storage capabilities are enabling companies to integrate their various databases into data storage units. Data warehousing is defined as a centralized data management and retrieval process. Data warehousing, like data mining, is a relatively new term, although the concept itself has been around for years. Data warehousing represents an ideal vision of maintaining a central repository of all the organization's data. Centralization of data is needed to maximize user access and analysis. Spectacular advances in technology are making this vision a reality for many companies. AND,Equally dramatic advances in data analysis software are allowing users to access this information freely. Data analysis software is what supports data mining.

What can data mining do?

Data mining is primarily used by companies with a strong consumer focus (retail, financial, communication, and marketing organizations). It enables these companies to determine the relationships between "internal" factors such as price, product positioning, or staff skills, and "external" factors, such as economic indicators, competition, and customer demographics. And, it allows them to determine the impact on sales, customer satisfaction and corporate profits. Lastly, it allows them to "drill down" to the summary information to see detailed transactional data.

With data mining, a retailer could use customers' purchase POS records to send out specific promotions based on an individual's purchase history. By mining comment or warranty card demographics, the retailer could develop products and promotions to appeal to specific customer segments.

For example, Blockbuster Entertainment mining is their historical video rental database to individually recommend to customers on their vacation. American Express can suggest products to its cardholders based on the analysis of their monthly expenses.

WalMart is a pioneer in massive data mining to transform its vendor relationships. WalMart captures point-of-sale transactions from more than 2,900 stores in 6 countries and continuously streams that data to its massive 7.5 terabytes of Teradata data storage. WalMart enables more than 3,500 suppliers to access data about their products and perform data analysis. These vendors use this data to identify customer buying patterns at the store display level. They use this information to manage local warehouse inventory and identify new marketing opportunities. In 1995, WalMart teams processed more than 1 million complex data queries.

The National Basketball Association (NBA) is exploring a data mining application that can be used in conjunction with image recordings of basketball games. Advanced browser software analyzes player movements to help coaches orchestrate plays and strategies. For example, an analysis of the play-by-play sheet for the game between the New York Knicks and the Cleveland Cavaliers on January 6, 1995, is that when Mark Price played guard, John Williams attempted four shots at jump and did each one. Not only does Advanced find this pattern, but it is explained that it is interesting because it differs considerably from the 49.30% average shooting percentage for the Cavs during that game.

By using the NBA's universal clock, a coach can automatically use video clips showing each of the shots that Williams attempted, without going through hours of video. Those clips show a very successful pick-and-roll game in which Price disarms Knick's defense and then finds Williams for an open shot jump.

How does data mining work?

While large-scale information technology has evolved, analytical systems and transactions separate, data mining provides the link between the two. Data mining software analyzes relationships and patterns in stored transaction data based on queries from undefined users. There are several types of analysis software that are available: statistics, machine learning, and neural networks. In general, any of the four types of relationships is sought:

Classes: The stored data is used to locate the data in predetermined groups. For example, a restaurant chain could extract customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic, to have specials of the day.

Groups: Data elements are grouped according to logical relationships or consumer preferences. For example, data can be extracted to identify market segments or consumer affinities.

Associations: Data can be extracted to identify associations. The beer-diaper example is an example of associative mining.

Sequential patterns: Data is extracted from anticipating trends and patterns of behavior. For example, an outdoor gear retailer might predict the probability of a backpack being purchased based on a consumer's purchase of sleeping bags and walking shoes.

Data mining is made up of five main elements:

Extract and transform freight transaction data into the data warehousing system Store and manage the data in a multidimensional database system Provide access to data, business analysts and information technology professionals Analyze the data with a software application Present the data in a useful format, such as a graph or table.

Different levels of analysis are available:

Artificial neural networks: non-linear predictive models that learn through training and resemble biological neural networks in structure Genetic algorithms: optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution Decision trees: tree-like structures that represent sets of decisions. These decisions generate rules for the classification of a data set. Decision tree specific methods include Classification and Regression Trees (CART) and the Automatic Chi-square Detection Interaction (CHAID). CART and CHAID are decision tree techniques used for the classification of a data set.They provide a set of rules that can be applied to a new (unclassified) data set to predict which records will have a given result. CART segments are a data set by creating a 2-slice path, while CHAID segments use chi-square tests to create multiple-slice pathways. CART typically requires less data preparation than CHAID. Nearest Neighbor Method: A technique that classifies each record in a data set based on a combination of the k classes of the card (s) most similar to it in a set of historical data (where k 1). It is sometimes called the k-nearest neighbor technique. Rule induction: The extraction of useful rules from the data based on statistical significance. Data visualization:The visual interpretation of complex relationships of multidimensional data. Graphical tools are used to illustrate data relationships.

What technological infrastructure is required?

Today, data mining applications are available in all size systems for mainframe, client / server, and PC platforms. Prices for systems range from several thousand dollars for the smallest applications to $ 1 million a terabyte for the largest. Applications across the enterprise generally range in size from 10 gigabytes to more than 11 terabytes. NCR has the capacity to deliver applications of more than 100 terabytes. There are two critical technological factors:

Database size: the more data is processed and maintained, the more powerful the system that is required The complexity of the query: the more complex the queries and the greater the number of queries that are being processed, the more powerful it is the required system.

Relational database storage and management technology are suitable for many data mining applications of less than 50 gigabytes. However, this infrastructure needs to be significantly improved to support the largest applications. Some vendors have added extensive indexing capabilities to improve query performance. Others use new hardware architectures, such as massively parallel processors (MPP) to achieve order-of-magnitude improvements in query time. For example, NCR's MPP systems link hundreds of high-speed Pentium processors to achieve performance levels higher than those of the largest supercomputers.

Text mining

Text mining is a new emerging field that attempts to extract meaningful information from the natural text of the language. It can be broadly characterized as the process of analyzing text to extract information that is useful for particular purposes. Compared to the type of data stored in databases, text is structured, amorphous, and difficult to handle algorithmically. However, in modern culture, the text is the most common vehicle for the formal exchange of information. The fields of text mining generally deal with texts whose function is the communication of facts, information or opinions, and the motivation to try to extract information from said text automatically is convincing, even if the success is only partial.

The phrase "text mining" is generally used to refer to any system that analyzes large amounts of text and natural language and detects lexical or linguistic usage patterns in an attempt to extract likely useful information.

Text mining and data mining

Just as data mining can be broadly described as finding patterns in your data, text mining is about searching patterns in text. However, the superficial similarity between the two masks real differences. Data mining can be more fully characterized as the extraction of implicit, previously unknown, and potentially useful data. Information is implicit in input data: it is hidden, unknown, and can hardly be extracted without resorting to automated data mining techniques. With text mining, however, the information that is extracted is clearly and precisely in the text. It is not hidden at all, most authors make sure that they express themselves clearly and unambiguously and,from a human point of view, the only sense in which it is "hitherto unknown" is that human resource constraints make it unfeasible for people to read the text themselves. The problem, of course, is that the information is not formulated in a way that is amenable to automatic processing. Text mining strives to bring text in a form that is suitable for consumption by computers directly, without the need for a human intermediary.Text mining strives to bring text in a form that is suitable for consumption by computers directly, without the need for a human intermediary.Text mining strives to bring text in a form that is suitable for consumption by computers directly, without the need for a human intermediary.

Although there is a clear difference philosophically, from a computer point of view the problems are quite similar. Text is as opaque as raw data when it comes to extracting the most detail.

Another requirement that is common to both data and text mining is that the information extracted must be "potentially useful." In a sense, this means actionable - capable of providing a basis for actions to be taken automatically. In the case of data mining, this notion can be expressed in a relatively domain-independent way: actionable patterns are those that allow non-trivial predictions to be made on new data from the same source. Performance can be measured by counting successes and failures, statistical techniques can be applied to compare different data mining methods on the same problem, and so on. However,in many text mining situations it is much more difficult to characterize what "actionable" means in a way that is independent of the particular domain. This makes it difficult to find fair and objective measures of success.

In many data mining applications, "potentially useful" is given a different interpretation: the key to success is that the information extracted must be understandable, as it helps explain the data. This is necessary when the result is intended for human consumption rather than on an automatic basis. This criterion is less applicable to text mining because, unlike data mining, the input itself is understandable. Text mining with understandable output is equivalent to summarizing salient features of a large body of text, which is a subfield in its own right: summary text.

Text mining and natural language processing

Text mining appears to encompass the entirety of automatic natural language processing and possibly much more, in addition to, for example, the analysis of linking structures such as bibliographic references in academic literature and hyperlinks in Web literature, both useful sources of information that lie outside the traditional domain of natural language processing. But in fact, most text mining efforts consciously reject the deeper and more cognitive aspects of classical natural language processing in favor of more superficial techniques akin to those used in practical information retrieval.

The reason is best understood in the context of the historical development of the subject of natural language processing resources. The roots of the field lay in machine translation projects in the late 1940s and early 1950s, whose hobbyists assumed that strategies based on word-for-word translation would provide dignified and useful rough translations that could easily be refined into something more accurate., using techniques based on primary parsing. But the only result of these high-profile, heavily funded projects was the clear realization of natural language, even at the height of illiterate children, it is an incredibly sophisticated medium that does not succumb to simplistic techniques.It fundamentally depends on what we think of as "common sense" knowledge, which despite its nature's cause, is exceptionally difficult to code and use algorithmically every day.

As a result of these embarrassing and highly publicized failures, the researchers removed the "toy world", especially the "block world" of geometric objects, shapes, colors, and stacking (operations whose semantics are clear and explicit, possible to encode). But gradually it became successful, the Toy Worlds, although initially impressive, did not translate into the success of realistic pieces of text. The toy techniques of the world deal well with artificially constructed sentences of what we might call the "Dick and Jane" variety after the well-known series of children's stories of the same name. But they fail miserably when confronted with the real text, whether it is painstakingly constructed and edited or produced in real time constraints (such as casual conversation).

Meanwhile, researchers in other areas simply had to deal with the actual text, with all its quirks, idiosyncrasies, and errors. Compression schemes, for example, should work well with all documents, whatever their content, and avoid catastrophic failure, even when deviant files (such as completely random input or binary files) are scandalously processed. Information retrieval systems must index documents of all kinds and allow them to be located effectively in whatever subject or linguistic correctness. The key to text summarization algorithms and extraction is that they have to do a decent job on any text file. Work systems and practices in these areas are separate topics,since most are language independent. They operate by treating the input as if it were data, not language.

Text mining is a consequence of this "real text" way of thinking. Accepting that it is probably not much, what can be done with unrestricted input, can the ability to process large amounts of text make up for relatively simple techniques?

It is interesting that data mining also evolved from a history of difficult relationships between disciplines, in this case of machine learning, rooted in experimental computer science, with special methodologies of evaluation and statistics well founded theoretically, but based on to a tradition of testing explicitly stated hypotheses rather than searching for new information. Early machine learning researchers knew or cared little about statistics; The early researchers of structured statistical hypotheses remained ignorant of parallel work in machine learning. The result was that similar techniques (for example, building decision trees and the nearest neighbor) emerged in parallel from the two disciplines,and only later did they make a balanced approach.

Sentiment mining

Computers can be good at working with numbers, but can they crunch feelings?

The emergence of blogs and social networks has generated a market around personal opinion: opinions, evaluations, recommendations and other forms of expression on the network. For computer scientists, this rapidly growing mountain of data is opening a tantalizing window into the collective consciousness of Internet users.

An emerging field known as sentiment analysis is taking shape around one of the unexplored frontiers of the computing world: translating the vagaries of human emotion into hard data.

The theory of "embodied cognition" suggests that a variety of mental activities are reflected in states of the body, such as postures, arm movements, and facial expressions. A study investigates the degree to which the profiles of computer users - their gender, feelings, and emotional experiences - can be evaluated from the movements of computer cursors.

In one experiment, participants (N = 372) watched three movie clips for two minutes each, rated their feelings afterward, and performed simple perception tasks three times, our program traced the path of the participants' cursor every 20 milliseconds. The degree to which features extracted from the cursor path could reveal the profiles of the participants was investigated. The results indicated that a small number of trajectory variables were helpful in identifying which movie the participants watched, how they felt while viewing the movie, and their gender. It is suggested that the cursor movements provide extensive information for mining a dynamic user profile.

This is more than an interesting programming exercise. For many companies, online opinion has become a kind of virtual currency that can make or break a product on the market.

Yet many businesses struggle to make sense of the box or chest of complaints and congratulations that now revolve around their online products. As emotion analysis tools that are beginning to take shape, they could not only help companies improve bottom line results, but also transform the online information search experience over time.

Several new emotion analysis companies are trying to take advantage of the growing interest of companies in what is said online.

"Social media used to be this project for 25-year-old consultants," said Margaret Francis, vice president of product at Explorer Labs in San Francisco. Now, she said, top executives "are recognizing it as an incredibly rich vein of market intelligence."

Scout Labs, which is backed by the venture capital firm started by CNet founder Halsey Minor, has recently introduced a subscription service that allows clients to monitor blogs, news articles, online forums, and social media sites for trends in opinions about products, services or topics in the news.

In early May, ticketing StubHub used Explorer Labs' monitoring tool to identify a sudden spike in negative blog sentiment after rain delaying a Sox Yankees-Red game.

The official stadium mistakenly told hundreds of fans that the game had been canceled and StubHub denied fans' requests for refunds, arguing that the game had actually been played. But after spotting beer problems online, the company offered discounts and credits to affected fans. He is currently re-evaluating his bad weather policy.

"This is a canary in a coal mine for us," said John Whelan, StubHub's director of customer service.

Yonkers-based Jodange offers a service for online publishers that allows them to incorporate opinion data from more than 450,000 sources, including mainstream news sources, blogs, and Twitter.

Based on research by Claire Cardie, a former Cornell computer science professor, and Jan Wiebe of the University of Pittsburgh, the service uses a sophisticated algorithm that not only assesses sentiments on particular topics, but also identifies the most opinionated opinion holders. influential.

Jodange, whose early investors include the National Science Foundation, is currently working on a new algorithm that could use sentiment data to predict future developments, such as forecasting the impact of newspaper editorials on a company's stock price.

In a similar vein, the Financial Times recently introduced Newssift, an experimental program that tracks sentiment on business topics in the news, along with a specialized search engine that allows users to organize their queries by topic, organization, place, person and subject.

Using Newssift, a recent Wal-Mart search reveals that sentiment about what the company is running is positive at a slightly better ratio of two to one. When that search is refined with the suggested term "Force and the Unions," however, the ratio of positive to negative feelings is closer to one to one.

These tools could help companies to pinpoint the effect of specific questions on customer perceptions, helping them respond with appropriate marketing and public relations strategies.

For casual netizens, simpler incarnations of sentiment analysis are emerging in the form of lightweight tools like Tweetfeel, Twendz, and Twitrratr. These sites allow users to take the pulse of Twitter users on particular topics.

A quick search on Tweetfeel, for example, reveals that 77 percent of Twitter users like the movie "Julie & Julia." However, the same search on Twitrratr reveals a couple of misfires. The site assigns a negative rating to a tweet read "Julie and Julia were really lovely." That same message ended with "we all feel very hungry after this" - and the system took the word "hungry" to indicate a negative feeling.

While the more advanced algorithms used by exploration labs, Jodange and Newssift employ advanced analytics to avoid such pitfalls, neither of these services works perfectly. "Our algorithm is about 70 to 80 percent accurate," said Francis, adding that his users can reclassify inaccurate results, so the system learns from their mistakes.

Translating the slippery stuff of human language into binary values will always be an imperfect science, however. "Sentiments are very different from conventional facts," said Seth Grimes, the founder of suburban Maryland consulting firm Plana Alta, pointing to the many cultural factors and linguistic nuances that make it difficult to convert a string of written text in a simple feeling for or against. "Sinner, that's a good word when applied to chocolate cake," he said. The simplest algorithm job is to scan by keyword to classify a statement as positive or negative, based on a simple binary analysis ("love" is good, "hate" is bad). However, this approach fails to capture the subtleties that bring human language to life: irony,sarcasm, slang and other idioms. Reliable sentiment analysis requires analyzing many shades of gray in linguistics.

"It's about confidence that can be expressed in subtle ways," said Bo Pang, a Yahoo researcher who co-wrote "Opinion Mining and Sentiment Analysis," one of the first scholarly books on sentiment analysis.

To arrive at the true intent of a statement, Pang developed software that analyzes several different filters, including polarity (is the statement positive or negative), intensity (what is the degree of emotion being expressed?), And subjectivity (the partial or impartial form is the source).

For example, the preponderance of adjectives often indicates a high degree of subjectivity, while verbal and noun statements tend towards a more neutral point of view.

As emotion analysis algorithms become more sophisticated, they should begin to produce more accurate results that may point the way to more sophisticated filtering mechanisms. They could become a part of using the Web every day.

"I see sentiment analysis becoming a standard feature of search engines," said Grimes, suggesting that these types of algorithms could start to influence both general web search purposes and more specialized searches in areas like e-commerce, travel reservations and movie reviews.

Pang envisions a search engine that specifies the results for users in detail based on trust. For example, it could influence the order of search results for certain types of queries such as "best hotel in San Antonio."

As search engines begin to incorporate more and more opinion data into their results, the distinction between fact and opinion can begin to blur to the point that, as David Byrne once said, »all facts come with points of view. »

Conflicting feelings about the mining business and manipulation of emotions

In the charming new animated film, "Inside Out," it is taken inside the head of Riley, an 11-year-old girl, to meet the characters representing five of the six emotions that psychologists have characterized as universal.: joy, sadness, fear, anger and disgust. (The sixth emotion: surprise, was omitted, perhaps because movie producers, like most business people, hate surprises.) Without revealing any spoilers, suffice it to say that, in Riley, as in the heads. Of most actual girls her age, Joy presents some images from her mind to sadness, anger, fear, and the other, less cute members of the emotional circle.

In this film and in films such as "Avatar" and "Toy Story," the animators were informed and inspired by the pioneering work of psychologist Paul Ekman in mapping small changes in facial expression. All that information about the actions to be taken into account in the film was given based on the mining of people's behaviors and feelings. But filmmakers aren't the only professionals who turn to Ekman for inspiration and guidance. The CIA, TSA, and other security-conscious organizations employ facial coding activity to root out liars and malicious people. And advertisers, eager to get inside consumers' heads and shape our decisions before we are even aware of making them,they see the gold panning in the commercialization of functional magnetic resonance imaging machines and in the camera detection of our little smiles, grimaces and eye movements. They are trying to test how ads make us feel, microsecond by microsecond, to ensure that emotional barriers to their message are minimized and the joy or other emotional incentive it generates is maximized.

All the decisions that companies make today are based on a large database that they have been filling by observing the individual, the reason why they offer certain types of products is given by the ease that it provides to these companies. sentiment mining.

The Internet is an increasingly important part of our lives. Internet users share information and opinions on social media networks where they express their feelings, judgments, personal emotions easily. Text mining and information retrieval techniques allow us to explore all this information and discover what types of opinions, claims, or claims the authors are making.

In summary, mining in the data collection area is used to determine what type of information users are looking for, facilitate the use of large amounts of information, texts, classify characteristics, know the preferences of a company's customers. All this with the objective that the interested party agrees. Generally, companies collect all this type of information to know what products or services to present to the client, how they will react, in what they will be interested.

On the other hand, the classification of information has come to give great help to those people who handle large amounts of data, thanks to increasingly rapid systems in the processing of said data.

References:

Practical data mining, machine learning tools and techniques with Java implementations (2000). Ian H. Witten, Eibe Frank. Editorial Morgan Kaufmann Perception based on data mining and decision making in economics and finance (2007). Ildar Batyrshin, Leonid Sheremetov, Lofti A. Zadeh. Editorial Illustrated Neural correlations decisions and actions, current opinion in neurobiology (2010). B. They will weigh.

Download the original file