Data Warehousing and Data Mining
Introduction
Data Mining:
The process of Discovering meaningful patterns & trends often previously unknown, by shifting large amount of data, using pattern recognition, statistical and Mathematical techniques.
A group of techniques that find relationship that have not previously been discovered
What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases
Alternative names and their “inside stories”:
Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
What is not data mining?
(Deductive) query processing.
Expert systems
Data Mining: Confluence of Multiple Disciplines
Data Mining: On What Kinds of Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Data Mining Applications
Data mining is a young discipline with wide and diverse applications
There is still a nontrivial gap between general principles of data mining and domain-specific, effective data mining tools for particular applications
Some application domains
Biomedical and DNA data analysis
Financial data analysis
Retail industry
Telecommunication industry
Biomedical Data Mining and DNA Analysis
DNA sequences: 4 basic building blocks (nucleotides): adenine (A), cytosine (C), guanine (G), and thymine (T).
Gene: a sequence of hundreds of individual nucleotides arranged in a particular order
Humans have around 100,000 genes
Tremendous number of ways that the nucleotides can be ordered and sequenced to form distinct genes
Semantic integration of heterogeneous, distributed genome databases
Current: highly distributed, uncontrolled generation and use of a wide variety of DNA data
Data cleaning and data integration methods developed in data mining will help
DNA Analysis: Examples
Similarity search and comparison among DNA sequences
Compare the frequently occurring patterns of each class (e.g., diseased and healthy)
Identify gene sequence patterns that play roles in various diseases
Association analysis: identification of co-occurring gene sequences
Most diseases are not triggered by a single gene but by a combination of genes acting together
Association analysis may help determine the kinds of genes that are likely to co-occur together in target samples
Path analysis: linking genes to different disease development stages
Different genes may become active at different stages of the disease
Develop pharmaceutical interventions that target the different stages separately