BCA SEM-6 Data Mining and Data Warehouse imp question with answer and old paper


Click Here to Download 2020 paper 

UNIT -1 

1) What is Data Mining? What kind of data can be mined?
ANS)
Data mining is the process of uncovering patterns and other valuable information from large data sets. Data mining has improved organizational decision-making through insightful data analyze. It is used to organize and filter data.
It is also known as Knowledge Discovery in data (KDD).
  • The key properties of data mining are: 
    • Automatic discovery of patterns 
    • Prediction of likely outcomes 
    • Creation of actionable information 
    • Focus on large data sets and databases 
  • Type of data that can be mined:
    • Flat Files
    • Relational Databases
    • DataWarehouse
    • Transactional Databases
    • Multimedia Databases
    • Spatial Databases
    • Time Series Databases
    • World Wide Web(WWW)
  • Flat Files
    • Flat files is defined as data files in text form or binary form with a structure that can be easily extracted by data mining algorithms.
    • Data stored in flat files have no relationship, like if a relational database is stored on flat file, then there will be no relations between the tables.
    • Flat files are represented by data dictionary. Eg: CSV file.
    • Application: Used in Data Warehousing to store data, etc.
  • Relational Databases
    • A Relational database is defined as the collection of data organized in tables with rows and columns.
    • Physical schema defines the structure of tables and
    • Logical schema defines the relationship among tables.
    • Standard API of relational database is SQL.
    • Application: Data Mining, ROLAP model, etc.
  • Data Warehouse
    • A datawarehouse is defined as the collection of data integrated from multiple sources combined in same source.
    • There are three types of datawarehouse: 
      • Enterprise data warehouse, 
      • Data Mart and 
      • Virtual Warehouse.
    • Two approaches can be used to update data in DataWarehouse: 
      • Query-driven Approach and 
      • Update-driven Approach.
    • Application: Business decision making, Data mining, etc.
  • Transactional Databases
    • Transactional databases is a collection of data organized by time stamps, date, etc to represent transaction in databases.
    • This type of database has the capability to roll back or undo its operation when a transaction is not completed or committed.
    • Highly flexible system where users can modify information without changing any sensitive information.
    • Follows ACID property of DBMS.
    • Application: Banking, Distributed systems, Object databases, etc.
  • Multimedia Databases
    • Multimedia databases consists audio, video, images and text media.
    • They can be stored on Object-Oriented Databases.
    • They are used to store complex information in a pre-specified formats.
    • Application: Digital libraries, video-on demand, news-on demand, musical database, etc.
  • Spatial Database
    • Store geographical information.
    • Stores data in the form of coordinates, topology, lines, polygons, etc.
    • Application: Maps, Global positioning, etc.

  • Time-series Databases:
    • Time series databases contains stock exchange data and user logged activities.
    • Handles array of numbers indexed by time, date, etc.
      It requires real-time analysis.

      Application: eXtremeDB, Graphite, InfluxDB, etc.
  • WWW
    • WWW refers to World wide web is a collection of documents and resources like audio, video, text, etc which are identified by Uniform Resource Locators (URLs) through web browsers, linked by HTML pages, and accessible via the Internet network.
    • It is the most heterogeneous repository as it collects data from multiple resources.
    • It is dynamic in nature as the volume of data is continuously increasing and changing.
    • Application: Online shopping, Job search, Research, studying, etc.

2) Write a note on technologies used in data mining.
ANS

Several techniques used in the development of data mining methods. Some of them are mentioned below: 

1. Statistics: 
  • It uses the mathematical analysis to express representations, model and summarize empirical data or real world observations.
  • Statistical analysis involves the collection of methods, applicable to large amount of data to conclude and report the trend.
2. Machine learning:
  • Machine learning is a field of study that gives computers the ability to learn without being programmed. When the new data is entered in the computer, algorithms help the data to grow or change due to machine learning. 
  • In machine learning, an algorithm is constructed to predict the data from the available database.
  • The four types of machine learning are: 
    1. Supervised learning
    2. Unsupervised learning 
    3. Semi-supervised learning 
    4. Active learning
3. Information retrieval:
  • Information deals with uncertain representations of the semantics of objects (text, images).
  • For example: Finding relevant information from a large document. 
4. Database systems and data warehouse: 
  • Databases are used for the purpose of recording the data as well as data warehousing. 
  • Online Transactional Processing (OLTP) uses databases for day to day transaction purpose. 
  • To remove the redundant data and save the storage space, data is normalized and stored in the form of tables. 
  • Entity-Relational modeling techniques are used for relational database management system design.
  • Data warehouses are used to store historical data which helps to take strategical decision for business.
  • It is used for online analytical processing (OALP), which helps to analyze the data.
5. Decision support system:
  • It is a category of information system. 
  • Its is very useful in decision making of an organizations.
  • It is an interactive software based system which helps dicision makers to extract usuful information from the data to make decision.

3) What do you mean by Data Mining? Write a note on steps in knowledge discovery.
ANS

Data mining is the process of uncovering patterns and other valuable information from large data sets. Data mining has improved organizational decision-making through insightful data analyze. It is used to organize and filter data. 
It is also known as Knowledge Discovery in data (KDD).

Steps Involved in KDD Process:


  • Data Cleaning: 
    • Data cleaning is defined as removal of noisy and irrelevant data from collection. noisy data, where noise is a random or variance error.
    • Cleaning in case of Missing values.
  • Data Integration: 
    •     It is defined as similar type of data from multiple sources combined in common source (DataWarehouse).
    • It uses Data Migration and Data Synchronization tools.
  • Data Selection: 
    •     It is defined as the process where data relevant to the analysis is decided and retrieved from the data collection.
    • Data selection using Neural network, Decision Trees, Clustering, Regression etc
  • Data Transformation: 
    • It is defined as the process of transforming data into appropriate form required by a mining procedure.
    • Data Transformation is a two-step process:
    • Data Mapping: 
      • Assigning elements from source base to destination to capture transformations.
    • Code generation: 
      • Creation of the actual transformation program.
  • Data Mining: 
    • Data mining is defined as clever techniques that are applied to extract patterns which are potentially useful.
    • Transforms task relevant data into patterns.
    • Decides purpose of model using classification or characterization.
  • Pattern Evaluation: 
    • Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge based on given measures.
    • Find interesting ness score of each pattern.
    • Uses summarization and Visualization to make data understandable by user.
  • Knowledge representation: 
    • Knowledge representation is defined as a technique which uses visualization tools to represent data mining results.
    • Generate reports, tables, discriminant rules, classification rules, characterization rules, etc.
4) Short note : Data Discrimination and Data Characterization
ANS
1. Data characterization
  • Data characterization is a summarization of the general characteristics or features of a target class of data. 
  • The data corresponding to the user-specified class are typically collected by a query. 
  • For example, to study the characteristics of software products with sales that increased by 10% in the previous year, the data related to such products can be collected by executing an SQL query on the sales database.
  • There are several methods for effective data characterization:
  • Simple data summaries based on statistical measures
  • An attribute-oriented induction technique can be used to perform data characterization without step-by-step user interaction.
  • The output of data characterization can be presented in various forms 
  • For Example, pie charts, bar charts, curves, multidimensional data cubes, and multidimensional tables, including crosstabs. 
  • The resulting descriptions can also be presented as generalized relations or in rule form (called characteristic rules).
2.Data Discrimination:
  • It  is a comparison of the general features of the target class data objects against the general features of objects from one or multiple contrasting classes. 
  • The target and contrasting classes can be specified by a user, and the corresponding data objects can be retrieved through database queries. 
  • For example, a user may want to compare the general features of software products with sales that increased by 10% last year against those with sales that decreased by at least 30% during the same period.
  • The methods used for data discrimination are similar to those used for data characterization.
  • Discrimination descriptions expressed in the form of rules are referred to as discriminant rules.
5) Write a note on sampling and its types.
ANS
Sampling is a process used in statistical analysis in which a predetermined number of observations are taken from a larger population. In simple word The porcess of selecting a sample is called as Sampling 
It helps a lot in research.
It is one of the most important factors which determines the accuracy of your research/survey result.
If anything goes wrong with the sample then it will be directly reflected in the final result.
There are lot of sampling techniques which are grouped into two categories as:
  • Probability Sampling 
    • Simple Random Sampling
    • Stratfied Sampling 
    • Systematic Sampling 
    • Cluster Sampling
    • Multi stage Sampling 
  • Non-Probability Sampling 
    • Convenience Sampling 
    • Purposive Sampling 
    • Quota Sampling
    • Referral / Snowball Sampling 
1. Simple Random Sampling :
  • Every element of population has equal chance to selected to be part of sample.
  • It is used when we dont have prior information about the target population.
2. Stardied Sampling:
This techniques devides elements of population into small sub groups know as strata.
The dividation done in such way that elements within the groups are similar or dissimilar among the other groups formed.
Then, The elements are randomly selected from the each of these strata.
This techniqur need prior information about the population to create sub groups.

3. Systematic Sampling:
In this technique selection of elements is systematic not random except the first element.
Elements of a sample are choosen at regular intervals of population 
All the elements are put togather in a sequnce first where every each element has the equal chance of being selected.

4. Cluster Sampling:
Our entire population is divided into clusters or sections and then the clusters are randomly selected. All the elements of the cluster are used for sampling. 
Clusters are identified using details such as age, sex, location etc. 
Cluster sampling can be done with either of 2 given ways.
    1.Single stage Cluster Sampling 
    2. Two stage Cluster Sampling 

5.Multi-stage Sampling :
It is the combination of one or more methods described above

2.Non-Probability Sampling :
• It does not rely on randomization. 
This technique is more reliant on the researcher’s ability to select elements for a sample. Outcome of sampling might be biased and makes difficult for all the elements of population to be part of the sample equally.
This type of sampling is also known as non-random sampling

1. Convenience Sampling
Here the samples are selected based on the availability. 
This method is used when the availability of sample is rare and also costly.
So based on the convenience samples are selected.

2. Purposive Sampling 
This is based on the intention or the purpose of study. 
Only those elements will be selected from the population which suits the best for the purpose of our study.

3. Quota Sampling:
This type of sampling depends of some pre-set standard. 
It selects the representative sample from the population.
Ratio of characteristics in sample should be same as population.
like if ration of woman 55% and men 45%, This ratio should be same for sample.

4. Referral / Snow ball Sampling:
This technique is used in the situations where the population is completely unknown and rare. Therefore we will take the help from the first element which we select for the population and ask him to recommend other elements who will fit the description of the sample needed.
So this referral technique goes on, increasing the size of population like a snowball.

6) Write a note on Histogram and its types in detail.
ANS


7)
Write a note on Association Rule mining.
ANS)
    Association Rule mining technique helps to discover a link between two or more items. It finds a hidden pattern in the data set.
    Association rules are if-then statements that support to show the probability of interactions between data items within large data sets in different types of databases. Association rule mining has several applications and is commonly used to help sales correlations in data.

    Association rules are if-then statements that support to show the probability of interactions between data items within large data sets in different types of databases. Association rule mining has several applications and is commonly used to help sales correlations in data. Market based Analysis is one of the key technique used by large relations to show association between items. 

    The way the algorithm works is that you have various data, For example, a list of grocery items that you have been buying for the last six months. It calculates a percentage of items being purchased together.

Rule evaluation Matrix :
  • Support:
    This measurement technique measures how often multiple items are purchased and compared it to the overall dataset.
                      (Item A + Item B) / (Entire dataset)
  • Confidence:
    This measurement technique measures how often item B is purchased when item A is purchased as well.
                      (Item A + Item B)/ (Item A)
  • Lift:
    This measurement technique measures the accuracy of the confidence over how often item B is purchased.
                      (Confidence) / (item B)/ (Entire dataset)
    The Association rule is very useful in analyzing datasets. The data is collected using bar-code scanners in supermarkets. Such databases consists of a large number of transaction records which list all items bought by a customer on a single purchase. So the manager could know if certain groups of items are consistently purchased together and use this data for adjusting store layouts, cross-selling, promotions based on statistics.

8) Write a note on Apriori Algorithm.
ANS






Comments

  1. Aap bhot achha kam karte hei aaftab bhai😂🫂

    ReplyDelete

Post a Comment

Popular posts from this blog

C++ Practice Program

Cloud Computing Important Question-Answer for University Exam

Software Testing Gujarat University Important Questions For Exam