Course Outlines/Presentations

  Mid-term Presentations

Final Presentations

  Tutorials

 

Data Mining in Engineering

CHE1147H

FALL 2013

Friday 5-8 p.m.    Room: BA1240

Bahen Centre Information Tech
40 St. George Street
Toronto, Ontario, Canada

  

Professor S. Sayad

saed.sayad@utoronto.ca

 

OBJECTIVES

Data mining is about explaining the past and predicting the future by means of analyzing data. The objective of this course is to enable you to obtain accurate, precise, quantitative and qualitative knowledge from data.  
 

INTRODUCTION

An exceptional ability to deal with data is the defining characteristic of an engineer.  Three major problems currently associated with engineering data are:  the presence of experimental error in the data; the presence of both qualitative and quantitative data and very large quantities of data. There are two very closely related fields directed at overcoming these problems: data mining and statistics.  Data Mining is a branch of Informatics that refers to a wide variety of methods used to discover patterns in data (i.e. to obtain information from data). Statistics is the branch of mathematics directed at how to collect and analyze data in the presence of variability. This course will focus upon Data Mining and the statistics required to accomplish it. Although the necessary statistics will be examined in class, an undergraduate statistics course is a prerequisite for this course.

Data Mining methods have been used for an extremely wide variety of applications. The list includes: computer vision techniques, control systems, traffic control, geographical information systems, electric power systems stability, medical diagnosis, electronic commerce, customer support, credit and other financial analysis, information retrieval, software development, business processes, assembly sequences in engineering, skill learning, transportation and production planning. In chemical engineering and chemistry, topics include image processing, color matching, elucidation of structure property relationships as well as structure biological function relationships, environmental assessment of processes, assessment of process efficiency and economic viability, detection of the hydrodynamic regime in two phase flow, detection of coding and non-coding regions in DNA sequences, analysis of screening data for identifying classes of compounds that are promising cancer drugs. This course will acquaint you with data mining methods which have been found particularly useful in a variety of fields and will show you how to apply these methods to practical problems. As will be seen below, the essential challenge in the course is for you to learn the methods sufficiently well to be able to apply them to diverse engineering data.

 

COURSE CONTENT

Although there will be some supplementary references (published papers, web sites, etc.) the course textbook will provide the backbone of the course.  This textbook is online:

An Introduction to Data Mining

Topics of interest include methods of visualization, classification, clustering, regression and association.  Visualization allows data miners to find patterns or structures in datasets using graphical methods.  These methods include Scatterplot Matrices, Parallel Coordinates, Pixel Oriented Methods, Icon based Methods, Dimensional Stacking, Treemap and simple charts (line, bar and pie).   Classification attempts to predict whether or not a case is a member of a particular qualitative class by building a model based on some independent variables. These methods include: ZeroR, OneR, Decision Tree, Naïve Bayesian, Instance-Based Learning, Artificial Neural Networks and Support Vector Machine. Regression methods emphasize fitting equations to data to predict the values of continuous variables. In addition to linear and non-linear regression we are also interested in methods involving both qualitative and quantitative variables (e.g. Classification and Regression Trees). Clustering methods find groups of items that are similar. (Since the classes are unspecified before the analysis, clustering is sometimes referred to as unsupervised learning.) Clustering methods include K-Mean Clustering and Self-Organizing Maps (SOM).  Association methods create rules that describe how often events occur together. A method termed “A Priori” is a method of particular interest.

 

ASSIGNMENTS

As alluded to above, in lectures, examples will be drawn from a wide variety of applications. In some cases these methods have not previously been applied to engineering data. Thus, there will be many small projects in this course requesting each of you to apply a recently learned method to such data. The preferred source of the data will be your own research. However, in some cases, data from the Internet or scientific publications may be used. Even computer simulated data may occasionally be suitable. Assignments will be discussed in class and main basis for the mark in each case will how well the example illustrates the Data Mining method. The course will not involve computer programming. However, you will be expected to be able to download and intelligently use specified software.
 

MARKS

First presentation will compose 30% of the course mark with the remaining 70% allocated to the final presentation. There will be no mid or final term examinations.    

 

Revised: August 29, 2013