CSE-5095-004      and        Engineering-SEC001-1133




CAST 204, Mon/Wed 2:00pm – 3:15pm




         ITEB 138, TBD



            Jinbo Bi

            Phone: 486-1458


            Office hours: Tue. 2:30pm – 3:15pm

            Office: ITEB 233


            Abdulaziz Miyajan



            Office hours: by appointment

            Office: ITEB



The purpose of this course is to introduce to the students the general topics and techniques of data mining and machine learning with specific application focus on biomedical informatics. This course introduces multiple real-world medical problems with real patient data, and how multiple analytic algorithms have been used in an integrated fashion to cope with these problems. It covers some cutting-edge data mining technology which can successfully tackle problems that are complex, highly dimensional, and/or ambiguous in labeling. General topics of data mining, such as clustering, classification, regression, dimension reduction, will be described. However, efforts will also be given to more advanced and recent topics. In particular, imprecisely supervised learning problems will be discussed, including multiple instance learning, metric learning, and learning with multi-labeler annotations etc.  Throughout the entire course, practical medical/healthcare problems will be used as examples to demonstrate the adoption and effectiveness of data mining methods.   

The course will consist of lectures, labs, paper reviews and projects. Lectures will serve as the vehicle to introduce concepts and knowledge to students. Labs will be used to enforce the material given in lectures and students paper reviews will be used to study the state-of-the-art from researchers in the field. Participation is encouraged during the class. 

As part of the course, the students will work on a term project with the goal of applying any of the studied techniques to a problem selected from a list of projects. Students are also encouraged to propose and design their own problems which need to be approved by the instructor for class suitability. Teams of two-three students will be created for each project. Each team is required to present in the classroom and submit a project report, of 15-20 pages, which includes the definition of the problem, techniques used to solve the problem and experimental results obtained. This exercise will help the team gain a hands-on understanding of the material studied in this course and promotes collaborations among team members.



  1. Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, ISBN-10: 0321321367
  2. Pattern Classification (2nd Edition) by Richard O. Duda, Peter E. Hart and David G. Stork, ISBN-10: 0471056693
  3. Pattern Recognition and Machine Learning (Information Science and Statistics) by Christopher M. Bishop, ISBN-10: 0387310738


  1. In-Class Lab Assignments (3): 30%
  2. Paper review (1): 10%
  3. Term Project (1): 50% (including 2 progress reports and 1 final report with software package)
  4. Participation: 10%





Lecture Notes









General topics in machine learning, review basics



Project planning and introduction of previous students projects



General classification and regression 



Linear models for regression


Support vector machines




Lab (1): homework assignment on classification


Medical problem 1: Clinical Decision Support Systems



Invited Lecture: EEG analyses for neurophysiological disorders by a psychologist Prof. Chen


General topics on clustering



Hierarchical clustering – traditional techniques


Spectral clustering – modern techniques 




Medical problem 2: Cardiac Ultrasound Image Categorization (deal with medical images)



Invited Lecture: drug discovery by Scientist from Pfizer Pharmaceuticals




Lab (2): introduction of Matlab and homework assignment on clustering of cardiac image data


Student paper review presentation



Student paper review presentation


Student paper review presentation



Student paper review presentation


General topics on dimension reduction



Unsupervised dim reduction: PCA, CCA, ICA


Supervised dim reduction: LASSO, group LASSO, or 1-norm SVM



Medical problem 3: Computerized Diagnostic Coding (deal with natural language text data)


Lab (3): conduct dimension reduction assignments on diagnostic coding data



Invited Lecture


Presentation of final term projects




Presentation of final term projects (cont.)



Presentation of final term projects (cont.)




Final Exam Week – No Classes, Make-up Exam, Term project reports are due on Friday





1.      Computers are allowed in classroom for taking notes or any activity related to the current class meeting.

2.      Assignments must be submitted electronically via HuskyCT.  If the assignment is handed in late, 10 credits will be reduced for each additional day.

3.      Participation in paper review itself will earn 80% credits for each review assignment. Paper review presentation slides need to be turned in via HuskyCT before the class that the presentation is scheduled.  The quality of your paper review presentation will be judged by the instructor (10 credits) and scoring of peer students in the class (10 credits).

4.      Assignments and paper reviews will be graded by the teaching assistant assigned to this course under guidance and consulting of the instructor.

5.      Final term projects will be graded by the instructor based on the clarity and creativity of the project report and the comparison of final presentation of all teams.


1.      If a lab assignment or a paper review presentation is missed, there will be a take-home final exam to make up the credits.

2.      If two of the lab assignments or paper reviews are missed, there will be an additional assignment and a take-home exam to make up each of the two items.


A HuskyCT site has been set up for the class. You can access it by logging in with your NetID and password.  You must use HuskyCT for submitting assignments and check it regularly for class materials, grades, problem clarifications, changes in class schedule, and other class announcements.



Paper Review:

(Please select one paper from the following machine learning / data mining papers to present)


You can check this webpage to see your selection and the papers that have been chosen.


The following papers are from International Conference on Machine Learning 2011-2012.

1.      A Co-training Approach for Multi-view Spectral Clustering

Abhishek Kumar, Hal Daume III, University of Maryland

2.      Information-Theoretic Co-clustering

Inderjit S. Dhillon, Subramanyam Mallela and Dharmendra S. Modha

3.      Learning with Whom to Share in Multi-task Feature Learning

Zhuoliang Kang, Kristen Grauman, Fei Sha

4.      Automatic Feature Decomposition for Single View Co-training

Minmin Chen, Kilian Weinberger, Yixin Chen

5.      A Unified Probabilistic Model for Global and Local Unsupervised Feature Selection

Yue Guan, Jennifer Dy, Michael Jordan

6.      On Random Weights and Unsupervised Feature Learning

Andrew Saxe, pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, Andrew Ng

7.      The Constrained Weight Space SVM: Learning with Ranked Features

Kevin Small, Byron Wallace, Carla Brodley, Thomas Trikalinos

8.      Support Vector Machines as Probabilistic Models

Vojtech Franc, Alexander Zien, Bernhard Schölkopf

9.      A Graph-based Framework for Multi-Task Multi-View Learning

Jingrui He, Rick Lawrence

10.  TrueLabel + Confusions: A Spectrum of Probabilistic Models in Analyzing Multiple Ratings

Chao Liu, Yi-Min Wang

11.  Convex Multitask Learning with Flexible Task Clusters

Wenliang Zhong, James Kwok

12.  Multi-level Lasso for Sparse Multi-task Regression

Aurelie Lozano, Grzegorz Swirszcz

13.  Output Space Search for Structured Prediction

Janardhan Rao Doppa, Alan Fern, Prasad Tadepalli

14.  A Complete Analysis of the l_1,p Group-Lasso

Julia Vogt, Volker Roth

15.  Information-Theoretical Learning of Discriminative Clusters for Unsupervised Domain Adaptation

Yuan Shi, Fei Sha

16.  A convex relaxation for weakly supervised classifiers

Armand Joulin, Francis Bach

17.  Inferring Latent Structure From Mixed Real and Categorical Relational Data

Esther Salazar, Lawrence Carin

18.  Learning Task Grouping and Overlap in Multi-task Learning

Abhishek Kumar, Hal Daume III


The following papers are from ACM Special Interest Group on Knowledge Discovery and Data Mining 2011-2012

19.  A Sparsity-Inducing Formulation for Evolutionary Co-Clustering
Shuiwang Ji*, Old Dominion Univ; Wenlu Zhang, Old Dominion University; Jun Liu, Siemens Corporate Research at Princeton

20.  On Socio-Spatial Group Query for Location-Based Social Networks
De-Nian Yang, Academia Sinica; Chih-Ya Shen*, National Taiwan University; Wang-Chien Lee, Pennsylvania State University; Ming-Syan Chen, NTU

21.  Robust Multi-Task Feature Learning
Pinghua Gong*, Tsinghua University; Jieping Ye, Arizona State University; Changshui Zhang, Tsinghua University

22.  Unsupervised Feature Selection for Linked Social Media Data
Jiliang Tang*, Arizona State University; Huan Liu, Arizona State University

23.  Learning from Crowds in the Presence of Schools of Thought
Yuandong Tian*, Carnegie Mellon University; Jun Zhu, Tsinghua University

24.  Event-based Social Networks: Linking the Online and Offline Social Worlds
Xingjie Liu*, The Pennsalvania State Univ; QI HE, IBM Almaden Research Center; Yuanyuan Tian, IBM Almaden Research; Wang-Chien Lee, Pennsylvania State University; John McPherson, IBM Almaden Research Center; Jiawei Han, University of Illinois at Urbana-Champaign

25.  On the Semantic Annotation of Places in Location-based Social Networks

Mao Ye, Dong Shou, ; Wang-Chien Lee, ; Peifeng Yin, ; Krzysztof Janowicz,

26.  Two-locus association mapping in subquadratic runtime

Panagiotis Achlioptas, ; Bernhard Schölkopf, Max Planck Institute; Karsten Borgwardt, Max Planck Institutes

27.  Differentially Private Data Release for Data Mining

Noman Mohammed*, Concordia University; Rui Chen, Concordia University; Benjamin Fung, Concordia University; Mourad Debbabi, Concordia University; Philip Yu, University of Illinois at Chicago

28.  Collaborative Topic Models for Recommending Scientific Articles

Chong Wang*, Princeton University; David Blei, Princeton Univ

29.  Multi-View Clustering Using Mixture Models in Subspace Projections
Stephan Günnemann*, Ines Faerber, Thomas Seidl

30.  Subspace Correlation Clustering: Finding Locally Correlated Dimensions in Subspace Projections of the Data
Stephan Günnemann*, Ines Faerber, Kittipat Virochsiri, Thomas Seidl



Tools that may help with course projects (to be complete)

  1. Matlab Optimization Toolbox
  2. SVM_Light (support vector machines)
  3. LIBSVM (support vector machines)
  4. Bayesian Knowledge Discoverer (BKD): computer program able to learn Bayesian Belief Networks from databases
  5. Bayes net toolbox for Matlab
  6. TSP Demo
  7. LeNet (neural networks)
  8. Neural networks demo
  9. Neural networks flash demo
  10. GAUL (genetic algorithm)
  11. Java genetic algorithm demo
  12. A complete notebook GA
  13. A system for distributing statistical software, datasets, and information by electronic mail, FTP and WWW
  14. Tools for mining large databases C5.0 and See5
  15. Description of the SLIPPER rule learner, that is a system that learns sets of rules from data based on original RIPPER rule learner
  16. Information about Data Mining and knowledge discovery in Databases
  17. Clustering Algorithms


You are expected to adhere to the highest standards of academic honesty. Unless otherwise specified, collaboration on assignments is not allowed. Use of published materials is allowed, but the sources should be explicitly stated in your solutions. Violations will be reviewed and sanctioned according to the University Policy on Academic Integrity. Collaborations among team members are only allowed for the final term projects that are selected.

“Academic integrity is the pursuit of scholarly activity free from fraud and deception and is an educational objective of this institution. Academic dishonesty includes, but is not limited to, cheating, plagiarizing, fabricating of information or citations, facilitating acts of academic dishonesty by others, having unauthorized possession of examinations, submitting work for another person or work previously used without informing the instructor, or tampering with the academic work of other students.”


If you have a documented disability for which you are or may be requesting an accommodation, you are encouraged to contact the instructor and the Center for Students with Disabilities or the University Program for College Students with Learning Disabilities as soon as possible to better ensure that such accommodations are implemented in a timely fashion.

Jinbo Bi ©2012/8-2012/12
Last revised: 8/27/2012