CSE 5095: String Algorithms and Applications in Bioinformatics - Spring 2013

Yufeng Wu
235 ITEB, ywu@engr.uconn.edu
Office hours:
ITEB 235, Tuesday/Thursday 9:30 am -12 pm or by appointment.


This course is an algorithmic course. Most of the topics will be about combinatorial
algorithms on string processing. The goal is to survey the field of string algorithms
by covering important algorithmic ideas in string processing. We will also discuss
applications of string algorithms, especially in bioinformatics. However, the foucs
of this course will be on algorithms.

This course is lecture-based. Students are required to read and present a research
paper in string algorithms and their application. Each student should also perform
some empirical study by implementing some string algorithms.

In particular, the planned subjects are:

1) Classic exact string matching algorithms and applications. Topics include:
Knuth–Morris–Pratt, Boyer-Moore, Karp-Rabin, Aho-Corasick, suffix trees
and suffix arrays.

2) Extension to the classic string algorithms: approximate string matching, extensions
for the basic sequence alignment, other string match heuristics (e.g. Blast).
Probabilistic models of strings and patterns.

3) Burrows-Wheeler transform.
Algorithms in data compression and coding.

4) Applications in bioinformatics. Topics may include reads mapping in high-throughput
sequencing and genome assembly.

Prerequisites. As for background, essentially no biology is assumed. Some knowledge
of probability may help. The most relevant background is a graduate course on algorithms,
but a serious student who has only had a undergraduate algorithm course, or a smart,
mathematically mature student who has had neither, might also be able to follow the course.


Textbook: The following book is required. At least half of the lectures will be based on this book.
We will also cover some topics outside this book.

Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology
by Dan Gusfield, 1997.
An excellent survey of string algorithms and bioinformatics applications.

Homework.  There will be written homework assignments from time to time.

Presentation.  Each student needs to select a particular subject in string algorithms and applications
in bioinformatics to present to the class. The student should contact the instructor about the
topic/paper first. I prefer the presentaiton that provides some interesting algorithmic/technical aspects.

Projects. Each student should do a project on string algorithms. Then, each student needs to write a
project report that summerizes the findings.
I prefer that each student works on his/her own project
but team projects can be made with permission.
There are different kinds of projects.
Ideally, a student will choose to conduct some research: design a faster algorithm for some
string algorithmic problems, develop some new algorithms for solving some practical problems (e.g. in
bioinformatics) or develop some
theoretical analysis of some string algorithmic problem.
Again, each project needs to be first approved by the instructor.

Exams. Currently I plan to have a final exam in this course. I believe this will help me to see how much
the students in the class learn the materials.

Grading. The grade will be assigned by: homework (25%), project (30%), paper presentation (10%)
and final exam (35%). This is based on the current plan that there will be a final exam.