CSE 5840: String Algorithms and Applications in Bioinformatics - Fall 2015

Yufeng Wu
235 ITEB, ywu@engr.uconn.edu
Office hours:
ITEB 235, Monday/Wednesday 2:00 pm to 4:30 pm or by appointment.

This course is an algorithmic course. Most of the topics will be about combinatorial
algorithms on string processing. The goal is to survey the field of string algorithms
by covering important algorithmic ideas in string processing. We will also discuss
applications of string algorithms in bioinformatics, especially in analyzing high throughput
sequencing data. Sequence data analysis has been a major application of string
algorithm. I expect to cover various types of problems in analyzing sequence data.

This course is lecture-based. Students are required to read and present a research
paper in string algorithms and their applications. Each student should also perform
some empirical study by implementing some string algorithms.

In particular, the planned subjects are:

1) Classic exact string matching algorithms and applications. Topics include:
Knuth–Morris–Pratt, Boyer-Moore, Aho-Corasick, suffix trees and suffix arrays.

2) Extension to the classic string algorithms: approximate string matching, multiple sequence
alignment, other string match heuristics (e.g. Blast).

3) Burrows-Wheeler transform.
Algorithms in data compression and coding.

4) Applications in bioinformatics. Topics may include reads mapping in high-throughput
sequencing, genome assembly, genetic variation calling, and compression.

Prerequisites. As for background, essentially no biology is assumed. The most relevant
background is a graduate course on algorithms, but a serious student who has only had a
undergraduate algorithm course, or a smart, mathematically mature student who has had neither,
might also be able to follow the course.

Textbook: The following book is recommended but not required. We will cover some topics from
this book. But the majority of topics will outside this book. I will try to post relevant materials/links
on these topics.

Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology
by Dan Gusfield, 1997.
An excellent survey of string algorithms and bioinformatics applications.

Homework.  There will be written homework assignments from time to time.

Presentation.  Each student needs to select a particular subject in string algorithms and applications
in bioinformatics to present to the class. The student should contact the instructor about the
topic/paper first. I prefer the presentation that provides some interesting algorithmic/technical aspects.

Projects. Each student should do a project on string algorithms. Then, each student needs to write a
project report that summarizes the findings.
I prefer that each student works on his/her own project
but team projects can be made with permission.
There are different kinds of projects.
Ideally, a student will choose to conduct some research: design a faster algorithm for some
string algorithmic problems, develop some new algorithms for solving some practical problems (e.g. in
analyzing sequence data) or develop some
theoretical analysis of some string algorithmic problem.
Alternatively, one may evaluate empirically the performance of string processing algorithms.
Again, each project needs to be first approved by the instructor.

Exams. Currently I have not decided whether to hold a final exam in this course. If I do, it is likely to
a take-home exam.

Grading. The grade will be assigned by: homework (25%), project (30%), paper presentation (10%)
and final exam (35%). This is based on the assumption that there will be a final exam.