Yufeng Wu

235 ITEB, ywu@engr.uconn.edu

Office hours: ITEB 235, Monday/Wednesday 2:00 pm to 4:30 pm or by appointment.

This course is an algorithmic course. Most of the topics will be about combinatorial

algorithms on string processing. The goal is to survey the field of string algorithms

by covering important algorithmic ideas in string processing. We will also discuss

applications of string algorithms in bioinformatics, especially in analyzing high throughput

sequencing data. Sequence data analysis has been a major application of string

algorithm. I expect to cover various types of problems in analyzing sequence data.

This course is lecture-based. Students are required to read and present a research

paper in string algorithms and their applications. Each student should also perform

some empirical study by implementing some string algorithms.

In particular, the planned subjects are:

1) Classic exact string matching algorithms and applications. Topics include:

Knuth–Morris–Pratt, Boyer-Moore, Aho-Corasick, suffix trees and suffix arrays.

2) Extension to the classic string algorithms: approximate string matching, multiple sequence

alignment, other string match heuristics (e.g. Blast).

3) Burrows-Wheeler transform. Algorithms in data compression and coding.

4) Applications in bioinformatics. Topics may include reads mapping in high-throughput

sequencing, genome assembly, genetic variation calling, and compression.

Prerequisites. As for background, essentially no biology is assumed. The most relevant

background is a graduate course on algorithms, but a serious student who has only had a

undergraduate algorithm course, or a smart, mathematically mature student who has had neither,

might also be able to follow the course.

Textbook: The following book is recommended but

this book. But the majority of topics will outside this book. I will try to post relevant materials/links

on these topics.

Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology

by Dan Gusfield, 1997. An excellent survey of string algorithms and bioinformatics applications.

Homework. There will be written homework assignments from time to time.

Presentation. Each student needs to select a particular subject in string algorithms and applications

in bioinformatics to present to the class. The student should contact the instructor about the

topic/paper first. I prefer the presentation that provides some interesting algorithmic/technical aspects.

Projects. Each student should do a project on string algorithms. Then, each student needs to write a

project report that summarizes the findings. I prefer that each student works on his/her own project

but team projects can be made with permission. There are different kinds of projects.

Ideally, a student will choose to conduct some research: design a faster algorithm for some

string algorithmic problems, develop some new algorithms for solving some practical problems (e.g. in

analyzing sequence data) or develop some theoretical analysis of some string algorithmic problem.

Alternatively, one may evaluate empirically the performance of string processing algorithms.

Again, each project needs to be first approved by the instructor.

Exams. Currently I have not decided whether to hold a final exam in this course. If I do, it is likely to

a take-home exam.

Grading. The grade will be assigned by: homework (25%), project (30%), paper presentation (10%)

and final exam (35%). This is based on the assumption that there will be a final exam.