CSE 5840:  String Algorithms and Applications in Bioinformatics
Fall 2015

Instructor: Yufeng Wu

Lecture: Thursday 11:00--12:15 pm.

Office Hour: ITE 235, Monday/Wednesday 2:00-4:30, or by appointment.
Note: some materials may be posted on HuskyCT.


Course Description. See the Syllabus.

Schedule. Planned schedule is here, but this is what is really happening:

Student presentation

Student presentation

11/19: Brief introduction to sorting by reversal

11/17: Unique decipherability
A book chapter by P. Pevzner on genome rearrangement by reversals.

A notes written by Gusfeld on the Unique Decipherability problem.

11/12: Text compression

11/10: Sequence error correction
Papers and links to text compression: ref 1, ref 2 and ref 3.

Papers on sequence reads error correction: paper 1, paper 2 and paper 3.

11/5: k-mer counting

11/3: Genome assembly
Three papers on k-mer counting covered in class: paper 1, paper 2, and paper 3.

Genome assembly from pair-end reads
The IDBA paper

10/29: Sequencing data analysis: genome assembly
Lecture 18

10/27: Sequencing data analysis: reads mapping
Lecture 17
Pevzner, et al's paper using Eulerian path.
The BWT-based reads mapping: the BWA paper

Sequence reads mapping: the MAQ paper.
Proposals for paper presentation and project assigned.
10/22: Blast and Pattern Hunter.
See the Notes.

10/20: Approximate string matching.

Lecture 15
Gusfield: 12.3.

The original paper about spaced seed (Link).
The algorithm for computing the probability of seed hitting a region can be found on pages 9-10 from Keich, et al. (Link). This algorithm is slightly different from what we covered in class, but very similar.
My explanation of the algorithm related to seeding: this is the algorithm presented in the class (PDF)
This tutorial on spaced seeds can be useful.

10/15: Approximate string matching
Lecture 14

10/13: Compressed suffix array and a little 2D string matching
Lecture 13
Gusfield: 4.2 and 12.2.
This web page explains the 2-dimensional string matching.
Basic dynamic programming for comparing two strings. If you have not learned the basic DP on string comparison, you should carefully read it.

Compressed suffix array: Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching
Another paper on compressed suffix array. I only covered a small part of the first CSA paper and
none of the second paper. It would be good if someone in the class can tell us more about CSA.
HW4 is posted in HuskyCT. Due: 10/27.
9/29: Pattern match with BWT
Lecture 12

10/8: BWT
Lecture 11

The main paper covered on BWT pattern matching is: Opportunistic data structures with applications (FOCS 2000)
There are many online reference on BWT. There are also books on BWT (e.g. link).

9/29: Suffix array (continued)
Lecture 10

9/29: Suffix array
Lecture 9
Gusfield: Sections 7.14, 7.10.

The paper of the three-partition linear-time suffix array construction.
For reference only: my notes on suffix array (written a few years ago).
Gusfield's notes on LCP array
HW3 is posted in HuskyCT. Due: 10/13.
9/24: More applications of suffix tree
Lecture 8

9/22: Applications of suffix tree
Lecture 7
Gusfield: Chapters 7 and 9. I could not cover the entire chapter. But it is still worthy of reading.
Gusfield's writeup on O(nlogn) tandem repeat finding.

HW2. Due: 10/1.
9/17: Suffix tree.
Lecture 6

9/15: Aho-Corasick algorithm
Lecture 5
Gusfield: Section 3.4. Chapter 5. Section 6.1.
An introduction to suffix tree by Dan Gusfield (PDF).
The writing by Gusfield on Ukkonen's algorithm (PDF).

9/10: Karp-Rabin and Aho-Corasick algorithms
Lecture 4

9/8:Boyer-Moore algorithm and the linear time analysis
Lecture 3
Gusfield: Sections 3.4 and 4.4.

Gusfield: Section 3.2. Some links that might be useful:
About Boyer-Moore

9/3: Classic string matching algorithms: KMP and Boyer-Moore.
Lecture 2

9/1: Introduction to string matching. Different kinds of string matching/comparison algorithms. Z algorithm.
Lecture 1
Gusfield: Chapter 1, Sections 2.1-2.3, Section 3.2. Also some online reference:
Z algorithm
Introduction to Knuth-Morris-Pratt algorithm
Introduction to Boyer-Moore
Basic dynamic programming for comparing two strings.
HW1. Due: 9/15.

Please submit electronically in HuskyCT. Please consider using LATEX for writing up your solutions.