CSE 5095:  String Algorithms and Applications in Bioinformatics
Spring 2013


Instructor: Yufeng Wu

Lecture: Thursday 6:15--9:15 pm.

Office Hour: ITE 235, Wednesday 9:30-12:00 and 2:00-4:30, or by appointment.
Note: some materials may be posted on HuskyCT.

Anouncements.

Course Description. See the Syllabus.

Schedule. Planned schedule is here, but this is what is really happening:

Week
Topics
References
14
Student presentation
5/2/: Approximate string matching (presented by L.Nazaryan )

4/30: Extension to BWT (presented by R. Jiang), and tandem repeat finding (presented by S. Mirzaei)
Extension to BWT:
An Extension of the Burrows Wheeler Transform and Applications to Sequence Comparison and Data Compression (http://link.springer.com/chapter/10.1007%2F11496656_16)

Tandem repeats:
http://bioinformatics.oxfordjournals.org/content/18/4/634.short
http://online.liebertpub.com/doi/abs/10.1089/cmb.2005.12.928
http://www.waset.org/journals/ijmbs/v6/v6-2.pdf
13
Student presentation
4/25: Hash-based reads mapping (presented by C. Chu)

4/23: RNA seq reads mapping (presented by S. Saha)

Papers on reads mapping:
"Hobbes:optimized gram-based methods for efficient read alignment"
http://www.ncbi.nlm.nih.gov/pubmed/?term=Hobbes%3Aoptimized+gram-based+methods+for+efficient+read+alignment
and "Accelerating read mapping with FastHASH"
http://www.ncbi.nlm.nih.gov/pubmed/?term=accelerating+read+mapping+with+fasthash

Papers on RNA-seq:
http://www.ncbi.nlm.nih.gov/pubmed/19289445
http://genomebiology.com/2010/11/3/R34
http://nar.oxfordjournals.org/content/38/14/4570
12
Student presentation
4/18: An approximate matching algorithm by G. Myers (presented by G. Ilie).

4/16: Fast Lempel-Ziv data compression (presented by A. Mamun).
Paper on approximate string matching.

Slides on data compression.
11
No class.

10
4/4: Probability of strings and sequences.
The paper by Chvatal and Sankoff on longest common subsequences.
9
3/28: Coding and compression of text.
A notes written by Gusfeld on the Unique Decipherability problem.
The RECOMB'10 paper  about sequence reads compression.
8
3/14: Applications in high-throughput sequencing.
Two papers (paper 1 and paper 2) on genome assembly with paired reads.
Pevzner, et al's paper using Eulerian path.
The BWT-based reads mapping: the BWA paper
7
3/7: Burrows-Wheeler Transform
Compressed suffix array: Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching
Another paper on compressed suffix array
The main paper covered on BWT pattern matching is: Opportunistic data structures with applications (FOCS 2000)
There are many online reference on BWT. There are also books on BWT (e.g. link).
6
2/28: More approximate string matching. Blast: concepts and seeding. Introduction to sequence reads mapping.
Sections 12.2 and 12.3.
The original paper about spaced seed (Link).
The algorithm for computing the probability of seed hitting a region can be found on pages 9-10 from Keich, et al. (Link). This algorithm is slightly different from what we covered in class, but very similar.
My explanation of the algorithm related to seeding written for another class: it is a little different from what is presented this time and so is only for reference (PDF)
This tutorial on spaced seeds can be useful.
Sequence reads mapping: the MAQ paper.

5
2/21: Two more applications of suffix trees: MUMs and mximal substrings of more than two strings. Introeduction to sequence alignment. Approximate string matching.
Gusfield: section 9.7, chapter 11 (I did not go over these topics in details in class b/c most students have already learned this but if you have not, you should carefully studied these sections), sections 12.2 and 12.7 (I only very briefly mentioned the four-russians approach; see more details in the link below). Section 4.2.

Introduction to sequence alignment by Gusfield.
The four-russians writing by Gusfield.
4
2/14: Suffix array. More applications of suffix tree and array: tandem repeats, longest prefix-suffix matches for multiple strings, and maximal unqiue matches.
Gusfield: Sections 7.14, 7.10.

For reference only: my notes on suffix array (written a few years ago).
Notes on LCP array
The paper of the three-partition linear-time suffix array construction.
Gusfield's writeup on O(nlogn) tandem repeat finding.

HW2: will appear in HuskyCT.
3
2/7:  String matching with wildcards. Suffix tree. Applications of suffix tree.
Gusfield: Sections 3.5, 5.1-5.4, 6.1, 8.1-8.10.
2
1/31:  Boyer-Moore (cont.), Karp-Rabin and Aho-Corasic.
Gusfield: Sections 3.4, 4.1 and 4.4,
Lecture slides on Aho-Corasick at another institution

HW1: posted on HuskyCT.
1
1/24: Introduction to string matching. Z algorithm. KMP. Boyer-Moore.
Gusfield: Chapter 1, Sections 2.1-2.3, Section 3.2. Also some online reference:
Z algorithm
Introduction to Boyer-Moore