The enormous amount of biological data resulting from the completion of Human Genome Project and advances of high-throughput technologies provide unprecedented challenges in data analysis. Toward the goal of understanding newly collected genomic data, this dissertation consists of three independent papers with a unified theme.
The first two papers share a common topic of looking for interesting regions in the genome. Exploring genomic landscapes of different biological endpoints is an important approach for understanding biological processes and disease etiologies. Examples of the endpoints are DNA sequence, epigenetic marks, and binding sites for different transcription factors. Genome-wide measurements have been collected to for these endpoints. Detecting regions of interests from these data can be categorized as a general "bump finding" problem, where a bump is defined as a genomic location for which data behave differently from the majority of the genome. In the first paper, a novel hidden Markov model (HMNI) based method is proposed to search for CpG islands (COI) from DNA sequence. The main advantage of this approach over others is that it summarizes the evidence for CGI status as probability scores, which provides flexibility in the definition of a CGI. The second paper proposes a hierarchical model to detect transcription factor binding sites by jointly analyzing multiple related ChIP-chip datasets. The model captures the locational correlation among datasets, which provides basis for sharing information across experiments.
The third paper focuses on data pre-processing for second generation sequencing data from ABI/SOLiD system. Capable of sequencing millions of short DNA fragments in parallel, second-generation sequencing have rapidly revolutionized genomic research. Among several available platforms, the SOLiD system from Applied Biosystems Inc. provides an unique approach to translate a pair of adjacent nucleotides into one of the four colors. Colors reported from the SOLiD system are the result of complicated statistical manipulations of noisy fluorescence intensities, which introduces systematic biases that may mislead downstream analysis. In this paper a version of quantile normalization was developed which substantially improves yield and accuracy of calls at a small computational cost.
Three statistical applications in genomics: Redefining CpG islands, peak detection from multiple ChIP-chips, and data normalization for second generation sequencing.
📄 Viewing lite version
Full site ›
Book Details
Author(s)Hao Wu
ISBN / ASIN1244588881
ISBN-139781244588882
AvailabilityUsually ships in 24 hours
MarketplaceUnited States 🇺🇸