Three statistical applications in genomics: Redefining CpG islands, peak detection from multiple ChIP-chips, and data normalization for second generation sequencing.

Name: Three statistical applications in genomics: Redefining CpG islands, peak detection from multiple ChIP-chips, and data normalization for second generation sequencing.
Author: Hao Wu
ISBN: 9781244588882

Author Hao Wu

Publisher ProQuest, UMI Dissertation Publishing

📄 Viewing lite version Full site ›

🌎 Shop on Amazon — choose country

🇺🇸 USA 🇨🇦 Canada 🇬🇧 UK 🇩🇪 Germany 🇫🇷 France 🇮🇳 India

69.00 USD

🛒 Buy New on Amazon 🇺🇸

✓ Usually ships in 24 hours

Book Details

Author(s)Hao Wu

PublisherProQuest, UMI Dissertation Publishing

ISBN / ASIN1244588881

ISBN-139781244588882

AvailabilityUsually ships in 24 hours

MarketplaceUnited States 🇺🇸

Description ▲

The enormous amount of biological data resulting from the completion of Human Genome Project and advances of high-throughput technologies provide unprecedented challenges in data analysis. Toward the goal of understanding newly collected genomic data, this dissertation consists of three independent papers with a unified theme.

The first two papers share a common topic of looking for interesting regions in the genome. Exploring genomic landscapes of different biological endpoints is an important approach for understanding biological processes and disease etiologies. Examples of the endpoints are DNA sequence, epigenetic marks, and binding sites for different transcription factors. Genome-wide measurements have been collected to for these endpoints. Detecting regions of interests from these data can be categorized as a general "bump finding" problem, where a bump is defined as a genomic location for which data behave differently from the majority of the genome. In the first paper, a novel hidden Markov model (HMNI) based method is proposed to search for CpG islands (COI) from DNA sequence. The main advantage of this approach over others is that it summarizes the evidence for CGI status as probability scores, which provides flexibility in the definition of a CGI. The second paper proposes a hierarchical model to detect transcription factor binding sites by jointly analyzing multiple related ChIP-chip datasets. The model captures the locational correlation among datasets, which provides basis for sharing information across experiments.

The third paper focuses on data pre-processing for second generation sequencing data from ABI/SOLiD system. Capable of sequencing millions of short DNA fragments in parallel, second-generation sequencing have rapidly revolutionized genomic research. Among several available platforms, the SOLiD system from Applied Biosystems Inc. provides an unique approach to translate a pair of adjacent nucleotides into one of the four colors. Colors reported from the SOLiD system are the result of complicated statistical manipulations of noisy fluorescence intensities, which introduces systematic biases that may mislead downstream analysis. In this paper a version of quantile normalization was developed which substantially improves yield and accuracy of calls at a small computational cost.