Table Of ContentProbabilistic Models for Collecting, Analyzing,
and Modeling Expression Data
Hai-Son Phuoc Le
May 2013
CMU-ML-13-101
Probabilistic Models for Collecting, Analyzing,
and Modeling Expression Data
Hai-SonPhuocLe
May2013
CMU-ML-13-101
MachineLearningDepartment
SchoolofComputerScience
CarnegieMellonUniversity
ThesisCommittee
ZivBar-Joseph,Chair
ChristopherLangmead
RoniRosenfeld
QuaidMorris
Submittedinpartialfulfillmentoftherequirements
fortheDegreeofDoctorofPhilosophy.
Copyright@2013Hai-SonLe
ThisresearchwassponsoredbytheNationalInstitutesofHealthundergrantnumbers
5U01HL108642and1R01GM085022,theNationalScienceFoundationundergrantnum-
bers DBI0448453 and DBI0965316, and the Pittsburgh Life Sciences Greenhouse. The
viewsandconclusionscontainedinthisdocumentarethoseoftheauthorandshould
notbeinterpretedasrepresentingtheofficialpolicies,eitherexpressedorimplied,ofany
sponsoringinstitution,theU.S.governmentoranyotherentity.
Keywords: genomics,geneexpression,generegulation,microarray,RNA-Seq,
transcriptomics,errorcorrection,comparativegenomics,regulatorynetworks,
cross-species,expressiondatabase,GeneExpressionOmnibus,GEO,orthologs,
microRNA,targetprediction,DirichletProcess,IndianBuffetProcess,hiddenMarkov
model,immuneresponse,cancer.
ToMomandDad.
i
Abstract
Advancesingenomicsallowresearcherstomeasurethecompletesetoftranscripts
incells.ThesetranscriptsincludemessengerRNAs(whichencodeforproteins)and
microRNAs,shortRNAsthatplayanimportantregulatoryroleincellularnetworks.
Whilethisdataisagreatresourceforreconstructingtheactivityofnetworksincells,
italsopresentsseveralcomputationalchallenges. Thesechallengesincludethedata
collectionstagewhichoftenresultsinincompleteandnoisymeasurement,developing
methodstointegrateseveralexperimentswithinandacrossspecies,anddesigning
methodsthatcanusethisdatatomaptheinteractionsandnetworksthatareactivated
in specific conditions. Novel and efficient algorithms are required to successfully
addressthesechallenges.
Inthisthesis, wepresentprobabilisticmodelstoaddressthesetofchallenges
associatedwithexpressiondata.First,wepresentanovelprobabilisticerrorcorrection
methodforRNA-Seqreads. RNA-Seqgenerateslargeandcomprehensivedatasets
thathaverevolutionizedourabilitytoaccuratelyrecoverthesetoftranscriptsincells.
However,sequencingreadsinevitablycontainerrors,whichaffectalldownstream
analyses.Toaddresstheseproblems,wedevelopanefficienthiddenMarkovmodel-
basederrorcorrectionmethodforRNA-Seqdata.Second,fortheanalysisofexpression
dataacrossspecies,wedevelopclusteringanddistancefunctionlearningmethodsfor
queryinglargeexpressiondatabases. ThemethodsuseaDirichletProcessMixture
Modelwithlatentmatchingsandinfersoftassignmentsbetweengenesintwospeciesto
allowcomparisonandclusteringacrossspecies.Third,weintroducenewprobabilistic
modelstointegrateexpressionandinteractiondatainordertopredicttargetsand
networksregulatedbymicroRNAs.
Combined,themethodsdevelopedinthisthesisprovideasolutiontothepipeline
ofexpressionanalysisusedbyexperimentalistswhenperformingexpressionexperi-
ments.
iii
Acknowledgements
APh.D.maybethehighestpersonalacademicrewardwhichmanywishtoachieve,but
the road leading to a Ph.D. is certainly not a work of a single person. I would like to
expressmydeepestgratitudetothemultitudeofpeoplewhohavetaught,helped,and
supportedmeduringthejoyfulbutalsoadventurousandchallengingtimeatCarnegie
MellonUniversity. Certainlyforme,writingthisacknowledgementsisoneofthemost
wonderfulexercisesingraduateschool.
Iamindebtedtothegeneroussupportofmyadvisor,ZivBar-Joseph,whoistruly
a source of inspiration and ideas. Not too long after I started school, it immediately
becamecleartomethathisacademicsuccessisaproductofaremarkablebalanceof
work,family,andlife-longpassions. Heisnotonlyanacademicfatherbutalsoaliferole
model. Zivisalwayspersistentandpatientwithansweringamyriadofmyquestions.
Everyweek,Ilookforwardtoourmeetingwithalistofquestionsandalwaysleavewith
moreideastoworkon. Hisinstinctandfastthinkingabilitycutthroughconceptuallayers
ofmanyproblemssoquicklyandleadtoquestions,forwhichusuallytakemeweeksto
findgoodanswers. NotonlydidIlearnthetechnicalandresearchmethodology,butI
alsodevelopedanappreciationforhigh-impactresearch,whichmustbewell-motivated
anddrivenbydeliberateapplicationsandsubstantialfindings. Heissodedicatedtothe
researchanddetail-orientedtotheresults. Ononeoccasion,Zivshowedupatmyoffice
lateintheeveningtomysurprise. Itturnedoutthathewenthomeearlierforgettingto
sendmaterialsneededforourpapersubmissiondueatmidnight. Hewalkedbackto
schoolandgaveittomeinperson.
Iappreciatethecommittee,RoniRosenfeld,ChrisLangmead,andQuaidMorrisfor
theiradvice,comments,andsuggestionstoimprovetheworkinthisthesisandmyoral
presentation. Inparticular, Quaidmeticulouslyreadthedraftandsuggestedwaysto
makethedraftmorereadable. Roniinsistedonmakingthepresentationmoreaccessible
totheaudience.
I want to thank past and current members of the Systems Biology group at CMU.
MarcelSchulzisremarkableatsellingnewideasanddedicatedtonewresearchcollabora-
tion. Hisopennesstoshareknowledgeledtothefirstpartofthisthesis. SaketNavlakha
is instrumental in helping me improve my presentation skills. Our discussion about
research, philosophical aspects of life, and religion is always entertaining and makes
lunchmoreenjoyable. IenjoysmallchatswithAnthonyGitter,ShanZhong,AaronWise,
andGuyZinman,withwhomIsharedroomandexplorednewcitiesduringconferences.
IamgratefulforadministrativehelpofDianeStidleandMichelleMartininscheduling
meetings,talksandpaperwork.
Iwouldliketothankmyparents,HaiLeandLanhChau,fortheirunconditionallove
andsupport. Mydad,whotaughtmemathsinfirstgrade,introducedmetotheworld
oflogicalthinking. Myhumbleandwarm-heartedmomtaughtmehowtolistenand
treatpeoplewithcareandrespect. Althoughbothwerenotphysicallywithmeduring
myundergraduateandgraduatestudy,theirpresencewasalwaysinmyheart. Mysister,
v
TramLe,andherfamilyisasourceofcomfortandencouragementindifficulttime.
IcherishmytimespendingwithmanynewfriendsinPittsburgh. ThaoPham,Hoang
TranandHaNguyencookdeliciousfoodandalwayswelcomemetosharetheirculinary
delights. IenjoylisteningtoHangNguyendiscussing,debatingandrantingaboutpolitics,
historyorhoroscope. HoanHoalwaysremindsmeofadeterminedandstrong-willed
personwhenIfaceadifficulttask. PhuongPhamandThangHomotivatedmetorun
andtrainedwithme. IamthankfulforSuzeNinh’sdiligentcareandeffortinrevisingmy
writingandlisteningtomypracticetalks. Hersincerelovecomfortsmeinstressfultime.
TuanNguyen’ssenseofhumorputsawayworriesandtroubles. Theyarealwaysavailable
tolistentomyproblemsandcheerfullyenjoyssipsofwhiskeyorabottleofbeer.
I also thank other friends that I have exchanged ideas and interacted with: Lucia
Castellanos,RobHall,Tzu-KuoHuang,WooyoungLee,AnkurParikh,LiangXiong,Min
XuandYangXu. IwillmissplayingtennisandgoingtothegymwithHoan, Marcel,
Saket,Yang,ChaoShen,andHuaShan.
Pittsburgh,PA
May2013
vi
Description:microRNA, target prediction, Dirichlet Process, Indian Buffet Process, hidden Markov model . Tran and Ha Nguyen cook delicious food and always welcome me to share their culinary 2 SEECER: a probabilistic method for error correction of RNA-Seq. 13 . A Supplementary materials for Chapter 2.