ebook img

MEASURING AND ANTICIPATING THE IMPACT OF DATA REUSE by Kathleen Marie Fear A ... PDF

268 Pages·2013·5.09 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview MEASURING AND ANTICIPATING THE IMPACT OF DATA REUSE by Kathleen Marie Fear A ...

MEASURING AND ANTICIPATING THE IMPACT OF DATA REUSE by Kathleen Marie Fear A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Information) in the University of Michigan 2013 Doctoral Committee: Professor Elizabeth Yakel, Chair Assistant Professor Eytan Adar Professor George C. Alter Professor Margaret L. Hedstrom ACKNOWLEDGEMENTS I would like to thank my dissertation chair, Elizabeth Yakel, for her guidance and support throughout my time in the doctoral program. Working with Beth on my master’s thesis project inspired me to continue on to the doctoral program, and her ongoing encouragement kept me moving forward despite my missteps and stumbles. I am also especially grateful to Margaret Hedstrom, who brought me into the NSF IGERT Open Data program (National Science Foundation Grant No. 0903629), and in doing so played a major role in shaping my research and professional career. George Alter provided ready advice and guidance throughout this dissertation study that substantially shaped its intellectual direction, and I am also grateful for the generous access he granted me to the usage data and ICPSR study metadata that forms the core of my study. I would also like to thank Eytan Adar for his guidance and suggestions throughout the process of this dissertation. I could not have asked for a better committee: you continually challenged me to produce my best work, and I feel privileged to have had the chance to work closely with you all. This dissertation would not have been possible without generous access to data given to me by several sources. The bibliographic data I used came from the ICPSR Bibliography of Data-Related Literature, which is produced and maintained by Elizabeth Moss. I would also like to thank Ixchel Faniel, Elizabeth Yakel, Adam Kriesberg and Morgan Daniels for their work on the Dissemination Information Packages for Information Reuse ii (DIPIR) project, Institute for Museum and Library Services, Grant # LG-06-10-0140-10 and for granting me access to the interview data from data reusers. I would also like to thank all those who provided support of all kinds throughout this project. Veronica Falandino, Jen Todd, Sue Schuon, Karen Woollams and Lai Tutt were always willing and ready to answer my questions about the doctoral program, from travel funding to dissertation deadlines. I want to thank Dharma Akmon for reading and reviewing endless revisions of my proposal and then dissertation, and Matt Burton and Morgan Daniels for their valuable feedback as I worked through my dissertation proposal. Throughout my time at SI, I had the opportunity to work with excellent collaborators, all of whom contributed to my work and intellectual growth through their kindness, advice and support, especially Devan Donaldson, Paul Conway, and Kai Zheng. I am grateful to Mary Ann Mavrinac and Katie Clark at the University of Rochester for their patience and flexibility as I have worked to complete my dissertation and begin my career. I am thankful to my parents for the sacrifices they made to send me along this path. Finally, I am grateful to my husband, Ryan, for his unwavering support and confidence in me, and to my sister, for always knowing how to lift my spirits. iii TABLE OF CONTENTS ACKNOWLEDGEMENTS .......................................................................................................... ii LIST OF TABLES ........................................................................................................................ vi LIST OF FIGURES .................................................................................................................... viii LIST OF APPENDICES .............................................................................................................. ix CHAPTER 1 ................................................................................................................................... 1 1.1 BACKGROUND AND PROBLEM STATEMENT ........................................................................... 1 1.2 RESEARCH QUESTIONS AND DESIGN OVERVIEW .................................................................... 5 1.3 KEY FINDINGS ........................................................................................................................ 6 1.4 DATA AND DATA REUSE IN THE SOCIAL SCIENCES ................................................................ 8 1.5 DATA SOURCES ...................................................................................................................... 9 1.6 DEFINITIONS ......................................................................................................................... 11 1.7 CONTRIBUTION AND SIGNIFICANCE OF THE STUDY ............................................................. 13 1.8 OVERVIEW OF THE DISSERTATION ....................................................................................... 16 CHAPTER 2 ................................................................................................................................. 17 2.1 INCENTIVES AND DISINCENTIVES FOR SHARING DATA ....................................................... 17 2.2 IDENTIFYING DATA REUSE .................................................................................................. 29 2.3 MEASURING DATA REUSE IMPACT ...................................................................................... 37 2.4 PREDICTING IMPACT ............................................................................................................ 43 2.5 CONCLUSION ........................................................................................................................ 45 CHAPTER 3 ................................................................................................................................. 48 3.1 STUDY OVERVIEW ................................................................................................................ 49 3.1.1 Identifying data reuse in a corpus of social science literature (Chapter 4) ............. 50 3.1.2 Quantifying the impact of data reuse (Chapter 5) .................................................... 51 3.1.3 Anticipating data impact (Chapter 6) ....................................................................... 52 3.2 STUDY SETTING AND CORE DATA ........................................................................................ 54 3.2.1 Study sample ............................................................................................................. 55 3.2.2 Extracting study information .................................................................................... 58 3.2.3 The sample data ........................................................................................................ 78 3.3 LIMITATIONS ........................................................................................................................ 81 CHAPTER 4 ................................................................................................................................. 84 4.1 METHODS FOR IDENTIFYING REUSE ..................................................................................... 85 4.2 PROCESSING THE DATA-RELATED PUBLICATIONS BIBLIOGRAPHY ...................................... 86 4.3 IDENTIFYING DATA CITATIONS AND ACKNOWLEDGEMENTS IN PUBLISHED LITERATURE ... 91 4.3.1 Framework for categorizing data citations or acknowledgements ........................... 93 4.4 FINDINGS .............................................................................................................................. 97 4.4.1 Do authors cite or acknowledge data in primary papers? ....................................... 97 iv 4.4.2 Do secondary authors cite data when they use data? Do authors cite data in documents where they do not use data? ................................................................................ 99 4.5 DISCUSSION: CITATION PATTERNS IN SOCIAL SCIENCE LITERATURE ................................ 105 CHAPTER 5 ............................................................................................................................... 110 5.1 COMPUTING IMPACT METRICS FROM CITATION DATABASES ........................................... 110 5.2 MEASURING DATA REUSE IMPACT .................................................................................... 115 5.2.1 Reuse count ............................................................................................................. 116 5.2.2 Secondary impact .................................................................................................... 121 5.2.3 Diversity .................................................................................................................. 125 5.2.4 Downloaders ........................................................................................................... 133 5.3 DISCUSSION: COMPARING METRICS .................................................................................. 140 CHAPTER 6 ............................................................................................................................... 146 6.1 ANTICIPATING IMPACT BY ANTICIPATING DATA REUSE .................................................... 146 6.2 INFLUENCES ON DATA REUSE ............................................................................................. 148 6.3 INFLUENCES ON DATA REUSE IN THE SOCIAL SCIENCES .................................................... 154 6.3.1 Interviews with social science data reusers ............................................................ 155 6.3.2 Research Question Fit / Dataset Size ...................................................................... 157 6.3.3 Information about the data collection process ....................................................... 159 6.3.4 Data Producer Reputation ...................................................................................... 161 6.3.5 Connection with data producer .............................................................................. 162 6.3.6 Discipline of data producers ................................................................................... 163 6.3.7 An additional factor: data in the published literature ............................................ 164 6.3.8 Institutional vs. individual authors ......................................................................... 168 6.4 CHOOSING REGRESSION MODELS ....................................................................................... 168 6.5 FINDINGS ............................................................................................................................ 173 6.5.1 What factors influence whether or not data are reused? ........................................ 173 6.5.2 What factors influence reuse impact? ..................................................................... 183 6.6 DISCUSSION: PREDICTORS OF DATA REUSE AND REUSE IMPACT ..................................... 192 6.6.1 Research question fit (size of dataset) .................................................................... 192 6.6.2 Discipline of the data producer .............................................................................. 193 6.6.3 Data collection process information (processing status) ....................................... 194 6.6.4 Data producer reputation (h-index at time of data release) ................................... 195 6.6.5 Connection with the data producer (co-authorship network size) .......................... 195 6.6.6 Prominence of data (presence in research literature) ............................................ 196 CHAPTER 7 ............................................................................................................................... 199 7.1 SUMMARY OF FINDINGS ..................................................................................................... 199 7.2 IMPLICATIONS .................................................................................................................... 202 7.3 PROBLEMS AND LIMITATIONS ............................................................................................ 212 7.4 DIRECTIONS FOR FUTURE RESEARCH ................................................................................. 213 7.5 BROADER IMPACTS AND CONCLUSION .............................................................................. 216 APPENDICES ............................................................................................................................ 218 BIBLIOGRAPHY ...................................................................................................................... 239 v LIST OF TABLES CHAPTER 3: METHODS Table 3.1 Study summary ................................................................................................. 53! Table 3.2 Derivation of final study sample (beginning N = 8,471, final N = 273) ........... 58! Table 3.3 Information extracted from DDI study metadata .............................................. 59! Table 3.4 Distribution of ICPSR-assigned data types among proc. studies (N = 221) ..... 67! Table 3.5 Distribution of recoded data types among processed studies (N = 221) .......... 68! Table 3.6 Dist. of data type combos among studies using mult. types of data (N = 58) .. 69! Table 3.7 Distribution of data sources among processed studies (N = 221) ..................... 70! Table 3.8 Identification of individual auths. for institutional- or gov't-produced studies 75! Table 3.9 Distribution of datasets (N = 253) across subject categories ............................ 76! Table 3.10 Distribution of datasets (N = 312) within and outside of social sciences ....... 78 CHAPTER 4: DATA CITATION PATTERNS IN THE SOCIAL SCIENCES Table 4.1 Distribution of docs in Bib. of Data-Related Literature by type (N = 2,323) ... 87! Table 4.2 Distribution of docs in Bib. of Data-Related Literature, after coding and elimination of duplicates (N = 2,173) ....................................................................... 91! Table 4.3 Distribution of journal articles (N = 1,473) by type ......................................... 93! Table 4.4 Frequency of citation types across secondary publications (N = 449) ........... 103! Table 4.5 Frequency of combinations of citation types across sec. pubs (N = 449) ...... 103! Table 4.6 Proportion of papers citing data provider over time ....................................... 104! Table 4.7 Proportion of papers citing data producer over time ...................................... 104 CHAPTER 5: MEASURING IMPACT Table 5.1 Summary of measures of data reuse impact ................................................... 116! Table 5.2 Top 10 highest impact studies according to reuse count ................................ 120! Table 5.3 Top 10 highest impact datasets by secondary impact ..................................... 125! Table 5.4 Reuse publications (N = 449) per subject category ........................................ 128! Table 5.5 Top 10 highest impact datasets by Rao-Stirling diversity .............................. 130! Table 5.6 Top 10 highest impact datasets by diversity (adapted Rao-Stirling) .............. 131! Table 5.7 Top 10 highest impact datasets by downloaders ............................................ 138! Table 5.8 Correlations between data reuse impact measures ......................................... 141! Table 5.9 Outlier datasets ............................................................................................... 143 CHAPTER 6: ANTICIPATING REUSE Table 6.1 Summary of literature on data reuse ............................................................... 152! Table 6.2 Factors that influence data reuse ..................................................................... 155! vi Table 6.3 Initial set of independent variables: predictors of reuse ................................. 169! Table 6.4 Sample size calculation ................................................................................... 170! Table 6.5 Descriptive statistics: Continuous variables ................................................... 172! Table 6.6 Descriptive statistics: Categorical variables ................................................... 172! Table 6.7 Logistic regression of reuse outcomes for 226 datasets ................................. 174 Table 6.8 Logistic regression of reuse outcomes for 226 datasets ..................................175 Table 6.9 Linktest of 7-predictor model for reuse outcomes .......................................... 175! Table 6.10 Log. regression of reuse outcomes, incl. reuse before ICPSR release .......... 176! Table 6.11 Linktest on 8-predictor model for reuse outcomes ....................................... 176! Table 6.12 Logistic regression of reuse outcomes for 224 datasets ............................... 178! Table 6.13 Log. regression of reuse outcomes for datasets not reused prior .................. 180! Table 6.14 Logistic regression of reuse outcomes for datasets not reused prior ............ 181! Table 6.15 Logistic regression of reuse outcomes for datasets not reused prior ............ 182! Table 6.16 Negative binomial regression of downloaders for 227 datasets ................... 183! Table 6.17 Negative binomial regression of downloaders for 227 datasets ................... 185! Table 6.18 Results of individual tests for reuse count .................................................... 188! Table 6.19 Results of individual tests for secondary impact .......................................... 190! Table 6.20 Results of individual tests for diversity ........................................................ 191! vii LIST OF FIGURES CHAPTER 3: METHODS Figure 3.1 Studies released by ICPSR by year (N = 8,471) .............................................. 55! Figure 3.2 Non-series studies released by ICPSR by year (N = 1,257) ............................ 57! CHAPTER 4: DATA CITATION PATTERNS IN THE SOCIAL SCIENCES Figure 4.1 Decision tree for classifying publications ....................................................... 90! Figure 4.2 Matrix of data citation types ............................................................................ 96! Figure 4.3 Hist. of elapsed time betw. publication of primary papers and data release. .. 98! Figure 4.4 Hist. of time elapsed (in years) betw. data release and non-prim pubs. ........ 100! Figure 4.5 Elapsed time between data release and publication for non-prim pubs ........ 101 CHAPTER 5: MEASURING IMPACT Figure 5.1 Hist. of reuse citations from journal arts. for studies with 1+ citations ........ 118! Figure 5.2 Cumulative percentage of studies cited (N = 273; 44 total studies cited) .... 119! Figure 5.3 Download events by year (N = 1,173,873) .................................................... 135! Figure 5.4 Distribution of downloaders metrics ............................................................. 139! Figure 5.5 Median new unique downloaders by year post release ................................. 140! Figure 5.6 Median new unique downloaders by calendar year ...................................... 140! CHAPTER 6: ANTICIPATING IMPACT Figure 6.1 Plots of standardized Pearson residuals and deviance residuals ................... 177! Figure 6.2 Plot of leverage for each study ...................................................................... 177! Figure 6.3 Plots of reuse count vs. predictors ................................................................. 187! Figure 6.4 Plot of secondary impact vs. predictors ......................................................... 189! Figure 6.5 Plot of diversity vs. predictors ....................................................................... 191! viii LIST OF APPENDICES APPENDIX A: List Of ICPSR Studies Included In Sample ...........................................218 APPENDIX B: Comparison of Major Citation Databases ..............................................228 APPENDIX C: Sample G-Index Calculation for Hypothetical Datasets ........................236 APPENDIX D: Impact Metric Scores and Ranking For 44 Datasets ..............................237 ix CHAPTER 1 Introduction 1.1 Background and problem statement In 2013, an economics graduate student identified an error made by high-profile researchers in a 2010 paper that had been an important influence on public policy (Herndon, Ash, & Pollin, 2013). In 2010, after several years of planning and work to coordinate and share data, a collaborative group of Alzheimer’s researchers began making breakthroughs on the detection and early diagnosis of that disease (Mueller et al., 2005). Between 2007 and 2010, scientists in genomics produced 1,150 new papers from data they did not collect themselves (Piwowar, Vision, & Whitlock, 2011). These successes in research were all made possible through data sharing: directly, in the case of the economics paper and through collaborative data production, in the case of the Alzheimer’s Disease Neuroimaging Initiative; and through contribution of data to a repository, the Gene Expression Omnibus. Scholars, funding agencies and public policy makers increasingly recognize sharing data for others to reuse as an important part of scholarship. Sharing data goes hand-in-hand with preserving them. There are numerous potential benefits to preserving data (Beagrie, Chruszcz, & Lavoie, 2008). Preserving and sharing data can increase the return on investment in research by ensuring the persistence of 1

Description:
Doctor of Philosophy Data program (National Science Foundation Grant No. Daniels for their work on the Dissemination Information Packages for Information Reuse Weber and Chao (2011) found that reuse of quantitative.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.