ebook img

DTIC ADA528330: Exploring the Use of Concept Spaces to Improve Medical Information Retrieval PDF

17 Pages·1.6 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA528330: Exploring the Use of Concept Spaces to Improve Medical Information Retrieval

DecisionSupportSystems30(cid:14)2000.171–186 www.elsevier.comrlocaterdsw Exploring the use of concept spaces to improve medical information retrieval Andrea L. Houstona,), Hsinchun Chenb,1, Bruce R. Schatz c,2, Susan M. Hubbardd,3, Robin R. Sewelle,4, Tobun D. Ng f,5 a ISDSDepartment,CollegeofBusinessAdministration,LouisianaStateUni˝ersity,3178B4CEBA,BatonRouge,LA70803,USA bManagementInformationSystemsDepartment,Uni˝ersityofArizona,Tucson,AZ85721,USA c CommunityArchitectureforNetworkInformationSystems(CANIS)Laboratory,Uni˝ersityofIllinois,Urbana–Champaign, Urbana,IL61820,USA d DirectoroftheInternationalCancerInformationCenter,NationalCancerInstitute,Bethesda,MD20852,USA eSchoolofLibraryScience,Uni˝ersityofArizona,Tucson,AZ85721,USA fSchoolofComputerScience,CarnegieMellonUni˝ersity,Pittsburgh,PA15213,USA Abstract This research investigated the application of techniques successfully used in previous information retrieval research, to the more challenging area of medical informatics. It was performed on a biomedical document collection testbed, CANCERLIT,providedby the NationalCancer Institute(cid:14)NCI., which containsinformationon all types of cancer therapy. The quality or usefulness of terms suggested by three different thesauri, one based on MeSH terms, one based solely on termsfromthe documentcollection,andonebasedonthe UnifiedMedicalLanguageSystem(cid:14)UMLS. Metathesaurus,was exploredwiththe ultimategoalof improvingCANCERLITinformationsearchand retrieval. ResearchersaffiliatedwiththeUniversityofArizonaCancerCenterevaluatedlistsofrelatedtermssuggestedbydifferent thesauri for 12 different directed searches in the CANCERLIT testbed. The preliminary results indicated that among the thesauri,therewerenostatisticallysignificantdifferencesineithertermrecallorprecision.Surprisingly,therewasalmostno overlap of relevant terms suggested by the different thesauri for a given search. This suggests that recall could be significantlyimprovedby usinga combinedthesaurusapproach. q2000ElsevierScienceB.V. All rightsreserved. Keywords: Informationretrieval;Medicalinformatics;Medicalinformationretrieval;Conceptspace;MeSHterms;UMLSMetathesaurus 1. Introduction )Corresponding author. Tel.: q1-225-578-2503; fax: q1-225- 578-2511. Medicine is a dynamic field incorporating numer- E-mail addresses: [email protected] (cid:14)A.L. Houston., ousspecialties,eachwithitsownpreferredterminol- [email protected] (cid:14)H. Chen., [email protected] (cid:14)B.R. Schatz.,[email protected](cid:14)S.M.Hubbard.,[email protected] ogy. This diversity of vocabularies can be an obsta- (cid:14)R.R.Sewell.,[email protected](cid:14)T.D.Ng.. cle for medical professionals requiring access to 1Tel.:q1-520-621-4153. current medical information w16x. While advances in 2Tel.:q1-217-244-0651. medical database technology have improved infor- 3Tel.:q1-301-496-9096. 4Tel.:q1-520-621-2748. mation accessibility, retrieval speed, and searching 5Tel.:q1-412-268-4499. flexibility, they have not resolved the problems of 0167-9236r00r$-seefrontmatterq2000ElsevierScienceB.V.Allrightsreserved. PII: S0167-9236(cid:14)00.00097-X Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2000 2. REPORT TYPE 00-00-2000 to 00-00-2000 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Exploring the use of concept spaces to improve medical information 5b. GRANT NUMBER retrieval 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION University of Arizona ,Management Information Systems Department REPORT NUMBER ,Tucson,AZ,85721 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT This research investigated the application of techniques successfully used in previous information retrieval research, to the more challenging area of medical informatics. It was performed on a biomedical document collection testbed CANCERLIT, provided by the National Cancer InstituteNCI., which contains information on all types of cancer therapy. The quality or usefulness of terms suggested by three different thesauri, one based on MeSH terms, one based solely on terms from the document collection, and one based on the Unified Medical Language SystemUMLS.Metathesaurus, was explored with the ultimate goal of improving CANCERLIT information search and retrieval. Researchers affiliated with the University of Arizona Cancer Center evaluated lists of related terms suggested by different thesauri for 12 different directed searches in the CANCERLIT testbed. The preliminary results indicated that among the thesauri, there were no statistically significant differences in either term recall or precision. Surprisingly, there was almost no overlap of relevant terms suggested by the different thesauri for a given search. This suggests that recall could be significantly improved by using a combined thesaurus approach. 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE Same as 16 unclassified unclassified unclassified Report (SAR) Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 172 A.L.Houstonetal.rDecisionSupportSystems30(2000)171–186 vocabularydifferencesamongbiomedicalspecialties, We are investigating improving medical informa- variations in indexing and classificationsystems, nor tion retrieval by building on techniques successfully variations in information accessing systems. Medical applied to other information retrieval domains (cid:14)e.g. information retrieval, as a specialized case of infor- WormrFly Genome, the Internet, and a large scien- mation retrieval, is subject to classic information tific abstract collection.. Previous research demon- retrieval problems such as: Ainformation overloadB strated that the creation of automatically generated w3x,theAvocabulary problemB (cid:14)semanticbarrierw17x., concept spaces (cid:14)thesauri. is an efficient, effective synonymyandpolysemy.Italsohassomeinteresting technique to improve document precision and recall w x problems of its own 12,14,15. in directed searchesoflargeinformationspaces.The The community of medical information users is WormrFly genome research w5x indicated that a extremely varied in its level of biomedical expertise, combined thesaurus approach could improve recall its familiarity with various biomedical indexing vo- without sacrificing precision. Currently, we are in- cabularies and its information usage requirements. vestigating augmenting automatically generated con- For example, biomedical expertise ranges from pa- cept space terms with terms from existing medical tients and families encountering terms for the first thesauri: Medical Subject Headings (cid:14)MeSH. and the time, to specialistsin focusedresearchareaswhoare Unified Medical Language System (cid:14)UMLS. consideredexperts.Compoundingthisproblemisthe Metathesaurus, both developed and maintained by fact that there is no single commonly accepted the National Library of Medicine (cid:14)NLM.. biomedical indexing vocabulary. This lack of an information standard and the existence of thousands of different medical databases containing informa- 2. Literature review tion that can be formatted, indexed and stored in a There are four major approaches to textual medi- variety of different ways make it difficult, if not cal information retrieval. They are: keyword index- impossible, to locate and exchange medical informa- ing and retrieval (cid:14)traditional method., statistically tion. Users requiring information from a variety of based methods (cid:14)Salton-based syntactic techniques., medical sources may have to learn several different relevance feedback (cid:14)using searcher feedback to im- information retrieval systems and several different prove future searches., and semantic methods (cid:14)in- indexing vocabularies to locate the information they cluding extensions to Salton’s techniques and Natu- need. ral Language Processing.. Depending on medical information usage require- ments, the goals of indexing vocabularies may con- 2.1. Human indexing and keyword search flict. For example, biomedical research information, databasesof clinical studies or drug trials, and medi- The most common example of the keyword ap- cal insurance databases all need to have data orga- proach is human indexing (cid:14)using a standard set of nized or summarized by categories (cid:14)generalization.. pre-determined subject terms that domain experts However, primary care professionals dealing with assign to documents.. This approach relies entirely individual patient records require a detailed, precise on indexer expertise in both subject domain and and expressive vocabulary that can accurately de- standardindexingvocabularies.Previousresearchw2x w x scribe patient information 11,13. Patient records demonstrated that different, well-trained indexers of- can be a composite of every potential data format ten assign different terms to the same document (cid:14)numeric,freetext,tables,graphs,imagesandaudio.. (cid:14)synonymy. and that an indexer may use different Patientrecordinformationsystems,therefore,require terms for the same document at different times. a standard vocabulary that can specialize (cid:14)the direct Meanwhile,differentuserstendtousedifferentterms opposite of generalization. and accommodate a mas- to seek identical information (cid:14)polysemy.. Furnas et sive quantityof highlyvariableand volatileinforma- al. w10x, showed that the probability of two people tion, thereby increasing medical information system using the same term to classify an object was less challenges. than 20%. A.L.Houstonetal.rDecisionSupportSystems30(2000)171–186 173 Because of these discrepancies, an exact match 2.3. Rele˝ance feedback between searcher terms and indexer terms is un- likely, resulting in poor document recall and preci- Relevance feedback is a method that automates sion. Individual keywords are simply not adequate theintellectualprocessofevaluatingtheresultsofan discriminators of semantic content. Furthermore, initial search to improve future searches. It can be manual indexing is too time-consuming for process- used with both vector queries and Boolean searches. ing large volumes of information. Human indexing w x Salton and Buckley 20 discuss a variety of proce- methods need to be either supplemented or replaced dures for relevance feedback and outline three bene- by more effective and efficient techniques, hence, fits: automatic expansion of queries, gradual ad- the major research effort in developing automatic vancement towards the subject, and selective key indexing techniques. term emphasis. Their experiments demonstrated that relevancefeedbackcanbeveryeffective(cid:14)90%preci- sion w20x. and its usefulness has been well-docu- 2.2. Statistical techniques — ˝ector space and the- mented in TIPSTER conferences. The disadvantages saurus of this technique include the facts that concerns over processing speed and information storage outweigh Other approaches to addressing the vocabulary the benefits, and that it cannot improve effective problem include using either a thesaurus or a vector queries. spacedocumentrepresentationw19x.Thesauri(cid:14)mainly used to expand users’ queries by translating query terms into alternative terms that match document 2.4. Semantic approaches indexes.canbegeneratedmanuallyorautomatically. w x Srinivasan 21 presentsa medical informaticsexam- Another approach is to index documents semanti- ple that evaluates different query expansion strate- cally, allowing users to search using conceptual gies using a MEDLINE testbed. Most automatically meanings instead of keywords. Multi-dimensional generated thesauri are syntactically based techniques semantic space techniques attempt to enhance infor- that use vector space document representation and mation retrieval by placing documents (cid:14)vectors. by word statistical co-occurrence analysis. Many incor- meaning in a designated space. The most representa- porate other statistical techniques such as cluster tivemulti-dimensionalsemanticspacetechniquesare analysis, co-occurrence efficiency analysis, and fac- Metric Similarity Modeling (cid:14)MSM. and Latent Se- tor analysis. Relationships are then represented in mantic Indexing (cid:14)LSI.. mathematical matrices. MSMrepresentsbothqueriesanddocumentswith Most statistical methods concentrate on solving vectors in a multi-dimensional semantic space using synonymy by adding associative terms to keyword techniquesfromMulti-DimensionalScalingw1x.Doc- indexes. A major disadvantage is that some added umentvectorsarecomputedusingstandardstatistical terms have meanings that are different from the techniques and then placed in a multi-dimensional intended meanings, resulting in rapid degradation of semantic space, their positions determined by simi- w x information retrieval precision. Cimino et al. 7 has larity constraints. One disadvantage of MSM is that an interesting discussion on the problems of auto- it can only be used when external sources exist to matedmedicalinformationtranslationusingthesauri. determine similarityconstraints(cid:14)i.e. co-citationanal- Automatically generated neural-like thesauri (cid:14)con- ysis, relevance feedback or document classification cept spaces. provide an alternative to traditional information — Library of Congress Subject Head- thesauri. In a neural knowledge base, concepts ings or Compendex Classification Codes.. (cid:14)terms.arerepresentedasnodes,andrelationshipsas LSI w8x is an optimal method of MSM. It repre- weighted links. The associative memory feature of sents documents, queries and terms as vectors in a this thesaurus type allows a new paradigm for matrix determined by multi-dimensional Singular knowledge discovery and document searching using Value Decomposition. LSI takes advantage of the spreading activation algorithms (cid:14)e.g. Hopfield net.. fact that semantic relations exist within a document 174 A.L.Houstonetal.rDecisionSupportSystems30(2000)171–186 and attempts to place similar documents close to the Group Health Cooperative of Puget Sound. each other in a multi-dimensionalspace. Chute et al. Metathesaurus use has been shown to significantly w6x have applied the technique to medical informat- increase(cid:14)60–88%. documentrecallrate(cid:14)overMeSH ics. Deerwester et al. w8x have tested LSI on two terms. w18x. standard document collections (cid:14)MED and CISI — chosenbecauserelevancejudgmentsalreadyexisted. with promising results in both document recall and 3. CANCERLIT experiment precision. LSI proved equal to or better than either simple term matching or SMART, and better than Voorhees’ term disambiguation process. Unfortu- CANCERLITcontainsbibliographicrecords(cid:14)pre- nately,themeaningsofthesemathematicallyderived dominantly abstracts. from biomedical journals on semantic techniques are difficult to understand and research related to cancer biology, etiology, screen- computationally cumbersome. Their usefulness in ing, prevention, and treatment published between suggesting meaningful indexing or searching terms 1963 and today. Approximately 200 core journals has not been validated on a large real-world collec- accountfor the majorityof the collection. Additional tion. citations come from journals, scientific meeting pro- ceedings, books, dissertations, technical reports, and 2.5. UMLS other publications. The National Cancer Institute (cid:14)NCI. and NLM share processing costs, therefore, In 1986, NLM began developing the UMLS to many CANCERLIT citations are cross-indexed in address medical vocabulary problems by Aimproving MEDLINE. CANCERLIT is updated monthly to en- the ability of computer programs to ‘understand’ sure that the most current published cancer research biomedical meaning in user inquiries and then using results are available (cid:14)see acknowledgments.. There this understanding to retrieve and integrate relevant are more than 1.2 million records in the complete machine-readable informationB w16x. The UMLS has collection, which NCI estimates increases annually four components: the Metathesaurus, the Specialist by approximately 90000 abstracts. The record for- Lexicon, the Semantic Net, and the Information mat includes the following fields: authors and ad- Sources Map. The Metathesaurus is the largest and dresses, MeSH headings indexing the document, and most complex component, incorporating 589000 the document’s source, title, and abstract. More de- names for 253000 concepts from more than 30 tailed information is available from NCI. biomedical vocabularies, thesauri, and classification systems (cid:14)including MeSH, SNOMED, COSTAR, 3.1. CANCERLIT concept space ICD-9CM, and Dorland’s Illustrated Medical Dictio- nary, 27th edn... The Metathesaurus is not intended to serve as an Aoverarching classification systemB or Our CANCERLIT testbed contains 2 months of controlled vocabulary, but to facilitate translation CANCERLIT data (cid:14)May and June 1996.. The ap- and interpretation of biomedical terminology across proximately 10000 abstracts take up 40 MB of vocabularies. memory, and took roughly 1 h to create on an HP NLM makes UMLS copies available to resear- 9000 workstation. We have since expanded the chers for the development of biomedically related testbed to include the last 5 years of CANCERLIT expertsystems,automaticindexingandclassification documents (cid:14)seeai.bpa.arizona.edurCancerLit.. tools, and tools to index patient records. Research in Two different options were used to create the this area includes the SAPHIRE project (cid:14)Oregon prototype CANCERLIT concept space. A MeSH- Health Sciences University., the Internet Grateful basedCANCERLITconceptspacewascreatedusing Med interface for MEDLARS databases (cid:14)COACH existing MeSH terms that index each document, and Browser., the Interactive Query Workstation (cid:14)IQW., an Automatic Indexing CANCERLIT concept space the InterMed Vocabulary Server project (cid:14)Stanford, was generated using automatic indexing techniques Columbia, Harvard and the University of Utah., and (cid:14)word identification, stop wording, stemming, term- A.L.Houstonetal.rDecisionSupportSystems30(2000)171–186 175 phrase formation, and tf)idf term weighing.. Con- 3.1.1. Document collection cept space (cid:14)automatic thesaurus. creation is a stan- In any automatic thesaurus building effort, the dard process that can be applied to a variety of first task is to identify the collection of documents different kinds of textual information. Users with thatwillserveas the basisof the thesaurus.We used different levels of expertise have successfully used a 2-month collection of CANCERLIT documents. the output in different subject domains. Fig. 1 illus- trates the complete process as it might be applied to 3.1.2. Automatic indexing medical knowledge spaces. A brief overview of the The purpose of this step is to automatically iden- process is described below. tifyeachdocument’scontent.WeusedaSalton-based Fig.1.TheMedicalKnowledgeRepresentationSystem(cid:14)MKRS.Architecture. 176 A.L.Houstonetal.rDecisionSupportSystems30(2000)171–186 w x technique 19 that identifies document subject de- thesaurus is treated as a neuron and the asymmetric scriptors and computes descriptor frequency for the weight between any two terms is the unidirectional, entire collection. Then, a stop-word list eliminates weighted connection between neurons. With user- non-content bearing words (cid:14)e.g. AtheB, AaB, AonB, supplied terms as input patterns, the Hopfield algo- AinB., and a stemming algorithm identifies the re- rithm activates term neighbors (cid:14)strongly associated maining word stems. A document term-frequency terms., combines weights from all associated neigh- requirement removes AnoiseB. bors (cid:14)by addingcollectiveassociationstrengths., and repeats this process until convergence (cid:14)see Fig. 4.. 3.1.3. Co-occurrence analysis Fig.5illustratesaCANCERLITsession.Theuser Theimportanceofeachdescriptor(cid:14)term.inrepre- entered the term Abreast cancerB and the prototype’s senting document content varies. Using term fre- combined MeSHrAutomatic Index concept space quency and inverse document frequency, the cluster suggestsa listof relatedterms. Next, theuserselects analysis step assigns weights to each document term the term AFamily HistoryB from the list of related to represent term importance. Term frequency indi- terms, which combines it with the original term cateshowoftenaparticulartermoccursinthe entire Abreast cancerB to narrow the search. The prototype collection. Inverse document frequency (cid:14)indicating finds 506 documents that contain the two terms and term specificity. allows terms to have different the user selects the first document to review. strengths (cid:14)importance. based on specificity. A term 3.2. Experimental design can be a one-, two-, or three-word phrase. Fig. 2 describes the frequency computation. Cluster analy- Theprimarygoalin thispreliminaryphasewasto sis is then used to convert raw data indexes and evaluatetheusefulnessofsuggestedtermsfromthree weights into a matrix indicating term similarityrdis- different thesauri: similarity using a distance computation based on Chen and Lynch’s w4x asymmetric ACluster Func- v MeSH concept space — a thesaurus based on tionB which represents term association better than NLM’s controlled medical information retrieval the cosine function (cid:14)see Fig. 3 for more detail.. A controlled vocabulary: MeSH. net-like concept space of terms and weighted rela- v Auto Index concept space — our automatically tionships is then created, using the cluster function. generated thesaurus based exclusively on terms contained in the collection’s documents. 3.1.4. Associati˝e retrie˝al v Internet Grateful Med — the most commonly The Hopfield algorithmis ideal for concept-based cited on-line tool based on the UMLS Metathe- information retrieval. Each term in the network-like saurus. Fig.2.Frequencycomputation. A.L.Houstonetal.rDecisionSupportSystems30(2000)171–186 177 Fig.3.Clusteranalysiscomputations. Our subjects were five cancer researchers affili- begin a document search, and to suggest five related ated with the University of Arizona Cancer Center, terms for each search term. and a veterinarian. Phase one of the experiment During phase two, a subject-supplied search term involved evaluating term usefulness during a was entered into a thesaurus and subjects evaluated directed search of the CANCERLIT testbed. Twelve the top 40 thesaurus-suggested terms (cid:14)ArelevantB or searches were performed on each of the three the- Anot relevantB.. This step was repeated for the other sauri(cid:14)36totalsearches.. First,wedemonstratedeach two thesauri. The entire process was then repeated thesaurus using a subject-provided term. Then, each forothersubject-suppliedterm(cid:14)s..Theorderinwhich subject was asked to provide one or two terms to wesearchedthethesauriwaspre-assignedbysubject Fig.4.Hopfieldnetalgorithm. 178 A.L.Houstonetal.rDecisionSupportSystems30(2000)171–186 Fig. 5. CANCERLIT session: User entered Abreast cancerB in (cid:14)1.. CANCERLIT suggests related terms in (cid:14)2.. User selects AFamily HistoryB.Systemlocates506documentsrelatedtoAbreastcancerBandAFamilyHistoryBin(cid:14)3..Userchoosestoreadfirstdocumentin(cid:14)4.. numberin a randomfashion. We recordedthe verbal Subjectswere allowedto accessdocumentslinkedto protocols of the searching and evaluation process. the thesaurus-suggested terms, but we did not evalu- A.L.Houstonetal.rDecisionSupportSystems30(2000)171–186 179 ate that part of the search. Next, we solicited feed- information for term precision. There were no statis- back on the usefulness of the three thesauri, the tically significant differences among the three the- users’ searching experiences with CANCERLIT, sauriintermrecallorprecision,indicatingthatterms MEDLINE, and Internet Grateful Med, and our user suggested by our tool are comparable to terms sug- interface. Precision and recall for phase two was gested by Internet Grateful Med (cid:14)UMLS Metathe- computed. saurus-based. and MeSH indexing terms. Based on subject qualitative feedback, we believe this is par- tially due to the prototype’s size (cid:14)2 months worth of 3.3. Precision and recall data vs. the entire MEDLINE collection.. We were pleased that our tool could perform at a comparable Precision and recall were calculated as follows: level based on such limited input. (cid:14)1. Avery relevantB terms — 1 point; (cid:14)2. Apossibly An interestingresultfromthisresearchis the lack relevantB terms — 0.5 points;and(cid:14)3. Anot relevantB of duplicate relevant terms suggested by the three terms — zero points. For each search, all relevant different thesauri. In previous research, the most terms from each thesaurus were combined (cid:14)eliminat- relevant terms typically were suggested by all the- ing duplicates. for a total relevant score. Term recall sauri. We were surprised at the lack of term overlap for each thesaurus was calculated by dividing the (cid:14)for example, four searches had no overlapping relevant score for that thesaurus by the total relevant terms, and four searches had only one overlapping score for all thesauri. Term precision for each the- term., which suggests that a combined approach saurus was calculated by dividing the total relevant wouldprobablybemoreusefultosearchers.Subjects score for the thesaurus by the total number of terms confirmed this in their verbal feedback, and there is suggested by the thesaurus. supporting evidence in the literature w5,9,21x. Our Fig. 6 illustrates Minitab’s one-way ANOVA test data also support the literature’s premise that the- fortermrecallforeachthesaurus,andforthevarious saurus combinationcan increase recall without sacri- thesauri combinations. Fig. 7 illustrates the same ficing precision. Fig.6.Recallcomparisonbytermsource.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.