Table Of Content

Patent Retrieval: A Literature Review Walid Shalaby, Wlodek Zadrozny DepartmentofComputerScience UNCCharlotte {wshalaby, wzadrozn}@uncc.edu ABSTRACT in order to identify strengths and differences of corporate’s own Withtheeverincreasingnumberoffiledpatentapplicationsevery patentportfoliocomparedtootherkeyplayersworkingonrelated 7 year,theneedforeffectiveandefficientsystemsformanagingsuch technologies,4)patentrankingandscoringinordertoquantifythe 1 tremendousamountsofdatabecomesinevitablyimportant. Patent strengthoftheclaimsofanexistingoranewpatent,and5)prior 0 Retrieval(PR)isconsideredisthepillarofalmostallpatentanal- art search in order to retrieve patent documents and other scien- 2 ysistasks. PRisasubfieldofInformationRetrieval(IR)whichis tific publications relevant to a new patent application. All those n concernedwithdevelopingtechniquesandmethodsthateffectively patent-relatedactivitiesrequiretremendouslevelofdomainexper- a andefficientlyretrieverelevantpatentdocumentsinresponsetoa tisewhich,evenifavailable,mustbeintegratedwithhighlysophis- J given search request. In this paper we present a comprehensive ticatedandintelligentanalyticsthatprovidecognitiveandinterac- 2 reviewonPRmethodsandapproaches. Itisclearthat,recentsuc- tiveassistancetotheusers. cessesandmaturityinIRapplicationssuchasWebsearchcannot ] betransferreddirectlytoPRwithoutdeliberatedomainadaptation PatentRetrieval(PR)isthepillarofalmostallpatentanalysistasks. R and customization. Furthermore, state-of-the-art performance in PRisasubfieldofInformationRetrieval(IR)whichisconcerned I automaticPRisstillaroundaverage.Theseobservationsmotivates withdevelopingtechniquesandmethodsthateffectivelyandeffi- . s theneedforinteractivesearchtoolswhichprovidecognitiveassis- ciently retrieve relevant patent documents in response to a given c tancetopatentprofessionalswithminimaleffort.Thesetoolsmust search request. Although the field of IR has received huge ad- [ also be developed in hand with patent professionals considering vancesfromdecadesofresearchanddevelopment,researchinPR 1 theirpracticesandexpectations. Weadditionallytouchonrelated isrelativelynewerandmorechallenging.Ontheonehand,patents v taskstoPRsuchaspatentvaluation,litigation,licensing,andhigh- aremulti-page,multi-modal,multi-language,semi-structured,and 4 lightpotentialopportunitiesandopendirectionsforcomputational metadata rich documents. On the other hand, patent queries can 2 scientistsinthesedomains. beacompletemulti-pagepatentapplication.Theseuniquefeatures 3 maketraditionalIRmethodsusedforWeboradhocsearchinap- 0 1. INTRODUCTION propriateoratleastoflimitedapplicabilityinPR. 0 Patentsrepresentproxiesforeconomic,technological,andevenso- . Moreover, patent data is multi-modal and heterogeneous. As in- 1 cialactivities. TheIntellectualProperty(IP)systemmotivatesthe dicated by Lupu et al. [2], analyzing such data is a challenging 0 disclosure of novel technologies and ideas by granting inventors 7 exclusive monopoly rights on the economic value of their inven- task for many reasons; patent documents are lengthy with highly 1 tions. Patents,therefore,haveamajorimpactonenterprisesmar- complexanddomainspecificterminology. Toestablishtheirwork : ketvalue[1].Withthecontinuousriseinthenumberoffiledpatent novelty,inventorstendtousejargonandcomplexvocabularytore- v fertothesameconcepts. Theyalsousevagueandabstractterms applicationseveryyear,theneedforeffectiveandefficientsystems i X formanagingsuchtremendousamountsofdatabecomesinevitably inordertobroadenthescopeoftheirpatentprotectionmakingthe problemofpatentanalysislinguisticallychallenging. r important. a PR starts with a search request (query) which often represents a Typicalpatentanalysistasksinclude: 1)technologyexplorationin patent application under novelty examination. Therefore, several ordertocapturenewandtrendytechnologiesinaspecificdomain, methodsforqueryreformulation(QRE)havebeenproposedinor- and subsequently using them to create new innovative services, dertoselect,remove,orexpandtermsintheoriginalqueryforim- 2)technologylandscapeanalysisinordertoassessthedensityof provedretrieval.QREmethodsareeitherkeyword-based,semantic- patentfilingsofspecifictechnology,andsubsequentlydirectR&D based,orinteractive. Keyword-basedmethodsworkbysearching activities accordingly, 3) competitive analysis and benchmarking forexactmatchesbetweensearchquerytermsandthetargetcorpus, andthusfailtoretrieverelevantdocumentswhichusedifferentvo- cabularybuthavesimilarmeaningtooriginalquery.Inordertoal- leviatethevocabularymismatchproblem,semantic-basedmethods trytosearchbymeaningthroughexpandingqueriesand/ortarget corpuswithsimilarorrelatedtermsandthusbridgingthevocabu- larygap.Becauseneithermethodsprovedacceptableperformance, few interactive methods were proposed to allow users to interac- tivelycontrolQREwithreasonableeffort. Table1:Patentkindcodesofmajorpatentoffices Type USPTO(US) EPO(EP) WIPO(WO) A1 application applicationw/searchreport A2 republishedapplication applicationw/osearchreport A3 - searchreport A4 - supplementarysearchreport publicationofamendedclaims A9 modifiedapplication B1 grantedpatentw/oA1 grantedpatent(publication) - B2 grantedpatentw/A1 amendedB1 - Thisreviewaimstoprovideresearcherswithanillustrativeandcrit- 2.3 PatentFamilies icaloverviewofrecenttrends,challenges,andopportunitiesinPR. Apatentfamilyispatentdocumentsthatrefertothesameinvention Therestofthispaperisorganizedasfollows. Section2presents andarepublishedbydifferentpatentofficesaroundtheworld[3], somepreliminariesandbackgroundaboutpatentsdata. Section3 usuallyindifferentlanguagesdependingontheissuingpatentof- providesanoverviewofevaluationtracksanddatacollectionsfor fice. Patentfamiliescouldbeexploitedtoexpandthepriorartlist PRbenchmarking. AnillustrationofPRtasksispresentedinSec- oftopicpatentsastheydisclosethesameinvention. tion4. Section5presentsacomprehensivereviewonPRmethods andapproaches. Section6lightlytouchesonrelatedtaskssuchas 3. DATA&EVALUATIONTRACKS patent quality assessment, litigation, and licensing. Finally, con- This section presents an overview of evaluation tracks organized cludingremarksarepresentedinSection7. forpatentdataanalysisalongwithavailabledatacollectionswith focusontaskspertainingtoPR. 2. PRELIMINARIES 2.1 PatentDocumentsandKindCodes 3.1 CLEF-IPCollections Patent documents are mostly textual. They are highly structured TheConferenceandLabsoftheEvaluationForum4 (CLEF)isan withtypicalelements(sections)includingtitle, abstract, descrip- European series of workshopswhich started in 2001 to foster re- tion(a.k.abackgroundoftheinvention),andclaims. Thedescrip- search in Cross Language Information Retrieval (CLIR). The In- tion section articulate in details the technical specification of the tellectualProperty(IP)track(CLEF-IP)whichranbetween(2009- inventionanditspossibleembodiments. Theclaimssectionisthe 2013)wasorganizedto: 1)fosterresearchinpatentdataanalysis, mostsignificantoneasitdescribesthescopeofprotectionsought and 2) provide large and clean test collections of multi-language by the inventor and hence encodes the real value of the patent. patentdocuments,specificallyinthethreemainEuropeanlanguages Patent documents are lengthy with highly complex and domain (English,French,andGerman).Researchlabshavetheopportunity specificterminology. Theyalsocontainmultipledatatypes(e.g., to test their methods on multiple shared tasks such as PR, patent text, images, flow charts, formulae...etc) with a rich set of meta- classification,image-basedPR,imageclassification,flowchartrecog- dataandbibliographicinformation(e.g.,classificationcodes,cita- nition,andstructurerecognition[1,3–6]. tions, inventors, assignee, filing/publication dates, addresses, examiners...etc). TheCLEF-IPdatacollectionarepatentdocumentsextractedfrom theEPOdata. ItisprovidedthroughtheInformationResearchFa- Typically,eachpatenthasasetofpertainingdocumentswhichpub- cility5 (IRF) and hosted by Marec 6. Patent documents are pro- lishedthroughoutitslife-cycle.Alldocumentsareidentifiedbyan videdinXML format andhavecommonDocumentType Defini- alphanumeric name with a common naming convention. Names tion(DTD)schema. Thecollectionwasconstructedaccordingto startwithtwolettersidentifyingtheissuingpatentoffice(e.g.,US the proposed methodology by Graf and Azzopardi [7] and is di- andEP),thenthepatentnumberassequenceofdigits,andfinally videdintotwopools: asuffixindicatingthedocument’skindcode. Thekindcodeiden- tifies the stage in the patent life-cycle at which the document is published. Table 1 shows a brief description of kind codes used 1. Thecorpuspool:documentsselectedfromthispoolarepro- atmajorpatentofficesincludingtheUSPatentandTrademarkOf- vided for participating labs as training or lookup instances fice1(USPTO),theEuropeanPatentOffice2(EPO),andtheWorld dependingonthetask. IntellectualPropertyOrganization3(WIPO). 2. Thetopicspool:documentsselectedfromthispoolarecalled topicsandtheyrepresenttestingorevaluationinstancesde- 2.2 PatentClassification pending on the task. For example, in prior art search, the Patentofficesorganizepatentsbyassigningclassificationcodesto topicmightbeapatentapplicationdocumentforwhichitis eachofthembasedonthetechnicalfeaturesoftheinvention. The requiredtoretrievepriorart. patentclassificationsystemisahierarchicalone.Commonclassifi- cationsystemsincludetheInternationalPatentClassification(IPC), the US Patent Classification (USPC), and the Cooperative Patent TheXMLdocumentsconsistofthemaintextualsectionssuchas Classification(CPC). bibliographicdata,abstract,description,andclaims.Eachsection 1http://www.uspto.gov/ 4http://www.clef-initiative.eu/ 2http://www.epo.org/ 5http://www.ir-facility.org/ 3http://www.wipo.int/portal/en/index.html 6http://www.ir-facility.org/prototypes/marec iswritteninoneormorelanguages(English,French,and/orGer- 69patentapplications. man) and is denoted by a language code. At least the claims of grantedpatents(B1documents)arewritteninthethreelanguages 3.2 NTCIRCollections becauseitisEPOrequirementonceapatentapplicationisgranted. TheJapaneseNationalInstituteofInformaticsTestbedsandCom- munityforInformationaccessResearchproject7 (NTCIR)started CLEF-IP2009Collection:thisdatasetwasdesignedfortheprior in 1997 to support research in IR and other areas, focusing on artsearchtask[4]. Thecorpuspoolcontainsdocumentspublished CLIR.NTCIRhasbeenorganizingaseriesofworkshopsproviding between (1985-2000) (~2m documents pertaining to ~1m unique testcollectionstoresearchersforevaluatingtheirmethodologieson patents). The topics pool contains documents published between multipleCLIRtasks[8].BetweenNTCIR-3andNTCIR-11(2002- (2001-2006)(~0.7mdocumentspertainingto~0.5mindividualpatents). 2013), there has been dedicated tasks for patent data analysis in- Topicsaresetsofdocumentsfromthetopicspoolwithsizesrang- cludingpatentretrieval[9],classification,mining,andtranslation. ingfrom500to10,000topics.Topicswereassembledfromgranted patentdocumentsincludingabstract,description,andclaimssec- NTCIR-3: thePRtaskinNTCIR-3targetedthe"technologysur- tions.Citationinformationfromthebibliographicdatasectionwas vey" problem. The dataset for this task includes: 1) full text of excluded. Japanesepatentapplicationsbetween(1998-1999), 2)abstractof Japanesepatentapplicationsbetween(1995-1999)alongwiththeir Amajorpitfallinthisdatasetisitstopics,whichwerechosenfrom respectiveEnglishtranslations,and3)30searchtopicswhereeach granted patent documents (B1 documents). Initially, the creators topicincludesarelatednewspaperarticle. Thetaskistoretrieve ofthedatasetweremotivatedbyhavingtopicsfromgrantedpatent patentsrelevanttonewsarticles. Bothcross-genreexperimentsin documentswhichhaveclaimsinthreelanguages.Thiswasthought whichpatentswereretrievedbyanewspaperclipaswellasordi- toprovideakindofparallelcorpussuitableforCLIR.Theproblem naryadhocretrievalofpatentsbytopicswereconducted[9]. ofusingsuchdocumentsissimple,itcontradictsthepracticeofIP searchprofessionalswhostartwiththepatentapplicationdocument NTCIR-4: two PR tasks were organized in NTCIR-4 [10]: 1) notthegrantedone. patent map generation, and 2) invalidity search. The dataset for thePRtasksincludes:1)unexaminedJapanesepatentapplications CLEF-IP2010Collection: thisdatasetwascreatedfortheprior publishedbetween(1993-1997)alongwithEnglishtranslationsof art search and patent classification tasks [5]. The corpus pool of theabstract,and2)34searchtopicswhereeachtopicisaclaimof thisdatasetcontainsdocumentswithpublicationdatebefore2002 arejectedpatentapplicationwhichwasinvalidatedbecauseofex- (~2.6mdocumentspertainingto~1.9muniquepatents).Thetopics istingpriorart. Relevancejudgmentswereindividualpatentsthat pool contains documents published between (2002-2009) (~0.8m caninvalidateatopicclaimbyitsownorinconjunctionwithother documentspertainingto~0.6muniquepatents).Topicsfortheprior patents.Relevantpassagestotheinvalidatedclaimwerealsoanno- arttaskaretwosetsofdocumentsfromthetopicspool;asmallset tatedandaddedtotherelevancejudgments. of500topicsandalargersetof2000topics. UnliketheCLEF-IP 2009 dataset, topics are assembled from patent application docu- NTCIR-5:twoPRtaskswereorganizedinNTCIR-5[11]:1)doc- mentsratherthangrantedpatentdocuments. umentretrieval(invaliditysearch),and2)patentpassageretrieval. Thedatasetfortheinvaliditysearchtaskincludes: 1)unexamined CLEF-IP2011Collection: Thisdatasetwascreatedasatestcol- Japanesepatentapplicationspublishedbetween(1993-2002)along lectionforfourtasks:priorartsearch,patentclassification,image- withEnglishtranslationsoftheabstract, and2)1200searchtop- basedpriorartsearch,andimageclassification[1]. Thetopicsand icswhereeachtopicisaclaimofaninvalidatedpatentapplication. corpuspoolswerethesameasinCLEF-IP2010dataset. Forthe Relevancejudgmentsweregeneratedinamannersimilartotheone priorarttask, 3973topicswereprovidedasaseparatearchiveof usedinNTCIR-4invaliditysearchtask. patentapplicationdocuments. NTCIR-6: two PR tasks were organized in NTCIR-6 [12]: 1) CLEF-IP2012Collection: thisdatasetwascreatedasatestcol- Japaneseretrieval(invaliditysearch),and2)Englishretrieval. The lectionforthreetasks:passageretrievalstartingfromclaims,chem- datasetfortheJapaneseretrievaltaskisthesameoneusedinNTCIR- icalstructurerecognition,andflowchartrecognition[6].Thetopics 5butmoretopicswereused(1685topics). TheEnglishretrieval andcorpuspoolswerethesameasinCLEF-IP2010dataset. The taskwasfocusingonfindingallthecitationscitedbytheapplicant passageretrievaltaskisdesigneddifferentlyfrompreviousCLEF- andtheexaminer. Thedatasetforthistasksincludes: 1)granted IP prior art search collections. The purpose for this tasks is to patentsfromtheUSPTObetween(1993-2000),and2)3221search retrieve both documents and passages relevant to a set of claims. topicswhereeachtopicisagrantedpatentpublishedbetween(2000- Topics for the passage retrieval task were extracted from patent 2001). applicationspublishedafter2001. Relevancejudgmentswerethe highlyrelevantcitationsonly(i.e.,markedXorY)intheexamin- 3.3 TREC-CHEMCollections ers’searchreports(A4documents)ofchosentopicpatents. TheTREC-CHEMtrackwasorganizedtomotivatelargescalere- CLEF-IP2013Collection: thisdatasetwascreatedasatestcol- searchonchemicaldatasets,especiallychemicalpatentIR[13]. lectionfortwotasks:1)passageretrievalfromclaims,and2)struc- ture recognition from patent images [3]. The topics and corpus TREC-CHEM 2009: this collection was created as a test col- poolswerethesameasinCLEF-IP2010dataset.SimilartoCLEF- lection for two tasks [13]: 1) technology survey, and 2) prior art IP2012,theCLMtaskisdesignedtoretrievebothdocumentsand search. 18 topics were provided for the technology survey task passages relevant to a set of claims. Topics for the passage re- whererelevancejudgmentswereobtainedfromexpertsandchem- trievaltaskwereextractedfrompatentapplicationspublishedafter istrygraduatestudents.Forthepriorartsearch,1,000patentswere 2002. Overall, thetopicssetcontained148topicsextractedfrom 7http://research.nii.ac.jp/ntcir Table2:Scenariosofpriorartsearch Task Who When Purpose Output relatedworksearch inventor/prosecutor pre-grant allrelatedwork applicant’sdisclosure patentabilitysearch prosecutor/examiner pre-grant/examination noveltybreakingwork grant/modify/reject infringementsearch owner/investor post-grant relevantclaims/infringingproducts sue/license/clearance freedomtooperatesearch investor post-grant relevantclaims/relatedwork clearance invaliditysearch competitor/defendant post-grant noveltybreakingwork re-examine/IPR/PGR technologysurvey technologyanalyst pre/post-grant allpublishedpatents surveyreport providedastesttopicswhererelevancejudgmentswerecollected Prior-art search is the main theme of the CLEF-IP and NTCIR fromthecitationsoftopicpatentsaswellastheirfamilymembers. tracks. The importance of this task stems from the requirement Thesearchcorpuscontains~1.2mchemicalpatentsfileduntil2007 byallpatentofficesthatfiledpatentsmust constitute novel, non- atEPO,USPTO,andWIPO.Italsocontains59Kscientificarticles. obvious, andnon-abstractideas. Therefore, animportantactivity throughthepatentlife-cycleistothoroughlyensurethatnoearlier TREC-CHEM2010:thiscollectionwascreatedforthesametwo publishedpatentormaterialdescribingtheprescribedideasexist. tasksasinTREC-CHEM2009[14]. 30topicswereprovidedfor Thetaskcanbedefinedasfollows: the technology survey task. The search corpus contains ~1.3m chemicalpatentsand177Kscientificarticles.Relevancejudgments Problem: GivenapatentapplicationX,retrieveallrelateddocu- werecreatedthesamewayasinTREC-CHEM2009. mentstoX. TREC-CHEM2011:thiscollectionwascreatedforthesametwo Priorartsearchisatotalrecalltask,thereforeitdemonstratessev- tasksasinpreviousTREC-CHEMtracksbesidesanewchemical eral challenges. Search coverage is one of the main challenges, image recognition task. The technology survey task topics were because it is required to cover all previously published material biomedicalandpharmaceuticalpatents[15]. (patentornon-patentliterature)inallforms(electronicorprinted) whichisinfeasible. Anothermajorchallengeistheneedtosearch 3.4 OtherSources throughmaterialswrittenindifferentlanguages.Lastbutnotleast, OtherIPdatasourcesaredetailedin[16].Theseincludefullpatent traditional IR methods perform poorly when confronted with the textsaswellasbibliographicinformationfrommajorpatentoffices patentpriorartsearchtask. Mainlybecausethepatentlanguageis such as the USPTO, EPO, and WIPO. Bibliographic information fullofjargonanduserdefinedterminology.Inventorsintentionally for patents published from 1976 to 2006 is provided through the tendtousedifferentvocabularytoexpresssameorsimilarideasin NationalBureauofEconomicResearch8(NBER)andsubsequently ordertoestablishthenoveltyoftheirwork. cleanedandextendedtoincludepatentsuntil20139. Patentprose- cutionhistoriesareavailablethroughthePatentApplicationInfor- Priorartsearchisperformedatdifferentstagesofthepatentlife- mation Retrieval10 (PAIR). Patent assignments, filings, classifica- cycle,bydifferentstakeholders,forvariouspurposes,andforlim- tions,andpetitiondecisionsarealsoprovidedthroughtheUSPTO itedperiodoftime. Understandingthereal-lifepracticesofpatent bulkdownloadshostedbyGoogle11. professionalsiscriticaltobettersatisfytheirinformationneed[17]. Inotherwords,thesearchscenariodependsonwhenitisdone,by who,andforwhatreason(s).Table2showsthesevariousscenarios 4. PATENTRETRIEVALTASKS whicharedetailedbelow. ThegoalofPRistoretrieverelevantpatentdocumentstoagiven searchrequest(query). Thisrequestcantakedifferentformssuch RelatedWorkSearch: duringthepre-grantstage, inventorsand asasequenceofkeywords,amemo,oracompletetextdocument prosecutors run related work search to retrieve all relevant work (e.g.apatentapplication).Thepurposeofthistaskismanifold,for to the invention. Moreover, some patent offices request from in- example: ventors an applicant’s disclosure document specifying all related publicationswhenfilinganewapplication. • Retrieverelatedpatentstoagivenpatentapplicationinorder togatherrelatedwork,orinvalidateoneormoreofitsclaims. PatentabilitySearch: duringtheexaminationstage,patentexam- inersperformpatentabilitysearchinordertoensurethatthepro- • Explorepatentfilingactivityunderspecifictechnology. posedideasarenovel,non-obvious,andnon-abstract. Theoutput ofthistaskwouldbeasearchreportwithallretrievedrelevantpub- • Explore the competitive landscape of a given company by lications. Inthisreport, eachentrywillhaveaspecialcodeindi- lookingatothercompaniesfilingpatentssimilartothegiven catingwhetheritisjustarelatedpublication,ornoveltybreaking companypatents. one.Examinerswouldalsospecifywhichpassagesorfiguresinre- trievedpublicationsconstituterelevancy. Dependingonthesearch Because of these multiple objectives, various PR tasks were pro- findings,thepatentofficemightgrant,reject,orasktheapplicant posedtofulfilleachobjective,andmultipledatasetswereprovided tomodifythepatentapplication. Patentabilitysearchisalsoper- dependingonthegiventask. formedbypatentprosecutorsasasanitycheck.Althoughthistask should be of equal interest to prosecutors who file the patent ap- 8https://sites.google.com/site/patentdataproject/ plication as it is to examiners, prosecutors often do not dig deep 9http://rosencrantz.berkeley.edu/batchsql/ searching for relevant publications, and delegate finding relevant 10http://portal.uspto.gov/pair/PublicPair priorworktoexaminersinordertosavecosts. 11https://www.google.com/googlebooks/uspto-patents.html Figure1:TaxonomyofPatentRetrieval(PR)methods InfringementSearch:thistask,alsocalledproductclearancesearch, PostGrantReview13(PGR)infrontofthePatentTrialandAppeal aimstoensurewhetheranexistingoraproposedproductisinfring- Board14(PTAB). inganypublishedpatentclaim(s). Patentownersrequirethattype searchtofindoutifathirdpartyhasaproductwithfeatureswhich TechnologySurvey: anotherPRtaskwhere,inatypicalscenario, arecitedbyoneormoreclaimsoftheirpatents. Ifso,theymight business managers would request search professionals to prepare either sue or negotiate a license with that infringing party. The asurveyofpatentdocumentsgivenamemorandumtheyprepared scopeofthesearchherewouldincludeallrelatedpublicationsto from some source (e.g., news article) [9]. This basic scenario is commercialproducts(e.g.,productdescriptions,vouchers...etc). limitedonlytopatentliteratureanditisassumedthatpatentdocu- mentsarejustacollectionoftechnicalpapers. InvestorsandR&Dmanagers,ontheotherhand,requirethattype ofsearchtoensurenewlyproposedproduct(s)arenotinfringinga 5. PATENTRETRIEVALMETHODS published patent claim(s) and investment in such products would Inthissection,wepresentacomprehensivereviewofPRmethods be lucrative. The scope of search in this case would be limited and approaches. We start by presenting available test collections topatentandcopyrightedliteratureonly. Deepunderstandingand andevaluationmetrics. Then,weprovideataxonomyoftheseap- correctinterpretationofpatentclaimsisimperativeforbuildingthe proacheshighlightingtheircharacteristicsandlimitations. correctcorrespondencebetweenproductfeaturesandclaimsinor- dertoestablishordismissinfringement. AsshowninFigure1,PRmethodscanbecategorizeddependingon whichpiece(s)ofdatafromboththesearchqueriesandthesearch Freedom to Operate Search: this PR task extends beyond in- corpusareusedforretrievingrelevantdocuments. Keyword-based fringement search. Here, investors and R&D managers not only methodsutilizeonlytermsfromsearchqueriesandlookforexact needtomakesurethatproposedproductsdonotinfringeanexist- matches in the target corpus. Pseudo Relevance Feedback meth- ingpatentorcopyrightedmaterial,butalsotoensuretheyhavethe odsutilizetermsfromthetoprankedresultsofrunningtheinitial freedom to filepatents on theseproducts without worrying about query to improve the set of relevant retrieved results. Semantic- previouslypriorartthatmightinvalidatesuchinventions. Another basedmethodstrytoovercomethevocabularymismatchproblem objectiveoffreedomtooperatesearchistomakebetterinvestment betweenthesearchtermsandrelatedpatentsvocabularybymatch- decisionsandR&Dplansaccordingtoexistingpriorart. ingthembasedontheirmeanings.Metadata-basedmethodsexploit the language independent non-textual metadata and bibliographic InvaliditySearch: aspatentsguaranteemonopolyrightstotheir information in order to improve patent retrievability. Finally, in- owners on the economic value of granted inventions, companies teractivemethodsaimtobetterorganizeandpresentsearchresults andotherpartiesusuallymonitorgrantedpatentsoftheircompeti- totheusers. Moreover, throughinteraction, usersareengagedin tors or pertaining to their technology landscape to ensure com- aniterativeprocessofsearching,reviewing,andrefininghopingto petitive superiority. Therefore, invalidity search is performed to retrieveasmanyrelevantresultsaspossible. findpublishedmaterialthatwasmissedbythepatentofficeduring patentabilitysearch.Invaliditysearchisalsoconsideredasthefirst line of defense when a party is confronted with patent infringe- 5.1 TestCollections&EvaluationMeasures ment lawsuit. Again published material might include patent or As we highlighted in section 3, several datasets were created to non-patentliteraturesuchasbooks, newsarticles, academicperi- supportevaluatingdifferentPRtechniques. Inalmostallofthese odicals...etc. Afterfindingsuchvaliditybreakingmaterial,athird datasets,relevantdocumentstosearchquerieswerecollectedfrom partymightfileapost-grant(opposition)proceduredependingon thecitationsoftopicpatentdocuments(e.g.,CLEF-IP2009/2010/2011 thepatentofficepolicies. Forexample,theUSPTOprovidespro- collections). Becausethesecitationsrepresentrelatedpriorwork, cedures suchas re-examination, Inter PartesReview12 (IPR), and theyareappropriateonlyfortherelatedworksearchtask. 13http://www.uspto.gov/patents-application-process/appealing- 12http://www.uspto.gov/patents-application-process/appealing- patent-decisions/trials/post-grant-review patent-decisions/trials/inter-partes-review 14https://ptabtrials.uspto.gov Table3:Keyword-basedMethods Method Description Dataset MAP P R PRES Verberne&E.D’hondt[18] • removestopwords,andpunctuation CLEF-IP2009 0.05 0.01 0.22 - • useclaimsasBOW Magdyetal.[19] • removestopwords,andfrequentterms CLEF-IP2009 0.12 - 0.63 - • usedifferentsectionswithmanualweights • performIPCfiltering • usebigramswithtf>1 Mahdabietal.[20] • usequerylanguagemodelsondifferentsections CLEF-IP2010 0.12 - 0.60 0.49 • usequeriesof100terms • performIPCfiltering Wang&Lin[21] • uselinguistic-basedconcepts CLEF-IP2010 0.10 - 0.48 0.40 • concept weighting using weighted tf-idf and mutual information Konishi[22] • patternstoidentifyclaimcomponentsterms NTCIR-5 0.20 - - - • patternsforexplanationtermsfromdescription • rankboostingbasedonIPC InotherdatasetssuchasCLEF-IP2012/2013collections,relevant • Query Expansion (QE): where representative terms other documentswerecollectedfromnoveltybreakingcitationsfoundin thantheonesinQareextractedandmergedwithQtoform examiners’searchreports,therefore,thesedatasetsareappropriate Q¯.PseudoRelevanceFeedback(PRF)methodsarethemost forthepatentabilityandinvaliditysearchtasks. Thoughinvalidity prominentinthiscategorywheretermsfromtoprankedre- searchrequiresnon-patentliteratureaswell. sults of running Q are used to expand Q terms assuming thesetopresultsarerelevant[26].Othersemantic-basedQE StandardIRaswellasPR-specificevaluationmeasuresaregener- methodsworkbyexpandingQwithtermsofsimilarmean- allyusedtoevaluatePRsystemsincluding: ingssuchassynonymsorhyponyms. • Hybrid (QE & QR): where irrelevant terms are removed 1. Precision (P) and Recall (R) at top-K ranks (e.g., K={1, 5, fromQandmorerelevanttermsareappendedtoQtoform 10,50,100,1000}). Q¯. MosttechniquesusedforQEareappropriateforQRas well, where only terms appearing in the expansion list are 2. MeanAveragePrecision(MAP)[23]whichgenerallyfavors keptandallothersarepruned. earlyretrievalofrelevantdocumentswithlessfocusonre- call. 5.2.1 Keyword-basedMethods 3. NormalizedDiscountedCumulativeGain(nDCG)[24]which favorsnotonlyearlyretrievalofrelevantdocumentsbutalso Thissetoftechniquesretrievesrelevantdocumentsbylookingfor therespectiverankingqualityofthesedocuments. exact matches between search query term(s) and the target data. keywordsearchoperatesundertheclosedvocabularyassumption 4. Patent Retrieval Evaluation Score (PRES) [25] which was wherevocabularyisderivedsolelyfromtermsthatappearinthetar- proposed specifically for recall-oriented tasks such as PR. getsearchdata.Table3showssomekeyword-basedmethodsalong PRESfocusesontheoverallsystemrecallaswellasuser’s withtheirperformanceresultsonbenchmarkdatasets. Keyword- review effort which can be estimated from the rankings at basedtechniquesdifferin:1)whichelementsofthetargetdataare whichrelevantdocumentsareretrieved. indexed,2)whichquerytermsareselected/removed,3)therelative weightsofsuchterms,and4)thematchscoringfunction. 5.2 QueryReformulation(QRE) ThemostwidelyusedtechniquesforPRaretheQueryReformu- Query Reduction (QR): the rationale behind QR approaches is lation(QRE)techniques. Thesemethodsaimattransformingthe intuitiveaspatentsareverylongdocumentswithseveralsections. input query Q into Q¯ by means of reduction or expansion of Q Queryingwiththewholedocumentwouldbeimpracticalandinef- termsinordertoimprovetheretrievabilityofrelevantdocuments. ficient. SomeQRmethodsareposition-based;theyselectrelevant QREcanbeperformedthrough: termsbasedontheirpositioninthepatentdocument[18–20,31,32]. Forexample,VerberneandD’hondt[18]usedonlytermsfromthe claimssectionontheCLEF-IP2009collection. However,there- • Query Reduction (QR): where a representative subset of sults were moderate in terms in MAP compared to other runs on terms are selected from Q and used as Q¯ terms. Position- thesamecollection. basedmethodsarethemostcommonlyusedinthiscategory wheretermsfromspecificpartsorsectionsofthepatentdoc- Magdyetal.[19]experimentedusingtextfromdifferentsections umentareused,orgivenhighermatchingweightthanothers. of the topic patent on the CLEF-IP 2009 collection. The authors AnotherexampleofQRistheIPC-basedmethodswhichuti- usedvariouscombinationsofsectionsincluding: 1)shortsections lizetermsfromIPCdefinitionsasalexiconorstop-wordslist such as title, abstract, first line of the description, first sentence forQ. oftheclaims,and2)lengthysectionssuchasthedescriptionand Table4:PseudoRelevanceFeedbackMethods Method Description Dataset MAP PRES Bouadjeneketal.[27] • usedifferentmethodsofQEandQRfromthePRFset • CLEF-IP2010 • 0.13 • 0.55 • useRicchio,MMR,LM • CLEF-IP2011 • 0.10 • 0.45 Magdyetal.[19] • naivePRF CLEF-IP2009 0.05 - • removestop-words • usemostfrequentterms Mahdabi&Crestani[28] • buildregressionmodelusingrelevancescore,RFsim- CLEF-IP2010 0.16 0.56 ilarities...etc • usethemodeltoestimatetheeffectivenessofRF • usetop100RFandmaximizeAP Gangulyetal.[29] • performquerysegmentation CLEF-IP2010 0.14 0.47 • retainsegmentshighlytobegeneratedusingRFLM GolestanFaretal.[30] • manuallyannotateonerelevantRFresult CLEF-IP2010 0.2915 - • addtermsintheannotatedresulttothequery GolestanFaretal.[30] • assumerelevantRFresultsareknown CLEF-IP2010 0.4816 - • addtermsmorefrequentinrelevantthanirrelevantRF toquery theclaims. Theauthorsassigneddifferentweightstoeachsection linguisticanalysisofpatenttexts. manually. Their best scores were achieved using a combination ofallshortsectionsandpost-filteringretrieveddocumentskeeping 5.2.2 PseudoRelevanceFeedback(PRF) onlythosethatsharethesameIPCclassificationcodewiththetopic ThesemethodsareoneoftheprominenttechniquesusedforQRE. patent. The main challenge with such approach is how to assign PRF starts with an initial run of the given query Q. Then, terms therespectiveweightofeachsectionautomatically.Moreover,IPC fromtoprankedresultsareusedtoselect,remove,and/orexpand filteringwouldn’tbepossiblewhenonlypartialpatentapplication termsinQ,assumingthatthesetopresultsarerelevant.PRFisthus isavailableforpriorartsearch. advantageous as it works automatically without human interven- tionbutmightbecomputationallyinefficientespeciallywithlong Mahdabietal.[20]proposedaposition-basedQRmethodwhich queries.Table4showssomePRFmethodsalongwiththeirperfor- selectsrelevantquerytermsbybuildingtwoquerylanguagemod- manceresultsonbenchmarkdatasets. els using various sections of the topic patent: 1) a variant of the weighted log-likelihood model [33] , and 2) a model based on Despitetheireffectivenessandpopularity,severalchallengesarise theparsimoniouslanguagemodel[34]. Theirexperimentsshowed whenitcomestoPRF-basedQRE[27]suchas: 1)whichpart(s) thatqueriesconstructedfromtermsinthedescriptionsectionus- of the patent application should be used as the initial query?, 2) ingweightedlog-likelihoodgivebetterresultsthanothersections whichpart(s)oftheretrievedresultsshouldbeusedasthesource whichagreeswiththeresultsin[27,35,36]. Themainadvantage of expansion and/or reduction?, 3) what is the best length of the ofthisapproachisthat,respectiveweightsofquerytermsarede- expansion list in case of QE, or the best threshold for removing rived automatically from the query model. However, some chal- termsincaseofQR?, 4)whichpseudo-relevantresultsarereally lengesstillexistregardingtuningthemodelparameterssuchasthe relevantandhowmanyofthemshouldbeused?,and5)whatisthe smoothingparameterwhichwassetheuristically. bestrelevancescoringmodelforthesearchtask(e.g.,BM25[39], thevectorspacemodelwithtf-idfweighting...etc). QueryExpansion(QE):pattern-basedQREwasproposedinmany studies [22,37,38]. Wang and Lin [38] proposed patterns in the Bouadjeneketal.[27]providedathoroughevaluationontheCLEF- formofsyntacticrulesinordertoextractquerytermsasweighted IP2010/2011collectionstoaddresssomeoftheabovechallenges. concepts.In[22],Konishiproposedapattern-basedQEmethodfor Theauthorsexploredthescenariowhenonlypartialpatentappli- thepatentinvaliditysearchtaskontheNTCIR-5collection.Rather cationisavailableforpriorartsearch(e.g.,title,abstract,extended thanusingrawtermsfromtopicpatent’sclaimswhichareoftenab- abstract,ordescription). TheauthorstesteddifferentQEandQR stract,theauthor,usingpatternmatching,identifiesotherspecific generalmethodssuchasRocchio[40]andavariantoftheMaximal terms in the description and use them as expansion terms. First, MarginalRelevance(MMR)[41]. Theyalsotestedpatent-specific componentsoftheinventionareextractedfromthetopicclaimus- methods utilizing synonym sets [42], language models [29], and inghandcraftedpatterns. Secondly,explanationsentencesdescrib- IPC-based lexicon [43]. After experimenting various sections as ingcomponentsoftheinventionareextractedfromthedescription sourcesfortheinitialquerytermsaswellasexpansion/reduction using handcrafted patterns. Thirdly, terms from first and second sources, the results showed that, the description section among stepsareusedasthenewquery. TheresultsshowedthatthisQE othersectionsisthebesttouseastheinitialqueryincaseofboth approachworksbetterthanusingtermsextractedfromtheclaims QEandQR.QEwasnotbeneficialforthelongdescriptionqueries sectiononly. Themaindrawbackofthismethodisitsdependency asitalreadycontainsgoodcoverageofrelevantterms. However, onmanuallycodedpatternstoidentifypotentialterms.Meanwhile, QR on description queries was useful as it removed many of the itdemonstratesthepotentialofusingentitiesandtheirrelationsas retrievalfeaturesmotivatingtheneedfordeeperandmoregeneric 15Thisisasemi-supervisedperformance 16ThisisanOracleperformance noisyterms. Generally, QRoutperformedQEondescriptionand tion/expansionofrelevantqueryterms.Tomotivatetheefficacyof extendedabstractquerieswhichindicatesthat,withlongqueries, QREonretrievalperformance,theauthorsfirstdesignedanexperi- QR is effective for better retrieval performance. The results also mentwhererelevancejudgmentsofaquerypatentQwereassumed showedthatgenericQEmethodssuchasRicchioworksgenerally to be known in advance. After running Q, using PRF on top-K betterforQEthanpatent-specificQEmethods. Finally,theresults documents,onlytermsthataremorefrequentinretrievedrelevant showedthatBM25scoringworksbetterthantheTF-IDFscoring documents(thosefromrelevancejudgments)thanirrelevantdocu- on the long description queries for both QR and QE, while TF- mentsarekeptandusedasQ¯. Then, queryingusingQ¯ achieved IDFworksbetterthanBM25onshortandmedium-lengthtitleor abetterperformancethanstate-of-the-artontheEnglishsubsetof abstractqueries. Throughthiscomprehensiveexperimentalstudy, CLEF-IP2010collection.ToapproximateQ¯automatically,theau- theauthorsdidnotevaluatetheimpactofusingmultiplesections thors proposed four different methods hoping to identify relevant in combination as sources for QE or QR. More importantly, the vs. irrelevanttermsinQby: 1)removingtermswithhighdocu- studydoesnotprovideanyinsightsabouttherespectivevaluesof ment frequency in the top-100 retrieved documents, 2) removing numberofexpansiontermsortermremovalthresholdandwhether infrequent terms in Q, 3) using frequent terms in relevant docu- thesevaluesaresomewhatdeterministicorvarywidelycallingfor mentsassumingthetop-5retrieveddocumentsarerelevant,and4) interactivesetting. performingQRonQusingIPCdefinitionsasstop-words. Allof thefourmethodsfailedtoperformbetterthanthekeyword-based ToaddresstheproblemofpoorPRFresultsinPRcomparedtotra- baseline. Moreinterestingly,theauthorsdemonstratedthat,base- ditionalIR,BashirandRauber[44]proposedanovelapproachfor lineperformancecanbedoubledifonlyonerelevantdocumentwas PRF-basedQEwhichbuildsamodelthatlearnstoidentifybetter manuallyprovidedbytheuser. Thislastobservationmotivatesthe PRF results based on their similarity with the query patent over needforinteractiveQREasasimpleandeffectivemethodforPR. specificterms.Thesetermsarelearnedbybuildingaclassification modelthatclassifieswhetheratermwouldbeusefulforQEornot 5.2.3 Semantic-basedMethods according to some proximity features between the original query Aswementionedbefore, inPRqueriescanvaryfromfewterms termsandpseudo-relevantterms.Theauthors,throughexperiments (e.g., survey memo) to thousands of terms (e.g., full patent ap- onasubsetofUSPTOpatents,showedtheabilityofthismodelto plication). Straightforward keyword-based PR proved to be inef- introducemorerelevantQEtermsandsubsequentlyincreasingthe fectivesimplybecauseofthevocabularymismatchbetweenquery retrievability of individual patents. However, the authors did not termsandrelevantpatentscontent. Magdyetal.[19]showedthat, evaluatethismodelonaanyoftheavailabletestcollections.More- in the CLEF-IP 2009 collection, 12% of the relevant documents over,extractingsimilarityfeaturesandcomputingsimilaritieswith havenocommonwordswiththesearchtopics. Thismotivatesthe PRF results during query execution is computationally expensive needfornovelapproachestobridgethisvocabularymismatchgap. andtimeconsuming. Severalsemantic-basedmethodshavebeenproposedinattemptto match queries with relevant documents based on their meanings Along the same efforts, Mahdabi and Crestani [28] proposed a ratherthanrelyingonkeywordmatchesonly. Table5showssome frameworkforidentifyingeffectivePRFdocumentsatruntimeand semantic-based methods along with their performance results on then performing QE using terms from these relevant documents. benchmarkdatasets. The authors first proposed patent-specific features and then used themtobuildaregressionmodelwhichcalculatesarelevancyscore Dictionary-based: semantic-basedmethodsperformQREbyex- ofeachPRFdocument. ThoughresultsontheCLEF-IP2010col- pandingthequerytoincludeothertermsthathavesimilarmeanings lection were encouraging, several challenges still exist. For ex- to the original query terms. The first category of these methods ample,thecomputationalcomplexityofcalculatingtheregression arethedictionary-basedtechniqueswhichuseeithergeneric[42], modelfeaturesatruntime.AndPRFparameterstuning(e.g.,num- technical[51],orpatent-specificdictionaries[46,47,49,50,52,53] berofPRFdocumentstouse). forQRE.Genericdictionariescouldbeexistinglexicaldatabases suchasWordNet[54], whilepatent-specificdictionariesarelexi- In[29],Gangulyetal. proposedaPRFapproachwhichutilizesa caldatabasesgeneratedfrompatent-relateddatasuchasexaminer’s languagemodelforQRoflongqueriescomposedoffullpatentap- querylogs. Ineithercase, similarorrelatedtermstotheoriginal plications.Theauthorsarguedthat,naiveapplicationofPRFtoex- querytermsareretrievedfromsuchdictionariesandusedforQE. pandquerytermscouldaddnoisytermscausingquery-topicdrift. Moreover,naiveremovaloftermsthathasunittermfrequencyin MagdyandJones[42]exploredtheuseofWordNetforQEinPR the query could cause removal of useful terms and thus hurt re- on the CLEF-IP 2010 collection. Overall, adding synonyms and trieval effectiveness. Instead, the authors proposed a PRF-based hyponymsfornounsandverbsintheoriginalqueryincreasedthe QR technique which generates language model similarity scores MAPscoreslightly,whiledecreasedthePRESscoresignificantly. between query segments (sentences or n-grams) and top ranked Moreover, queryexecutiontimewasincreasedconsiderably. The results. Segments with top scores are kept and all others are re- authorsconsideredthisa"negative"result. AstheuseofWordNet moved. Results on the English subset of the CLEF-IP 2010 col- wasproventobeeffectiveinotherretrievaltasks[55,56],moreex- lectionshowedthattheproposedapproachoutperformsthebase- perimentsareneededtoaffirmtheauthors’conclusion. Forexam- lines.Parametertuningisstillthemaindownsideofthistechnique. ple,investigatingtheimpactofusingsynonymsonlyorhyponyms Theperformanceoftheproposedapproachwasunstablecompared only, and expanding terms belonging to specific sections or am- tothebaselineswithdifferentparametervalues. Specifically, the biguoustermsonly. window size, the number of pseudo-relevant documents, and the fractionoftermstoretain. Recently,moreresearchwasfocusedonutilizingdomain-specific andtechnicaldictionariesratherthanWordNet. Examiners’query In[30],GolestanFaretal.providedastudyonhybridQREwhich logs have been an important resource for building such technical aimstoautomaticallyapproximatetheoptimalQ¯bycarefulselec- thesauri. Tannebaum et al. [45–50] introduced an analysis of the Table5:Semantic-basedMethods Method Description Dataset MAP PRES Magdy&Jones[42] • useWordnetsynonymsandhyponymsforQE CLEF-IP2010 • 0.136 • 0.484 • slowprocessingtime • 0.140(cid:63) • 0.486(cid:63) • noimprovement Tannebaum&Andreas[45–50] • mine query logs for synonyms, co-occurring, and CLEF-IP2010 • 139 • 512 proximityterms • 139(cid:63) • 512(cid:63) • noimprovement • useuponrequest Magdy&Jones[42] • using synonyms learned from parallel translations CLEF-IP2010 • 144 • 485 (EN,GE,andFR) • 140(cid:63) • 486(cid:63) • improveMAPonly • useuponrequest (cid:63)indicatesbaselineperformance USPTOexaminers’searchquerylogs. Theiranalysis,thoughona thetypicalsearchpractices, whereitisneededtoretrieverelated subsetofquerylogs,revealedinterestinginsightsaboutpatentex- patentdocumentsnotrelatedclassificationcodes. aminers’searchbehaviorwhichcouldbeveryusefulfordesigning effective PR systems. For example, the authors noted that about Anothercorpus-basedmethodwasproposedbyMagdyandJones examiners’behaviorwhilesearchingforpriorart: 1)theaverage [42], where synonym sets were automatically generated from the query length is four terms, 2) search terms are mostly from the CLEF-IPpatentcorpus. Theauthorsutilizedparalleltranslations patentapplicationunderinvestigation,3)expansiontermsrepresent of patent sections in different languages to build a word-to-word smallpercentageofquerytermsandmostlyappearinthespecific translationmodelandinfersynonymyrelationwhenawordinone patentdomainterminology, 4)themajorityofquerytermsrepre- languageistranslatedtomultiplewordsinanotherlanguage.These sentssubjecttechnicalfeaturesthatappearsintheclaimssection, multiplewordsundersomeprobabilisticthresholdcouldbeconsid- whileverylittlepercentageofthemappearsinthedescriptionsec- eredsynonyms. Overallresultsusingthismethodwerebetterthan tion, 5) the majority of terms are nouns, followed by verbs, then PRF and Wordnet based QE, but worse than the keyword-based adjectives,and6)abouthalfofthequeryoperatorsusedare"OR", baseline in [36]. The authors also showed that, the performance followedby"AND",thenproximityoperators. ofthismethodonsometopicswasbetterthanthebaselinewhich indicates its potential. The issue they raised is how to more ef- Tannebaum et al. built upon these insights and introduced meth- fectively apply QE by selecting "good" terms [26], or predicting ods to automatically identify synonyms/equivalents, co-occurring QEperformancebeforehand[58,59]. Suchchallengescanalsobe terms,andproximityrelationsforexpandingquerytermsbymin- alleviatedsemi-automaticallybydevelopingintelligentandusable ingexaminers’searchlogs. Aswecannotice,learningexpansion interactive QE frameworks which engage users in such decision. termsfromquerylogsmightbemisleadingbecausenotallquery Finally, KrestelandSmyth[60]appliedtopicmodelingofsearch sessionssucceedtoidentifypriorart.Additionally,deeperanalysis hitsinordertobetterrankretrievedpatents.Theresultsonasmall ofthequerylogsconsideringothermetadatasuchasrelevanthits collectionoftheUSPTOpatentsshowedimprovedMAP. countmightbeusefulinthisregard.Ontheotherhand,itwouldbe moreusefulifwecanmodelthefeaturesoftheseterms,forexample,basedontheirlocation,frequency,part-of-speech...etc. From 5.2.4 Metadata-basedMethods effectiveness perspective, evaluating the generated lexical knowl- Patents are not only textual documents, they contain lots of non- edgeontheCLEF-IP2010collectiondidnotrecordsignificantim- textualmetadataandbibliographicinformationaswell(e.g.,cita- provement [49]. Therefor, the authors recommended using it in tions, tables, formulas, drawings, classification...etc). Combining aninteractivemoderatherthanautomaticmodetosemi-automate metadataanalysiswithtext-basedPRhasshownimprovementsin querygeneration. performanceintheliterature[51,61,62,64,65]. Metadatafeatures are also language independent making them advantageous when Corpus-based: the second category of semantic-based QRE is usedforCLIR.Table6showssomemetadata-basedmethodsalong the corpus-based methods. In these methods, textual corpora are withtheirperformanceresultsonbenchmarkdatasets analyzed to extract semantically related concepts to query terms whichcanbeusedforQE.Al-ShboulandMyaeng[57]proposeda Citation-based: The use of citation analysis for better PR is the Wikipedia-basedQEmethodwhichworksbyfirstcreatingasum- mostheavilyreportedtechniqueofmetadata-basedmethods.Naively maryofeachWikipediaarticlecontainingthemaincategory,allti- incorporating citations from topic patent applications as prior art tlesunderthemaincategory,andothercategorieswithin/outlinks provedtobeeffective,eliminatingtheneedfordeepercitationanal- tothemaincategory. Atquerytime,querytermsandphrasesare ysis [66]. However, citation extraction from patent texts is chal- matchedwithpagesummaries,then,phrasesfrommatchingpages lengingbecausethereisnostandardwritingstyleforpatentrefer- arescoredandselectedforQEundertheassumptionthattheyare ences.LopezandRomary[64]developedatoolforcitationmining semanticallyrelated.ExperimentsonthesubsetofUSPTOpatents whichidentifies, parses, normalizes, andconsolidatespatentcita- intheNTCIR-6collectionshowedincreaseinMAPoverotherQE tions. As citations might not be always available in all scenarios techniques. However,theauthorsusedIPCcodesratherthancita- (e.g., related work search, technology survey...etc), more mature tionsasrelevancejudgmentstotopicquerieswhichdoesnotreflect techniques are needed. Fujii [61] proposed using PageRank [67] anddocumentpopularityasanadditionalscoringtore-rankquery Table6:Metadata-basedMethods Method Description Dataset MAP PRES Fujii[61] • usePageRankonpatentscitationgraph NTCIR-6 • 0.075 - • usepatentpopularityamongtopresultswithweighted • 0.081 voting • 0.071(cid:63) Mahdabi&Crestani[62] • build query specific citation graph from PRF results CLEF-IP2011 • 0.105 • 0.481 andtheircitations • 0.099(cid:63) • 0.450(cid:63) • weightnodesusingPageRank • estimatequeryLMfromthegraphnodesconsidering theirPageRankscores Mahdabi&Crestani[63] • using time-aware random walk on weighted citation CLEF-IP2011 • 0.125 • 0.536 graph • 0.058(cid:63) • - (cid:63)indicatesbaselineperformance topresultsreturnedusingclaims-basedqueries. Theresultsofap- tioncodes,however,almostallpreviousresearchconsideredonly plying popularity scoring on the English subset of NTCIR-6 im- theprimaryclassbutnotsecondaryclassificationswhichmight,if proved MAP and recall over the raw text-based scoring. Incor- utilized,improvetheretrievalperformance. poratingPageRank,thoughintuitive,posesmanychallengesespe- ciallybecausepatentdocumentshavereferencestonon-patentlit- Hybrid: thesemethodsutilizevarioussourcesofmetadatatoim- eraturewhichwouldproduceincompletecitationgraph. Mahdabi provePRperformance. In[63],MahdabiandCrestanibuiltupon andCrestani[62]extendedtheirquerymodelingtechniquein[20] previousworkin[20,62]andproposedaQEmethodthatutilizes by incorporating term distributions of the PRF results as well as time-aware random walk on a weighted patent citations network. their citations in calculating the query language model. The au- Citationweightsarederivedfromvariousmetadata(e.g.,classifica- thorsfirstconstructaquery-specificcitationgraphusingPRFre- tioncodes,inventors,assignee...etc).Citationswithhigherweights sultsandtheircitationsandassignascoreforeachofthemusing areconsideredmoreinfluentialwhenperformingQE.Experiments PageRank. Then, aquerymodelisestimatedfromtermdistribu- ontheCLEF-IP2010/2011collectionsshowimprovedrecalland tions of the documents in the citation graph constrained by their MAP.Theauthorsin[53]proposedbuildingaquery-specificlexi- respectivePageRank.Finally,QEisperformedusingtheestimated confromIPCdefinitionpagesandusingitforQE.Unfortunately, querymodel.ExperimentsontheCLEF-IP2011collectionshowed thelexiconwouldbehelpfulonlyifthequeryrepresentsacomplete improved recall performance with no change in precision, which patentdocumentwithIPCcodesassignedtoitwhichisnotalways indicates the usefulness of using cited documents vocabulary for thecaseespeciallyattheearlystagesofthepatentlife-cycle. QE.Bestimprovementswereachievedusingthetop30PRFdoc- uments, 2-levels citation graph, and 100 expansion terms. How- 5.2.5 InteractiveMethods ever,wecannoticetwomaincomputationalchallengesusingthis InteractivePRisinevitable. Aswecannoticefromtheabovere- techniqueinreal-timesetting: 1)computingthePageRankofthe view,effectivefullyautomatedretrievalofpatentpriorartisvery 2-levelcitationsgraph,and2)estimatingthequerymodelfromtop challenging. Best methods perform around average in terms of PRFdocumentsaswellasdocumentsinthecitationgraph. PRESandmuchlessintermsofMAP.Additionally,thesemethods requiretuningalargenumberofparametersandthresholdswhose Classification-based:thesemethodsutilizeclassificationinforma- optimalvaluesdifferaccordingtothegivenqueryandthespecific tionofthetopicpatentandtheretrieveddocumentstoimprovethe informationneed. Forexample, decidingwhichpatentsectionto performanceofPR[68–72]. ThenaiveuseofIPCclassificationis use?,whichPRFresults?,andwhichexpansiontermsandtheirre- tofilterretrieveddocumentstokeeponlyonesthatsharethesame spective weights?. The answers of these questions are not deter- IPCclassificationcodeatsomelevel(e.g.,samesubclass)withthe ministicandprobablyrequiremultipleinteractioncycleswiththe topicpatent[19,73].Moresophisticateduseofclassificationinfor- userinordertosatisfyhis/herinformationneed. mationwasintroducedbyVermaandVarmain[74]whoproposed anewrepresentationofpatentdocumentsbasedonIPCclassifica- CurrentinteractivemethodsinPRaremorefocusedonbetterorga- tions.ThemethodutilizesIPCcodesassignedtothecorpuspatents nization,integration,andutilizationofstructuredandtextualpatent as well as codes of their citing documents to form an IPC class datathanonbetterretrievalperformance.Inotherwords,PRisad- vector. First,thevectorisinitializedfrompatent’sIPCcode,then dressedasaprofessionalsearchproblemratherthanpriorartsearch codesofcitingpatentsarepropagatedovermultipleiterations.The problem.In[75],FafaliosandTzitzikaspresentedakeyword-based mostsimilarpatentsareretrievedusingcosinesimilaritybetween interactivesearchframeworktosupportpatentsearch. Theinter- IPC class vectors and re-ranked using text-based search utilizing action elements are presented through post-analysis of search re- thetop20tf-idftopicpatentterms. ExperimentsontheCLEF-IP sultsintheformoffacetsbasedfeatureslikestaticmetadata(e.g., 2011collectionshowedimprovedrecallbutlowMAPscores. The IPC codes), textual clustering, named entity extraction, semantic instabilityofthepatentclassificationsystemposesarealchallenge enrichments, and others. The framework was applied on patent whenitcomestoincorporatingclassificationmetadataintoPRsys- search[76]andevaluatedusinguserstudyoftwelvepatentexam- tems. Overtime,newclassesareaddedtotheclassificationhierar- iners[77].Evaluationresponsesindicatedoverallacceptanceofthe chyandexistingclassesareexpanded.Inordertodoreliablesearch frameworkintermsofusability,easeofuse,efficiency,learnability. basedonclassificationcodes,thesechangesmustbeaccountedfor However,theauthorsdidnotreportontheeffectivenessorsuccess periodically. Moreover,patentsareassignedtomultipleclassifica- ofthesystemhelpingpatentexaminerstofindpriorart.