Responding to Retrieval: A Proposal to Use Retrieval Information for Better Presentation of Website Content CRavindranathChowdary AnilKumarSingh AnilNelakanti IIT(BHU) IIT(BHU) IIT (BHU) Varanasi,India221005 Varanasi,India221005 Varanasi,India221005 [email protected] [email protected] [email protected] 5 1 ABSTRACT of load that was met by computationally powerful servers. 0 Retrieval and content management are assumed to be mu- Thebiggerchallengewastoorganizeandmakeavailablethe 2 tually exclusive. In this paper we suggest that they need hugeamountofinformationinareadilyconsumablemanner. not beso. In theusual information retrieval scenario, some This required the third entity of retrieval systems. What n informationaboutqueriesleadingtoawebsite(dueto‘hits’ essentially wasatwo-way transaction betweenthehost and a or‘visits’)isavailabletotheserveradministratorofthecon- the client has become three-way with an IR system in the J cernedwebsite. Thisinformationcanusedtobetterpresent middle. 9 the content on the website. Further, we suggest that some 1 moreinformationcanbesharedbytheretrievalsystemwith Clients are served byhosts, a relation facilitated by IRsys- the content provider. This will enable the content provider tems. However, current day IR systems are more than just ] R (any website) to have a more dynamic presentation of the organizer ofweblinks. Theymodeluserchoicesandprefer- content that is in tune with the query trends, without vio- ences to serve them better. We argue that the three-entity I . lating the privacy of the querying user. The result will be unitoftheclient,IRsystemandthehostisgreaterthanthe s c abettersynchronizationbetweenretrievalsystemsandcon- sumofitsparts. Therelationbetweenthesethreeentitiesis [ tentproviders,withthepurposeofimprovingtheuser’sweb ignoredbythecurrentweb-servicearchitecture. Wepresent search experience. This will also give the content provider hereaproposalwhichwillexploitthisrelationshiptobetter 1 a say in this process, given that the content provider is the deliver some aspects for web service usage. v one who knows much more about the content than the re- 9 trieval system. It also means that the content presentation Web designers write content on the pages based on the in- 0 maychangeinresponsetoaquery. Intheend,theuserwill formation provided by the owner of the site. Content in a 5 beabletofindtherelevantcontentmoreeasilyandquickly. website is primarily organized based on the categorization 4 of the information and arranged appropriately by the de- 0 General Terms signer. Inthisentireprocessofthecurrentdesignparadigm, . 1 thequery hasnorole toplayduringthedesign orpresenta- ContentManagementSystems,PersonalizedContent,Infor- 0 tion phases of a website. But, when a search query is given mation sharing 5 to an IR system, it retrieves links of pages that are pre- 1 pared without taking query into consideration on the host 1. INTRODUCTION v: side. Thisis becauseretrieval and contentmanagement are Information retrieval (IR) systems have become integral to i considered mutuallyexclusive, that is, thecontentmanage- X daily activities of millions and will retain their prominence ment system does not knowabout the retrieval system and in years to come. One of the reasons for such importance r theretrievalsystemdoesnotknowhowthecontentprovider a of a good IR system is the amount of data that is available may respond to the query. Due to this shortcoming, both on the web and the pace at which it is increasing. The the content provider and IR system are under performing. numberofwebsitesreportedlyincreasedfromonein1991to In this paper, we try to address this issue by proposing an 1 morethanonebillion inSeptember2014 . Simultaneously, architecture that enables the server hosting the website to therewas anincreasing numberofusersavailing thehosted presentcontentthatisbasedonthequeryposedbytheuser. services. This increase in web usage is more than an issue 1http://www.internetlivestats.com/total-number-of- 2. THE PROPOSEDARCHITECTURE websites/ on 27/10/2014 The outline of the architecture we propose is presented in Figure 1. The scenario is that the user starts a retrieval systemandgivesaquery. Theretrievalsystempresentsthe search results to the user. Out of them, the users selects one and is taken to thedestination website. Whenthe user is taken to that website, the retrieval system also shares some information about the user and the query (subject to privacyrequirements: seeSection5)withtheserverhosting thewebsite. Thehostserverusesthisinformationtopresent the content such that the user might have a better search Feedback-based User's IR Results selection Anonymized Content user attributes request Content Query Host User IR System Management User stay Server System based feedback Content request Query Dependent User stay Content Host response Figure 1: The proposed architecture for more responsive IR and CMS systems experience (see Section 3). This presentation might, for ex- servermaypresentalistofoldactionmovies(couldbefrom ample,makeiteasierfortheusertofindcertainthings. The otherpagesofthehostserver)toAandalist ofnewaction host server will then provide feedback to the retrieval sys- movies to B, in both case in addition to the content at the tem (again subject to privacy requirements) based on the Link1. user’s stay in thewebsite and theuser’s activity duringthe stay. Since the information shared with the host server is Forgettingmaximumbenefitfromthiskindofarchitecture, anonymized, so will be the feedback given to the retrieval thecurrentContentManagement Systems(CMS) likeDru- system. The retrieval system will now use this feedback to pal,Joomla, Djangoetc. mayhavetoberedesignedtotake givebetterresultsinthefuture(seeSection4). Theoverall user queries into account for presenting the final web page result will be better synchronization between the retrieval tobeshowntotheuser. Thiswillallowthehostserver(and system and the host server for the purpose of presenting theCMS)toplayanactiveroleintheprocessofcontentre- better results to the user. Anonymization, opt-out option trieval. Sincethecontentproviderknowsmuchmore about andcustomizationwillbethecentralrequirements,enforced the content than the retrieval system, all that knowledge throughaprotocol(seeSection6),topreventanyabusethat could be used to present dynamic query-aware content to can result from sharing theinformation. theuser. 4. FEEDBACK-AWARE RETRIEVAL 3. QUERY-AWARECONTENTPRESENTA- Aclassicalorbare-bonedretrievalsystem[2]onlytakesinto TION account thequery for retrieval. Some modern retrieval sys- Current state of the art Web servers do not take the query tem go further and use the information available about the into consideration while presenting the content to the user. user for personalizing the results. However, they do not Lotofworkhasbeenreportedonimprovingthearchitecture take into account the user’s activity once the user has se- ofWebserversforvariousapplications[1,6,4]. Manymod- lected and visited one of the web pages. In the proposed els are available to compare thearchitectures of the servers architecture, anonymized information about the user’s ac- [3]. [7] discusses improving the performance of websites by tivitywillbemadeavailabletotheretrievalsystem. Itwill, using edge servers in Fog Computing Architecture. To the thus,bepossibletodesignalgorithmsthattakethisactivity best of our knowledge there is no attempt to use the query into account. Some work in this direction was proposed by by thehost serverto present thecontent totheuser. [5]. Ifahostservercanpresentthecontenttotheuserbasedon The details about this activity might include information thequery,thenitwillbebeneficialtoboththeuserandthe such as the other links on the website that the user clicked hostorganization. SupposethatuserAgivesquery“popular on and the total time that the user spent on the website movies in action genre + old” and that B gives “popular andonvariouspages. Aretrievalsystemmadeawareofthe movies in action genre + latest”. Let us assume that both feedback from the host server should, intuitively, perform the users get Link1 as their first link. We propose that better. thehostserverofLink1shouldpresentdifferentcontentsto each of them based on their query. In this case, the host Somemodernretrievalsystemsalsoprovideadditionallinks aspartofthesummary‘snippet’whilepresentingtheresults Asthisproposalisworkedoutinmoredetailinfuturework, ofretrievaltotheuser. Suchsnippetscanbebetterprepared more such requirements might be identified and will also with thesuggested feedback from thehost server. haveto beaddressed. Additionally, and importantly, the host server can provide Evenafteraddressingtheseissues,oneconcernstillremains extrainformation aboutitscontentaspartof thefeedback. regarding the proposed architecture. Even if the shared in- This extra information will be based on the query and the formation isanonymizedandthehostserverdoesnotknow knowledgeofthecontentthatisavailabletothehostserver. theidentityof theuser, theretrieval system may still know This will allow the content provider to have a say in the theidentityandbeabletoconnecttheactivityoftheuseron presentation of the snippet for the concerned website. The the visited website with the user’s identity. This raises the retrieval system may or may not use this information, de- question whether the retrieval system will come to acquire pendingontheretrievalandsnippetpreparationalgorithm. moreknowledgeabouttheuserthaniswarranted. Thismay be a problematic ethical issue and requires further investi- gation. 5. PRIVACYAND CUSTOMIZATION Ourproposalrequirestheretrievalsystemtosharesomein- 6. RETRIEVAL RESPONSE PROTOCOL formationabouttheuserandthequerywiththehostserver. There are many different kinds of retrieval systems. Sim- Italsorequiresthehostservertoprovidefeedbacktothere- ilarly, there are many different kinds of host servers and trieval system based on the user’s stay in the website after content management systems. If there is to be a flow of in- theuserselectedthewebsitefromtheretrievalresults. This formation between them as suggested in the preceding sec- extra sharing of information immediately raises the ques- tions, then it will have to be precisely regulated so that it tionsofprivacy. Ifourproposalisimplemented,itsdetailed ispossible toimplementsystemswithoutanyconflict. This version will need to include stringent requirements to ad- will require a well-defined and well-designed protocol. We dress all the possible privacy concerns. We list below some call thistheRetrieval ResponseProtocol (RRP). of theserequirements: TheRetrievalResponseProtocolwillregulatetheflowofin- • The first such requirement is that the user’s identity, formation betweentheretrievalsystemandthehostserver. The protocol will be used to initiate, maintain and close a even if known to the retrieval system, will not be re- retrieval session. Assoon as theuser select one result from vealed to the host server. Whatever information is the results provided by the retrieval system in response to shared with the host server will have to be strictly theuserquery,aretrievalsessionwillbeinitiated. Theend- anonymized so as not toreveal theuser’s identity. ing of the session will perhaps have to be timeout based as • The second requirement is that only the relevant in- there is no other way to know when the user has left the website. formation will be shared. If we view this information asalistofattribute-valuepairs,thenonlythatsubset Duringthetimethesessionisalive,theretrievalsystemwill of attribute-value pairs will be shared with the host firstsharetheinformationabouttheuserandthequerywith server that the host server needs to know in order to thehostserver. Afterthat,basedontheuser’sactivity,the betterpresent its content. hostserverwillprovidethefeedbacktotheretrievalsystem. • The third requirement is that an opt-out option will All the activity during this session will be subject to the privacy and customization requirements and the protocol beavailabletoboththeuserandthehostserver. The design will havetotake thisinto account. user will be made aware of the sharing of information and the user will decide whether this sharing is to be The protocol will have to be designed to regulate this re- allowed or not. The information will be shared only trievalsession. Weleavethedesignofthisprotocolforfuture if the user explicitly agrees to it. In the default case, work. there will be no sharing. Similarly, the host server willdecidewhethertoprovidefeedbacktotheretrieval 7. CONCLUSION system or not and thedefault will bethelatter. Inthecurrentinformationretrievalparadigm,thehostdoes • The fourth requirement is that both the user and the notusethequeryinformationforcontentpresentation. The host server must be able to customize sharing of in- retrievalsystemdoesnotknowwhathappensaftertheuser formation. If they decide to share information, they selects a retrieval result. And the host also does not have will further be given the option to select the specific access to the information which is available to the retrieval attributesthattheyarewillingtoshare. Forexample, system. We presented the outline of an architecture that iftheretrievalsystemknowsabouttheuser’slocation, addressestheseissues. Theaimistoprovideabettersearch age,genderandlanguage,thentheusermaydecideto experience to the user through better presentation of the share only location and language. contentbasedonthequeryandbetterretrievalresultsbased onthefeedbacktotheretrievalsystemfromthehostserver. • Theuserwillhavetobeinformedthattheactivityon The retrieval system will share some information with the thevisitedwebsitemaybeusedforprovidingfeedback host serverand thehostserverinturnwill providerelevant to the retrieval system. And theuser will then decide feedback to the retrieval system based on the user’s stay in whether and what part of the activity on the website thewebsite. Thehostusesallthequeryrelatedinformation canbeusedtoprovidefeedbacktotheretrievalsystem. for dynamic content presentation. This revised paradigm forinformationretrievalalsointroducestheissuesofprivacy which will have to be addressed stringently. It also needs a newprotocolforcontentretrievalresponse,whichwebriefly described. Thisprotocolwillregulatetheflowofinformation between the retrieval system and the host server subject to theprivacy and customization requirements. 8. REFERENCES [1] AliBegen,TankutAkgul,andMarkBaugher.Watching videoovertheweb: Part 1: Streamingprotocols. IEEE Internet Computing, 15(2):54–63, March 2011. [2] Sergey Brin and Lawrence Page. Theanatomy of a large-scale hypertextualweb search engine. In Proceedings of the Seventh International Conference on World Wide Web 7, WWW7, pages 107–117, Amsterdam,The Netherlands,The Netherlands,1998. Elsevier Science PublishersB. V. [3] AshifS. Harji, Peter A.Buhr, and Tim Brecht. Comparing high-performance multi-coreweb-server architectures. InProceedings of the 5th Annual International Systems and Storage Conference, SYSTOR’12, pages 1:1–1:12, New York,NY,USA, 2012. ACM. [4] Raoufehsadat Hashemian, Diwakar Krishnamurthy, Martin Arlitt,and Niklas Carlsson. Improvingthe scalability of a multi-core web server.In Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering, ICPE ’13, pages 161–172, NewYork,NY,USA,2013. ACM. [5] XuehuaShen,Bin Tan, and ChengXiang Zhai. Context-sensitiveinformation retrieval using implicit feedback.In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’05, pages 43–50, New York,NY,USA,2005. ACM. [6] Bryan Vealand AnnieFoong. Performance scalability of a multi-coreweb server. In Proceedings of the 3rd ACM/IEEE Symposium on Architecture for Networking and Communications Systems, ANCS ’07, pages 57–66, NewYork,NY,USA,2007. ACM. [7] Jiang Zhu,D.S.Chan, M.S. Prabhu,P. Natarajan, Hao Hu,and F. Bonomi. Improvingweb sites performance usingedge servers in fog computingarchitecture. In IEEE 7th International Symposium on Service Oriented System Engineering (SOSE), pages 320–323, March 2013.