using database technology to improve performance of web proxy servers

Download Using Database Technology to Improve Performance of Web Proxy Servers

Post on 19-Jan-2016

36 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

Using Database Technology to Improve Performance of Web Proxy Servers. K. Cheng ¹ , Y. Kambayashi ¹ , M. Mohania ² ¹ Kyoto University, Japan ² Western Michigan University, USA. Proxy Server. Lower Bandwidth. Higher Bandwidth. ( WAN ). ( LAN ). X. Direct Access. - PowerPoint PPT Presentation

TRANSCRIPT

  • Using Database Technology to Improve Performance of Web Proxy ServersK. Cheng, Y. Kambayashi, M. MohaniaKyoto University, JapanWestern Michigan University, USA

    WebDB'2001, Santa Barbara CA

  • Caching on web proxy serversImprove throughput of proxy serversImprove response times for end usersBridge bandwidth gap between WAN and LANDistribute workload from web serversWeb ServersClients

    WebDB'2001, Santa Barbara CA

  • Characteristics of proxy caching

    Traditional CachingProxy CachingStorageMemory-basedDisk-basedCache size Small HugeObject survival timeShortLongAlgorithmSimpleCan be complexWho use ? Programmed processPeople with specific interest

    WebDB'2001, Santa Barbara CA

  • Limitations of current caching schemes: case 1Tom found a very good page P1 about car modelsJohn is also looking for that kind of pages, but he only got P2Both P1 and P2 were cached, but Tom didnt know P2 and John didnt know about P1. After several days, however, both were replaced since no further visits.As a result, Tom missed P2, John missed P1, and cache missed 2 hitsState-of-art caching schemes cannot deal this case!!

    WebDB'2001, Santa Barbara CA

  • Limitations of current caching schemes: case 2Suppose the users of a proxy server are mostly interested in XML, but rarely favor of FuzzySuppose some clients retrieved pages P1 and P2After checking the content of P1and P2, we know P1 is a XML one, P2 is a Fuzzy one Should we prefer to cache P1 or P2 ?

    WebDB'2001, Santa Barbara CA

  • Why current schemes cant deal with these cases ?Physical object based cache management Content transparency low utilization rate (Case 1)Approximately 60% data in cache never usedApproximately 90% data in cache rarely usedUsage-based object replacement Needlessly long stay time for irrelevant contents (Case 2)

    WebDB'2001, Santa Barbara CA

  • Our solutionWe propose a hierarchical data model for management of web data (physical pages, logical pages and topics). Object replacement based on Link structure (logical pages)Semantic similarity with other objects (topics )Facilitate active access to cache contents

    WebDB'2001, Santa Barbara CA

  • A hierarchical model for web dataTopic managerLogical page managerPhysical page managerp1p2p3p4p5p6L1L2L3T1T2MappingMappingTopicsLogical pagesPhysical pagesnavigateSearchBrowse

    WebDB'2001, Santa Barbara CA

  • Physical pages http://www.difa.unibas.it/webdb2001 /instructionsPage/index.html../icons/webdblogo.gifPhysical page APhysical page B

    WebDB'2001, Santa Barbara CA

  • Logical pageAB

    WebDB'2001, Santa Barbara CA

  • Managing physical pagesPhysical pageHTML/plain text file (.html, .txt) Embedded media file (.gif, .png, wav, .mp3) Application Generated File (.pdf, .ps, .doc) Managing physical pages based onURL (protocol, ip, port, path)Physical properties (e.g. size, cost etc.)Usage (frequency, recency)

    WebDB'2001, Santa Barbara CA

  • Constructing logical pagesBasic logical pagesSingle multimedia documentHTML(1)+ embedded media files(1..*)Extended logical pagesSeveral closely related directly linked pages E.g. an HTML paper with sections on different multimedia documents

    WebDB'2001, Santa Barbara CA

  • Managing topicsDefining a topicTopic = Popularity=f(F, R, P, U)F Access Frequency of TopicR - Time interval between last access time and current timeP Number of logical pages belonging to a topicU Number of users accessing a topicDeciding membership of a logical page to a topic IR Approaches (K-NN, )ML Approaches (e.g. Support Vector Machine-SVM)

    WebDB'2001, Santa Barbara CA

  • DefinitionsWe use a term Priority for object replacement. It is a function of several parameters, e.g. access frequency(F), time interval(R), size of object(S), retrieval cost(C), significance(G).Significance: Importance of the topic

    WebDB'2001, Santa Barbara CA

  • Caching policy: LRU-SP+Topic managementPriority = f(F, R, G)Logical page managementBasic logical pages only Priority = g(F, R)Physical page managementLRU-SP --size-adjusted & popularity-aware LRU (K. Cheng et al, Compsac00)Priority = h(F, R, S)

    WebDB'2001, Santa Barbara CA

  • Evaluate & add new objectsL1L2L3P10P11P40P30P20P41P31 P22P12P21P42T1T2Physical PagesLogical PagesTopicsHigher Lower New Object DPriorityD is of higher priority

    WebDB'2001, Santa Barbara CA

  • Replace an object Choose a candidate topic (T1) T1 has 1 logical page (L1), choose (L1)(L1) has 3 physical pages (P10), ( P11), (P12), where (P12) shared by (L2)Choose a victim (P*) from (P10), ( P11). Replace (P*) with the new page

    WebDB'2001, Santa Barbara CA

  • Preliminary experimentsReplay access logs of our proxy server(Squid)30 clients, 30 days873,824 requests, 21.30GB data7 Topics, Priority [1..5]Significance Factor ([0, 2])Measure the significance of each topicHit Rate(HR) Percentage of requests satisfied by cacheProfit Rate(PR)-- is significance of topic

    WebDB'2001, Santa Barbara CA

  • Baseline algorithm LRV (Rizzo et al 1998) A physical-page-based algorithm Using size(S) to predict further access to incoming objectsParameters in considerationAccess frequency (F)Time interval (R)Size of objects (S)

    WebDB'2001, Santa Barbara CA

  • Results: Hit Rates 20% UPCache space in % of total unique data

    WebDB'2001, Santa Barbara CA

  • Results: Profit Rates 30% UpCache space in % of total unique data

    WebDB'2001, Santa Barbara CA

  • Conclusion and future workPerformance of caching proxies can be remarkably improved if cache contents were well organized and managedProposed a hierarchical model and the cache management scheme based on that modelFuture workTuning various parameters to achieve better performance(Logical page clustering, priority balancing significance and popularity etc.)More experiments

    WebDB'2001, Santa Barbara CA

Recommended

View more >