Download - Artificial Intelligence - kti.mff.cuni.czkti.mff.cuni.cz/~bartak/ui2/lectures/lecture07eng.pdfPrisoner‘s dilemma Consider the following story: – Two alleged burglars, Alice and

ArtificialIntelligence

Roman BartákDepartment of Theoretical Computer Science and Mathematical Logic

Decisionswithmultipleagents

Whatiftheuncertaintyisduetootheragentsandthedecisionstheymake?Andwhatifthedecisionsofthoseagentsareinturninfluencedbyourdecisions?

• agentdesign– gametheorycananalyzetheagent’sdecisionsandcomputetheexpectedutilityforeachdecision(undertheassumptionthatotheragentsareactingoptimallyaccordingtogametheory)

• mechanismdesign– inversegametheorymakeitpossibletodefinetherulesoftheenvironmentsothatthecollectivegoodofallagentsismaximized(wheneachagentadoptsthegame-theoreticsolutionthatmaximizesitsownutility)

Single-movegames

Considerarestrictedsetofgames,whereallplayerstakeactionsimultaneouslyandtheresultofthegameisbasedonthissinglesetofactions

– whatmattersisthatnoplayerhasknowledgeoftheotherplayers’choices

Asingle-movegameisdefinedbythreecomponents:– players(oragents),likeO(odd)andE(even)– actionsthattheplayerscanchoose, likeoneortwofingers– a payofffunction thatgivestheutilitytoeachplayerforeachcombinationofactionsbyallplayers;thepayoffmatrixfortwo-fingerMorra isasfollows:

O:one O:twoE:one E=+2, O=-2 E=-3, O=+3

E:two E=-3, O=+3 E=+4, O=-4

Single-movegames:solutionandstrategy

Eachplayerinagamemustadoptandthenexecuteastrategy(policy).

– a purestrategyisadeterministicpolicy;forasingle-movegame,itisjustasingleaction

– a mixedstrategyisrandomizedpolicythatselectsactionsaccordingtoaprobabilitydistribution;fortwoactions,itiswritten[p,a;(1-p),b]

Thegame’soutcomeisanumericvalueforeachplayer.Asolutiontoagameisastrategyprofile(anassignmentofstrategytoeachplayer)inwhicheachplayeradoptsarationalstrategy.

– Whatdoes“rational”meanwheneachagentchoosesonlypartofthestrategyprofilethatdeterminestheoutcome?

Prisoner‘sdilemma

Considerthefollowingstory:– Twoallegedburglars,AliceandBob,arecaughtred-handednear

thesceneofburglaryandareinterrogatedseparately.– Aprosecutorofferseachadeal: ifyoutestifyagainstyour

partnerastheleaderofaburglaryring,youwillgofreewhileyourpartnerwillserve10yearsinprison.

– However, ifyoubothtestifyagainsteachother,youwillbothget5years.

– Ifyoubothrefusetotestify,youwillserveonly1yeareachforlesserchargeofpossessingstoleproperty.

Showtheytestifyorrefuse?– Therationaldecisionistotestify.

Alice:testify Alice:refuseBob:testify A=-5, B=-5 A=-10, B=0

Bob:refuse A=0, B=-10 A=-1, B=-1

Dominance

TestifyisadominantstrategyforthePrisoner‘sdilemma.– a strategysforplayerpstronglydominatesstrategys’iftheoutcomeforsis

betterforpthantheoutcomefors’,foreverychoiceofstrategiesbytheotherplayer(s)

– a strategysweaklydominatess’ifsisbetteronatleastonestrategyprofileandnoworseonanyother

Itisirrational toplayadominatedstrategyandnottoplayadominantstrategyifoneexists.

Wheneachplayerhasadominantstrategy,thecombinationofthosestrategiesiscalledadominantstrategyequilibrium.Astrategyprofileformsanequilibrium ifnoplayercanbenefitbyswitchingstrategies,giventhateveryotherplayerstickswiththesamestrategy.Everygamehasatleastoneequilibrium– Nashequilibrium.

TheoutcomeofgameisParetooptimal ifthereisnootheroutcomethatallplayerswouldprefer.

– AnoutcomeisParetodominatedbyanotheroutcomeifallplayerswouldprefertheotheroutcome.

Prisoner’sdilemmaisduetohavingadominantstrategyequilibrium(testify,testify)thatisParetodominatedbyoutcome(refuse, refuse).

NodominantstrategyConsiderthefollowinggame

– Acme,avideogameconsolemanufacturer,hastodecidewhetheritsnextgamemachinewilluseBlu-raydiscsorDVDs.

– Meanwhile,thevideogamesoftwareproducerBestneedstodecidewhethertoproducenextgameonBlu-rayorDVD.

– Theprofitsofbothwillbepositiveiftheyagreeandnegativeiftheydisagree.

Thereisnodominantstrategyequilibrium forthisgame,buttherearetwoNashequilibria.Therearemultipleacceptablesolutions,butifeachagentaimsforadifferentone,thenbothagentswillsuffer.Howcantheyagreeonasolution?

BothcanshouldchoosethePareto-optimal Nashequilibriumprovidedthatoneexists;(bluray,bluray)isthePareto-optimalsolutionWhatiftherearemuchsuchsolutions(forexampleif(bluray,bluray)hadpayoff(5,5))?

• agentscaneitherguessorcommunicate• coordination games(gamesinwhichplayersneedtocommunicate)

Acme:bluray Acme:dvdBest:bluray A=+9, B=+9 A=-4, B=-1

Best:dvd A=-3, B=-1 A=+5, B=+5

Mixedstrategies

Considertwo-fingerMorra game– nopure-strategy profileexists

• ifthetotalnumberoffingersiseven,thenOwillwanttoswitch

• ifthetotalisodd,thenEwillwanttoswitch

– wemustlookformixedstrategiesinstead

VonNeumanndevelopedamethodforfindingtheoptimalmixedstrategyfortwo-player,zero-sumgames(gamesinwhichthesumofthepayoffsisalwayszero).– themaximin technique– weneedtoconsider thepayoffsofonlyoneplayer(E)

Maximin technique(purestrategies)

Supposewechangetherulesasfollows:– FirstEpicksherstrategyandrevealsittoO.ThenOpickshis

strategy,withknowledge ofE‘sstrategy.• Thisgivesaturn-takinggametowhichwecanapplythestandardminimax algorithm.

• Clearly,thisgamesfavorsO,sowegetalowerboundforthetrueutilityforE(-3).

– NowsupposewechangetherulestoforceOtorevealhisstrategyfirst,followedbyE.• ThisgivesanupperboundforthetrueutilityofE(2).

Maximin technique(mixedstrategies)

Weneedtoturnoutouranalysistomixedstrategies[p,one;(1-p),two]

Oncethefirstplayerhasrevealedhisorherstrategy,thesecondplayermightaswellchooseapurestrategy.

WhatisthevalueofptogetthebestutilityforE(left)?– p=7/12andthepayoff -1/12

WhatisthevalueofqtogetthebestutilityforO(right)?– q=7/12andthepayoff-1/12

Optimalstrategyforbothplayersis[7/12,one;5/12,two]» maximin equilibrium(itisalsoaNashequilibrium)

» two-fingerMorra gamefavortheplayerO

- -

Repeatedgames

Whatifthesamegameisrepeatedmoretimes?

Repeatedgame isthesimplestkindofamultiple-movegame:

• playersfacethesamechoicerepeatedly,buteachtimewithknowledgeofthehistoryallplayers’previouschoices

• payoffsareadditiveovertime

StrategiesforrepeatedgamesTherepeatedversionoftheprisoner’sdilemma:1. thesameplayersplay100rounds

– rationalstrategyisstilltotestify(thelastgameisnottherepeatedgameetc.)

– earningatotaljailsentenceof500yearseach2. 99%chancethattheplayersmeetagain

– theexpectednumberofroundsisstill100,butneitherplayerknowsforsurewhichroundwillbethelast

– perpetualpunishmentstrategy:eachplayerrefuses unlesstheotherplayhaseverplayedtestify

– theexpectedfuturepayoffis-100(∑ 0.99% ∗ (−1)+,-. )ifbothplayers

adoptedthisstrategy– a playerwhodeviatesfromthestrategyandchoosestestifywillgain

ascore0,butthenbothplayerswillplaytestifyandthetotalexpectedfuturepayoffbecomes-495(0 + ∑ 0.99% ∗ (−5)+

,-1 )

Afamousstrategyiscalledtit-for-tat:• startingwithrefuse andtheechoingtheotherplayer’sprevious

moveonallsubsequentmoves• highlyrobustandeffectiveagainstawidevarietyofstrategies

MechanismdesignSofarwefocusedonthequestion„Givenagame,whatisarationalstrategy?“Whatifweask„Giventhatagentspickrationalstrategies,whatgameshouldwedesign?“Wewouldlike todesignagamewhosesolutions,consistingofeachagentpursuingitsownrationalstrategy,resultinthemaximizationofsomeglobalutilityfunction.

Thisiscalledmechanismdesignorsometimes inversegametheory.Itisusedineconomicsandpoliticalscience.Ingeneralitallowsustoconstructsmartsystemsoutofcollectionofmorelimited(evenuncooperative)systems.

Amechanism consistsof:– a language fordescribingthesetofallowablestrategies

thatagentsmayadopt,– a distinguishedagent– center – thatcollectsreports

ofstrategychoicesfromagentsinthegames– anoutcomerule,knowntoallagents,thatthecenter

usestodeterminethepayoffsofeachagentgiventheirstrategychoices

Auctions

Anauction isamechanismforsellingsomegoodstomembersofapoolofbidders.

Forsimplicity,weconcentrateonauctionswithasingleitemforsale.

Eachbidderi hasautilityvalue vi forhavingtheitem.• Insomecases,eachbidderhasaprivatevaluefortheitem.

– Anoldfurniturehasdifferentvalueforafurniturecollectorandyoungfamily.

• Inothercases,theitemhasacommonvalue,butthereisuncertaintyastowhattheactualvalueis.– Differentbiddershavedifferentinformationandhencedifferent

estimatesoftheitem’struevalue.

Auctionmechanism– eachbiddergetsachancetomakeabidbi– thehighestbidbmaxwinstheitem,buttheprice

paidneednotbebmax (partofmechanismdesign)

Englishauction

Thebest-knownauctionmechanismistheascending-bid, orEnglishauction.

– Thecenterstartsbyaskingforaminimum (orreserve)bidbmin

– Ifsomebidderiswillingtopaythatamount,thecenterthenasksforbmin+d,forsomeincrementd,andcontinuesupfromthere.

– Theauctionendswhennobodyiswillingtobidanymore.– Thenthelastbidderwinstheitem,payingthepricehebid.

Howdoweknowifthisisagoodmechanism?– onegoalistomaximizeexpectedrevenue fortheseller;another

goalistomaximizeanotionofglobalutility– wesayanactionisefficientifthegoodsgototheagentwho

valuesthemmostTheEnglishauctionisusuallybothefficientandrevenuemaximizingifthereis

– asufficientnumberofbidderstoenterthegame– nocollusion– anunfairorillegalagreementbytwooremore

bidderstomanipulateprices

Collusion

Anunfairorillegalagreementbytwooremorebidderstomanipulateprices.Itcanhappeninsecretbackroomdealsortacitly,withintherulesofthemechanism

Exampleofpricemanipulationwithintherulesofthemechanism– In1999,Germanyauctionedtenblocksofcell-phonespectrumwitha

simultaneousaction(bidsweretakenonalltenblocksatthesametime)usingtherulethatanybidmustbeaminimumofa10%raiseoverthepreviousbidonablock.

– Therewereonlytwocrediblebidders,Mannesman andT-Mobile– Mannesman entered thebidof20millionDEMonblocks1-5and18.18

millionDEMonblocks6-10.– T-MobileinterpretedMannesman’s firstbidasanoffer:bothparties

couldcomputethata10%raiseon18.18Mis19.99M.Mannesman’s bidwasinterpretedasanoffer “wecangeteachhalfofblocksfor20M”

Whattodowithit?– ahigherreserveprice– a sealed-bid first-priceauction– bringathirdbidder

Truth-revealing

Ingeneral,boththesellerandtheglobalutilityfunctionbenefitiftherearemorebidders.Onewaytoencouragemorebiddersistomakethethemechanismeasierforthem.Itisdesirablethatthebiddershaveadominantstrategy,strategythatworksagainstallotherstrategies.– anagentwithadominantstrategycanjustbid,withoutwastingtimecontemplating theotheragents’possiblestrategies

Usuallysuchastrategyinvolvesthebiddersrevealingtheirtruthvaluevi – thenitiscalledatruth-revealing,ortruthful,auction.

PropertiesofEnglishauction

TheEnglishauctionhasmostofthedesirableproperties:– biddershaveasimpledominant strategy:keepbiddingaslongasthecurrentcostisbelowyourvi

– thisisnotquitetruth-revealing, becausethewinningbidderrevealsonlythathisvi ≥ b0 +d(weknowonlyalowerboundonvi)

SomedisadvantagesoftheEnglishauction:– ifthere isoneclearlystrongerbiddersuchthathecanalwaysbidhigherthananyotherbidderthenthecompetitorsmaynotenteratall,andthestrongbidderendsupwinningatthereserveprice(discourage competition)

– highcommunication costsastheauctiontakesplaceinoneroomorallbiddershavetohavehigh-speed,securecommunicationlines

Sealed-bidauction

Analternativemechanismisthesealed-bidauction.– eachbiddermakesasinglebidandcommunicatesittotheauctioneerwithouttheotherbiddersseeingit

– thehighestbidwins

Thereisno longerasimpledominantstrategy– thebiddependsonexpectedbidsofotheragentsagents– letvi beyourutilityvalueandb0 betheexpectedmaximumofalltheotheragents’bids

– thenyoushouldbidb0+3 (forsomesmall3),ifthatislessthanvi

Notethattheagentwiththehighestvi mightnotwintheauction,reducingthebiastowardanadvantagedbidder(theauctionismorecompetitive).

Sealed-bidsecond-priceauction

Asmallchangeinthemechanismforsealed-bidauctionsproducesthesealed-bidsecond-priceauction,alsoknownasaVickrey auction.– Thewinnerpaysthepriceofthesecond-highestbid,b0,

ratherthanpayinghisownbid.– Thedominant strategyisnowsimplytobidvi;the

mechanismistruth-revealing.

Whyisthisadominantstrategy?theutilityofagenti intermsofhisbidbi,hisvaluevi,andthebestbidamongtheotheragentsb0:(vi– b0)ifbi>b0,otherwise0

• when(vi– b0)>0,thenanybidthatwinstheauctionisoptimal,andbiddingviinparticularwinstheauction

• when(vi– b0)<0,thananybidthatlosestheauctionisoptimal,andbidding,vi inparticularlosestheauction

• sobiddingvi isoptimalforallpossiblevaluesofb0,andinfact,vi istheonlybidthathasthisproperty

Commongoods

Consideranothertypeofgame,inwhichcountriessettheirpolicyforcontrollingairpollution.

Eachcountryhasachoice• theycanreducepollutionatacostof-10pointsforimplementing thenecessary changes

• ortheycancontinuetopollute,whichgivesthemanetutilityof-5(inaddedhealthcosts,etc.)andalsocontributes-1pointstoeveryothercountry(becausetheairissharedacrosscountries)

Whatisthestrategyofeachcountry?• Clearly,thedominantstrategyforeachcountryis“continuetopollute”.

• Ifthereare100countriesandeachfollowsthispolicy,theneachcountrygetsatotalutility-104.

• Ifevery countryreducespollution,theywouldeachhaveautilityof-10!

Tragedyofcommons

Tragedyofcommons:ifnobodyhastopayforusingacommonresource,thenittendstobeexploitedinawaythatleadstoalowertotalutilityforallagents.

Itissimilartotheprisoner’sdilemma:thereisanothersolutiontothegamethatisbetterforallparties,butthereappearstobenowayforrationalagentstoarriveatthatsolution.

Tragedyofcommons:taxes

Astandardapproachfordealingwiththetragedyofcommonsistochangethemechanismtoonethatchargeseachagentforusingthecommons(acarbontax).

Weneedtoensurethatallexternalities – effectsonglobalutilitythatarenotrecognizedinthe individualagents’transactions–aremadeexplicit.

Anotherexample:– SupposeacitydecidesitwantstoinstallsomefreewirelessInternettransceivers.However,thenumberoftransceiverstheycanaffordislessthanthenumberofneighborhoodsthatwantthem.

– Theproblemisthatifthey justaskeachneighborhoodcouncil“howmuchdoyouvaluethisfreegift?“theywouldallhaveanincentivetolie,andreportahighvalue.

– Asolutionisaskingtopayforit.

Vickrey-Clarks-Grovesmechanism

1. thecenteraskseachagenttoreportitsvalueforreceivinganitem– bi

2. the centerallocatesthegoodstoasubsetAofthebidders.Letbi(A)=bi,ifi∈A,otherwise0.ThecenterchoosesAtomaximizetotalreportedutilityB=Σi bi(A)

3. each agentpaysataxequaltoW-i – B-i,whereB-i =Σj≠i bj(A)W-i =totalglobalutilityifi werenotinthegameeachwinnerwouldpayataxequaltothehighestreported valueamongthelosers(loserspaynothing)

PropertiesofVickrey-Clarks-Grovesmechanism

WhydoestheVCGmechanismsmaketheagentshappy?– allwinnersshouldbehappybecausetheypayataxthatisless

thantheirvalue– alllosersareashappyastheycanbe,becausetheyvaluethe

goodslessthantherequiredtax

Whyisitthatthismechanism istruth-revealing?– eachagentmaximizes hispayoff,whichisthevalueofgettingan

item,minusthetaxvi(A)– (W-i – B-i,)

– agenti knowsthatthecenterwillmaximize globalutilityusingthereportedvaluesΣj bj(A)=bi(A)+Σj ≠i bj(A)

– whereasagenti wantsthecentertomaximizevi(A)+Σj≠i bj(A)– W-i

– Sinceagenti cannotaffectsthevalueofW-i (itdependsonlyontheotheragents),theonlywayi canmake thecenteroptimizewhati wantsistoreportthetrueutilitybi =vi

© 2016 Roman BartákDepartment of Theoretical Computer Science and Mathematical Logic

[email protected]

Download - Artificial Intelligence - kti.mff.cuni.czkti.mff.cuni.cz/~bartak/ui2/lectures/lecture07eng.pdfPrisoner‘s dilemma Consider the following story: – Two alleged burglars, Alice and

Top Related