machine learning : foundations

Click here to load reader

Post on 14-Feb-2016




2 download

Embed Size (px)


Machine Learning : Foundations. Yishay Mansour Tel-Aviv University. Typical Applications. Classification/Clustering problems: Data mining Information Retrieval Web search Self-customization news mail. Typical Applications. Control Robots Dialog system Complicated software - PowerPoint PPT Presentation


  • Machine Learning: FoundationsYishay MansourTel-Aviv University

  • Typical ApplicationsClassification/Clustering problems:Data miningInformation RetrievalWeb searchSelf-customizationnewsmail

  • Typical ApplicationsControlRobotsDialog systemComplicated softwaredriving a carplaying a game (backgammon)

  • Why Now?Technology ready:Algorithms and theory.Information abundant:Flood of data (online)Computational powerSophisticated techniquesIndustry and consumer needs.

  • Example 1: Credit Risk AnalysisTypical customer: bank.Database:Current clients data, including:basic profile (income, house ownership, delinquent account, etc.)Basic classification.Goal: predict/decide whether to grant credit.

  • Example 1: Credit Risk AnalysisRules learned from data:IF Other-Delinquent-Accounts > 2 and Number-Delinquent-Billing-Cycles >1THEN DENAY CREDITIF Other-Delinquent-Accounts = 0 and Income > $30kTHEN GRANT CREDIT

  • Example 2: Clustering newsData: Reuters news / Web dataGoal: Basic category classification:Business, sports, politics, etc.classify to subcategories (unspecified)Methodology:consider typical words for each category.Classify using a distance measure.

  • Example 3: Robot controlGoal: Control a robot in an unknown environment.Needs both to explore (new places and action)to use acquired knowledge to gain benefits.Learning task control what is observes!

  • A Glimpse in to the futureToday status:First-generation algorithms:Neural nets, decision trees, etc.Well-formed data-basesFuture:many more problems:networking, control, software.Main advantage is flexibility!

  • Relevant DisciplinesArtificial intelligenceStatisticsComputational learning theoryControl theoryInformation TheoryPhilosophyPsychology and neurobiology.

  • Type of modelsSupervised learningGiven access to classified dataUnsupervised learningGiven access to data, but no classificationControl learningSelects actions and observes consequences.Maximizes long-term cumulative return.

  • Learning: Complete InformationProbability D1 over and probability D2 forEqually likely.Computing the probability of smiley given a point (x,y).Use Bayes formula.Let p be the probability.

  • Predictions and Loss ModelBoolean ErrorPredict a Boolean value.each error we lose 1 (no error no loss.)Compare the probability p to 1/2.Predict deterministically with the higher value.Optimal prediction (for this loss)Can not recover probabilities!

  • Predictions and Loss Modelquadratic lossPredict a real number q for outcome 1.Loss (q-p)2 for outcome 1Loss ([1-q]-[1-p])2 for outcome 0Expected loss: (p-q)2Minimized for p=q (Optimal prediction)recovers the probabilitiesNeeds to know p to compute loss!

  • Predictions and Loss ModelLogarithmic lossPredict a real number q for outcome 1.Loss log 1/q for outcome 1Loss log 1/(1-q) for outcome 0Expected loss: -p log q -(1-p) log (1-q)Minimized for p=q (Optimal prediction)recovers the probabilitiesLoss does not depend on p!

  • The basic PAC ModelUnknown target function f(x)Distribution D over domain XGoal: find h(x) such that h(x) approx. f(x)

    Given H find heH that minimizes PrD[h(x) f(x)]

  • Basic PAC NotionsS - sample of m examples drawn i.i.d using DTrue error e(h)= PrD[h(x)=f(x)]

    Observed error e(h)= 1/m |{ x eS | h(x) f(x) }|Example (x,f(x))Basic question: How close is e(h) to e(h)

  • Bayesian TheoryPrior distribution over HGiven a sample S compute a posterior distribution:Maximum Likelihood (ML) Pr[S|h]Maximum A Posteriori (MAP) Pr[h|S]Bayesian Predictor S h(x) Pr[h|S].

  • Nearest Neighbor MethodsClassify using near examples.

    Assume a structured space and a metric++++----?

  • Computational MethodsHow to find an h e H with low observed error.Heuristic algorithm for specific classes.Most cases computational tasks are provably hard.

  • Separating HyperplanePerceptron: sign( xiwi ) Find w1 .... wn

    Limited representation

  • Neural NetworksSigmoidal gates: a= xiwi and output = 1/(1+ e-a)Back Propagation

  • Decision Treesx1 > 5x6 > 2

  • Decision TreesLimited Representation

    Efficient Algorithms.

    Aim: Find a small decision tree with low observed error.

  • Decision TreesPHASE I:Construct the tree greedy, using a local index function.Ginni Index : G(x) = x(1-x), Entropy H(x) ...PHASE II:Prune the decision Tree while maintaining low observed error.Good experimental results

  • Complexity versus Generalizationhypothesis complexity versus observed error.More complex hypothesis have lower observed error, but might have higher true error.

  • Basic criteria for Model SelectionMinimum Description Lengthe(h) + |code length of h|Structural Risk Minimization:

    e(h) + sqrt{ log |H| / m }

  • Genetic ProgrammingA search Method.

    Local mutation operations

    Cross-over operations

    Keeps the best candidates.Change a node in a tree

    Replace a subtree by another tree

    Keep trees with low observed errorExample: decision trees

  • General PAC MethodologyMinimize the observed error.

    Search for a small size classifier

    Hand-tailored search method for specific classes.

  • Weak LearningSmall class of predicates H

    Weak Learning:Assume that for any distribution D, there is some predicate heHthat predicts better than 1/2+e.Weak LearningStrong Learning

  • Boosting AlgorithmsFunctions: Weighted majority of the predicates.

    Methodology: Change the distribution to target hard examples.

    Weight of an example is exponential in the number of incorrect classifications.Extremely good experimental results and efficient algorithms.

  • Support Vector Machinen dimensionsm dimensions

  • Support Vector MachineUse a hyperplane in the LARGE space.

    Choose a hyperplane with a large MARGIN.++++---Project data to a high dimensional space.

  • Other ModelsMembership Queriesxf(x)

  • Fourier Transformf(x) = S az cz(x) cz(x) = (-1)Many Simple classes are well approximated using large coefficients.

    Efficient algorithms for finding large coefficients.

  • Reinforcement LearningControl Problems.

    Changing the parameters changes the behavior.

    Search for optimal policies.

  • Clustering: Unsupervised learning

  • Unsupervised learning: Clustering

  • Basic Concepts in ProbabilityFor a single hypothesis h:Given an observed errorBound the true errorMarkov InequalityChebyshev InequalityChernoff Inequality

  • Basic Concepts in ProbabilitySwitching from h1 to h2:Given the observed errorsPredict if h2 is better.Total error rateCases where h1(x) h2(x)More refine

View more