last lecture summary (svm). support vector machine supervised algorithm works both as – classifier...

69
Last lecture summary (SVM)

Upload: vincent-thornton

Post on 14-Jan-2016

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Last lecture summary(SVM)

Page 2: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

• Support Vector Machine• Supervised algorithm• Works both as

– classifier (binary)– regressor

• De facto better linear classification• Two main ingrediences:

– maximum margin– kernel functions

Page 3: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Maximum margin

Which line is best?

𝒘 ⋅𝒙+𝑏=0

Page 4: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

The maximum margin linear classifier is the optimum linear classifier.

This is the simplest kind of SVM (linear SVM)

• Maximum margin intuitively feels safest.• Only support vectors are important.• Works very well.

Page 5: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

• The decision boundary is found by constrained quadratic optimization.

• The solution is found in the form

• Only points on the margin (i.e. support vectors xi) have αi > 0.

Lagrange multiplier

𝒘=∑𝑖=1

𝑛

𝑦 𝑖𝛼𝑖 𝒙 𝒊

Page 6: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

• w does not to be explicitly formed, because:

• Training SVM: find the sets of the parameters αi and b.

• Classification with SVM:

𝒘 ⋅𝒙+𝑏=∑𝑖=1

𝑛

𝑦 𝑖𝛼𝑖 𝒙 𝒊⋅𝒙+𝑏

class (𝑥𝑢𝑛𝑘𝑛𝑜𝑤𝑛)=sign(∑𝑖=1

𝑛

𝑦 𝑖𝛼𝑖 𝒙 𝒊⋅𝒙𝒖𝒌𝒏𝒐𝒘𝒏+𝑏)

Page 7: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

• soft margin– Allows misclassification errors.– i.e. misclassified points are allowed to be inside

the margin.– The penalty to classification errors is given by the

capacity parameter C (user adjustable parameter).– Large C – a high penalty to classification errors.– Decrease in C: points move inside margin.

Page 8: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

CSE 802. Prepared by Martin Law

Page 9: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Kernel functions

• Soft margin introduces the possibility to linearly classify the linearly non-separable data sets.

• What else could be done? Can we propose an approach generating non-linear classification boundary just by extending the linear classifier machinery?

Page 10: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

• We can map the input data points from the input space to the feature space by using the appropriate mapping .

• In the feature space the discriminant function becomes:

Page 11: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

X

Page 12: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

• Once we decide on the mapping , the coordinates of each data point in the feature space must be calculated (from the coordinates in the input space).

• However, the step of explicitly calculating the coordinates in feature space can be avoided by using kernel trick.

Page 13: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

• We know that the discriminant function is given by

• In the feature space it becomes

• And we define kernel function

Page 14: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

• Kernel allows to calculate the inner product directly from the coordinates in the input space.

• No transformation of data points to feature space is needed.

𝜙 (𝒙 )=(x12 ,√2 𝑥1𝑥2 ,𝑥22) 𝑘 (𝒙 , 𝒛 )=(𝒙 ∙𝒛 )2

Page 15: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Kernels

• Linear (dot) kernel

• Polynomial– simple, efficient for non-linear relationships– d – degree

• Gaussian

,k x z x z

, 1d

k x z x z

2

2, exp

2k

x zx z

Page 16: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Finishing SVM

Page 17: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

SVM parameters

• Training sets the parameters αi and b.• The SVM has another set of parameters called

hyperparameters.– The soft margin constant C.– Any parameters the kernel function depends on

• linear kernel – no hyperparameter (except for C)• polynomial – degree• Gaussian – width of Gaussian

Page 18: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

• So which kernel and which parameters should I use?

• The answer is data-dependent.• Several kernels should be tried.• Try linear kernel first and see, if the

classification can be improved with nonlinear kernels (tradeoff between quality of the kernel and the number of dimensions).

• Select kernel + parameters + C by crossvalidation.

Page 19: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Computational aspects

• Classification of new samples is very quick, training is longer (reasonably fast for thousands of samples).

• Linear kernel – scales linearly.• Nonlinear kernels – scale quadratically.

Page 20: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Multiclass SVM

• SVM is defined for binary classification.• How to predict more than two classes

(multiclass)?• Simplest approach: decompose the multiclass

problem into several binary problems and train several binary SVM’s.

Page 21: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

• one-versus-one approach– Train a binary SVM for any two classes from the

training set– For -class problem create SVM models– Prediction: voting procedure assigns the class to

be the class with the maximum votes

1/2

1

1/3 1/4 2/3 2/4 3/4

1

11 3

4

4

Page 22: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

• one-versus-all approach– For k-class problem train only k SVM models.– Each will be trained to predict one class (+1) vs.

the rest of classes (-1)– Prediction:

• Winner takes all strategy• Assign new example to the class with the largest output

value .

1/rest 2/rest 3/rest 4/rest

Page 23: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Resources

• SVM and Kernels for Comput. Biol., Ratsch et al., PLOS Comput. Biol., 4 (10), 1-10, 2008

• What is a support vector machine, W. S. Noble, Nature Biotechnology, 24 (12), 1565-1567, 2006

• A tutorial on SVM for pattern recognition, C. J. C. Burges, Data Mining and Knowledge Discovery, 2, 121-167, 1998

• A User’s Guide to Support Vector Machines, Asa Ben-Hur, Jason Weston

Page 24: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

• http://support-vector-machines.org/• http://www.kernel-machines.org/• http://www.support-vector.net/

– companion to the book An Introduction to Support Vector Machines by Cristianini and Shawe-Taylor

• http://www.kernel-methods.net/– companion to the book Kernel Methods for Pattern

Analysis by Shawe-Taylor and Cristianini

• http://www.learning-with-kernels.org/– Several chapters on SVM from the book Learning with

Kernels by Scholkopf and Smola are available from this site

Page 25: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Software

• SVMlight – one of the most widely used SVM package. fast optimization, can handle very large datasets, very efficient implementation of the leave–one–out cross-validation, C++ code

• SVMstruct - can model complex data, such as trees, sequences, or sets

• LIBSVM – multiclass, weighted SVM for unbalanced data, cross-validation, automatic model selection, C++, Java

Page 26: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Naïve Bayes Classifier

Page 27: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Example – Play Tennis

Page 28: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Example – Learning Phase

Outlook Play=Yes Play=No

Sunny 2/9 3/5Overcast 4/9 0/5

Rain 3/9 2/5

Temperature Play=Yes Play=NoHot 2/9 2/5Mild 4/9 2/5Cool 3/9 1/5

Humidity Play=Yes Play=No

High 3/9 4/5Normal 6/9 1/5

Wind Play=Yes Play=No

Strong 3/9 3/5Weak 6/9 2/5

P(Play=Yes) = 9/14 P(Play=No) = 5/14

P(Outlook=Sunny|Play=Yes) = 2/9

Page 29: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Example - prediction

• Answer this question:“Will we play tennis given that it’s cool but sunny, humidity is high and it is blowing a strong wind?”

• i.e., predict this new instace:x’=(Outl=Sunny, Temp=Cool, Hum=High, Wind=Strong)

• Good strategy is to predict arg max P(Y|cool,sunny,high,strong)

where Y is Yes or No.

Page 30: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Example - Predictionx’=(Outl=Sunny, Temp=Cool, Hum=High, Wind=Strong)

Look up tables

P(Outl=Sunny|Play=No) = 3/5

P(Temp=Cool|Play=No) = 1/5

P(Hum=High|Play=No) = 4/5

P(Wind=Strong|Play=No) = 3/5

P(Play=No) = 5/14

P(Outl=Sunny|Play=Yes) = 2/9

P(Temp=Cool|Play=Yes) = 3/9

P(Hum=High|Play=Yes) = 3/9

P(Wind=Strong|Play=Yes) = 3/9

P(Play=Yes) = 9/14

P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053 P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

Page 31: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Another Application

• Digit Recognition

• X1,…,Xn {0,1} (Black vs. White pixels)• Y {5,6} (predict whether a digit is a 5 or a 6)

Classifier 5

Page 32: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Bayes RuleSo how do we compute posterior probability that the image represents a 5 given its pixels?

Why did this help? Well, we think that we might be able to specify how features are “generated” by the class label (i.e. we will try to compute likelihood).

Normalization Constant

Likelihood PriorPosterior

Page 33: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

• Let’s expand this for our digit recognition task:

• To classify, we’ll simply compute these two probabilities and predict based on which one is greater.

• For the Bayes classifier, we need to “learn” two functions, the likelihood and the prior.

Page 34: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Learning prior

• Let us assume training examples are generated by drawing instances at random from an unknown underlying distribution P(Y), then allow a teacher to label this example with its Y value.

• A hundred independently drawn training examples will usually suffice to obtain a reasonable estimate of P(Y).

Page 35: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Learning likelihood

• Consider the number of parameters we must estimate when Y is boolean and X is a vector of n boolean attributes.

• In this case we need to estimate a set of parameters (i.e. probabilities):

• index i: 2n values, index j: 2 values … 2n.2 = 2n+1, – however for any fixed j the sum over i of must be 1

2(2n - 1)

Page 36: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

• So this corresponds to two distinct parameters for each of the distinct instances in the instance space for X.

• Worse yet, to obtain reliable estimates of each of these parameters, we will need to observe each of these distinct instances multiple times.

• For example, if X is a vector containing 30 boolean features, then we will need to estimate more than 3 billion parameters!

Page 37: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification
Page 38: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

• The problem with explicitly modeling P(X1,…,Xn|Y) is that there are usually way too many parameters:– We’ll run out of space.– We’ll run out of time.– And we’ll need tons of training data (which is

usually not available).

Page 39: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

The Naïve Bayes Model

• The Naïve Bayes Assumption: Assume that all features are independent given the class label Y.

• Equationally speaking:

Page 40: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Naïve Bayes Training

MNIST Training Data

Page 41: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Naïve Bayes Training

• Training in Naïve Bayes is easy:– Estimate P(Y=v) as the fraction of records with Y=v

– Estimate P(Xi=u|Y=v) as the fraction of records with Y=v for which Xi=u

Page 42: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

• In practice, some of these counts can be zero• Fix this by adding “virtual” counts:

• This is called Smoothing.

Naïve Bayes Training

Page 43: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Naïve Bayes Training

For binary digits, training amounts to averaging all of the training fives together and all of the training sixes together.

Page 44: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Naïve Bayes Classification

Page 45: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Assorted remarks

• What’s nice about Naïve Bayes is that it returns probabilities– These probabilities can tell us how confident the

algorithm is– So… don’t throw away these probabilities!

• Naïve Bayes assumption is almost never true– Still… Naïve Bayes often performs surprisingly well

even when its assumptions do not hold.– Very good method in text processing.

Page 46: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Binary classifier performance

Page 47: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Confusion matrix

TP True Positives – is positive and is classified as positive

TN True Negatives – is negative and is classified as negative

FP False Positives – is negative, but is classified as positive

FN False Negatives – is positive, but is classified as negative

Known Label

positive negative

Predicted positive TP FPLabel negative FN TN

also called a contingency table

Page 48: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Accuracy

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Known Label

positive negative

Predicted positive TP FPLabel negative FN TN

Page 49: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Information retrieval (IR)

• A query by the user – to find the documents in the database.

• IR systems allow to narrow down the set of documents that are relevant to a particular problem.

Page 50: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

documents containing what I am looking for

documents not containing what I am looking for

Page 51: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

TP FPFN

Precision = TP / (TP + FP)

Recall = TP / (TP + FN)

How many of the things I consider to be true are actually true?

How much of the true things do I find?

TN

Page 52: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Precision

• A measure of exactness• A perfect precision score of 1.0 means that

every result retrieved by a search was relevant.

• But says nothing about whether all relevant documents were retrieved.

TP FP

Page 53: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Recall

• A measure of completeness• A perfect recall score of 1.0 means that all

relevant documents were retrieved by the search.

• But says nothing about how many irrelevant documents were also retrieved.

TPFN

Page 54: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Precission-Recall tradeoff

• Returning all documents lead to the perfect recall of 1.0.– i.e. all relevant documents are present in the returned set

• However, precission is not that great, as not every result is relevant.

• Apparently, the relationship between them is inverse – it is possible to increase one at the cost of reducing the other.

• They are not discussed in isolation.– Either values for one measure are compared for a fixed level at the

other measure (e.g. precision at the recall level of 0.75) – Combine both into the F-measure.

Page 55: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

F-measure• Common F1 measure

• General Fβ measure

• β - relative value of precision– β = 1 – weight precision and recall by the same amount– β < 1 – more weight on precision– β > 1 – more weight on recall

2

2

Precision Recall1

Precission RecallF β

β

1

Precision Recall2Precission Recall

F

Page 56: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Sensitivity & Specificity

• Measure how ‘good’ a test is at detecting binary features of interest (disease/no disease).

• There are 100 patients, 30 have disease A.• A test designed to identify who has the

disease and who does not.• We want to evaluate how good the test is.

Page 57: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Sensitivity & Specificity

Disease+

Disease-

Test+

25 2

Test-

5 68

Page 58: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Sensitivity & Specificity

25/30sensitivity

68/70specificity

Disease+

Disease-

Total

Test+

25 2 27

Test-

5 68 73

Total 30 70 100

Page 59: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

• Sensitivity measures the proportion of actual positives which are correctly identified as such (e.g. the percentage of sick people who are identified as sick).

TP/(TP + FN)

• Specificity measures the proportion of negatives which are correctly identified (e.g. the percentage of healthy people who are identified as healthy).

TN/(TN + FP)

Page 60: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Performance Evaluation

Precision, Positive Predictive Value (PPV) TP / (TP + FP)

Recall, Sensitivity, True Positive Rate (TPR), Hit rate TP / P = TP/(TP + FN)

False Positive Rate (FPR), Fall-out FP / N = FP / (FP + TN)

Specificity, True Negative Rate (TNR) TN / (TN + FP) = 1 - FPR

Accuracy (TP + TN) / (TP + TN + FP + FN)

Page 61: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

Types of classifiers

• A discrete (crisp) classifier– Output is only a class label, e.g. decision tree.

• A soft classifier– Yield a probability (score, confidence) for the given

pattern.– Number representing the degree to which an

instance is a member of a class.– Use threshold to assign to (+) or to (-) class.– e. g. SVM, NN, naïve Bayes.

Page 62: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

ROC Graph

• Receiver Operating Characteristics.• Plot TPR vs. FPR

• Sensitivity vs. (1 – Specificity).• TPR is on the Y axis, FPR on the X axis.

• An ROC graph depicts relative trade-offs between benefits (true positives) and costs (false positives).

Page 63: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

random guess

worse

better

perfect classification

Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27, 861–874.

never issue positive classification

always issue positive classification

0.5, 0.5

perfect classification

Page 64: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

A is more conservative than B.

conservative classifiers

liberal classifiersThey make positive

classifications only with strong evidence so they make few false positive errors, but they often have low true positive rates as well

They make positive classifications with weak evidence so they classify nearly all positives correctly, but they often have high false positive rates

Page 65: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

ROC Curve

Fawcet, ROC Graphs: Notes and Practical Considerations for Researchers

Page 66: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

lowering threshold

corresponds to moving fromthe conservative to the liberal

areas of the graph

Fawcet, ROC Graphs: Notes and Practical Considerations for Researchers

Page 67: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification
Page 68: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

AUC

• To compare classifiers we may want to reduce ROC performance to a single scalar value representing expected performance.

• A common method is to calculate the area under the ROC curve, abbreviated AUC.

• Its value will always be between 0 and 1.0.• Random guessing has an area 0.5.• Any realistic classifier should have an AUC

between 0.5 and 1.0.

Page 69: Last lecture summary (SVM). Support Vector Machine Supervised algorithm Works both as – classifier (binary) – regressor De facto better linear classification

The AUC has an important statistical property: the AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance.

Classier B is generally better than A, it has higher AUC.

However, with the exception at FPR > 0.55 where A has slight advantage.

So it is possible for a high-AUC classifier to perform worse in a specific region of ROC space than a low-AUC classifier.

But in practice the AUC performs very well and is often used when a general measure of predictiveness is desired.