announcements - carleton university

23
Announcements I IBM Lecture on Watson Analytics will be next Monday March 07 in RB 3201 http://carleton.ca/ims/rooms/river-building-3201/ I Schedule of project presentations. Enter your preferences to the file shared on Slack I Details about Data Day 3.0 I Register (free) and attend Data Day on Tuesday March 29 http://carleton.ca/cuids/cu-events/data-day-3-0-2/ I Consider participating in Graduate Student Poster Competition (prizes: 750$, 500$, 250$ for 1st, 2nd and 3rd place, respectively) http://carleton.ca/cuids/cu-events/data-day-3-0-graduate- student-poster-competition/ I Volunteers wanted. Please email Kathryn Elliot ([email protected]) if interested

Upload: others

Post on 21-Oct-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Announcements

I IBM Lecture on Watson Analytics will be next MondayMarch 07 in RB 3201http://carleton.ca/ims/rooms/river-building-3201/

I Schedule of project presentations. Enter yourpreferences to the file shared on Slack

I Details about Data Day 3.0I Register (free) and attend Data Day on Tuesday March 29

http://carleton.ca/cuids/cu-events/data-day-3-0-2/I Consider participating in Graduate Student Poster

Competition (prizes: 750$, 500$, 250$ for 1st, 2nd and 3rdplace, respectively)http://carleton.ca/cuids/cu-events/data-day-3-0-graduate-student-poster-competition/

I Volunteers wanted. Please email Kathryn Elliot([email protected]) if interested

Machine Learning

February 29, 2016

Naïve Bayes Classification

Naive Bayes classifiers are especially useful for problems:

I with many input variables,I categorical input variables with a very large number of possible values,I text classification.

Naive Bayes would be a good first attempt at solving thecategorization problem.

Naïve Bayes Classification

I Applicable for categorical response with categorical predictors.I Bayes theorem says that

P(Y = y |X1 = x1,X2 = x2) =P(Y = y)P(X1 = x ,X2 = x2|Y = y)

P(X1 = x1,X2 = x2)

I The denominator can be expanded by conditioning on Y

P(X1 = x1,X2 = x2) =∑

z

P(X1 = x1,X2 = x2|Y = z)P(Y = z)

I The Naïve Bayes method is to assume the Xj are mutually conditionallyindependent, i.e.

P(X1 = x1,X2 = x2|Y = z) = P(X1 = x1|Y = z)P(X2 = x2|Y = z)

I Now the probabilities on the right-hand side can be estimated bycounting from the data.

Example of Naïve Bayeslibrary(e1071)D <- mutate(Default, income=cut(income, 3), balance=cut(balance, 2))nb.D <- naiveBayes(default~., data=D, subset=train)

* * *A-priori probabilities:Y

No Yes0.96570645 0.03429355

Conditional probabilities:student

Y No YesNo 0.7073864 0.2926136Yes 0.6181818 0.3818182

balanceY (-2.65,1.33e+03] (1.33e+03,2.66e+03]No 0.86454029 0.13545971Yes 0.09090909 0.90909091

incomeY (699,2.5e+04] (2.5e+04,4.93e+04] (4.93e+04,7.36e+04]No 0.3242510 0.5497159 0.1260331Yes 0.3927273 0.4836364 0.1236364

Example of Naïve Bayes

D <- mutate(Default, income=cut(income, 10), balance=cut(balance, 10))nb.D <- naiveBayes(default~., data=D, subset=train)nb.pred <- predict(nb.D, subset(D, test))table(Actual=D$default[test], Predicted=nb.pred)

PredictedActual No Yes

No 1905 18Yes 40 18

Neural Networks

X1Input #1

X2Input #2

X3Input #3

X4Input #4

Z1

Z2

Z3

Z4

Z5

Y1 Output #1

Y2 Output #2

Y3 Output #3

Hiddenlayer

Inputlayer

Outputlayer

Neural Networks

Zm = σ(α0m + α1mX1 + · · ·αpmXp)

Yj = β0j + β1jZ1 + · · ·+ βMjZM

I The input neurons are attached to the predictorsX1, . . . ,Xp.

I They are activated by a function σ(v) = 11+e−v .

I The neurons in the hidden layer, Z1, . . . ,Zm are linearcombinations of the inputs.

I There may be zero, one, or multiple hidden layers, witheach layer being a linear combination of the previous one.

I The last layer is attached to the outputs.

Neural Networks Example> library(nnet)> nnet.fit <- nnet(default~., data=Default, subset=train, size=5)# weights: 26initial value 6553.347412iter 10 value 1136.024073iter 20 value 1135.901203final value 1135.901077converged

> summary(nnet.fit)a 3-5-1 network with 26 weightsoptions were - entropy fittingb->h1 i1->h1 i2->h1 i3->h1-0.10 -0.22 -0.37 -0.47b->h2 i1->h2 i2->h2 i3->h20.05 -0.46 -0.25 0.25b->h3 i1->h3 i2->h3 i3->h3-0.33 0.55 0.44 0.40b->h4 i1->h4 i2->h4 i3->h40.30 0.27 0.08 -0.28b->h5 i1->h5 i2->h5 i3->h5-0.04 0.01 -0.06 -0.07b->o h1->o h2->o h3->o h4->o h5->o

-22.19 -0.01 8.29 10.50 0.18 0.35

Neural Networks Example

> nnet.pred <- predict(nnet.fit, newdata=subset(Default, test),type="class")

> table(Actual=Default$default[test], Predicted=nnet.pred)Predicted

Actual NoNo 1939Yes 76

I The table is missing the "Yes" column because the neuralnetwork didn’t predict any positives.

I The neural network model is over-parametrized and thereis danger of over-fitting.

I The minimization is unstable and random initializationleads to different solution each time.

K-Means Clustering

I Pick a number of clusters, say K .I Start with a random assignment of each observation to one

of the K clusters.I For each cluster, compute the centroid as the mean of the

points in the cluster.I Reassign observations to clusters, with each observation

going to the cluster with the nearest centroid.I Repeat until convergence.

K-Means ClusteringExample with simulated data.

pts <- read.csv(’pts_2clusters.csv’, header=TRUE)qplot(x, y, data=pts, color=cl) + labs(color="Cluster")

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

0

2

4

6

0.0 2.5x

yCluster

1

2

Actual clusters

K-Means ClusteringSolve for two clusters.

km.out <- kmeans(pts, 2)qplot(x, y, data=mutate(pts, cl.1=factor(km.out$cluster)), color=cl.1)

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

0

2

4

6

0.0 2.5x

yCluster

●●●●

1

2

Predicted clusters

K-Means Clustering

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

0

2

4

6

0.0 2.5x

y

Cluster●●●●●●

1

2

3

Predicted clusters

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

0

2

4

6

0.0 2.5x

y

Cluster●●●●●●●●

1

2

3

4

Predicted clusters

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

0

2

4

6

0.0 2.5x

y

Cluster●●●●●●●●●●

1

2

3

4

5

Predicted clusters

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

0

2

4

6

0.0 2.5x

y

Cluster●●●●●●●●●●●●●●●●●●●●

1

2

3

4

5

6

7

8

9

10

Predicted clusters

Hierarchical Clustering

I Here we don’t pick the number of clusters in advance, this is decided bythe algorithm.

I We need a distance or dissimilarity measureI Start with each point in its own cluster.I Compute all pairwise dissimilarities and merge the two most similar

clusters.I Repeat until some stopping criterion is reached.

I To compute dissimilarity between two clusters, A and B, one may lookat different possibilities.

I Take the dissimilarity of the two centroids.

Compute all pairwise dissimilarities between points in A andpoints in B.

I Complete linkage: take the maximumI Single linkage: take the minimum;I Average linkage: take the average.

Hierarchical ClusteringExample with the same simulated data.

library(grDevices)hc.out <- hclust(dist(pts[c(’x’,’y’)]), method="complete")plot(hc.out, xlab="", main="Complete linkage", sub="")

1273 1684

860

1074 1500

1592

663 743

613

1060

1065 1399

638

1737 1836

1579

582 1369

1383

586 1597

946 1120

1246

639 1482

728 1315

1745

536 1878

1807

543 617

1056

1342 1476

1204 1734

1611 1921

553 1219

504

538

868 2000

1292 1381

883

620 1319 816

1455 1740

1424 1731 604 1654

1233

1919 1983 1404 1651

1739 1900

802

1239 1679

948 1314

741 1152

1206 1528

777 1421

1212 1716

531

674 1426

1445

579

1137 1768 747 1285

773 808 903 1561

1505

884 1781 1201 1949

1609 1995

1681

1180

544 1160

1671

1214

1935

1008 1111

1131 1447

820

560 1667

955 1640

593

1930 1988

806 1857

947 1989 819 1720

801 1359

1312

1331 1498

1149 1959 771 1772 631 1075

1112 1465

1079

708 1785

1963

1485

832 1566

644 1127

1182 1322

810

1450 1813

537 1159 602 910

1194

670 1569

1437 1947 716 1590 588 1773

1076 1270

1594 1940

1657

624

770 1797 693 1296

625

1143 1329

1439

758 1529

1829

523

831 1370

754 1371

1621

790 909

999

519 943 1730

1126 1543

628 1258

1073

906 1588

156 490

5

119 451

210

418

128 157

237 316

298

8 87

41

63 106

16 391 29 373

345 363

289 409

45

252 422

162

267

124 309

484

238 473

291

183 279

366 476

283 349

56 416

654

341 467

28 141

241

257 357

72

64 102

1253 1601

290 672

1749

823 1166

666

1324

632 1001

1725

448 1479 525 1220

1546

1336 1858

760 1924

1052 1608

1393 1409

1580

1545 1855

1384

610 1632

530 1299

1560

1154 1196

1164

1457 1922

862 1715

637 1205

1095 1452 1821

1395 1633

738

1035 1338

578 1349

1438

1040 1788 891 1994

698 1089 1287

811 1225

1966

925 1174

720 1624

690 938 506

648 1612

1041 1568

575 1547 1981 836 1467

849

1271

709 1661

1352

1577 1707

1185 1774

798 866

1215

1531 1973

671 1330

608 872

1104

559 678

1026

1004 1009

1710

930

1019

1708 1869

1713

1509 1726

701

1692

841 1307

618 1235 980 1458 942 1962

576 813

815

1366

629 960

368

682

189 225

1117

14 209

2 1593

796

914 1913

1520 1990

751

859

114 115

326

1340 1915

1279 1908 1741 1876

146

634 1165

1078 1589

1157 1620

885

984 1425

334 457

215 331

93 338

20

277

26 231

107

282 383

222

57 398 145 214

268

103 296

195

297 449

381 389

65 421

15 464

94

58 113

9 311

159

232 458

247 470

262

53 327

186

59 312

204

295 406

71

402 427

286 378

300 339

187

66 351

98 480

303

324 414

126

137 206

97

333

433

342 495

61 174

69

413

438 471

280

245 475

13

269 356

125

4

321 393

211

423

194 273

172 248

287 499

233

80 380

253

424 151 465

226

205 213

372

386

478

67 396

1

197 306

88 192

23

455 491

121

188 403

37 292

212 294

400

135 271

442

164 270

352 477

288 429

384

482

11 149

36

487

447 493

123 472

242 385

7 48

40

24 463 440

236 494

336

203

60 367

17

30

360 453

278 443

6 95

91 101

62

249 375

317

138

229 425

256

158 445

3 250

160 202

766 1317

786 1902

1237

596 740

679 1658

763

1635 1891

1744

511 1996

185 1401

805 1656

591

1456 1662 1538 540 1211

1411

733 1835

569 1197

1617

577 1581

1527 1586

1622 1905

851 1881

1834 1892

1506

1570 1937

1217 1574 1495

1249 1603

1038

143 1304 1484 1628

1087

1378

557 1904

1262

633 887

1490

1086

1138 1473

717

1861 1875

1190

941

1311

661 1792

1121

1703 1758

1228

546 897

1607 1961

1483 1668

1584

1054 1832

686 1142

598

1362 1956 1042 1816

699

991 1874 1259 1557

505 1459

136

772 1195

1530

1345 1394 765

1514 1863

1021 1818

1953

1416 1899

1368 1970

1125

844

171 1764

590 1277

1105

774 1034

1753

1555 1736

725 830

1735

907 1039

1451

996 1860

1098 1526

1158

979 1625

1335 1729

1877

1114 1199

700 1986

1751 1911

1702

1695 1825

1918

1787 1979 501 627

833 1815

963

1181

1036 1791

724

1097

715 923

1223

1032

522

1278

1238 1969

616

565 1916

1606

847 1755 870 1554

502

1412 1511

1927

1468 1810

1102 1308

676

1028

874 1176

1099 1128

1386 1844

508

687 1805

1072

1630 1653

517 568

1673 1939

635

592 1082

1548

1155

895 1859

1365 1733

814 892

1882

583

1031 1433

818

1171

521 1508

563 803

1045 1191

1619 1705

683 1954

721

1048 1776 1852 614

1268 1290 1309 1718

876

1841 953 1346

1704

1449

1010 1475

1002

856 1343

987 1440

1058

1701

780 1690

719

558

1015 1276

727

1248 1374

1742

993 1133

1840

1122

917 1700

1573

873 1496

857 962 681

951 1723

549 894

1565

1263 1685

1499

567 1286

379 921

609 1706

730

750

919 1487

1320 1814

655 1794 1462 1202 1626

1284

1848 1975

1820

737 978 1325 1763 764 905

992

1175 1209

769 1851

176 1539

1512

571

1660

1272

861 1865

867

564

826 1846 852 1912

692

515 1254

630 1291 1067 1872

1524

1265 1649

533 1251

1743

1837 1909

789 1184

1643

1163 1193

967 974

534 587

619 762 1232 1634 1894

1431

1062 1591

965

882

541 1923

902 1747 1819 1697

589 998 935 1101 1069 1208

1146 1413

1013 1059 784 875

1493 1800

1297

1110

1427 1910

600

1809 1880 908 1080

804 824

1389

1728

797 972 794 1472

526 1669

865

800 1977 1187 1407

554

552 1156 1229 1873

518 1093

1321 1693

825

936 1068

1326

898 1914

603 920 601 950

1222

933 949

1945

927 975

1081

1677

748 1955

611 1540

1141 1883 1521 1765 1134 1648 1186 1944

1144 1477 622 843

723

1750 1917

520 714

1585

695 994

768

1224 1236

1811

1061 1504

1494

1732

1145 1348

645 1328 1752

1227 1992

718 1522

787 1207

791

1895 1267 1980 1446

1542

1116 1382

1153 1417 1436 606 1481

542 1088 580

922

1129 1480

1629 1889 1775 1826

1355

1027 1727

208

969 1760

1167

731 986

788

1946

657 1646

1150 1748 615 1759

1302 1709

1463

585 1327 939 1867

1537

626 1402

1350

1360

1063 1518 1161 1092 1896

677 1595

817

1377 1943

1598

734 809

1886

597

1790

503 1007 1448 1664 1906 1123 1670 1170 1838

812 958 1968

879 1310

1316 1893 1576 668 1294 1562 1637 1600 1817

736

753 1991

1779

848

1541

1169 1866

746 1418 532 1644

1403

650 1856

1071

1443 1754

1879

642 1242

829

970 1675

1564

864 1053

1151

713

1261 1842

1280 1604

694

1982

667 1415

1376

1933

562 761

1295 1843

878

934

81

1168 1655

646

1782 1887

1410

1135 1503

1552

255 795 1301

954 989 656

1614 1958

1057

1756

827 1388

1351 1683

572

1106 1533 330 1441 263 712 1005

1266 1831

1853

1616 1960

1596

834 1183

997 1414

1650

1084 1247 959 1770 869 1801 840 1203

911

1051 1920 1298 1363

1392 1976

662

752

855 1241 1488 1971

983

1022

940 1400

1221

776 1623

838 1783

1833

1868 1950

595 1189

1430 1571 1172 1862

594

652

1103

1532 1536

1795

945 1006

612

1563

705 1293

643

1639 828

1289 1762

1523

1016 1358

370 1050

1839

1559 1898

1406

1429 1804

890

881 1613

561 1113

1029 1077

1689

1136 1491

573 1513

1525 1780

664 1812

837 1888 1094 988

581 888

1965

782 1964

1587

1264 1420 739

880 913

1627 1806 1423 1974

1305

850 1178

704

1422 1618

1391

1244

1987

793 1341

1255

621 1274 1631 1784

27 1240

1453

1641 1778 703 1572

1850

1033 1108

785 1674

929

745 853

889 1647

1901

976

675 1132

673 781

1680

328 1231 912 1396

1519 1845

665 931

1049 1507

556

982

1188 1721

18

551 1243

1442 641 918

1014 1118 1044 1551

660 896

469 886 1636

555 732

707

1018 1398

702 1024

539 1712

584 792

545

858 1719

547 1179

329

1682 1828

1688

1556

1216 1460 514 1356

1344 1687

937

1055 1218 900 1375 570

1925 1941

1306 1803

854

1461

863 1003

1012 807 1354

1173 1599 1432

944

1337 1353

1471

710 1119

726

697 899 756 990

179

901 1047

981 1130

1474 1510 1550

1544

1558

742 1957

1615

696 1066

507 1903 1332 1984

1864

109 1162

1847

308 1642

1796 1870

928 1638

706

684

1367 1497

1252

651 1469

964

822 1724

1793 1798

711

1405

1011 1575 1025 1582

799 1318

434

387 1951 1210

535 775

1043 1952

529

1722

932 1300

649 744 924

524 952

1694

1333

1148

1124

251 1978

1390

510

1808 1972

647

1361 1993 1535 1691

516 1372

1281

1830

640 839

767 1313 1147 1652

1777

966 1454

926

527 1897 599 1738

1107 1997 1672 1699 1926 1234 1492

680 957

605

548 574 722

755 1230

217

392 971

877

224 1139

607

1757 1985

1645

1686

845

1046 1177

1890 1931

147

419 1932

1198

133 348

22 431

1696

1373

51 485

729 1466

275

426 685

1023 1226

54 977

411

21

904 1275

779

1967

139 1250 915 1435

512

301 1364

410

127 1283

528 1516

497 1583

1802

228 749

566 1256

1200

1444 1549

1771

1605 1907

961

75 1385

1085

1000

1486 1799

1659

636 893

1517

1676 1746

320

1663

10 488

181

1464 1827

313 995

166 350

417

266 428

104 394

47 77

415 956

96

82 498

100 105

116

325

219 307

173

1934

259 264

293

207 454

395

377

84 335

170

305

134 468

466

220 460

180

397 401

117

227 374

420 1037

258

347

52 500 42 234

12 353

1823

68 346

435

165 274

486

86 142

73 399

216 337

120 230

319

355

38 452

196

153

199 456

55

191 322

92

19 130

201

281 439

99 218

365

32 148 318 405

39

340 462

359 371

163

364

190 239

110 361

332

150 344

169 483 111

83 407

432

78 254

34

343 496

276

235 323

436

184

122 412

244

49 390

178 240

89 140

437

362 450

272 408 152

757

354 1109

1408 1534

44 689

1064

871

76 1083

1334

1936 1998

513 1766 509

835 968

1100

369

735 1192

1288

1339 778 1357

669

90 1090

1115 1257

1665 1849

821

167 691

1489 1786

1282 1761 1929

1884

1379 1666 1824

1260 1470

1323 1501

688 1419

1020 1303 759 1478

1140 1854

658

1515 1938 1380 1999

46 1678 284 304

168 265

33

261

221 260 154 1578 653 916

200 1767

1928 1948 1822 1567 1698

550 1871

43 985

1017 1070

1387

1030 1213

1269 1714

1502

1347 1942

1610 1885

783

1397 1711

129 388

112 1428

973

177 1096

161

131

1091 1245

492

1602 1769

314 1434

842

144 1553

35 155

481

118 175

299

108 474

193 489

358

459 461

74 223

382 446

85 441

302

70

246

243 310

25 132

479

79 623

198 444

50 430

31 404

376 659

1789

315 846

285

182 1717

02

46

8

Complete linkage

Hei

ght

Hierarchical Clustering

cl.1 <- cutree(hc.out, k=2)qplot(x, y, data=mutate(pts, cl.1=factor(cl.1)), color=cl.1)plot(hc.out, xlab="", main="Complete linkage", sub="")

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

0

2

4

6

0.0 2.5x

yCluster

1

2

Complete linkage

Hierarchical Clustering

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

0

2

4

6

0.0 2.5x

y

Cluster●

1

2

Complete linkage

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

0

2

4

6

0.0 2.5x

y

Cluster●

1

2

Average linkage

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

0

2

4

6

0.0 2.5x

y

Cluster●

1

2

Single linkage

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

0

2

4

6

0.0 2.5x

y

Cluster●

1

2

3

Average linkage

Association rules

I Data is a binary matrix with columns corresponding toproducts and rows corresponding to baskets.

I Entry (i , j) is TRUE if customer i purchased product j .I Apriori algorithm looks at most probable sets of products

and combines them

Association rules

I Association rule is a claim such as: A & B ⇒ C.I Support for the rule is the probability of all items being

together

Support(A & B & C) =Number of baskets with A, B and C

Total number of baskets

I Confidence of a rule is the conditional probability of theimplied item

Confidence(A & B ⇒ C) =Support(A & B & C)

Support(A & B)

I Lift of a rule is

Lift(A & B ⇒ C) =Confidence(A & B ⇒ C)

Support(C)

Association rules

I We start by computing the supports of all single items andsort them.

I Then prune at say 80% and compute the support of allrules with two items of the remaining ones.

I Sort and prune. Then proceed with rules with three items,not including pairs that have been pruned. And so on.

Association rules

> mb <- read.csv(’MarketBasket.csv’) # Simulated data> library(arules)> rules <- apriori(mb, parameter=list(supp=0.8, conf=0.8, target="rules"))

Parameter specification:confidence minval smax arem aval originalSupport support minlen maxlen target

0.8 0.1 1 none FALSE TRUE 0.8 1 10 rulesext

FALSE

Algorithmic control:filter tree heap memopt load sort verbose

0.1 TRUE TRUE FALSE TRUE 2 TRUE

apriori - find association rules with the apriori algorithmversion 4.21 (2004.05.09) (c) 1996-2004 Christian Borgeltset item appearances ...[0 item(s)] done [0.00s].set transactions ...[5 item(s), 500 transaction(s)] done [0.00s].sorting and recoding items ... [3 item(s)] done [0.00s].creating transaction tree ... done [0.00s].checking subsets of size 1 2 3 done [0.00s].writing ... [12 rule(s)] done [0.00s].creating S4 object ... done [0.00s].

Association rules

> inspect(rules)lhs rhs support confidence lift

1 {} => {V4} 0.932 0.9320000 1.00000002 {} => {V1} 0.950 0.9500000 1.00000003 {} => {V3} 1.000 1.0000000 1.00000004 {V4} => {V1} 0.882 0.9463519 0.99615995 {V1} => {V4} 0.882 0.9284211 0.99615996 {V4} => {V3} 0.932 1.0000000 1.00000007 {V3} => {V4} 0.932 0.9320000 1.00000008 {V1} => {V3} 0.950 1.0000000 1.00000009 {V3} => {V1} 0.950 0.9500000 1.000000010 {V1,

V4} => {V3} 0.882 1.0000000 1.000000011 {V3,

V4} => {V1} 0.882 0.9463519 0.996159912 {V1,

V3} => {V4} 0.882 0.9284211 0.9961599