experiments in genetic programming

20
1 Experiments in genetic programming Bouvet BigOne, 2012-03-29 Lars Marius Garshol, <[email protected]> http://twitter.com/larsga

Upload: lars-marius-garshol

Post on 15-Jan-2015

1.344 views

Category:

Technology


5 download

DESCRIPTION

A short talk about using

TRANSCRIPT

Page 1: Experiments in genetic programming

1

Experiments in genetic programming

Bouvet BigOne, 2012-03-29Lars Marius Garshol, <[email protected]>http://twitter.com/larsga

Page 2: Experiments in genetic programming

2

The background

• Duke– open source data matching engine (Java)– can find near-duplicate database records– probabilistic configuration– http://code.google.com/p/duke/

• People find making configurations difficult– can we help them?

Field Record 1 Record 2 Probability

Name acme inc acme inc 0.9

Assoc no 177477707 0.5

Zip code 9161 9161 0.6

Country norway norway 0.51

Address 1 mb 113 mailbox 113 0.49

Address 2 0.5

Page 3: Experiments in genetic programming

3

The idea

• Given – a test file showing the correct linkages

• can we– evolve a configuration

• using– genetic algorithms?

Page 4: Experiments in genetic programming

4

What a configuration looks like

• Threshold for accepting matches– a number between 0.0 and 1.0

• For each property– a comparator function (Exact, Levenshtein,

numeric...)– a low probability (0.0-0.5)– a high probability (0.5-1.0)

Page 5: Experiments in genetic programming

5

The hill-climbing problem

Page 6: Experiments in genetic programming

6

How it works

1. Generate a population of 100 random configurations

2. Evaluate the population3. Throw away the 25 worst, duplicate the

25 best4. Randomly modify the entire population5. Go back to 2

Page 7: Experiments in genetic programming

7

Actual code

for generation in range(POPULATIONS): print "===== GENERATION %s ================================" % generation

for c in population: f = evaluate(c)

if f > highest: best = c highest = f show_best(best, False) # make new generation population = sorted(population, key = lambda c: 1.0 - index[c]) # ditch lower quartile population = population[ : -25] # double upper quartile population = population[ : 25] + population

# mutate population = [c.make_new(population) for c in population]

Page 8: Experiments in genetic programming

8

Actual code #2class GeneticConfiguration: def __init__(self): self._props = [] self._threshold = 0.0

# set/get threshold, add/get properties def make_new(self, population): # either we make a number or random modifications, or we mate. # draw a number, if 0 modifications, we mate. mods = random.randint(0, 3) if mods: return self._mutate(mods) else: return self._mate(random.choice(population))

def _mutate(self, mods): c = self._copy() for ix in range(mods): aspect = random.choice(aspects) aspect.modify(c) return c

def _mate(self, other): c = self._copy() for aspect in aspects: aspect.set(c, aspect.get(random.choice([self, other]))) return c

def _copy(self): c = GeneticConfiguration() c.set_threshold(self._threshold) for prop in self.get_properties(): if prop.getName() == "ID": c.add_property(Property(prop.getName())) else: c.add_property(Property(prop.getName(), prop.getComparator(), prop.getLowProbability(), prop.getHighProbability())) return c

Page 9: Experiments in genetic programming

9

But ... does it work?!?

Page 10: Experiments in genetic programming

10

Linking countries

• Linking countries from DBpedia and Mondial– no common identifiers

• Manually I manage 95.4% accuracy– genetic script manages 95.7% in first

generation– then improves to 98.9%– this was too easy...

DBPEDIA

Id http://dbpedia.org/resource/Samoa

Name Samoa

Capital Apia

Area 2831

MONDIAL

Id 17019

Name Western Samoa

Capital Apia, Samoa

Area 2860

Page 11: Experiments in genetic programming

11

The actual configuration

PROPERTY COMPARATOR LOW HIGH

NAME Exact 0.19 0.91

CAPITAL Exact 0.25 0.86

AREA Numeric 0.36 0.72

Threshold 0.6

Confusing.

Why exact name comparisons?

Why is area comparison given such weight?

Who knows. There’s nobody to ask.

Page 12: Experiments in genetic programming

12

Semantic dogfood

• Data about papers presented at semantic web conferences– has duplicate speakers– about 7,000 records, many long string values

• Manually I get 88% accuracy– after two weeks, the script gets 82% accuracy– but it’s only half-way

Name Grigorios Antoniou

Homepage http://www.ics.forth.gr/~antoniou

Mbox_Sha1 f44cd7769f416e96864ac43498b082155196829e

Affiliation

Name Grigoris Antoniou

Homepage http://www.ics.forth.gr/~antoniou

Mbox_Sha1 f44cd7769f416e96864ac43498b082155196829e

Affiliation http://data.semanticweb.org/organization/forth-ics

Page 13: Experiments in genetic programming

13

The configuration

PROPERTY COMPARATOR LOW HIGH

NAME JaroWinklerTokenized 0.2 0.9

AFFILIATION DiceCoefficient 0.49 0.61

HOMEPAGE Exact 0.09 0.67

MBOX_HASH PersonNameComparator 0.42 0.87

Threshold 0.91

Some strange choices of comparator.

PersonNameComparator?!?

DiceCoefficient is essentially same as Exact, for those values.

Otherwise as expected.

Page 14: Experiments in genetic programming

14

Hafslund

• I took a subset of customer data from Hafslund– roughly 3000 records– then made a difficult manual test file, where different

parts of organizations are treated as different– so NSB Logistikk != NSB Bane– then made another subset for testing

• Manually I can do no better than 64% on this data set– interestingly, on the test data set I score 84%

• With a cut-down data set, I could run the script overnight, and have a result in the morning

Page 15: Experiments in genetic programming

15

The progress of evolution

• 1st generation– best scores: 0.47, 0.43, 0.3

• 2nd generation– mutated 0.47 configuration scores 0.136, 0.467,

0.002, and 0.49– best scores: 0.49, 0.467, 0.4, and 0.38

• 3rd generation– mutated 0.49 scores 0.001, 0.49, 0.46, and 0.25– best scores: 0.49, 0.46, 0.45, and 0.42

• 4th generation– we hit 0.525 (modified from 0.21)

Page 16: Experiments in genetic programming

16

The progress of evolution #2

• 5th generation– we hit 0.568 (modified from 0.479)

• 6th generation– 0.602

• 7th generation– 0.702

• ...• 60th generation– 0.765– I’d done no better than 0.64 manually

Page 17: Experiments in genetic programming

17

Evaluation

CONFIGURATION TRAINING TEST

Genetic #1 0.766 0.881

Genetic #2 0.776 0.859

Manual #1 0.57 0.838

Manual #2 0.64 0.803

PROPERTY COMPARATOR LOW HIGH

NAME Levenshtein 0.17 0.95

ASSOCIATION_NO Exact 0.06 0.69

ADDRESS1 Numeric 0.02 0.92

ADDRESS2 PersonName 0.18 0.76

ZIP_CODE DiceCoefficient 0.47 0.79

COUNTRY Levenshtein 0.12 0.64

Threshold: 0.98PROPERTY COMPARATOR LOW HIGH

NAME Levenshtein 0.42 0.96

ASSOCIATION_NO DiceCoefficient 0.0 0.67

ADDRESS1 Numeric 0.1 0.61

ADDRESS2 Levenshtein 0.03 0.8

ZIP_CODE DiceCoefficient 0.35 0.69

COUNTRY JaroWinklerT. 0.44 0.68

Threshold: 0.95

Page 18: Experiments in genetic programming

18

Does it find the best configuration?

• We don’t know• The experts say genetic algorithms tend

to get stuck at local maxima– they also point out that well-known

techniques for dealing with this are described in the literature

• Rerunning tends to produce similar configurations

Page 19: Experiments in genetic programming

19

The literature

http://www.cleveralgorithms.com/ http://www.gp-field-guide.org.uk/

Page 20: Experiments in genetic programming

20

Conclusion

• Easy to implement– you don’t need a GP library

• Requires reliable test data• It actually works• Configurations may not be very

tweakable– because they don’t necessarily make any

sense

• This is a big field, with lots to learn

http://www.garshol.priv.no/blog/225.html