a3 sec -_regular_expressions

54
Regular Expressions Performance Optimizing event capture building better Ossim Agent plugins

Upload: a3sec

Post on 19-May-2015

69 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A3 sec -_regular_expressions

Regular Expressions Performance

Optimizing event capture building better Ossim Agent plugins

Page 2: A3 sec -_regular_expressions

About A3Sec

● AlienVault's spin-off

● Professional Services, SIEM deployments

● Alienvault's Authorized Training Center (ATC)

for Spain and LATAM

● Team of more than 25 Security Experts

● Own developments and tool integrations

● Advanced Health Check Monitoring

● Web: www.a3sec.com, Twitter: @a3sec

Page 3: A3 sec -_regular_expressions

About Me

● David Gil <[email protected]>

● Developer, Sysadmin, Project Manager

● Really believes in Open Source model

● Programming since he was 9 years old

● Ossim developer at its early stage

● Agent core engine (full regex) and first plugins

● Python lover :-)

● Debian package maintainer (a long, long time ago)

● Sci-Fi books reader and mountain bike rider

Page 4: A3 sec -_regular_expressions

Summary

1. What is a regexp?2. When to use regexp?

3. Regex basics

4. Performance Tests

5. Writing regexp (Performance Strategies)

6. Writing plugins (Performance Strategies)

7. Tools

Page 5: A3 sec -_regular_expressions
Page 6: A3 sec -_regular_expressions

Regular ExpressionsWhat is a regex?

Regular expression:

(bb|[^b]{2})

Page 7: A3 sec -_regular_expressions

Regular ExpressionsWhat is a regex?

Regular expression:

(bb|[^b]{2})\d\d

Input strings:bb445, 2ac3357bb, bb3aa2c7,a2ab64b, abb83fh6l3hi22ui

Page 8: A3 sec -_regular_expressions

Regular ExpressionsWhat is a regex?

Regular expression:

(bb|[^b]{2})\d\d

Input strings:bb445, 2ac3357bb, bb3aa2c7,a2ab64b, abb83fh6l3hi22ui

Page 9: A3 sec -_regular_expressions

Summary

1. What is a regexp?

2. When to use regexp?3. Regex basics

4. Performance Tests

5. Writing regexp (Performance Strategies)

6. Writing plugins (Performance Strategies)

7. Tools

Page 10: A3 sec -_regular_expressions

Regular ExpressionsTo RE or not to RE

● Regular expressions are almost never the right answer○ Difficult to debug and maintain○ Performance reasons, slower for simple matching○ Learning curve

Page 11: A3 sec -_regular_expressions

Regular ExpressionsTo RE or not to RE

● Regular expressions are almost never the right answer○ Difficult to debug and maintain○ Performance reasons, slower for simple matching○ Learning curve

● Python string functions are small C loops: super fast!○ beginswith(), endswith(), split(), etc.

Page 12: A3 sec -_regular_expressions

Regular ExpressionsTo RE or not to RE

● Regular expressions are almost never the right answer○ Difficult to debug and maintain○ Performance reasons, slower for simple matching○ Learning curve

● Python string functions are small C loops: super fast!○ beginswith(), endswith(), split(), etc.

● Use standard parsing libraries!Formats: JSON, HTML, XML, CSV, etc.

Page 13: A3 sec -_regular_expressions

Regular ExpressionsTo RE or not to RE

Example: URL parsing● regex:

^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$

● parse_url() php method:$url = "http://username:password@hostname/path?arg=value#anchor";print_r(parse_url($url));( [scheme] => http [host] => hostname [user] => username [pass] => password [path] => /path [query] => arg=value [fragment] => anchor)

Page 14: A3 sec -_regular_expressions

Regular ExpressionsTo RE or not to RE

But, there are a lot of reasons to use regex:● powerful● portable● fast (with performance in mind)● useful for complex patterns● save development time● short code● fun :-)● beautiful?

Page 15: A3 sec -_regular_expressions

Summary

1. What is a regexp?

2. When to use regexp?

3. Regex basics4. Performance Tests

5. Writing regexp (Performance Strategies)

6. Writing plugins (Performance Strategies)

7. Tools

Page 16: A3 sec -_regular_expressions

Regular ExpressionsBasics - Characters

● \d, \D: digits. \w, \W: words. \s, \S: spaces>>> re.findall('\d\d\d\d-(\d\d)-\d\d', '2013-07-21')>>> re.findall('(\S+)\s+(\S+)', 'foo bar')

● ^, $: Begin/End of string>>> re.findall('(\d+)', 'cba3456csw')>>> re.findall('^(\d+)$', 'cba3456csw')

● . (dot): Any character:>>> re.findall('foo(.)bar', 'foo=bar')>>> re.findall('(...)=(...)', 'foo=bar')

Page 17: A3 sec -_regular_expressions

Regular ExpressionsBasics - Repetitions

● *, +: 0-1 or more repetitions>>> re.findall('FO+', 'FOOOOOOOOO')>>> re.findall('BA*R', 'BR')

● ?: 0 or 1 repetitions>>> re.findall('colou?r', 'color')>>> re.findall('colou?r', 'colour')

● {n}, {n,m}: N repetitions:>>> re.findall('\d{2}', '2013-07-21')>>> re.findall('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}','192.168.1.25')

Page 18: A3 sec -_regular_expressions

Regular ExpressionsBasics - Groups

[...]: Set of characters>>> re.findall('[a-z]+=[a-z]+', 'foo=bar')

...|...: Alternation>>> re.findall('(foo|bar)=(foo|bar)', 'foo=bar')

(...) and \1, \2, ...: Group>>> re.findall(r'(\w+)=(\1)', 'foo=bar')

>>> re.findall(r'(\w+)=(\1)', 'foo=foo')

(?P<name>...): Named group>>> re.findall('\d{4}-\d{2}-(?P<day>\d{2}'), '2013-07-23')

Page 19: A3 sec -_regular_expressions

Regular ExpressionsGreedy & Lazy quantifiers: *?, +?

● Greedy vs non-greedy (lazy)>>> re.findall('A+', 'AAAA')

['AAAA']

>>> re.findall('A+?', 'AAAA')

['A', 'A', 'A', 'A']

Page 20: A3 sec -_regular_expressions

Regular ExpressionsGreedy & Lazy quantifiers: *?, +?

● Greedy vs non-greedy (lazy)>>> re.findall('A+', 'AAAA')

['AAAA']

>>> re.findall('A+?', 'AAAA')

['A', 'A', 'A', 'A']

● An overall match takes precedence over and overall non-match>>> re.findall('<.*>.*</.*>', '<B>i am bold</B>')

>>> re.findall('<(.*)>.*</(.*)>', '<B>i am bold</B>')

Page 21: A3 sec -_regular_expressions

Regular ExpressionsGreedy & Lazy quantifiers: *?, +?

● Greedy vs non-greedy (lazy)>>> re.findall('A+', 'AAAA')

['AAAA']

>>> re.findall('A+?', 'AAAA')

['A', 'A', 'A', 'A']

● An overall match takes precedence over and overall non-match>>> re.findall('<.*>.*</.*>', '<B>i am bold</B>')

>>> re.findall('<(.*)>.*</(.*)>', '<B>i am bold</B>')

● Minimal matching, non-greedy>>> re.findall('<(.*)>.*', '<B>i am bold</B>')

>>> re.findall('<(.*?)>.*', '<B>i am bold</B>')

Page 22: A3 sec -_regular_expressions

Summary

1. What is a regexp?

2. When to use regexp?

3. Regex basics

4. Performance Tests5. Writing regexp (Performance Strategies)

6. Writing plugins (Performance Strategies)

7. Tools

Page 23: A3 sec -_regular_expressions

Regular ExpressionsPerformance Tests

Different implementations of a custom is_a_word() function:

● #1 Regexp

● #2 Char iteration

● #3 String functions

Page 24: A3 sec -_regular_expressions

Regular ExpressionsPerformance Test #1def is_a_word(word):

CHARS = string.uppercase + string.lowercaseregexp = r'^[%s]+$' % CHARS

if re.search(regexp, word) return "YES" else "NOP"

Page 25: A3 sec -_regular_expressions

Regular ExpressionsPerformance Test #1def is_a_word(word):

CHARS = string.uppercase + string.lowercaseregexp = r'^[%s]+$' % CHARS

if re.search(regexp, word) return "YES" else "NOP"

timeit.timeit(s, 'is_a_word(%s)' %(w))

1.49650502205 YES len=4 word1.65614509583 YES len=25 wordlongerthanpreviousone..1.92520785332 YES len=60 wordlongerthanpreviosoneplusan..2.38850092888 YES len=120 wordlongerthanpreviosoneplusan..1.55924701691 NOP len=10 not a word1.7087020874 NOP len=25 not a word, just a phrase..1.92521882057 NOP len=50 not a word, just a phrase bigg..2.39075493813 NOP len=102 not a word, just a phrase bigg..

Page 26: A3 sec -_regular_expressions

Regular ExpressionsPerformance Test #1def is_a_word(word):

CHARS = string.uppercase + string.lowercaseregexp = r'^[%s]+$' % CHARS

if re.search(regexp, word) return "YES" else "NOP"

timeit.timeit(s, 'is_a_word(%s)' %(w))

1.49650502205 YES len=4 word1.65614509583 YES len=25 wordlongerthanpreviousone..1.92520785332 YES len=60 wordlongerthanpreviosoneplusan..2.38850092888 YES len=120 wordlongerthanpreviosoneplusan..1.55924701691 NOP len=10 not a word1.7087020874 NOP len=25 not a word, just a phrase..1.92521882057 NOP len=50 not a word, just a phrase bigg..2.39075493813 NOP len=102 not a word, just a phrase bigg..

If the target string is longer, the regex matching is slower. No matter if success or fail.

Page 27: A3 sec -_regular_expressions

Regular ExpressionsPerformance Test #2def is_a_word(word):

for char in word:

if not char in (CHARS): return "NOP"

return "YES"

Page 28: A3 sec -_regular_expressions

Regular ExpressionsPerformance Test #2def is_a_word(word):

for char in word:

if not char in (CHARS): return "NOP"

return "YES"

timeit.timeit(s, 'is_a_word(%s)' %(w))

0.687522172928 YES len=4 word1.0725839138 YES len=25 wordlongerthanpreviousone..2.34717106819 YES len=60 wordlongerthanpreviosoneplusan..4.31543898582 YES len=120 wordlongerthanpreviosoneplusan..0.54797577858 NOP len=10 not a word0.547253847122 NOP len=25 not a word, just a phrase..0.546499967575 NOP len=50 not a word, just a phrase bigg..0.553755998611 NOP len=102 not a word, just a phrase bigg..

Page 29: A3 sec -_regular_expressions

Regular ExpressionsPerformance Test #2def is_a_word(word):

for char in word:

if not char in (CHARS): return "NOP"

return "YES"

timeit.timeit(s, 'is_a_word(%s)' %(w))

0.687522172928 YES len=4 word1.0725839138 YES len=25 wordlongerthanpreviousone..2.34717106819 YES len=60 wordlongerthanpreviosoneplusan..4.31543898582 YES len=120 wordlongerthanpreviosoneplusan..0.54797577858 NOP len=10 not a word0.547253847122 NOP len=25 not a word, just a phrase..0.546499967575 NOP len=50 not a word, just a phrase bigg..0.553755998611 NOP len=102 not a word, just a phrase bigg..

2 python nested loops if success (slow)But fails at the same point&time (first space)

Page 30: A3 sec -_regular_expressions

Regular ExpressionsPerformance Test #3def is_a_word(word):

return "YES" if word.isalpha() else "NOP"

Page 31: A3 sec -_regular_expressions

Regular ExpressionsPerformance Test #3def is_a_word(word):

return "YES" if word.isalpha() else "NOP"

timeit.timeit(s, 'is_a_word(%s)' %(w))

0.146447896957 YES len=4 word0.212563037872 YES len=25 wordlongerthanpreviousone..0.318686008453 YES len=60 wordlongerthanpreviosoneplusan..0.493942975998 YES len=120 wordlongerthanpreviosoneplusan..0.14647102356 NOP len=10 not a word0.146160840988 NOP len=25 not a word, just a phrase..0.147103071213 NOP len=50 not a word, just a phrase bigg..0.146239995956 NOP len=102 not a word, just a phrase bigg..

Page 32: A3 sec -_regular_expressions

Regular ExpressionsPerformance Test #3def is_a_word(word):

return "YES" if word.isalpha() else "NOP"

timeit.timeit(s, 'is_a_word(%s)' %(w))

0.146447896957 YES len=4 word0.212563037872 YES len=25 wordlongerthanpreviousone..0.318686008453 YES len=60 wordlongerthanpreviosoneplusan..0.493942975998 YES len=120 wordlongerthanpreviosoneplusan..0.14647102356 NOP len=10 not a word0.146160840988 NOP len=25 not a word, just a phrase..0.147103071213 NOP len=50 not a word, just a phrase bigg..0.146239995956 NOP len=102 not a word, just a phrase bigg..

Python string functions (fast and small C loops)

Page 33: A3 sec -_regular_expressions

Summary

1. What is a regexp?

2. When to use regexp?

3. Regex basics

4. Performance Tests

5. Writing regexp (Performance Strategies)6. Writing plugins (Performance Strategies)

7. Tools

Page 34: A3 sec -_regular_expressions

Regular ExpressionsPerformance Strategies

Writing regex● Be careful with repetitions (+, *, {n,m})

(abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?

Page 35: A3 sec -_regular_expressions

Regular ExpressionsPerformance Strategies

Writing regex● Be careful with repetitions (+, *, {n,m})

(abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?(abc|def){2,1000} produces ...

Page 36: A3 sec -_regular_expressions

Regular ExpressionsPerformance Strategies

Writing regex● Be careful with repetitions (+, *, {n,m})

(abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?(abc|def){2,1000} produces ...

● Be careful with wildcardsre.findall(r'(ab).*(cd).*(ef)', 'ab cd ef')

Page 37: A3 sec -_regular_expressions

Regular ExpressionsPerformance Strategies

Writing regex● Be careful with repetitions (+, *, {n,m})

(abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?(abc|def){2,1000} produces ...

● Be careful with wildcardsre.findall(r'(ab).*(cd).*(ef)', 'ab cd ef') # slowerre.findall(r'(ab)\s(cd)\s(ef)', 'ab cd ef') # faster

Page 38: A3 sec -_regular_expressions

Regular ExpressionsPerformance Strategies

Writing regex● Be careful with repetitions (+, *, {n,m})

(abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?(abc|def){2,1000} produces ...

● Be careful with wildcardsre.findall(r'(ab).*(cd).*(ef)', 'ab cd ef') # slowerre.findall(r'(ab)\s(cd)\s(ef)', 'ab cd ef') # faster

● Longer target string -> slower regex matching

Page 39: A3 sec -_regular_expressions

Regular ExpressionsPerformance Strategies

Writing regex● Use the non-capturing group when no need

to capture and save text to a variable(?:abc|def|ghi) instead of (abc|def|ghi)

Page 40: A3 sec -_regular_expressions

Regular ExpressionsPerformance Strategies

Writing regex● Use the non-capturing group when no need

to capture and save text to a variable(?:abc|def|ghi) instead of (abc|def|ghi)

● Pattern most likely to match first(TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY)

Page 41: A3 sec -_regular_expressions

Regular ExpressionsPerformance Strategies

Writing regex● Use the non-capturing group when no need

to capture and save text to a variable(?:abc|def|ghi) instead of (abc|def|ghi)

● Pattern most likely to match first(TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY)TRAFFIC_(ALLOW|DROP|DENY)

Page 42: A3 sec -_regular_expressions

Regular ExpressionsPerformance Strategies

Writing regex● Use the non-capturing group when no need

to capture and save text to a variable(?:abc|def|ghi) instead of (abc|def|ghi)

● Pattern most likely to match first(TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY)TRAFFIC_(ALLOW|DROP|DENY)

● Use anchors (^ and $) to limit the scorere.findall(r'(ab){2}', 'abcabcabc')re.findall(r'^(ab){2}','abcabcabc') #failures occur faster

Page 43: A3 sec -_regular_expressions

Summary

1. What is a regexp?

2. When to use regexp?

3. Regex basics

4. Performance Tests

5. Writing regexp (Performance Strategies)

6. Writing plugins (Performance Strategies)7. Tools

Page 44: A3 sec -_regular_expressions

Regular ExpressionsPerformance Strategies

Writing Agent plugins● A new process is forked for each loaded

plugin○ Use the plugins that you really need!

Page 45: A3 sec -_regular_expressions

Regular ExpressionsPerformance Strategies

Writing Agent plugins● A new process is forked for each loaded

plugin○ Use the plugins that you really need!

● A plugin is a set of rules (regexp operations) for matching log lines○ If a plugin doesn't match a log entry, it fails in ALL its

rules!○ Reduce the number of rules, use a [translation] table

Page 46: A3 sec -_regular_expressions

Regular ExpressionsPerformance Strategies

Writing Agent plugins● Alphabetical order for rule matching

○ Order your rules by priority, pattern most likely to match first

Page 47: A3 sec -_regular_expressions

Regular ExpressionsPerformance Strategies

Writing Agent plugins● Alphabetical order for rule matching

○ Order your rules by priority, pattern most likely to match first

● Divide and conquer○ A plugin is configured to read from a source file, use

dedicated source files per technology○ Also, use dedicated plugins for each technology

Page 48: A3 sec -_regular_expressions

Regular ExpressionsPerformance Strategies

Tool1 20 logs/sec Tool2 20 logs/secTool3 20 logs/sec /var/log/syslogTool4 20 logs/sec (100 logs/sec)Tool5 20 logs/sec

5 plugins with 1 rule reading /var/log/syslog5x100 = 500 total regex/sec

Page 49: A3 sec -_regular_expressions

Regular ExpressionsPerformance Strategies

Tool1 20 logs/sec /var/log/tool1Tool2 20 logs/sec /var/log/tool2Tool3 20 logs/sec /var/log/tool3Tool4 20 logs/sec /var/log/tool4Tool5 20 logs/sec /var/log/tool5 (100 logs/sec)

5 plugins with 1 rule reading /var/log/tool{1-5}5x20 = 100 total regex/sec (x5) Faster

Page 50: A3 sec -_regular_expressions

Summary

1. What is a regexp?

2. When to use regexp?

3. Regex basics

4. Performance Tests

5. Writing regexp (Performance Strategies)

6. Writing plugins (Performance Strategies)

7. Tools

Page 51: A3 sec -_regular_expressions

Regular ExpressionsTools for testing Regex

Python:>>> import re

>>> re.findall('(\S+) (\S+)', 'foo bar')

[('foo', 'bar')]

>>> result = re.search(

... '(?P<key>\w+)\s*=\s*(?P<value>\w+)',

... 'foo=bar'

... )

>>> result.groupdict()

{ 'key': 'foo', 'value': 'bar' }

Page 52: A3 sec -_regular_expressions

Regular ExpressionsTools for testing Regex

Regex debuggers:● Kiki● KodosOnline regex testers:● http://gskinner.com/RegExr/ (java)● http://regexpal.com/ (javascript)● http://rubular.com/ (ruby)● http://www.pythonregex.com/ (python)Online regex visualization:● http://www.regexper.com/ (javascript)

Page 53: A3 sec -_regular_expressions

any (?:question|doubt|comment)+\?

Page 54: A3 sec -_regular_expressions

A3Secweb: www.a3sec.com

email: [email protected]

twitter: @a3sec

Spain Head OfficeC/ Aravaca, 6, Piso 2

28040 MadridTlf. +34 533 09 78

México Head OfficeAvda. Paseo de la Reforma, 389 Piso 10

México DF

Tlf. +52 55 5980 3547