Download - A3 sec -_regular_expressions
Regular Expressions Performance
Optimizing event capture building better Ossim Agent plugins
About A3Sec
● AlienVault's spin-off
● Professional Services, SIEM deployments
● Alienvault's Authorized Training Center (ATC)
for Spain and LATAM
● Team of more than 25 Security Experts
● Own developments and tool integrations
● Advanced Health Check Monitoring
● Web: www.a3sec.com, Twitter: @a3sec
About Me
● David Gil <[email protected]>
● Developer, Sysadmin, Project Manager
● Really believes in Open Source model
● Programming since he was 9 years old
● Ossim developer at its early stage
● Agent core engine (full regex) and first plugins
● Python lover :-)
● Debian package maintainer (a long, long time ago)
● Sci-Fi books reader and mountain bike rider
Summary
1. What is a regexp?2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
Regular ExpressionsWhat is a regex?
Regular expression:
(bb|[^b]{2})
Regular ExpressionsWhat is a regex?
Regular expression:
(bb|[^b]{2})\d\d
Input strings:bb445, 2ac3357bb, bb3aa2c7,a2ab64b, abb83fh6l3hi22ui
Regular ExpressionsWhat is a regex?
Regular expression:
(bb|[^b]{2})\d\d
Input strings:bb445, 2ac3357bb, bb3aa2c7,a2ab64b, abb83fh6l3hi22ui
Summary
1. What is a regexp?
2. When to use regexp?3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
Regular ExpressionsTo RE or not to RE
● Regular expressions are almost never the right answer○ Difficult to debug and maintain○ Performance reasons, slower for simple matching○ Learning curve
Regular ExpressionsTo RE or not to RE
● Regular expressions are almost never the right answer○ Difficult to debug and maintain○ Performance reasons, slower for simple matching○ Learning curve
● Python string functions are small C loops: super fast!○ beginswith(), endswith(), split(), etc.
Regular ExpressionsTo RE or not to RE
● Regular expressions are almost never the right answer○ Difficult to debug and maintain○ Performance reasons, slower for simple matching○ Learning curve
● Python string functions are small C loops: super fast!○ beginswith(), endswith(), split(), etc.
● Use standard parsing libraries!Formats: JSON, HTML, XML, CSV, etc.
Regular ExpressionsTo RE or not to RE
Example: URL parsing● regex:
^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$
● parse_url() php method:$url = "http://username:password@hostname/path?arg=value#anchor";print_r(parse_url($url));( [scheme] => http [host] => hostname [user] => username [pass] => password [path] => /path [query] => arg=value [fragment] => anchor)
Regular ExpressionsTo RE or not to RE
But, there are a lot of reasons to use regex:● powerful● portable● fast (with performance in mind)● useful for complex patterns● save development time● short code● fun :-)● beautiful?
Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
Regular ExpressionsBasics - Characters
● \d, \D: digits. \w, \W: words. \s, \S: spaces>>> re.findall('\d\d\d\d-(\d\d)-\d\d', '2013-07-21')>>> re.findall('(\S+)\s+(\S+)', 'foo bar')
● ^, $: Begin/End of string>>> re.findall('(\d+)', 'cba3456csw')>>> re.findall('^(\d+)$', 'cba3456csw')
● . (dot): Any character:>>> re.findall('foo(.)bar', 'foo=bar')>>> re.findall('(...)=(...)', 'foo=bar')
Regular ExpressionsBasics - Repetitions
● *, +: 0-1 or more repetitions>>> re.findall('FO+', 'FOOOOOOOOO')>>> re.findall('BA*R', 'BR')
● ?: 0 or 1 repetitions>>> re.findall('colou?r', 'color')>>> re.findall('colou?r', 'colour')
● {n}, {n,m}: N repetitions:>>> re.findall('\d{2}', '2013-07-21')>>> re.findall('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}','192.168.1.25')
Regular ExpressionsBasics - Groups
[...]: Set of characters>>> re.findall('[a-z]+=[a-z]+', 'foo=bar')
...|...: Alternation>>> re.findall('(foo|bar)=(foo|bar)', 'foo=bar')
(...) and \1, \2, ...: Group>>> re.findall(r'(\w+)=(\1)', 'foo=bar')
>>> re.findall(r'(\w+)=(\1)', 'foo=foo')
(?P<name>...): Named group>>> re.findall('\d{4}-\d{2}-(?P<day>\d{2}'), '2013-07-23')
Regular ExpressionsGreedy & Lazy quantifiers: *?, +?
● Greedy vs non-greedy (lazy)>>> re.findall('A+', 'AAAA')
['AAAA']
>>> re.findall('A+?', 'AAAA')
['A', 'A', 'A', 'A']
Regular ExpressionsGreedy & Lazy quantifiers: *?, +?
● Greedy vs non-greedy (lazy)>>> re.findall('A+', 'AAAA')
['AAAA']
>>> re.findall('A+?', 'AAAA')
['A', 'A', 'A', 'A']
● An overall match takes precedence over and overall non-match>>> re.findall('<.*>.*</.*>', '<B>i am bold</B>')
>>> re.findall('<(.*)>.*</(.*)>', '<B>i am bold</B>')
Regular ExpressionsGreedy & Lazy quantifiers: *?, +?
● Greedy vs non-greedy (lazy)>>> re.findall('A+', 'AAAA')
['AAAA']
>>> re.findall('A+?', 'AAAA')
['A', 'A', 'A', 'A']
● An overall match takes precedence over and overall non-match>>> re.findall('<.*>.*</.*>', '<B>i am bold</B>')
>>> re.findall('<(.*)>.*</(.*)>', '<B>i am bold</B>')
● Minimal matching, non-greedy>>> re.findall('<(.*)>.*', '<B>i am bold</B>')
>>> re.findall('<(.*?)>.*', '<B>i am bold</B>')
Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
Regular ExpressionsPerformance Tests
Different implementations of a custom is_a_word() function:
● #1 Regexp
● #2 Char iteration
● #3 String functions
Regular ExpressionsPerformance Test #1def is_a_word(word):
CHARS = string.uppercase + string.lowercaseregexp = r'^[%s]+$' % CHARS
if re.search(regexp, word) return "YES" else "NOP"
Regular ExpressionsPerformance Test #1def is_a_word(word):
CHARS = string.uppercase + string.lowercaseregexp = r'^[%s]+$' % CHARS
if re.search(regexp, word) return "YES" else "NOP"
timeit.timeit(s, 'is_a_word(%s)' %(w))
1.49650502205 YES len=4 word1.65614509583 YES len=25 wordlongerthanpreviousone..1.92520785332 YES len=60 wordlongerthanpreviosoneplusan..2.38850092888 YES len=120 wordlongerthanpreviosoneplusan..1.55924701691 NOP len=10 not a word1.7087020874 NOP len=25 not a word, just a phrase..1.92521882057 NOP len=50 not a word, just a phrase bigg..2.39075493813 NOP len=102 not a word, just a phrase bigg..
Regular ExpressionsPerformance Test #1def is_a_word(word):
CHARS = string.uppercase + string.lowercaseregexp = r'^[%s]+$' % CHARS
if re.search(regexp, word) return "YES" else "NOP"
timeit.timeit(s, 'is_a_word(%s)' %(w))
1.49650502205 YES len=4 word1.65614509583 YES len=25 wordlongerthanpreviousone..1.92520785332 YES len=60 wordlongerthanpreviosoneplusan..2.38850092888 YES len=120 wordlongerthanpreviosoneplusan..1.55924701691 NOP len=10 not a word1.7087020874 NOP len=25 not a word, just a phrase..1.92521882057 NOP len=50 not a word, just a phrase bigg..2.39075493813 NOP len=102 not a word, just a phrase bigg..
If the target string is longer, the regex matching is slower. No matter if success or fail.
Regular ExpressionsPerformance Test #2def is_a_word(word):
for char in word:
if not char in (CHARS): return "NOP"
return "YES"
Regular ExpressionsPerformance Test #2def is_a_word(word):
for char in word:
if not char in (CHARS): return "NOP"
return "YES"
timeit.timeit(s, 'is_a_word(%s)' %(w))
0.687522172928 YES len=4 word1.0725839138 YES len=25 wordlongerthanpreviousone..2.34717106819 YES len=60 wordlongerthanpreviosoneplusan..4.31543898582 YES len=120 wordlongerthanpreviosoneplusan..0.54797577858 NOP len=10 not a word0.547253847122 NOP len=25 not a word, just a phrase..0.546499967575 NOP len=50 not a word, just a phrase bigg..0.553755998611 NOP len=102 not a word, just a phrase bigg..
Regular ExpressionsPerformance Test #2def is_a_word(word):
for char in word:
if not char in (CHARS): return "NOP"
return "YES"
timeit.timeit(s, 'is_a_word(%s)' %(w))
0.687522172928 YES len=4 word1.0725839138 YES len=25 wordlongerthanpreviousone..2.34717106819 YES len=60 wordlongerthanpreviosoneplusan..4.31543898582 YES len=120 wordlongerthanpreviosoneplusan..0.54797577858 NOP len=10 not a word0.547253847122 NOP len=25 not a word, just a phrase..0.546499967575 NOP len=50 not a word, just a phrase bigg..0.553755998611 NOP len=102 not a word, just a phrase bigg..
2 python nested loops if success (slow)But fails at the same point&time (first space)
Regular ExpressionsPerformance Test #3def is_a_word(word):
return "YES" if word.isalpha() else "NOP"
Regular ExpressionsPerformance Test #3def is_a_word(word):
return "YES" if word.isalpha() else "NOP"
timeit.timeit(s, 'is_a_word(%s)' %(w))
0.146447896957 YES len=4 word0.212563037872 YES len=25 wordlongerthanpreviousone..0.318686008453 YES len=60 wordlongerthanpreviosoneplusan..0.493942975998 YES len=120 wordlongerthanpreviosoneplusan..0.14647102356 NOP len=10 not a word0.146160840988 NOP len=25 not a word, just a phrase..0.147103071213 NOP len=50 not a word, just a phrase bigg..0.146239995956 NOP len=102 not a word, just a phrase bigg..
Regular ExpressionsPerformance Test #3def is_a_word(word):
return "YES" if word.isalpha() else "NOP"
timeit.timeit(s, 'is_a_word(%s)' %(w))
0.146447896957 YES len=4 word0.212563037872 YES len=25 wordlongerthanpreviousone..0.318686008453 YES len=60 wordlongerthanpreviosoneplusan..0.493942975998 YES len=120 wordlongerthanpreviosoneplusan..0.14647102356 NOP len=10 not a word0.146160840988 NOP len=25 not a word, just a phrase..0.147103071213 NOP len=50 not a word, just a phrase bigg..0.146239995956 NOP len=102 not a word, just a phrase bigg..
Python string functions (fast and small C loops)
Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)6. Writing plugins (Performance Strategies)
7. Tools
Regular ExpressionsPerformance Strategies
Writing regex● Be careful with repetitions (+, *, {n,m})
(abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?
Regular ExpressionsPerformance Strategies
Writing regex● Be careful with repetitions (+, *, {n,m})
(abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?(abc|def){2,1000} produces ...
Regular ExpressionsPerformance Strategies
Writing regex● Be careful with repetitions (+, *, {n,m})
(abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?(abc|def){2,1000} produces ...
● Be careful with wildcardsre.findall(r'(ab).*(cd).*(ef)', 'ab cd ef')
Regular ExpressionsPerformance Strategies
Writing regex● Be careful with repetitions (+, *, {n,m})
(abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?(abc|def){2,1000} produces ...
● Be careful with wildcardsre.findall(r'(ab).*(cd).*(ef)', 'ab cd ef') # slowerre.findall(r'(ab)\s(cd)\s(ef)', 'ab cd ef') # faster
Regular ExpressionsPerformance Strategies
Writing regex● Be careful with repetitions (+, *, {n,m})
(abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?(abc|def){2,1000} produces ...
● Be careful with wildcardsre.findall(r'(ab).*(cd).*(ef)', 'ab cd ef') # slowerre.findall(r'(ab)\s(cd)\s(ef)', 'ab cd ef') # faster
● Longer target string -> slower regex matching
Regular ExpressionsPerformance Strategies
Writing regex● Use the non-capturing group when no need
to capture and save text to a variable(?:abc|def|ghi) instead of (abc|def|ghi)
Regular ExpressionsPerformance Strategies
Writing regex● Use the non-capturing group when no need
to capture and save text to a variable(?:abc|def|ghi) instead of (abc|def|ghi)
● Pattern most likely to match first(TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY)
Regular ExpressionsPerformance Strategies
Writing regex● Use the non-capturing group when no need
to capture and save text to a variable(?:abc|def|ghi) instead of (abc|def|ghi)
● Pattern most likely to match first(TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY)TRAFFIC_(ALLOW|DROP|DENY)
Regular ExpressionsPerformance Strategies
Writing regex● Use the non-capturing group when no need
to capture and save text to a variable(?:abc|def|ghi) instead of (abc|def|ghi)
● Pattern most likely to match first(TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY)TRAFFIC_(ALLOW|DROP|DENY)
● Use anchors (^ and $) to limit the scorere.findall(r'(ab){2}', 'abcabcabc')re.findall(r'^(ab){2}','abcabcabc') #failures occur faster
Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)7. Tools
Regular ExpressionsPerformance Strategies
Writing Agent plugins● A new process is forked for each loaded
plugin○ Use the plugins that you really need!
Regular ExpressionsPerformance Strategies
Writing Agent plugins● A new process is forked for each loaded
plugin○ Use the plugins that you really need!
● A plugin is a set of rules (regexp operations) for matching log lines○ If a plugin doesn't match a log entry, it fails in ALL its
rules!○ Reduce the number of rules, use a [translation] table
Regular ExpressionsPerformance Strategies
Writing Agent plugins● Alphabetical order for rule matching
○ Order your rules by priority, pattern most likely to match first
Regular ExpressionsPerformance Strategies
Writing Agent plugins● Alphabetical order for rule matching
○ Order your rules by priority, pattern most likely to match first
● Divide and conquer○ A plugin is configured to read from a source file, use
dedicated source files per technology○ Also, use dedicated plugins for each technology
Regular ExpressionsPerformance Strategies
Tool1 20 logs/sec Tool2 20 logs/secTool3 20 logs/sec /var/log/syslogTool4 20 logs/sec (100 logs/sec)Tool5 20 logs/sec
5 plugins with 1 rule reading /var/log/syslog5x100 = 500 total regex/sec
Regular ExpressionsPerformance Strategies
Tool1 20 logs/sec /var/log/tool1Tool2 20 logs/sec /var/log/tool2Tool3 20 logs/sec /var/log/tool3Tool4 20 logs/sec /var/log/tool4Tool5 20 logs/sec /var/log/tool5 (100 logs/sec)
5 plugins with 1 rule reading /var/log/tool{1-5}5x20 = 100 total regex/sec (x5) Faster
Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
Regular ExpressionsTools for testing Regex
Python:>>> import re
>>> re.findall('(\S+) (\S+)', 'foo bar')
[('foo', 'bar')]
>>> result = re.search(
... '(?P<key>\w+)\s*=\s*(?P<value>\w+)',
... 'foo=bar'
... )
>>> result.groupdict()
{ 'key': 'foo', 'value': 'bar' }
Regular ExpressionsTools for testing Regex
Regex debuggers:● Kiki● KodosOnline regex testers:● http://gskinner.com/RegExr/ (java)● http://regexpal.com/ (javascript)● http://rubular.com/ (ruby)● http://www.pythonregex.com/ (python)Online regex visualization:● http://www.regexper.com/ (javascript)
any (?:question|doubt|comment)+\?
A3Secweb: www.a3sec.com
email: [email protected]
twitter: @a3sec
Spain Head OfficeC/ Aravaca, 6, Piso 2
28040 MadridTlf. +34 533 09 78
México Head OfficeAvda. Paseo de la Reforma, 389 Piso 10
México DF
Tlf. +52 55 5980 3547