machine learning in php php con poland
TRANSCRIPT
Machine Learning in PHP
Poland, Warsaw, October 2016
"Learn, someday this pain will be useful to you"
Agenda
• How to teach tricks to your PHP
• Application : searching for code in comments
• Complex learning
Speaker
• Damien Seguy
• Exakat CTO
• Static analysis of PHP code
Machine Learning
• Teaching the machine
• Supervised learning : learning then applying
• Application build its own model : training phase
• It applies its model to real cases : applying phase
Applications
• Play go, chess, tic-tac-toe and beat everyone else
• Fraud detection and risk analysis
• Automated translation or automated transcription
• OCR and face recognition
• Medical diagnostics
• Walk, welcome guest at hotels, play football
• Finding good PHP code
Php Applications
• Recommendations systems
• Predicting user behavior
• SPAM
• conversion user to customer
• ETA
• Detect code in comments
Real Use Case
• Identify code in comments
• Classic problem
• Good problem for machine learning
• Complex, no simple solution
• A lot of data and expertise are available
Supervised Training
Historydata Training
ModelReal data Results
Supervised Training
Historydata Training
ModelReal data Results
The Fann Extension
• ext/fann (https://pecl.php.net/package/fann)
• Fast Artificial Neural Network
• http://leenissen.dk/fann/wp/
• Neural networks in PHP
• Works on PHP 7, thanks to the hard work of Jakub Zelenka
• https://github.com/bukka/php-fann
Neural Networks
• Imitation of nature
• Input layer
• Output layer
• Intermediate layers
Neural Networks
• Imitation of nature
• Input layer
• Output layer
• Intermediate layers
<?php
$num_layers = 1; $num_input = 5; $num_neurons_hidden = 3; $num_output = 1; $ann = fann_create_standard($num_layers, $num_input, $num_neurons_hidden, $num_output);
// Activation function fann_set_activation_function_hidden($ann, FANN_SIGMOID_SYMMETRIC); fann_set_activation_function_output($ann, FANN_SIGMOID_SYMMETRIC);
Initialisation
Preparing Data
Raw data Extract Filter Human review Fann ready
• Extract data from raw source
• Remove any useless data from extract
• Apply some human review to filtered data
• Format data for FANN
Expert At Work// Test if the if is in a compressed format
// nie mowie po polsku
// There is a parser specified in `Parser::$KEYWORD_PARSERS`
// $result should exist, regardless of $_message
// TODO : fix this; var_dump($var);
// $a && $b and multidimensional
// numGlyphs + 1
//$annots .= ' /StructParent ';
// $cfg['Servers'][$i]['controlpass'] = 'pmapass';
// if(ob_get_clean()){
Input Vector
• 'length' : size of the comment
• 'countDollar' : number of $
• 'countEqual' : number of =
• 'countObjectOperator' number of -> operator ($o->p)
• 'countSemicolon' : number of semi-colon ;
Input Data
47 5 1 825 0 0 0 1 0 37 2 0 0 0 0 55 2 2 0 1 1 61 2 1 3 1 1 ...
Number Of Input Number Of Incoming Data Number Of Outgoing Data
* (at your option) any later version. * * Exakat is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Affero General Public License for more details. * * You should have received a copy of the GNU Affero General Public License * along with Exakat. If not, see <http://www.gnu.org/licenses/>. * * The latest code can be found at <http://exakat.io/>. * */
// $x[3] or $x[] and multidimensional
//if ($round == 3) { die('Round '.$round);}
//$this->errors[] = $this->language->get('error_permission');
Black Magic
1 5 1 37 2 0 0 0 0
// $X[3] Or $X[] And Multidimensional
EXT/FANN
It's A Comment
Training<?php
$max_epochs = 500000; $desired_error = 0.001;
// the actual training if (fann_train_on_file($ann, 'incoming.data', $max_epochs, $epochs_between_reports, $desired_error)) { fann_save($ann, 'model.out'); } fann_destroy($ann); ?>
Training
• 47 cases
• 5 characteristics
• 3 hidden neurons
• + 5 input + 1 output
• Duration : 5.711 s
Application
Historydata Training
ModelReal data Results
Application<?php
$ann = fann_create_from_file('model.out');
$comment = '//$gvars = $this->getGraphicVars();';
$input = makeVector($comment); $results = fann_run($ann, $input);
if ($results[0] > 0.8) { print "\"$comment\" -> $results[0] \n"; }
?>
Results > 0.8
• Answer between 0 and 1
• Values ranges from -14 to 0,999
• The closer to 1, the safer. The closer to 0, the safer.
• Is this a percentage? Is this a carrots count ?
• It's a mix of counts…
Scores Distribution
- 1 6
- 1 2
- 8
- 4
0
6 0 . 0 0 0 0 0 0
7 0 . 0 0 0 0 0 0
8 0 . 0 0 0 0 0 0
9 0 . 0 0 0 0 0 0
1 0 0 . 0 0 0 0 0 0
Real Cases
• Tested on 14093 comments
• Duration 68.01ms
• Found 1960 issues (14%)
0.99999893 // $cfg['Servers'][$i]['controlhost'] = '';
0.99999928 //$_SESSION['Import_message'] = $message->getDisplay();
/* 0.99999928 if (defined('SESSIONUPLOAD')) { // write sessionupload back into the loaded PMA session
$sessionupload = unserialize(SESSIONUPLOAD); foreach ($sessionupload as $key => $value) { $_SESSION[$key] = $value; }
// remove session upload data that are not set anymore foreach ($_SESSION as $key => $value) { if (mb_substr($key, 0, mb_strlen(UPLOAD_PREFIX)) == UPLOAD_PREFIX && ! isset($sessionupload[$key]) ) {
0.98780382 //LEAD_OFFSET = (0xD800 - (0x10000 >> 10)) = 55232
0.99361396 // We have server(s) => apply default configuration 0.98383027 // Duration = as configured
0.99999928 // original -> translation mapping
0.97590065 // = ( 59 x 84 ) mm = ( 2.32 x 3.31 ) in
TRUE POSITIVE FALSE POSITIVE
TRUE NEGATIVE FALSE NEGATIVE
FOUND BY
FANN
(MACHINE
LEARNING)
TARGET (EXPERT WORK)
TRUE
POSITIVE
FALSE
POSITIVE
TRUE
NEGATIVE
FALSE
NEGATIVE
FOUND BY
FANN
TARGET
0.99999923
0.73295981
0.99999851
0.2104115
// $cfg['Servers'][$i]['table_coords'] = 'pma__table_coords';
//(isset($attribs['height'])?$attribs['height']: 1);
// if ($key != null) did not work for index "0"
// the PASSWORD() function
Results
• 1960 issues
• 50+% of false positive
• With an easy clean, 822 issues reported
• 14k comments, analyzed in 68 ms (367ms in PHP5)
• Total time of coding : 27 mins.
// = ( 59 X 84 ) Mm = ( 2.32 X 3.31 ) In /* Vim: Set Expandtab Sw=4 Ts=4 Sts=4: */
Learn Better, Not Harder
• Better training data
• Improve characteristics
• Configure the neural network
• Change algorithm
• Automate learning
• Update constantly Real data
Historydata
Training
Model Results
Retroaction
Better Training Data
• More data, more data, more data
• Varied situations, real case situations
• Include specific cases
• Experience is capital
• https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
Improve Characteristics
• Add new characteristics
• Remove the one that are less interesting
• Find the right set of characteristics
Network Configuration
• Input vector
• Intermediate neurons
• Activation function
• Output vector
0
5 0 0 0
1 0 0 0 0
1 5 0 0 0
2 0 0 0 0
1 2 3 4 5 6 7 8 9 1 0
1 layer 2 layers 3 layers 4 layers
Time Of Training (Ms)
Change Algorithm
• First add more data before changing algorithm
• Try cascade2 algorithm from FANN
• 0.6 => 0 found
• 0.5 => 2 found
• Not found by the first algorithm
• Ant colony, genetics algorithm, gravitational search, artificial immune, nie mowie po polsku, annealing, harmony search, interior point search, taboo search
Finding The Best
• Test with 2-4 layers10 neurons
• Measure results
0
2 2 5 0
4 5 0 0
6 7 5 0
9 0 0 0
1 2 3 4 5 6 7 8 9 1 0 11 1 2 1 3
1 layer 2 layers 3 layers 4 layers
Deep Learning
• Chaining the neural networks
• Translators, scorers, auto-encoders
• Unsupervised Learning
Other Tools
• PHP ext/fann
• Langage R
• https://github.com/kachkaev/php-r
• Scikit-learn
• https://github.com/scikit-learn/scikit-learn
• Mahout
• https://mahout.apache.org/
Conclusion
• Machine learning is about data, not code
• There are tools to use it with PHP
• Fast to try, easy results or fast fail
• Use it for complex problems, that accepts error
H T T P : / / W W W. E X A K AT. I O
@ E X A K AT
H T T P : / / W W W. S L I D E S H A R E . N E T / D S E G U Y /
P H P 7 . 1 P R E PA R AT I O N W O R K S H O P
D z i ę k i C z e m u