representing tf and tf-idf transformations in pmml

12
Representing TF and TF-IDF transformations in PMML Villu Ruusmann Openscoring OÜ

Upload: villu-ruusmann

Post on 22-Jan-2018

470 views

Category:

Data & Analytics


5 download

TRANSCRIPT

Page 1: Representing TF and TF-IDF transformations in PMML

Representing TF and TF-IDF transformations in PMML

Villu RuusmannOpenscoring OÜ

Page 3: Representing TF and TF-IDF transformations in PMML

TF-IDFGlobal Term Frequency (TF-IDF) - TF, weighted by the "significance" of the term in the corpus of training documents.

<Apply function="*"> <TextIndex textField="documentField"> <FieldRef field="termField"/> </TextIndex> <FieldRef field="termWeightField"/></Apply>

sklearn.feature_extraction.text.TfidfTransformerorg.apache.spark.ml.feature.IDF

Page 4: Representing TF and TF-IDF transformations in PMML

PMML encoding (1/2)The "centralized" TF-IDF function definition:

<DefineFunction name="tf-idf" dataType="continuous" optype="continuous"> <ParamField name="document"/> <ParamField name="term"/> <ParamField name="weight"/> <Apply function="*"> <TextIndex textField=" document"> <FieldRef field=" term"/> </TextIndex> <FieldRef field=" weight"/> </Apply></DefineFunction>

Page 5: Representing TF and TF-IDF transformations in PMML

PMML encoding (2/2)Many "centralized" TF-IDF function invocations:

<DerivedField name="tf-idf(2017)" dataType="float" optype="continuous"> <Apply function="tf-idf"> <FieldRef field="tweetField"/> <Constant dataType="string">2017</Constant> <Constant dataType="double">5.4132</Constant> </Apply></DerivedField>

Many "localized" TF-IDF usages:

<Node> <SimplePredicate field="tf-idf(2017)" operator="lessThan" value="7.25"></Node>

Page 6: Representing TF and TF-IDF transformations in PMML

PMML TF algorithm1. Normalize the document.2. Tokenize the term and the document. Trim tokens by removing leading and

trailing (but not continuation) punctuation characters.3. Count the occurrences of term tokens in document tokens subject to the

following constraints:3.1. Case-sensitivity3.2. Max Levenshtein distance (as measured in the number of

single-character insertions, substitutions or deletions).4. Transform the count to the final TF metric.

http://dmg.org/pmml/v4-3/Transformations.html#xsdElement_TextIndex

Page 7: Representing TF and TF-IDF transformations in PMML

String normalizationEnsuring that the unlimited, free-form text input complies with the limited, standardized vocabulary of the TextIndex element:

<TextIndexNormalization isCaseSensitive="false"> <InlineTable> <Row> <string>[\u00c0-\u00c5]</string><stem>a</stem> <regex>true</regex> </Row> <Row> <string>is|are|was|were</string><stem>be</stem> <regex>true</regex> </Row> </InlineTable></TextIndexNormalization>

Page 8: Representing TF and TF-IDF transformations in PMML

String tokenizationTwo approaches for string tokenization using regular expressions (REs):

1. Define word separator RE and execute (Pattern.compile(wordSeparatorRE)).split(string)

2. Define word RE and execute ((Pattern.compile(wordRE)).matcher(string)).findAll()

Popular ML frameworks support both approaches.

PMML 4.2 and 4.3 only support the first approach. Hopefully, PMML 4.4 will support the second approach as well.

http://mantis.dmg.org/view.php?id=173

Page 9: Representing TF and TF-IDF transformations in PMML

Counting terms in a documentA "match" is a situation where the difference between term tokens [0, length] and document tokens [i, i + length] (where i is the match position), is less than or equal to the match threshold.

Match threshold is a function of TextIndex@isCaseSensitive and TextIndex@maxLevenshteinDistance attribute values. During case-insensitive matching (the default), the edit distance between two characters that only differ by case is considered to be 0, whereas during case-sensitive matching it is considered to be 1.

The matches may overlap if the "length" of term tokens is greater than one.

http://mantis.dmg.org/view.php?id=172

Page 10: Representing TF and TF-IDF transformations in PMML

Interoperability with Scikit-Learn (1/2)from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(.., strip_accents = .., # If not None, handle using text normalization analyzer = "word", # Set to "word" preprocessor = .., # If not None, handle using text normalization tokenizer = .., # If not None, handle using text tokenization token_pattern = None, # Set to None. Use the "tokenizer" attribute instead lowercase = .., # If True, convert the document to lowercase String and perform term matching in a case-insensitive manner binary = .., # Determines the transformation from counts to final TF metric ("binary" for True, and "termFrequency" for False) sublinear_tf = .., # If True, apply scaling to final TF metric norm = None # Set to None )

Page 11: Representing TF and TF-IDF transformations in PMML

Interoperability with Scikit-Learn (2/2)from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn2pmml import PMMLPipelinefrom sklearn2pmml.feature_extraction.text import Splitter

pipeline = PMMLPipeline( ('tf-idf', TfidfVectorizer(analyzer = "word", preprocessor = None, strip_accents = None, tokenizer = Splitter() , token_pattern = None , stop_words = "english", ngram_range = (1, 2), binary = False, use_idf = True, norm = None)))

from sklearn2pmml import sklearn2pmml

sklearn2pmml(pipeline, "pipeline.pmml")

Page 12: Representing TF and TF-IDF transformations in PMML

Q&[email protected]

https://github.com/jpmmlhttps://github.com/openscoringhttps://groups.google.com/forum/#!forum/jpmml