representing tf and tf-idf transformations in pmml

Representing TF and TF-IDF transformations in PMML

Villu RuusmannOpenscoring OÜ

TFLocal Term Frequency (TF) - The frequency of the term in a document.

<TextIndex textField="documentField"> <FieldRef field="termField"/></TextIndex>

sklearn.feature_extraction.text.CountVectorizerorg.apache.spark.ml.feature.CountVectorizer

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/feature/CountVectorizer.html

http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/feature/CountVectorizer.html

TF-IDFGlobal Term Frequency (TF-IDF) - TF, weighted by the "significance" of the term in the corpus of training documents.

<Apply function="*"> <TextIndex textField="documentField"> <FieldRef field="termField"/> </TextIndex> <FieldRef field="termWeightField"/></Apply>

sklearn.feature_extraction.text.TfidfTransformerorg.apache.spark.ml.feature.IDF

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html

http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/feature/IDF.html

http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/feature/IDF.html

PMML encoding (1/2)The "centralized" TF-IDF function definition:

<DefineFunction name="tf-idf" dataType="continuous" optype="continuous"> <ParamField name="document"/> <ParamField name="term"/> <ParamField name="weight"/> <Apply function="*"> <TextIndex textField=" document"> <FieldRef field=" term"/> </TextIndex> <FieldRef field=" weight"/> </Apply></DefineFunction>

PMML encoding (2/2)Many "centralized" TF-IDF function invocations:

<DerivedField name="tf-idf(2017)" dataType="float" optype="continuous"> <Apply function="tf-idf"> <FieldRef field="tweetField"/> <Constant dataType="string">2017</Constant> <Constant dataType="double">5.4132</Constant> </Apply></DerivedField>

Many "localized" TF-IDF usages:

<Node> <SimplePredicate field="tf-idf(2017)" operator="lessThan" value="7.25"></Node>

PMML TF algorithm1. Normalize the document.2. Tokenize the term and the document. Trim tokens by removing leading and

trailing (but not continuation) punctuation characters.3. Count the occurrences of term tokens in document tokens subject to the

following constraints:3.1. Case-sensitivity3.2. Max Levenshtein distance (as measured in the number of

single-character insertions, substitutions or deletions).4. Transform the count to the final TF metric.

http://dmg.org/pmml/v4-3/Transformations.html#xsdElement_TextIndex



String normalizationEnsuring that the unlimited, free-form text input complies with the limited, standardized vocabulary of the TextIndex element:

<TextIndexNormalization isCaseSensitive="false"> <InlineTable> <Row> <string>[\u00c0-\u00c5]</string><stem>a</stem> <regex>true</regex> </Row> <Row> <string>is|are|was|were</string><stem>be</stem> <regex>true</regex> </Row> </InlineTable></TextIndexNormalization>

String tokenizationTwo approaches for string tokenization using regular expressions (REs):

1. Define word separator RE and execute (Pattern.compile(wordSeparatorRE)).split(string)

2. Define word RE and execute ((Pattern.compile(wordRE)).matcher(string)).findAll()

Popular ML frameworks support both approaches.

PMML 4.2 and 4.3 only support the first approach. Hopefully, PMML 4.4 will support the second approach as well.

http://mantis.dmg.org/view.php?id=173



Counting terms in a documentA "match" is a situation where the difference between term tokens [0, length] and document tokens [i, i + length] (where i is the match position), is less than or equal to the match threshold.

Match threshold is a function of TextIndex@isCaseSensitive and TextIndex@maxLevenshteinDistance attribute values. During case-insensitive matching (the default), the edit distance between two characters that only differ by case is considered to be 0, whereas during case-sensitive matching it is considered to be 1.

The matches may overlap if the "length" of term tokens is greater than one.




Interoperability with Scikit-Learn (1/2)from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(.., strip_accents = .., # If not None, handle using text normalization analyzer = "word", # Set to "word" preprocessor = .., # If not None, handle using text normalization tokenizer = .., # If not None, handle using text tokenization token_pattern = None, # Set to None. Use the "tokenizer" attribute instead lowercase = .., # If True, convert the document to lowercase String and perform term matching in a case-insensitive manner binary = .., # Determines the transformation from counts to final TF metric ("binary" for True, and "termFrequency" for False) sublinear_tf = .., # If True, apply scaling to final TF metric norm = None # Set to None )

Interoperability with Scikit-Learn (2/2)from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn2pmml import PMMLPipelinefrom sklearn2pmml.feature_extraction.text import Splitter

pipeline = PMMLPipeline( ('tf-idf', TfidfVectorizer(analyzer = "word", preprocessor = None, strip_accents = None, tokenizer = Splitter() , token_pattern = None , stop_words = "english", ngram_range = (1, 2), binary = False, use_idf = True, norm = None)))

from sklearn2pmml import sklearn2pmml

sklearn2pmml(pipeline, "pipeline.pmml")

Q&[email protected]

https://github.com/jpmmlhttps://github.com/openscoringhttps://groups.google.com/forum/#!forum/jpmml

representing tf and tf-idf transformations in pmml

Data & Analytics