representing tf and tf-idf transformations in pmml
TRANSCRIPT
![Page 1: Representing TF and TF-IDF transformations in PMML](https://reader035.vdocuments.mx/reader035/viewer/2022081806/58ae1e031a28ab7e4a8b5549/html5/thumbnails/1.jpg)
Representing TF and TF-IDF transformations in PMML
Villu RuusmannOpenscoring OÜ
![Page 2: Representing TF and TF-IDF transformations in PMML](https://reader035.vdocuments.mx/reader035/viewer/2022081806/58ae1e031a28ab7e4a8b5549/html5/thumbnails/2.jpg)
TFLocal Term Frequency (TF) - The frequency of the term in a document.
<TextIndex textField="documentField"> <FieldRef field="termField"/></TextIndex>
sklearn.feature_extraction.text.CountVectorizerorg.apache.spark.ml.feature.CountVectorizer
![Page 3: Representing TF and TF-IDF transformations in PMML](https://reader035.vdocuments.mx/reader035/viewer/2022081806/58ae1e031a28ab7e4a8b5549/html5/thumbnails/3.jpg)
TF-IDFGlobal Term Frequency (TF-IDF) - TF, weighted by the "significance" of the term in the corpus of training documents.
<Apply function="*"> <TextIndex textField="documentField"> <FieldRef field="termField"/> </TextIndex> <FieldRef field="termWeightField"/></Apply>
sklearn.feature_extraction.text.TfidfTransformerorg.apache.spark.ml.feature.IDF
![Page 4: Representing TF and TF-IDF transformations in PMML](https://reader035.vdocuments.mx/reader035/viewer/2022081806/58ae1e031a28ab7e4a8b5549/html5/thumbnails/4.jpg)
PMML encoding (1/2)The "centralized" TF-IDF function definition:
<DefineFunction name="tf-idf" dataType="continuous" optype="continuous"> <ParamField name="document"/> <ParamField name="term"/> <ParamField name="weight"/> <Apply function="*"> <TextIndex textField=" document"> <FieldRef field=" term"/> </TextIndex> <FieldRef field=" weight"/> </Apply></DefineFunction>
![Page 5: Representing TF and TF-IDF transformations in PMML](https://reader035.vdocuments.mx/reader035/viewer/2022081806/58ae1e031a28ab7e4a8b5549/html5/thumbnails/5.jpg)
PMML encoding (2/2)Many "centralized" TF-IDF function invocations:
<DerivedField name="tf-idf(2017)" dataType="float" optype="continuous"> <Apply function="tf-idf"> <FieldRef field="tweetField"/> <Constant dataType="string">2017</Constant> <Constant dataType="double">5.4132</Constant> </Apply></DerivedField>
Many "localized" TF-IDF usages:
<Node> <SimplePredicate field="tf-idf(2017)" operator="lessThan" value="7.25"></Node>
![Page 6: Representing TF and TF-IDF transformations in PMML](https://reader035.vdocuments.mx/reader035/viewer/2022081806/58ae1e031a28ab7e4a8b5549/html5/thumbnails/6.jpg)
PMML TF algorithm1. Normalize the document.2. Tokenize the term and the document. Trim tokens by removing leading and
trailing (but not continuation) punctuation characters.3. Count the occurrences of term tokens in document tokens subject to the
following constraints:3.1. Case-sensitivity3.2. Max Levenshtein distance (as measured in the number of
single-character insertions, substitutions or deletions).4. Transform the count to the final TF metric.
http://dmg.org/pmml/v4-3/Transformations.html#xsdElement_TextIndex
![Page 7: Representing TF and TF-IDF transformations in PMML](https://reader035.vdocuments.mx/reader035/viewer/2022081806/58ae1e031a28ab7e4a8b5549/html5/thumbnails/7.jpg)
String normalizationEnsuring that the unlimited, free-form text input complies with the limited, standardized vocabulary of the TextIndex element:
<TextIndexNormalization isCaseSensitive="false"> <InlineTable> <Row> <string>[\u00c0-\u00c5]</string><stem>a</stem> <regex>true</regex> </Row> <Row> <string>is|are|was|were</string><stem>be</stem> <regex>true</regex> </Row> </InlineTable></TextIndexNormalization>
![Page 8: Representing TF and TF-IDF transformations in PMML](https://reader035.vdocuments.mx/reader035/viewer/2022081806/58ae1e031a28ab7e4a8b5549/html5/thumbnails/8.jpg)
String tokenizationTwo approaches for string tokenization using regular expressions (REs):
1. Define word separator RE and execute (Pattern.compile(wordSeparatorRE)).split(string)
2. Define word RE and execute ((Pattern.compile(wordRE)).matcher(string)).findAll()
Popular ML frameworks support both approaches.
PMML 4.2 and 4.3 only support the first approach. Hopefully, PMML 4.4 will support the second approach as well.
http://mantis.dmg.org/view.php?id=173
![Page 9: Representing TF and TF-IDF transformations in PMML](https://reader035.vdocuments.mx/reader035/viewer/2022081806/58ae1e031a28ab7e4a8b5549/html5/thumbnails/9.jpg)
Counting terms in a documentA "match" is a situation where the difference between term tokens [0, length] and document tokens [i, i + length] (where i is the match position), is less than or equal to the match threshold.
Match threshold is a function of TextIndex@isCaseSensitive and TextIndex@maxLevenshteinDistance attribute values. During case-insensitive matching (the default), the edit distance between two characters that only differ by case is considered to be 0, whereas during case-sensitive matching it is considered to be 1.
The matches may overlap if the "length" of term tokens is greater than one.
http://mantis.dmg.org/view.php?id=172
![Page 10: Representing TF and TF-IDF transformations in PMML](https://reader035.vdocuments.mx/reader035/viewer/2022081806/58ae1e031a28ab7e4a8b5549/html5/thumbnails/10.jpg)
Interoperability with Scikit-Learn (1/2)from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(.., strip_accents = .., # If not None, handle using text normalization analyzer = "word", # Set to "word" preprocessor = .., # If not None, handle using text normalization tokenizer = .., # If not None, handle using text tokenization token_pattern = None, # Set to None. Use the "tokenizer" attribute instead lowercase = .., # If True, convert the document to lowercase String and perform term matching in a case-insensitive manner binary = .., # Determines the transformation from counts to final TF metric ("binary" for True, and "termFrequency" for False) sublinear_tf = .., # If True, apply scaling to final TF metric norm = None # Set to None )
![Page 11: Representing TF and TF-IDF transformations in PMML](https://reader035.vdocuments.mx/reader035/viewer/2022081806/58ae1e031a28ab7e4a8b5549/html5/thumbnails/11.jpg)
Interoperability with Scikit-Learn (2/2)from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn2pmml import PMMLPipelinefrom sklearn2pmml.feature_extraction.text import Splitter
pipeline = PMMLPipeline( ('tf-idf', TfidfVectorizer(analyzer = "word", preprocessor = None, strip_accents = None, tokenizer = Splitter() , token_pattern = None , stop_words = "english", ngram_range = (1, 2), binary = False, use_idf = True, norm = None)))
from sklearn2pmml import sklearn2pmml
sklearn2pmml(pipeline, "pipeline.pmml")
![Page 12: Representing TF and TF-IDF transformations in PMML](https://reader035.vdocuments.mx/reader035/viewer/2022081806/58ae1e031a28ab7e4a8b5549/html5/thumbnails/12.jpg)
https://github.com/jpmmlhttps://github.com/openscoringhttps://groups.google.com/forum/#!forum/jpmml