technologies large-scale search
TRANSCRIPT
![Page 1: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/1.jpg)
LARGE-SCALE SEARCH TECHNOLOGIES
![Page 2: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/2.jpg)
Why is building an internet search engine so hard?
Over 150 trillion webpages. Only 1 of every 3,000 webpages is indexed.
It is practically impossible to index all the words in every webpage
One cannot build a competitive search without huge amount of users’ data
It’s too expensive to compute search results on the fly
![Page 3: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/3.jpg)
State of the art technologies
State of the art technologies wouldn’t work to power large-scale web search engines
Scalability Noise
![Page 4: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/4.jpg)
Scalability challenges
The core algorithm of an inverted index scheme is based on a list intersection
This is a computationally inefficient operation, which is impractical when it comes to the web and its billions of webpages.
Scalability
![Page 5: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/5.jpg)
Noise reduction challenges
Textual measures: Word intersections, word distances, word positions, td-idf, BM25 etc.
No matter which textually-based approach you take, it might work well on one query, but poorly on the other.
Noise
![Page 6: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/6.jpg)
The conclusion
There is no point trying to index the parts of a webpage that are unlikely
to appear in prospective users’ search queries
![Page 7: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/7.jpg)
Query log
Suppose that for each webpage you had a list of all the search queries that ended with a user clicking on this webpage (say from Google)
Any part of a webpage that does interact with any of the search queries leading to it is completely irrelevant for the retrieval process
Hence, you could focus on only indexing the search queries.
![Page 8: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/8.jpg)
Example – Cyberpunk 2077
![Page 9: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/9.jpg)
Smaller index size
Less noise with more accurate results
User queries
![Page 10: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/10.jpg)
Predict first, index later
Indexing only the prospective user queries
Indexing problem is turned into a prediction problem
![Page 11: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/11.jpg)
Data collection
Long tail queries No in-depth understanding
Not independent Expansive infrastructures
Without massive data collection are we doomed to fail?
![Page 12: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/12.jpg)
Data prediction
Use AI to predict the prospective user's query
![Page 13: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/13.jpg)
What about recall and precision?
This solves the indexing problem
The next challenge is to maximize the recall and precision
![Page 14: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/14.jpg)
It’s all about the context
What is the release date of Cyberpunk 2077 to PS4 and Xbox One?
![Page 15: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/15.jpg)
Not an NLP problem
What is the release date of Cyberpunk 2077 to PS4 and Xbox One?
release date Cyberpunk 2077 PS4 Xbox One
![Page 16: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/16.jpg)
What will you choose?
What is the release date of Cyberpunk 2077 to PS4 and Xbox One?
Cyberpunk 2077 PS4 console
![Page 17: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/17.jpg)
Contextual Graph
The power of context
![Page 18: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/18.jpg)
Taxonomy vs. Contextuality
Nike
Shoes Sporting Goods
Running Basketball
Air Max
Cortez AdidasLebron James
StaticDynamic
![Page 19: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/19.jpg)
Contextual graph
NikeAir
MaxSporting Goods
30%
90%
5%
0.01%≈
Contextual Taxonomical
![Page 20: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/20.jpg)
Contextual graph
Nike
Shoes
Running Basketball
Sporting Goods
Air Max
Cortez
Lebron James
Adidas
0%≈0%≈
0%≈
2% 10% 8%
30%
15%
10%
5%
10%
90%
35%
25% 20%
30%
90% 80%
![Page 21: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/21.jpg)
Graph statistics
Input: 5 billion webpages
10 billion entities
300 billion edges
10 million entities added daily
Cities Books Shoes
Restaurants Cars Songs
Quotes Company Sites
![Page 22: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/22.jpg)
Challenges
Data collection
What to extract
How to connect
Scalable architecture
![Page 23: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/23.jpg)
Solution
Neuroscience instead of NLP
Brain inspired auto associative networks and statistical pattern recognition
![Page 24: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/24.jpg)
Technology
Applications to Search
![Page 25: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/25.jpg)
Possible query forms
Greater Toronto Area
![Page 26: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/26.jpg)
Possible query forms
![Page 27: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/27.jpg)
Query augmentation
![Page 28: TECHNOLOGIES LARGE-SCALE SEARCH](https://reader034.vdocuments.mx/reader034/viewer/2022051323/627bed5c7b54ae7af432b539/html5/thumbnails/28.jpg)
Contextual GraphIt’s all about the context