semantic web & semantic web processes
Post on 11-Feb-2016
Embed Size (px)
DESCRIPTIONSemantic Web & Semantic Web Processes. A course at Universidade da Madeira, Funchal, Portugal June 16-18, 2005 Dr. Amit P. Sheth Professor, Computer Sc., Univ. of Georgia Director, LSDIS lab CTO/Co-founder, Semagix , Inc. Special Thanks: Cartic Ramakrishnan , Karthik Gomadam. - PowerPoint PPT Presentation
Semantic Web & Semantic Web ProcessesA course at Universidade da Madeira, Funchal, PortugalJune 16-18, 2005
Dr. Amit P. ShethProfessor, Computer Sc., Univ. of GeorgiaDirector, LSDIS labCTO/Co-founder, Semagix, IncSpecial Thanks: Cartic Ramakrishnan, Karthik Gomadam
Semantic (Web) Technology Applications
Part III (b) Representative Enterprise Applications
Visualizer Content: BSBQ Application
-- Breaking News --Gore Demands That Recount RestartGore Says Fla. Can't Name Electors Bush Meets Colin Powell at RanchMarket Tumbles on Earnings WarningBarak Outlines His Peace Plan TALLAHASSEE, Florida (CNN) Though the two presidential candidates have until noon Wednesday to file briefs in Al Gore's appeal to the Florida Supreme Court, the outcome of two trials set on the same day in Leon County, Florida, may offer Gore his best hope for the presidency. Democrats in Seminole County are seeking to have 15,000 absentee ballots thrown out in that heavily Republican jurisdiction -- a move that would give Gore a lead of up to 5,000 votes statewide. Lawyers for the plaintiff, Harry Jacobs, claim the ballots should be rejected because they say County Elections Supervisor Sandra Goard allowed Republican workers to fill out voter identification numbers on 2,126 incomplete absentee ballot applications sent in by GOP voters, while refusing to allow Democratic workers to do the same thing for Democratic voters.
The GOP says that suit, and one similar to it from Martin County, demonstrates Democratic Party politics at its most desperate. Gore is not a party to either of those lawsuits. On Tuesday, the judge in the(1.33) - 12/06/00 - ABC(2.33) - 12/06/00 - CBS(3.12) - 12/06/00 - NNS(0.32) - 12/06/00 - CBS(1.33) - 12/06/00 - CBSDescriptionProduced by : CNN Posted Date : 12/07/2000 Reporter : David Lewis Event : Election 2000 Location : Tallahassee, Florida, USAPeople : Al Gore
Equity Research Dashboard with Blended Semantic Querying and BrowsingSheth et al, 2002 Managing Semantic Content for the Web
Global BankAim Legislation (PATRIOT ACT) requires banks to identify who they are doing business with
Problem Volume of internal and external data needed to be accessed Complex name matching and disambiguation criteria Requirement to risk score certain attributes of this data
Approach Creation of a risk ontology populated from trusted sources (OFAC etc); Sophisticated entity disambiguation Semantic querying, Rules specification & processing
Solution Rapid and accurate KYC checks Risk scoring of relationships allowing for prioritisation of results Full visibility of sources and trustworthinessAmit Sheth, From Semantic Search & Integration to Analytics, Proceedings of the KMWorld, October 26, 2004, Santa Clara, CA.
Ahmed Yaseer appears on Watchlistmember of organizationworks for CompanyAhmed Yaseer: Appears on Watchlist FBI Works for Company WorldCom Member of organization HamasThe Process
Global Investment BankExample of Fraud Prevention application used in financial servicesWorld Wide Web contentPublic RecordsBLOGS, RSSUn-structure text, Semi-structured Data Watch ListsLaw EnforcementRegulatorsSemi-structured Government Data
Law Enforcement AgencyAim Provision of an overarching intelligence system that provides a unified view of people and related information
Problem Need to create unique entities from across multiple disparate, non-standardised databases; Requirement to disambiguate dirty data Need to extract insight from unstructured text
Approach Multiple database extractors to disambiguate data and form relevant relationships Modelling of behaviours/patterns within very large ontology (6Mn+ entities)
Solution Merged and linked case data from multiple sources using effective identification, disambiguation, and link analysis Dynamic annotation of documents Single query across multiple datasets 360 view of an individual and relevant associations
Application of bespoke and pre-configured profiles for detailed investigationProfile CreationComplex QueryingSummary of ResultsInvestigation
Profile CreationComplex QueryingSummary of ResultsInvestigation
User configurable scoring profiles Profile based on direct matching with case characteristics Profiling based on link analysis through indirect relationships with other cases and informationProfile CreationComplex QueryingSummary of ResultsInvestigation
Free text searching across aggregated information sourcesGisondi, white ford expedition, main street, assault, traffic offencesProfile CreationComplex QueryingSummary of ResultsInvestigation
Unified view of direct and indirect results that best match the complex query and the profileProfile CreationComplex QueryingSummary of ResultsInvestigation
Knowledge Annotation of known entities from within free textDirect and indirect relationship scoring driven by risk weightingsAggregated knowledge from disparate sourcesProfile CreationComplex QueryingSummary of ResultsInvestigation
Scoring of key characteristics to drive relevance to original profile and query Identification of investigation pathVisualisation of resultsProfile CreationComplex QueryingSummary of ResultsInvestigation
Schematic of Ontological Approach to the Legitimate Access Problem
Legitimate Document Access Control Applicationdemo
Ontology QualityMany real-world ontologies may be described as semi-formal ontologies populated with partial or incomplete knowledgemay contain occasional inconsistencies, or occasionally violate constraints (e.g. all schema level constraints may not be observed in the knowledgebase that instantiates the ontology schema) often ontology is populated by many persons or by extracting and integrating knowledge from multiple sources analogy is dirty data which is usually a fact of life in most enterprise databases.From Semantic Search & Integration to Analytics
Ontology Representation ExpressivenessApplications vary in terms of expressiveness of representation needed.Trade-off between expressive power and computational complexity applies both to knowledge creation/maintenance and to inference mechanisms for such languages. It is often very difficult to capture the knowledge that instantiates the more expressive constructs/constraints. Many business applications end up using models/languages that lie closer to less expressive languages. On the other hand, we have seen a few applications, especially in scientific domains such as biology, where more expressive languages are needed.
Ontology Size / Population / FreshnessOntology population is critical. Among the ontologies developed by Semagix or using its technology, a median size of ontology is over 1 million instances/facts and relationship instances each (at least two have exceeded 10 million instances). This level of knowledge makes the system very powerful (as it is applied . Furthermore, in many cases, it is necessary to keep these ontologies current or updated with facts and knowledge on a daily or more frequent basis. Both the scale and freshness requirements dictate that populating ontologies with instance data needs to be automated.
Metadata Extraction Large scale metadata extraction and semantic annotation is possible. IBM WebFountain [Dill et al 2003] demonstrates the ability to annotate on a Web scale (i.e., over 4 billion pages), while Semagix Freedom related technology [Hammond et al 2002] demonstrates capabilities that work for a few million documents per day per server. However, the general trade-off of depth versus scale applies. Storage and manipulation of metadata for millions to hundreds of millions of content items requires database techniques with the challenge of improving performance and scale in presence of more complex structures
Semantic Technology Building BlocksA vast majority of the Semantic (Web) Technology Applications that have been developed or envisioned rely on three crucial capabilities: ontology creation, semantic annotation (metadata extraction) and querying/inferencing. Enterprise-scale applications share many requirements in these three respects with pan Web applications. All these capabilities must scale to many millions of documents and concepts (rather than hundreds to thousands) for current applications, and applications requiring billions of documents and concepts have also been discussed (esp. in intelligence and government space) but not yet deployed.
Primary Technical Capabilities/Key Research ChallengesTwo of the most basic semantic techniques are named entity identification, and semantic ambiguity resolution. [It would be nice to have relationship extraction too.] A tool for annotation is of little value if it does not support ambiguity resolution. Both require highly multidisciplinary approaches, borrowing for NLP/lexical analysis, statistical and IR techniques and possibly machine learning techniques. A high degree of automation is possible in meeting many real-world semantic disambiguation requirements, although pathological cases will always exist and complete automation is unlikely.
Content HeterogeneitySupport for heterogeneous content is key it is too hard to deploy separate products within a single enterprise to deal with structured, semi-structured and unstructured data/content management. New applications involve extensive types of heterogeneity in format, media and access/delivery mechanisms (e.g., news feed in RSS, NewsML news, Web posted article in HTML or served up dynamically through database query and XSLT transformation, analyst report in PDF or WORD, subscription se