gir-wg @ ogf19

17
GIR-WG GIR-WG @ OGF19 @ OGF19 Grid Information Grid Information Retrieval Retrieval Working Group Working Group January January 30, 2007 30, 2007 Chapel Hill, NC Chapel Hill, NC

Upload: angie

Post on 16-Jan-2016

37 views

Category:

Documents


3 download

DESCRIPTION

GIR-WG @ OGF19. Grid Information Retrieval Working Group January 30, 2007 Chapel Hill, NC. Agenda. IP Policy reminder Introduce participants GIR-WG charter & overview GIR document status review Reference implementations Mention of related work elsewhere Paul Kim presentation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: GIR-WG  @ OGF19

GIR-WG GIR-WG @ OGF19@ OGF19Grid Information Grid Information

RetrievalRetrievalWorking GroupWorking Group

JanuaryJanuary 30, 2007 30, 2007

Chapel Hill, NCChapel Hill, NC

Page 2: GIR-WG  @ OGF19

2

Agenda

• IP Policy reminder• Introduce participants• GIR-WG charter & overview• GIR document status review• Reference implementations• Mention of related work elsewhere• Paul Kim presentation• Chris Fallen presentation• Discussion

Page 3: GIR-WG  @ OGF19

3

Session Particulars

OGF IP policies applyGIR-WG chairs:

Dr. Greg Newby, Arctic Region Supercomputing Center

Dr. Paul Yangwoo Kim, Dongguk U.Nassib Nassar, RENCI

Page 4: GIR-WG  @ OGF19

4

What is GIR-WG?

GIR-WG was chartered by OGF to develop standards and reference implementations for information retrieval (IR) on computational grids.

GIR-WG has published a Requirements document under GGF (GFD-I.027)

Our first Experimental document was published recently (GFD-E.082)

Progress on the Architecture document is dormant, awaiting practical experience

Practical experience is being gained, and will result in at least further experimental documents.

Page 5: GIR-WG  @ OGF19

5

What is Information Retrieval?

IR is the science and method of delivering documents that are relevant to human information needs.

Rather than delivering sets of matching documents (as DBMS do), IR systems rank matching documents.

IR systems usually focus on textual input data (aka, natural language) either unformatted or formatted (plain text, HTML, XML, etc.)

Page 6: GIR-WG  @ OGF19

6

GIR-WG Charter

• The GIR WG will establish a specific set of requirements, an architecture, and detailed specifications for Information Retrieval (IR) on computational grids. GIR will provide document collection management, indexing/searching, and query processing services to grid users and applications.

• GIR Milestones:

• GIR Requirements Document - Stakeholder-driven list of service-level requirements for building a grid-based IR system. Published in 2005 as GFD-I.27.

• GIR Architecture Document - Describes overall system comprised of integrated grid services, scenarios, etc. Draft under consideration since 2004; based on Experimental document outcomes, final version is expected in 2007.

• Experimental Documents - Experiences with GIR implementations or partial implementations (query processors, indexers, collection managers...). GFD-E.082 in 2006; others under consideration

• GIR Recommendation Draft Document - Describes each service in detail, with sections for different implementation platforms (such as Web Services, Grid Services, standalone...). Draft is expected after Architecture document, in 2008.

• GIR Recommendation Final Document - After the Draft Recommendation, based on independent interoperable implementations and further practical experiences. Within 2 years of the Draft Recommendation.

Page 7: GIR-WG  @ OGF19

7

Why IR is a good candidatefor Grid computing

• Excellent for “divide and conquer” coarse-grained parallelism• Input items are discrete• Coordination across subsets of a

document collection can be minimal• Results from multiple sources can be

coordinated and relevance ranked together• Queries may be handled independently

Page 8: GIR-WG  @ OGF19

8

Significant Progress

o Documents: o “GIR Requirements” publishedo “GIR Architecture” in mid-draft (dormant)o Experimental document: published

o Implementation:o MCNC released a technology previewo Kim’s work: an experimental documento Newby’s work: heading to an experimental documento Nassar’s work: Sarcomere & Amberfish, open source

toolkit based on GT4o Fallen & Newby distributed IR research

Page 9: GIR-WG  @ OGF19

9

Requirements overview (per GFD-I.027)

Desirability of Grid infrastructure for IR, notably enterprise IR: VO (for security, segmentation) Conceptual separation of functions (for indexing, collection

management & query processing) Flexible but coarse-grained flow of control among elements Persistence of queries, collections and indexes

Three primary components : Collection manager: handles input gathering, transformation,

transport, staging and delivery Indexer: core information retrieval collection representation Query processor: respond to user needs, including standing

information needs (i.e., information filtering)

Page 10: GIR-WG  @ OGF19

10

Implementation Approaches

• Do not rely on particular implementations or middleware (e.g., Globus)

• Pursue different types of Grid implementations:• Minimalist, home grown• Globus-based• Pure Web services

• These approaches can each be separate Experimental docs; will be appendices in the Architecture doc

Page 11: GIR-WG  @ OGF19

11

GFD-E.082

• Kim: Grid Information Retrieval System for Dynamically Reconfigurable Virtual Organization

• Practical experience on re-allocation of GIR nodes based on system load• Indexer, collection manager or query processor,

based on system load• Dynamic reallocation of nodes within a

computational grid

Page 12: GIR-WG  @ OGF19

12

Nassar: Sarcomere

See http://sourceforge.net/projects/sarcomere/

• Sarcomere calls a collection of documents a "database". One or more "indexes" can be created per database. Each index represents an access point for searching the document collection. In theory, indexes can differ in how they constrain the queries (e.g. by fields), what kind of data structures are used, etc. At the moment only Amberfish full text indexes are supported (index type = "Amberfish").

• Current port types (very rudimentary and highly subject to change):• createDatabase• deleteDatabase• createIndex• deleteIndex• addDocument• Search

• Stay tuned for more developments!

Page 13: GIR-WG  @ OGF19

13

Newby: Multisearch

• How can we merge result sets from different IR engines?• Desire to merge based on global relevance• Challenging because different IR engines have

different scoring/ranking algorithms• Challenging because different collections have

different characteristics, influencing ranking

• Used for TREC by Fallen & Newby 2005, 2006

Page 14: GIR-WG  @ OGF19

14

• Results are merged based on statistical normalization

• No accounting for different IR engines or different collections• Simplifying assumptions that

all IR rankings come from the same basic distribution

Simple interface to an Axis/Tomcat backend

Page 15: GIR-WG  @ OGF19

15

Opportunities for Interaction

• OGSA-DAI has middleware that provides basic query and result set transport

• Search from multiple databases; add a higher-level merger

• Seems promising for GIR!• http://www.ogsadai.org.uk

Page 16: GIR-WG  @ OGF19

16

Discussion of GIR-WG

• Your questions, thoughts and suggestions

Page 17: GIR-WG  @ OGF19

17

Get Involved!

• Visithttp://www.gir-wg.org

• Subscribe to [email protected]

• Talk with chairs about data and reference implementations and documents