Download - Information Dump
EGEE-II INFSO-RI-031688
Enabling Grids for E-sciencE
www.eu-egee.org
EGEE and gLite are registered trademarks
Information Dump
White Areas Lecture Laurence Field
30th January 2009
2
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Overview
• What is a Grid?• Information Models• The Glue 2.0• The Information System• The New BDII• GStat 2.0
3
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
What is a Grid?
4
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
What is a Grid?
Cross-organizational
Grids
Intra-organizational
Grids
Data Centers
Virtualization
Volunteer Computing
Campus Grids
Clusters
Cloud
Computing
5
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
What is the problem?
• Organization A and B are administrative domains– Independent policies, systems and authentication mechanisms
• Users have local access to their local system using local methods• Users from A wish to collaborate with users from B
– Pool the resources– Split tasks by specialty– Share common frameworks
Organization BOrganization A
6
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
The Solution
• The Users from A and B create a Virtual Organization– Users have a unique identify but also the identity of the VO
• Organizations A and B support the Virtual Organization– Place “grid” interfaces at the organizational boundary– These map the generic “grid” functions/information/credentials
To the local security functions/information/credentials
• Multi-institutional e-Science Infrastructures
Organization BOrganization A Virtual Organization
7
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
The Information System
Organization BOrganization A
InformationSystem
Users Operations Service
8
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Information Models
9
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Information Model
• Abstract description of data – Description of values which are identified by attributes– Description of attribute groupings– Description of relationships between groupings
• Data → Information → Knowledge– Information model turns data into information
Existence, Description, State
• Describes the components in a grid infrastructure– and hence the grid itself
• The Data Model is the implementation – LDAP, XML, Relational etc.
10
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
The Original MDS 2.x Schema
http://www.globus.org/toolkit/docs/2.4/mds/Schema.html
11
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
European DataGrid Project
• Found that the MDS schema was not sufficient for their needs
• Each functional area defined their own sub schema– Workload management, data management, fabric management– data storage and network monitoring.
• Introduced the Computing Element (CE) entity which described– the GRAM endpoint – the batch system– state behind the endpoint – and a simple description of the resource (homogeneous cluster)
• The Storage Element (SE) entity which describes– the storage endpoint. – the Storage Element Protocol entity
12
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Nordugrid
• The Nordugrid project started in May 2001• Aimed to build a Nordic testbed
– for wide-area computing and data handling
http://www.nordugrid.org/documents/arc_infosys.pdf
13
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
The World Wide Testbed
• A 2002 DataTAG initiative to create a worldwide Grid testbed
• Comprised of – 8 European sites using the EDG 1.2 release– 9 U.S. sites using the VDT 1.1.3 release
• The EDG release contained addition information providers – which were not available in the VDT release
• The information was essential for the Resource Broker to function
• The information providers were installed on all the US sites– An example of interoperability using the parallel deployment model
14
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Origins and Aims
• GLUE: Grid Laboratory Uniform Environment– Started in April 2002– Join activity between EU-DataTAG, US-iVDGL and EDG
Focused on interoperability between US and EU HEP projects
– Aimed to provide common schema to facilitate interoperations
• Initial versions– v1.0 (released Nov 2002) – v1.1 (released April 2003)
• HEP driven revisions– v1.2 (released Dec 2005)– v1.3 (released Oct 2006)
15
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
OSG and GLUE v1.2
• Both EGEE and OSG used GLUE v1.1– OSG (MDS + GLUE + their own Grid3 schema)– EGEE (GLUE + their own extensions)
• Relying on custom extensions breaks interoperability– Additional use cases need to be added to GLUE
• A proposal for version GLUE v1.2 was discussed – An incremental approach taken– Only make the minimal changes– Only solve problems found in deployment– Ensure backwards compatibility
• For non-backwards compatible changes– Introduced the idea of defining Glue 2.0 at a future date
16
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
GLUE v1.3
• Last minute changes for LHC start-up.– Could not wait for Glue 2.0– Main focus was SRM 2.x
• Meeting in October 2006 to discuss proposed changes– 44 suggested changes ,30 accepted, 8 rejected and 5 duplicates
• Version 1.3 deployed the being of 2007– Ongoing migration with respect to usage
• No requirement for v1.4 – Suggests that things are not too bad
No blocking issues that urgently require a schema change
• Proved useful in interoperation activities– OSG, NDGF, gin-info, Unicore, NAREGI etc.
• Interpretation of the schema has been tightened– The understanding of the schema has improved– Many additional documents describe usage.
17
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
GLUE 2.0
18
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Moving into the Open Grid Forum
• Conceptual and structural changes left for Glue 2.0– Discussion on GLUE 2.0 at the Oct 2006 meeting in London
• Decision made to define Glue 2.0 within OGF– Improve the acceptance of GLUE by other communities
The OGF process should not create to much overhead
• GLUE-WG started in Jan 2007 at OGF19– Building on the 4 years of existing work
• Positive Outcomes– GLUE widely accepted within OGF
Seen as an important contribution
– Grid Forge helped the activity coordination– Broad view points limited assumptions– Increased participation from other projects
And hence acceptance by those projects.
19
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Glue 2.0 Introduction
• Glue Schema Working Group Created in the Open Grid Forum– Need demonstrated though the GIN activities.
• Build upon existing experiences– Consolidate over 4 years of production feedback
• Focus on use cases seen not envisaged– Cross-Grid use cases
• Define an abstract Information Model– And a number of renderings; LDAP, XML, Relational, CIM etc.
• Start with abstract core concepts– Evolve into specific service types
• Ensure participation from existing production infrastructures
20
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Glue 2.0 Key Concepts
User
Domain
Admin
Domain
Resource
Provides
Utilizes
21
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Glue 2.0 Key Concepts
User
Domain
Admin
Domain
Resource
Negotiates Share with
Defined onShare
Utilizes
Manages
Provides
Manager
22
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Glue 2.0 Key Concepts
User
Domain
Admin
Domain
ResourceShareEnd Point
Access
Policy
Mapping
Policy
Negotiates Share with
Defined on
Contacts
Maps User to
Has
Manager
Manages
Provides
23
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Glue 2.0 Key Concepts
User
Domain
Admin
Domain
Resource
Manager
ShareEnd Point
ActivityAccess
Policy
Mapping
Policy
Negotiates Share with
Provides
Manages
Runs
Defined on
Contacts
Maps User to
Has
Service
24
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Glue 2.0 Computing Schema
Computing
Service
Execution
Environment
Computing
Manager
Computing
Share
Computing
End Point
Computing
Activity
Manages
Runs
Defined onMaps User to Application
EnvironmentCan use
Mapped to
25
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Glue 2.0 Storage Schema
Storage
Service
Storage
Resource
Storage
Manager
Storage
Share
Storage
End Point
Share
Capacity
Defined onMaps User to
Storage
CapacityHas
Storage
AccessProtocol
Offers
Offers Manages
26
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Glue 2.0 Timeline
• Oct 2006, Decision taken to move into OGF• Jan 2007 (OGF 19), First working group meeting• June 2008 (OGF 23), Spec. entered public comments• Aug 2008, Public comment period ended• Nov 2008, Started addressing comments
• Jan 2009, Final Spec. ready?• Mar 2009, Glue 2.0 official OFG Specification?
• 1st April 2009, Start work on Glue 2.1
27
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Proposed Roll Out Plan
1. Create a hybrid schema file with both v1.3 and v2.0– Deploy across the infrastructure
Should have negligible side effects
– Est. 3 - 6 months after specification fixed
2. Update information providers– Publish Glue 2.0 information in addition to Glue 1.3– Deploy across the infrastructure– Est. 4 - 12 months after specification fixed
3. Update software and tooling as necessary– Est. 6 - 36 months after specification fixed
4. Remove Glue 1.3 providers when no longer required1. Est. 36 - ?? months after specification fixed
28
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Some Statistics
• 45 phone conferences – 1.5 hours each ~ 3 days talking– 5 people participating ~ 2 months FTE invested in total
Split between projects (EGEE, WLCG, Teragrid, Nordugrid, DEISA) This does not include the time invested by editor (OMII-Europe)
• 40 versions of the document – 347 days between first conference and initial specification– 46 pages, 12787 words – Document updated nearly every week
• 254 Attributes– 28 Objects
• Four different renderings– LDIF, XML, Relational and CIM
29
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
The Information System
30
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Globus MDS v2
• Metadata Directory Service (MDS)– http://www.globus.org/toolkit/docs/2.4/mds/
• Information Providers (IP)– Scripts that get the information and return LDIF
• Grid Resource Information Service (GRIS)– Daemon that runs the IP and answers LDAP queries – Register to a GIIS
• Grid Information Index Service (GIIS)– answers LDAP queries by querying registered GRIS’s or GIIS’s.
• Both the GRIS and GIIS have a 30s cache– To reduce load and improve performance
31
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Original MDS Deployment
Top
GIIS
Site
GIIS
GRISGRIS
Site
GIIS
GRISGRIS
Provider Provider ProviderProvider
Query
32
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
The BDII
• Berkeley Database Information Index.– Standard OpenLDAP server – Updated by a perl process.
Using LDAP URLs (ldapsearch) (GIIS mode) From a script (Information Provider) (GRIS mode)
• Why?– Because MDS didn’t work in a distributed environment.
Originally did not scale past 4 sites.• 1 broken work node could bring down the whole system!
MDS was the problem not LDAP.
• BDII first used as top-level GIIS– Now used at the site and resource level
33
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Information System Architecture
Top
BDII
Site
BDII
GRISResource
BDII
Site
BDII
GRISResource
BDII
Provider Provider ProviderProvider
Query
34
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
BDII
• Multiple DBs instances used to increase performance– Read only, write only and one spare for queries to finish.– This functionality is enabled by the port forwarder.
• List of sources to query from local file– Can be updated from a web page.– More than one DBs is used, separate read and write.
• Can also use a local LDIF file to modify DB after population.– Can be updated from a web page.
2171LDAP
2172LDAP
2173LDAP
2170Port Fwd
Update DB&
Modify DB
2170Port Fwd
Swap DBs
Write to cache Write to cache
Write to cache Write to cache
Write to cache ldapsearch
FCR
35
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Load Balanced BDII
BDII2170
BDII2170
BDII2170
BDII2170
BDII2170
BDII2170
DNS Round
Robin Alias
Queries
36
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Freedom of Choice
• Developed to meet a requirement from the VOs. – Modifies the information to their liking
White list and black list services.
– Only the VO manger can white list and black list the services.
• Generates an LDIF modify file.– Web based.
• BDII can be configured to use this file– Will modify the database after population– For use only with top-level BDIIs
• Linked with the Site Functional Tests Portal – Can automatically remove a site if it fails a functional tests
It’s the VOs choice.
37
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Generic Information Provider
• Provides information about the grid service. – Outputs LDIF information in accordance to the Glue Schema to stdout.
• Information can be provided by, – dynamic providers from the providers directory.– static files from the ldif directory. – dynamic plugins from the plugin directory.
• Cache used to improve efficiency and reduce load.
GIP
Provider
Config File
Plugin
Cache
LDIF
38
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Generic Information Provider
Read Config File
Fork of providers and plugins
Wait (response time)
Write to cache Write to cache
Write to cache Write to cache
Write to cache Write to cache
Read provider and plugins from cache
Read Static LDIF
LDAP_MODIFY
Print to stdout
Process will time out
use cache if fresh
39
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
User Tools
• lcg-infosites and lcg-info– Can be used to query the information system– For more information see the User Guide
https://edms.cern.ch/file/722398//gLite-3-UserGuide.pdf
• lcg-ManageVoTag– Used by the Vos to publish software environment tags– Publishes to /opt/edg/var/info/<VO>/<VO>.list
Ensure the VO can write here!
– Used by plugin glite-info-dynamic-software-wrapper
40
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Observations
Problems observed in information system – not always due to information system
It is just where the problem is visible
– Many problems at the information providers level Due to either poor configuration Poor fabric management affecting information providers
• Scalability and Stability– Top level BDIIs can become over subscribed– BDIIs take too much time and resources (CPU/RAM) to update– Production problems difficult to trace.
Requires more instrumentation in the code.
– BDIIs don’t work with low bandwidth connections.
41
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Investigations
• Stress Testing– ldapbench
42
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Results
43
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
The New BDII
44
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
The New BDII v5
• Use only one LDAP database– Reduces complexity and relies on the stability of OpenLDAP
• Only do differential updates– Reduce the write interaction and update time
• Merge the GIP and the BDII– Only do LDAP_ADD and LDAP_MODIFY in one place
• Remove all internal caches– The database is the cache!
• Improved logging– Using the standard python logger– More stats which are available remotely
• Do more with less (KISS)!
45
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
New Architecture
2170LDAP
NewLDIF
Provider
Plugin
LDIF
LDIFDIIF
LDAP_ADD
LDAP_ADD
LDAP_MODIFY
Query
Update
Merge
46
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Future Work
• Reducing the network load– Investigate the use of syncrepl– Update static information less frequently
• Reducing the query load– Query caching on the WN
lcg-utils and Service Discovery API
• Failover queries– Local cache– Site level– Top levels
47
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
GStat 2.0
48
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Information Validation
• It is important that information is correct– Miss-configured sites have in the past
Stopped services to to run grid wide! Caused black holes for job submission.
• Information must agree with the Glue Schema– http://forge.gridforum.org/sf/projects/glue-wg
• And be accurate– Grid Status (gstat) does basic sanity checks for the each site– http://goc.grid.sinica.edu.tw/gstat/– Grid Wiki gives solutions to common problems– http://goc.grid.sinica.edu.tw/gocwiki/FrontPage
49
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
The Original GStat
50
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
GStat 2.0 Core Concepts
• Monitor and test the information system• Primary goals for GStat:
– Detect faults in the information system– Validates the information content– Displays useful information with different views
• Build a sustainable architecture– Enabling decentralized operations – In a federated environment
• Redesign GStat in modular way– Reusable components reusable – Multi-location (site/roc)– Multi-application (certification/operations)
51
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
BDIINagios
data
Graphs
snapshot
Monitoring
Visualization
Core
Validation
DisplayValidation
Scripts
Results
Entities
Glue
GStat 2.0 Architecture
52
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Component Descriptions
• Core– Provides an SQL DB snapshot of the BDII content– Maintains an entity cache of what has been seen
• Validation– Validates the information content– Provides testing results for visualization or export
• Monitoring– Detects faults in the information system– Entity DB is used to configure which entities are monitored
– Depends on WLCG Nagios sensors (collaboration)– Prepares monitoring data and graphs ready for visualization
• Visualization– Uses entity DB to generate the main structure– Visualizes the result of validation and monitoring– Provides different views for different user groups
53
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
GStat 2.0 Documentation
54
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Example Usage
55
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Nagios
56
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Prototype Displays
57
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Summary
• Glue 2.0 is coming – The transition will take time
• BDIII v5 is coming– Needs rigorous testing
• New testing methods are coming– gstat-validate and ldapbench
• GStat 2.0 is coming– An instance can be installed for the Cert Testbed– Compatible with Nagios and WLCG – Extensible: You can build things on top!
• Future work– Focus: addressing scalability and stability