fp6−2004−infrastructures−6-ssa-026409 e-infrastructure shared between europe and latin...
TRANSCRIPT
FP6−2004−Infrastructures−6-SSA-026409
www.eu-eela.org
E-infrastructure shared between Europe and Latin America
Giuseppe Platania
INFN Catania
First EELA ROC-on-Duty Tutorial
Itacuruçá Island, State of Rio de Janeiro, Brazil
29 November 2006 - 01 December 2006
Troubleshooting of common problems
2FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America
First EELA ROC-on-Duty Tutorial - Giuseppe Platania
Outline
• SECURITY
• JOB SUBMISSION
• SITE BDII
FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America
First EELA ROC-on-Duty Tutorial - Giuseppe Platania
SECURITY
4FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America
First EELA ROC-on-Duty Tutorial - Giuseppe Platania
Security (1/5)
• GRAM Authentication test failure:
– Test: globusrun -a -r <ce_hostname>
GRAM Authentication test failure: authentication failed:
GSS Major Status: Authentication Failed
GSS Minor Status Error Chain:
init.c:499: globus_gss_assist_init_sec_context_async: Error during context initialization
init_sec_context.c:171: gss_init_sec_context: SSLv3 handshake problems
globus_i_gsi_gss_utils.c:888: globus_i_gsi_gss_handshake: Unable to verify remote side's credentials
globus_i_gsi_gss_utils.c:847: globus_i_gsi_gss_handshake: Unable to verify remote side's credentials:
Couldn't verify the remote certificate
OpenSSL Error: s3_pkt.c:1046: in library: SSL routines, function SSL3_READ_BYTES: sslv3 alert bad certificate
– Solutions: check on CE if the CA rpm is installed or if the 2119
port is closed by a firewall
– You find more details at the page 2 of the troubleshooting guide
5FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America
First EELA ROC-on-Duty Tutorial - Giuseppe Platania
Security (2/5)
• Invalid CRL: The available CRL has expired:
– Test: globusrun -a -r <ce_hostname>GSS authentication failure
GSS Major Status: Authentication Failed
GSS Minor Status Error Chain:
accept_sec_context.c:170: gss_accept_sec_context: SSLv3 handshake problems
globus_i_gsi_gss_utils.c:881: globus_i_gsi_gss_handshake: Unable to verify remote side's credentials
globus_i_gsi_gss_utils.c:854: globus_i_gsi_gss_handshake: SSLv3 handshake problems: Couldn't do ssl handshake
OpenSSL Error: s3_srvr.c:1816: in library: SSL routines, function SSL3_GET_CLIENT_CERTIFICATE: no certificate returned
globus_gsi_callback.c:351: globus_i_gsi_callback_handshake_callback: Could not verify credential
globus_gsi_callback.c:477: globus_i_gsi_callback_cred_verify: Could not verify credential
globus_gsi_callback.c:769: globus_i_gsi_callback_check_revoked: Invalid CRL: The available CRL has expired
Failure: GSS failed Major:000a0000 Minor:00000007 Token:00000000
– Solutions: check on CE if the CRL has expired (see /var/log/globus-gatekeeper.log)
If yes run: /opt/glite/libexec/fetch-crl.sh
– You find more details at the page 3-4 of the troubleshooting guide
6FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America
First EELA ROC-on-Duty Tutorial - Giuseppe Platania
Security (3/5)
• 501 501-FTPD GSSAPI error: GSS Major Status: General failure:
– Test: edg-gridftp-ls gsiftp://<ce_hostname>/error the server sent an error response: 535 535-FTPD GSSAPI error: GSS Major Status: Authentication Failed
535-FTPD GSSAPI error: GSS Minor Status Error Chain:
535-FTPD GSSAPI error: accept_sec_context.c:170: gss_accept_sec_context: SSLv3 handshake problems
535-FTPD GSSAPI error: globus_i_gsi_gss_utils.c:881: globus_i_gsi_gss_handshake: Unable to verify remote side's credentials
535-FTPD GSSAPI error: globus_i_gsi_gss_utils.c:854: globus_i_gsi_gss_handshake: SSLv3 handshake problems: Couldn't do
ssl handshake
535-FTPD GSSAPI error: OpenSSL Error: s3_srvr.c:1816: in library: SSL routines, function SSL3_GET_CLIENT_CERTIFICATE: no
certificate returned
535-FTPD GSSAPI error: globus_gsi_callback.c:351: globus_i_gsi_callback_handshake_callback: Could not verify credential
535-FTPD GSSAPI error: globus_gsi_callback.c:436: globus_i_gsi_callback_cred_verify: The certificate has expired: Credential
with subject: /C=IT/O=GILDA/OU=Personal Certificate/L=INFN Catania/CN=Giuseppe
Platania/[email protected]/CN=proxy has expired.
535 FTPD GSSAPI error: accepting context
– Solutions: Syncronize the nodes
– You find more details at the page 5 of the troubleshooting guide
7FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America
First EELA ROC-on-Duty Tutorial - Giuseppe Platania
Security (4/5)
• 530 530 No local mapping for Globus ID
– Test: edg-gridftp-ls gsiftp://<ce_hostname>/
error the server sent an error response: 530 530 LCMAPS credential
mapping NOT successful
(see /var/log/gridftp-lcas_lcmaps.log)
LCMAPS 0: 2006-11-17.12:57:52.968661.0000029308.0000000006 :
lcmaps_plugin_voms_poolaccount-plugin_run(): no match
(or no poolaccount available) for group (/VO=gilda/GROUP=/gilda)
in /opt/edg/etc/lcmaps/gridmapfile
– Solutions: ensure that under /etc/grid-security/gridmapdir there
are the pool accounts files (such as gildaxxx)
– You find more details at the page 6-8 of the troubleshooting guide
8FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America
First EELA ROC-on-Duty Tutorial - Giuseppe Platania
Security (5/5)
• 530 530 LCMAPS credential mapping NOT successful:
– Test: edg-gridftp-ls gsiftp://<ce_hostname>/
error the server sent an error response: 530 530 LCMAPS credential mapping
NOT successful
(see /var/log/gridftp-lcas_lcmaps.log)
LCMAPS 0: 2006-11-17.12:57:52.968661.0000029308.0000000006 :
lcmaps_plugin_voms_poolaccount-plugin_run(): no match
(or no poolaccount available) for group (/VO=gilda/GROUP=/gilda) in
/opt/edg/etc/lcmaps/gridmapfile
Solutions: check if:
o VO is enabled
o in /opt/edg/etc/lcmaps/gridmapfile thare are the VOMS entries
o all pool accounts are already in use
– You find more details at the page 9 of the troubleshooting guide
FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America
First EELA ROC-on-Duty Tutorial - Giuseppe Platania
JOB SUBMISSION
10FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America
First EELA ROC-on-Duty Tutorial - Giuseppe Platania
Job Submission (1/8)
• 7 authentication failed:
– Reasons from edg-job-get-logging-info:
7 authentication failed: GSS Major Status:
Authentication Failed GSS Minor Status Error
Chain:init.c:497:
globus_gss_assist_init_sec_context_async: Error
during context initialization init_sec_context
– Solutions: check for the reverse lookup problem in "/etc/hosts"
on the client side or dns configuration.
– You find more details at the page 10 of the troubleshooting guide
11FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America
First EELA ROC-on-Duty Tutorial - Giuseppe Platania
Job Submission (2/8)
• Cannot read JobWrapper output .... :
– Reasons from edg-job-get-logging-info:
Cannot read JobWrapper output, both from Condor and from Maradona
– Solutions: Fix WN, CE, DNS or batch system configuration.
Check PBS status running pbsnodes qstat
Try restarting the PBS daemons on the CE (and on the WN).
The gatekeeper and the gridftpd on the CE might not map the DN to the same local user.
This can happen if the one service is configured to use VOMS (via LCMAPS), while the other
relies on the standard grid-mapfile. Test this as follows:
$ globus-job-run my-CE /usr/bin/id
$ globus-url-copy file:/etc/group gsiftp://my-CE/tmp/test
$ globus-job-run my-CE /bin/ls -l /tmp/test
– You find more details at the page 11-12 of the troubleshooting guide
12FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America
First EELA ROC-on-Duty Tutorial - Giuseppe Platania
Job Submission (3/8)
• Brokerhelper: Cannot plan. No compatible resources :
– Reasons from edg-job-get-logging-info:
Cannot plan (a helper failed)
– Solutions: The status Cannot plan (a helper failed) means that a helper of the Workload
Manager failed. Match making may fail for several reasons that may arise either from a failing
middleware component, or application software unavailable, or a wrong request in the job JDL:
middleware failure is due to Information Service problems:
• the service is down
application software unavailable:
• the JDL requires a wrong/unsupported software version by that site
wrong user request takes place when the user asks for:
• an unsopported CPU type/operating system/memory/VO
– You find more details at the page 13-14 of the troubleshooting
guide
13FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America
First EELA ROC-on-Duty Tutorial - Giuseppe Platania
Job Submission (4/8)
• ssh problem from WN to CE:
– TEST:
globus-job-run <ce_hostname>:2119/jobmanager-lcgpbs -q short /bin/id
It doesn’t give back no output
– Solutions: Remove shosts.equiv and ssh_known_hosts files from /etc/ssh directory on the CE.
Re-run the following scripts on CE, that are usually also cron jobs:
• /opt/edg/sbin/edg-pbs-knownhosts
• /opt/edg/sbin/edg-pbs-shostsequiv
Re-run the following script on WN, that is usually also a cron job:
• /opt/edg/sbin/edg-pbs-knownhosts
– You find more details at the page 15 of the troubleshooting guide
14FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America
First EELA ROC-on-Duty Tutorial - Giuseppe Platania
Job Submission (5/8)
• submit-helper script ... gave error: cache export dir ...:
– TEST: globus-job-run <ce_hostname>:2119/jobmanager-lcgpbs /bin/id
submit-helper script running on host lxb1761 gave error: cache_export_dir
(/home/dteam002/.lcgjm/globus-cache-export.Of5sOd) on gatekeeper did not
contain a cache_export_dir.tar archive
– Solutions:
The CE is not running a gridftp daemon. Check on the CE:
o /etc/init.d/globus-gridftp status
o Restart it as needed
The gridftp port could be closed
– You find more details at the page 16-17 of the troubleshooting
guide
15FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America
First EELA ROC-on-Duty Tutorial - Giuseppe Platania
Job Submission (6/8)
• Globus error 3:
– Reason from edg-job-get-logging-info:
Got a job held event, reason: Globus error 3: an I/O operation failed
– Solutions: The problem was that memory was very low.
queue_submit() in Helper.pm of GRAM checks for memory and
returns a NORESOURCES error if the free memory is less than 2%
of the total, NORESOURCES is GRAM error 3, not necesarily IO.
Check the WN that has the above problem and reboot it
– You find more details at the page 18 of the troubleshooting guide
16FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America
First EELA ROC-on-Duty Tutorial - Giuseppe Platania
Job Submission (7/8)
• Unspecified gridmanager error:
– Reason from edg-job-get-logging-info:
Got a job held event, reason: Unspecified gridmanager error
– Solutions: the user does not have permission to submit to the
given queue, or because the batch system is in some bad state.
Check it on the configuration of the CE.
– You find more details at the page 19 of the troubleshooting guide
17FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America
First EELA ROC-on-Duty Tutorial - Giuseppe Platania
Job Submission (8/8)
• GRAM Job submission failed:
– TEST: globus-job-run <ce_hostname>:2119/jobmanager-lcgpbs /bin/id
GRAM Job submission failed because the job manager failed to
open stderr (error code 74)
– Solutions: The UI does not have inbound connectivity for the
GLOBUS_TCP_PORT_RANGE (2000-2005).
Fix the UI’s firewall.
– You find more details at the page 20 of the troubleshooting guide
FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America
First EELA ROC-on-Duty Tutorial - Giuseppe Platania
SITE BDII
19FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America
First EELA ROC-on-Duty Tutorial - Giuseppe Platania
Site BDII (1/3)
• Check if the GIIS is running on CE:
ldapsearch -x -h <ce_hostname> -p 2170 -b mds-vo-name=<site_name>,o=grid
ldap_bind: Can't contact LDAP server
• Solution: check if the site bdii is running on CE:
o /etc/init.d/bdii status
o restart it as needed
20FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America
First EELA ROC-on-Duty Tutorial - Giuseppe Platania
Site BDII (2/3)
• The Site BDII doesn’t publish CE informations:
Ensure that in /opt/bdii/etc/bdii-update.conf there is the CE ldap URL such as:
CE ldap://ce.localdomain:2135/mds-vo-name=local,o=grid
• Solution: put CE ldap URL in /opt/bdii/etc/bdii-update.conf and restart the
BDII service (see /opt/bdii/var/bdii.log to check the errors)
21FP6−2004−Infrastructures−6-SSA-026409
E-infrastructure shared between Europe and Latin America
First EELA ROC-on-Duty Tutorial - Giuseppe Platania
Site BDII (3/3)
• Site BDII error:
• tail -f /opt/bdii/var/bdii.log
CE: ldap_bind: Can't contact LDAP server
Time for searches: 0 s
Time to update DB: 1 s
Grabbing port 2170 for 2172
Tue Sep 19 11:47:45 CEST 2006
Sleeping for 30
• Solution: ensure that the GRIS is running:
o /etc/init.d/globus-mds restart
o restart it as needed