fp6−2004−infrastructures−6-ssa-026409 e-infrastructure shared between europe and latin...

21
FP6−2004−Infrastructures−6-SSA-026409 www.eu-eela.org E-infrastructure shared between Europe and Latin America Giuseppe Platania INFN Catania First EELA ROC-on-Duty Tutorial Itacuruçá Island, State of Rio de Janeiro, Brazil 29 November 2006 - 01 December 2006 Troubleshooting of common problems

Upload: tyrese-hardaker

Post on 01-Apr-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FP6−2004−Infrastructures−6-SSA-026409  E-infrastructure shared between Europe and Latin America Giuseppe Platania INFN Catania First EELA

FP6−2004−Infrastructures−6-SSA-026409

www.eu-eela.org

E-infrastructure shared between Europe and Latin America

Giuseppe Platania

INFN Catania

First EELA ROC-on-Duty Tutorial

Itacuruçá Island, State of Rio de Janeiro, Brazil

29 November 2006 - 01 December 2006

Troubleshooting of common problems

Page 2: FP6−2004−Infrastructures−6-SSA-026409  E-infrastructure shared between Europe and Latin America Giuseppe Platania INFN Catania First EELA

2FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America

First EELA ROC-on-Duty Tutorial - Giuseppe Platania

Outline

• SECURITY

• JOB SUBMISSION

• SITE BDII

Page 3: FP6−2004−Infrastructures−6-SSA-026409  E-infrastructure shared between Europe and Latin America Giuseppe Platania INFN Catania First EELA

FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America

First EELA ROC-on-Duty Tutorial - Giuseppe Platania

SECURITY

Page 4: FP6−2004−Infrastructures−6-SSA-026409  E-infrastructure shared between Europe and Latin America Giuseppe Platania INFN Catania First EELA

4FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America

First EELA ROC-on-Duty Tutorial - Giuseppe Platania

Security (1/5)

• GRAM Authentication test failure:

– Test: globusrun -a -r <ce_hostname>

GRAM Authentication test failure: authentication failed:

GSS Major Status: Authentication Failed

GSS Minor Status Error Chain:

init.c:499: globus_gss_assist_init_sec_context_async: Error during context initialization

init_sec_context.c:171: gss_init_sec_context: SSLv3 handshake problems

globus_i_gsi_gss_utils.c:888: globus_i_gsi_gss_handshake: Unable to verify remote side's credentials

globus_i_gsi_gss_utils.c:847: globus_i_gsi_gss_handshake: Unable to verify remote side's credentials:

Couldn't verify the remote certificate

OpenSSL Error: s3_pkt.c:1046: in library: SSL routines, function SSL3_READ_BYTES: sslv3 alert bad certificate

– Solutions: check on CE if the CA rpm is installed or if the 2119

port is closed by a firewall

– You find more details at the page 2 of the troubleshooting guide

Page 5: FP6−2004−Infrastructures−6-SSA-026409  E-infrastructure shared between Europe and Latin America Giuseppe Platania INFN Catania First EELA

5FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America

First EELA ROC-on-Duty Tutorial - Giuseppe Platania

Security (2/5)

• Invalid CRL: The available CRL has expired:

– Test: globusrun -a -r <ce_hostname>GSS authentication failure

GSS Major Status: Authentication Failed

GSS Minor Status Error Chain:

accept_sec_context.c:170: gss_accept_sec_context: SSLv3 handshake problems

globus_i_gsi_gss_utils.c:881: globus_i_gsi_gss_handshake: Unable to verify remote side's credentials

globus_i_gsi_gss_utils.c:854: globus_i_gsi_gss_handshake: SSLv3 handshake problems: Couldn't do ssl handshake

OpenSSL Error: s3_srvr.c:1816: in library: SSL routines, function SSL3_GET_CLIENT_CERTIFICATE: no certificate returned

globus_gsi_callback.c:351: globus_i_gsi_callback_handshake_callback: Could not verify credential

globus_gsi_callback.c:477: globus_i_gsi_callback_cred_verify: Could not verify credential

globus_gsi_callback.c:769: globus_i_gsi_callback_check_revoked: Invalid CRL: The available CRL has expired

Failure: GSS failed Major:000a0000 Minor:00000007 Token:00000000

– Solutions: check on CE if the CRL has expired (see /var/log/globus-gatekeeper.log)

If yes run: /opt/glite/libexec/fetch-crl.sh

– You find more details at the page 3-4 of the troubleshooting guide

Page 6: FP6−2004−Infrastructures−6-SSA-026409  E-infrastructure shared between Europe and Latin America Giuseppe Platania INFN Catania First EELA

6FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America

First EELA ROC-on-Duty Tutorial - Giuseppe Platania

Security (3/5)

• 501 501-FTPD GSSAPI error: GSS Major Status: General failure:

– Test: edg-gridftp-ls gsiftp://<ce_hostname>/error the server sent an error response: 535 535-FTPD GSSAPI error: GSS Major Status: Authentication Failed

535-FTPD GSSAPI error: GSS Minor Status Error Chain:

535-FTPD GSSAPI error: accept_sec_context.c:170: gss_accept_sec_context: SSLv3 handshake problems

535-FTPD GSSAPI error: globus_i_gsi_gss_utils.c:881: globus_i_gsi_gss_handshake: Unable to verify remote side's credentials

535-FTPD GSSAPI error: globus_i_gsi_gss_utils.c:854: globus_i_gsi_gss_handshake: SSLv3 handshake problems: Couldn't do

ssl handshake

535-FTPD GSSAPI error: OpenSSL Error: s3_srvr.c:1816: in library: SSL routines, function SSL3_GET_CLIENT_CERTIFICATE: no

certificate returned

535-FTPD GSSAPI error: globus_gsi_callback.c:351: globus_i_gsi_callback_handshake_callback: Could not verify credential

535-FTPD GSSAPI error: globus_gsi_callback.c:436: globus_i_gsi_callback_cred_verify: The certificate has expired: Credential

with subject: /C=IT/O=GILDA/OU=Personal Certificate/L=INFN Catania/CN=Giuseppe

Platania/[email protected]/CN=proxy has expired.

535 FTPD GSSAPI error: accepting context

– Solutions: Syncronize the nodes

– You find more details at the page 5 of the troubleshooting guide

Page 7: FP6−2004−Infrastructures−6-SSA-026409  E-infrastructure shared between Europe and Latin America Giuseppe Platania INFN Catania First EELA

7FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America

First EELA ROC-on-Duty Tutorial - Giuseppe Platania

Security (4/5)

• 530 530 No local mapping for Globus ID

– Test: edg-gridftp-ls gsiftp://<ce_hostname>/

error the server sent an error response: 530 530 LCMAPS credential

mapping NOT successful

(see /var/log/gridftp-lcas_lcmaps.log)

LCMAPS 0: 2006-11-17.12:57:52.968661.0000029308.0000000006 :

lcmaps_plugin_voms_poolaccount-plugin_run(): no match

(or no poolaccount available) for group (/VO=gilda/GROUP=/gilda)

in /opt/edg/etc/lcmaps/gridmapfile

– Solutions: ensure that under /etc/grid-security/gridmapdir there

are the pool accounts files (such as gildaxxx)

– You find more details at the page 6-8 of the troubleshooting guide

Page 8: FP6−2004−Infrastructures−6-SSA-026409  E-infrastructure shared between Europe and Latin America Giuseppe Platania INFN Catania First EELA

8FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America

First EELA ROC-on-Duty Tutorial - Giuseppe Platania

Security (5/5)

• 530 530 LCMAPS credential mapping NOT successful:

– Test: edg-gridftp-ls gsiftp://<ce_hostname>/

error the server sent an error response: 530 530 LCMAPS credential mapping

NOT successful

(see /var/log/gridftp-lcas_lcmaps.log)

LCMAPS 0: 2006-11-17.12:57:52.968661.0000029308.0000000006 :

lcmaps_plugin_voms_poolaccount-plugin_run(): no match

(or no poolaccount available) for group (/VO=gilda/GROUP=/gilda) in

/opt/edg/etc/lcmaps/gridmapfile

Solutions: check if:

o VO is enabled

o in /opt/edg/etc/lcmaps/gridmapfile thare are the VOMS entries

o all pool accounts are already in use

– You find more details at the page 9 of the troubleshooting guide

Page 9: FP6−2004−Infrastructures−6-SSA-026409  E-infrastructure shared between Europe and Latin America Giuseppe Platania INFN Catania First EELA

FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America

First EELA ROC-on-Duty Tutorial - Giuseppe Platania

JOB SUBMISSION

Page 10: FP6−2004−Infrastructures−6-SSA-026409  E-infrastructure shared between Europe and Latin America Giuseppe Platania INFN Catania First EELA

10FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America

First EELA ROC-on-Duty Tutorial - Giuseppe Platania

Job Submission (1/8)

• 7 authentication failed:

– Reasons from edg-job-get-logging-info:

7 authentication failed: GSS Major Status:

Authentication Failed GSS Minor Status Error

Chain:init.c:497:

globus_gss_assist_init_sec_context_async: Error

during context initialization init_sec_context

– Solutions: check for the reverse lookup problem in "/etc/hosts"

on the client side or dns configuration.

– You find more details at the page 10 of the troubleshooting guide

Page 11: FP6−2004−Infrastructures−6-SSA-026409  E-infrastructure shared between Europe and Latin America Giuseppe Platania INFN Catania First EELA

11FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America

First EELA ROC-on-Duty Tutorial - Giuseppe Platania

Job Submission (2/8)

• Cannot read JobWrapper output .... :

– Reasons from edg-job-get-logging-info:

Cannot read JobWrapper output, both from Condor and from Maradona

– Solutions: Fix WN, CE, DNS or batch system configuration.

Check PBS status running pbsnodes qstat

Try restarting the PBS daemons on the CE (and on the WN).

The gatekeeper and the gridftpd on the CE might not map the DN to the same local user.

This can happen if the one service is configured to use VOMS (via LCMAPS), while the other

relies on the standard grid-mapfile. Test this as follows:

$ globus-job-run my-CE /usr/bin/id

$ globus-url-copy file:/etc/group gsiftp://my-CE/tmp/test

$ globus-job-run my-CE /bin/ls -l /tmp/test

– You find more details at the page 11-12 of the troubleshooting guide

Page 12: FP6−2004−Infrastructures−6-SSA-026409  E-infrastructure shared between Europe and Latin America Giuseppe Platania INFN Catania First EELA

12FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America

First EELA ROC-on-Duty Tutorial - Giuseppe Platania

Job Submission (3/8)

• Brokerhelper: Cannot plan. No compatible resources :

– Reasons from edg-job-get-logging-info:

Cannot plan (a helper failed)

– Solutions: The status Cannot plan (a helper failed) means that a helper of the Workload

Manager failed. Match making may fail for several reasons that may arise either from a failing

middleware component, or application software unavailable, or a wrong request in the job JDL:

middleware failure is due to Information Service problems:

• the service is down

application software unavailable:

• the JDL requires a wrong/unsupported software version by that site

wrong user request takes place when the user asks for:

• an unsopported CPU type/operating system/memory/VO

– You find more details at the page 13-14 of the troubleshooting

guide

Page 13: FP6−2004−Infrastructures−6-SSA-026409  E-infrastructure shared between Europe and Latin America Giuseppe Platania INFN Catania First EELA

13FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America

First EELA ROC-on-Duty Tutorial - Giuseppe Platania

Job Submission (4/8)

• ssh problem from WN to CE:

– TEST:

globus-job-run <ce_hostname>:2119/jobmanager-lcgpbs -q short /bin/id

It doesn’t give back no output

– Solutions: Remove shosts.equiv and ssh_known_hosts files from /etc/ssh directory on the CE.

Re-run the following scripts on CE, that are usually also cron jobs:

• /opt/edg/sbin/edg-pbs-knownhosts

• /opt/edg/sbin/edg-pbs-shostsequiv

Re-run the following script on WN, that is usually also a cron job:

• /opt/edg/sbin/edg-pbs-knownhosts

– You find more details at the page 15 of the troubleshooting guide

Page 14: FP6−2004−Infrastructures−6-SSA-026409  E-infrastructure shared between Europe and Latin America Giuseppe Platania INFN Catania First EELA

14FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America

First EELA ROC-on-Duty Tutorial - Giuseppe Platania

Job Submission (5/8)

• submit-helper script ... gave error: cache export dir ...:

– TEST: globus-job-run <ce_hostname>:2119/jobmanager-lcgpbs /bin/id

submit-helper script running on host lxb1761 gave error: cache_export_dir

(/home/dteam002/.lcgjm/globus-cache-export.Of5sOd) on gatekeeper did not

contain a cache_export_dir.tar archive

– Solutions:

The CE is not running a gridftp daemon. Check on the CE:

o /etc/init.d/globus-gridftp status

o Restart it as needed

The gridftp port could be closed

– You find more details at the page 16-17 of the troubleshooting

guide

Page 15: FP6−2004−Infrastructures−6-SSA-026409  E-infrastructure shared between Europe and Latin America Giuseppe Platania INFN Catania First EELA

15FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America

First EELA ROC-on-Duty Tutorial - Giuseppe Platania

Job Submission (6/8)

• Globus error 3:

– Reason from edg-job-get-logging-info:

Got a job held event, reason: Globus error 3: an I/O operation failed

– Solutions: The problem was that memory was very low.

queue_submit() in Helper.pm of GRAM checks for memory and

returns a NORESOURCES error if the free memory is less than 2%

of the total, NORESOURCES is GRAM error 3, not necesarily IO.

Check the WN that has the above problem and reboot it

– You find more details at the page 18 of the troubleshooting guide

Page 16: FP6−2004−Infrastructures−6-SSA-026409  E-infrastructure shared between Europe and Latin America Giuseppe Platania INFN Catania First EELA

16FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America

First EELA ROC-on-Duty Tutorial - Giuseppe Platania

Job Submission (7/8)

• Unspecified gridmanager error:

– Reason from edg-job-get-logging-info:

Got a job held event, reason: Unspecified gridmanager error

– Solutions: the user does not have permission to submit to the

given queue, or because the batch system is in some bad state.

Check it on the configuration of the CE.

– You find more details at the page 19 of the troubleshooting guide

Page 17: FP6−2004−Infrastructures−6-SSA-026409  E-infrastructure shared between Europe and Latin America Giuseppe Platania INFN Catania First EELA

17FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America

First EELA ROC-on-Duty Tutorial - Giuseppe Platania

Job Submission (8/8)

• GRAM Job submission failed:

– TEST: globus-job-run <ce_hostname>:2119/jobmanager-lcgpbs /bin/id

GRAM Job submission failed because the job manager failed to

open stderr (error code 74)

– Solutions: The UI does not have inbound connectivity for the

GLOBUS_TCP_PORT_RANGE (2000-2005).

Fix the UI’s firewall.

– You find more details at the page 20 of the troubleshooting guide

Page 18: FP6−2004−Infrastructures−6-SSA-026409  E-infrastructure shared between Europe and Latin America Giuseppe Platania INFN Catania First EELA

FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America

First EELA ROC-on-Duty Tutorial - Giuseppe Platania

SITE BDII

Page 19: FP6−2004−Infrastructures−6-SSA-026409  E-infrastructure shared between Europe and Latin America Giuseppe Platania INFN Catania First EELA

19FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America

First EELA ROC-on-Duty Tutorial - Giuseppe Platania

Site BDII (1/3)

• Check if the GIIS is running on CE:

ldapsearch -x -h <ce_hostname> -p 2170 -b mds-vo-name=<site_name>,o=grid

ldap_bind: Can't contact LDAP server

• Solution: check if the site bdii is running on CE:

o /etc/init.d/bdii status

o restart it as needed

Page 20: FP6−2004−Infrastructures−6-SSA-026409  E-infrastructure shared between Europe and Latin America Giuseppe Platania INFN Catania First EELA

20FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America

First EELA ROC-on-Duty Tutorial - Giuseppe Platania

Site BDII (2/3)

• The Site BDII doesn’t publish CE informations:

Ensure that in /opt/bdii/etc/bdii-update.conf there is the CE ldap URL such as:

CE ldap://ce.localdomain:2135/mds-vo-name=local,o=grid

• Solution: put CE ldap URL in /opt/bdii/etc/bdii-update.conf and restart the

BDII service (see /opt/bdii/var/bdii.log to check the errors)

Page 21: FP6−2004−Infrastructures−6-SSA-026409  E-infrastructure shared between Europe and Latin America Giuseppe Platania INFN Catania First EELA

21FP6−2004−Infrastructures−6-SSA-026409

E-infrastructure shared between Europe and Latin America

First EELA ROC-on-Duty Tutorial - Giuseppe Platania

Site BDII (3/3)

• Site BDII error:

• tail -f /opt/bdii/var/bdii.log

CE: ldap_bind: Can't contact LDAP server

Time for searches: 0 s

Time to update DB: 1 s

Grabbing port 2170 for 2172

Tue Sep 19 11:47:45 CEST 2006

Sleeping for 30

• Solution: ensure that the GRIS is running:

o /etc/init.d/globus-mds restart

o restart it as needed