hypertext transfer protocol - harvard university · hypertext transfer protocol 7 of 110 4/17/2007...
TRANSCRIPT
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
1 of 110 4/17/2007 4:12 PM
Table of Contents | All Slides | Link List | CSCI E-12
Hypertext Transfer ProtocolApril 17, 2007
Harvard University Division of Continuing Education
Extension School
Course Web Site: http://cscie12.dce.harvard.edu/
Copyright 1998-2007 David P. Heitmeyer
Instructor email: [email protected] Course staff email: [email protected]
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
2 of 110 4/17/2007 4:12 PM
HyperText Transfer Protocol
GET /
HTTP is a stateless protocol. Cookies provide a mechanism to "maintain state".
Cookie Central: The Unofficial Cookie FAQ http://www.cookiecentral.com/faq/ http://www.cookiecentral.com/
Maintaining State with Cookies
HTTP State Management Mechanism http://www.ics.uci.edu/pub/ietf/http/rfc2109.txtCookie Central: The Unofficial Cookie FAQ http://www.cookiecentral.com/faq/ http://www.cookiecentral.com/Persistent Client State HTTP Cookies http://www.netscape.com/newsref/std/cookie_spec.html
view plain print ?
minerva% telnet www.npr.org 80 1.Trying 216.35.221.77... 2.Connected to www.npr.org. 3.Escape character is '^]'. 4.GET / HTTP/1.1 5.Host: www.npr.org 6. 7.HTTP/1.1 200 OK 8.Date: Tue, 10 Apr 2007 20:07:33 GMT 9.Server: Apache 10.Set-Cookie: Apache=140.247.197.240.289451144786054516; path=/ 11.Cache-Control: max-age=0 12.Expires: Tue, 10 Apr 2007 20:07:33 GMT 13.Transfer-Encoding: chunked 14.Content-Type: text/html 15. 16.76c 17.<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 18. "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 19.<html xmlns="http://www.w3.org/1999/xhtml"> 20.<head> 21.<title>NPR - National Public Radio - News, Arts, World, US.</title> 22.<!-- content removed --> 23.</html> 24.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
3 of 110 4/17/2007 4:12 PM
Cookie Example
Server returns cookie to HTTP client ("Set-Cookie" response header)HTTP client returns cookie to server ("Cookie" request header)
ESPN Cookies
Set-Cookie:SWID=C8F9AF31-F170-42BF-9471-50A95DA24C17;path=/;expires=Tue, 10-Apr-2027 03:20:59 GMT;domain=.go.com;
Set-Cookie:DE2=dXNhO21hO2NhbWJyaWRnZTt0MTs1OzQ7NDs1MDY7MDQyLjM4MDstMDcxLjEzNpath=/; expires=Tue, 17 Apr 2007 03:00:00 GMT; domain=.go.com
view plain print ?
minerva% lwp-request -USed http://www.espn.com/ 1.GET http://espn.go.com/ 2.User-Agent: lwp-request/2.07 3. 4.GET http://www.espn.com/ --> 301 Moved Permanently 5.GET http://espn.go.com/ --> 200 OK 6.Cache-Control: no-cache 7.Date: Tue, 10 Apr 2007 03:20:58 GMT 8.Pragma: no-cache 9.From: SPORTBARWEB08 10.Accept-Ranges: bytes 11.ETag: "802e571f7bc71:1762" 12.Server: Microsoft-IIS/5.0 13.Vary: Accept-Encoding 14.Content-Length: 122217 15.Content-Type: text/html; charset=iso-8859-1 16.Content-Type: text/html; charset=windows-1252 17.Last-Modified: Tue, 10 Apr 2007 03:19:21 GMT 18.Cache-Expires: Tue, 10 Apr 2007 03:24:22 GMT 19.Client-Date: Tue, 10 Apr 2007 03:21:02 GMT 20.Client-Peer: 198.105.193.43:80 21.Client-Response-Num: 1 22.P3P: CP="CAO DSP COR CURa ADMa DEVa TAIa PSAa PSDa IVAi IVDi CONi OUR SAMo OTRo BUS PHY O23.Refresh: 3600 24.Set-Cookie: SWID=C8F9AF31-F170-42BF-9471-50A95DA24C17; path=/; expires=Tue, 10-Apr-2027 025.Set-Cookie: DE2=dXNhO21hO2NhbWJyaWRnZTt0MTs1OzQ7NDs1MDY7MDQyLjM4MDstMDcxLjEzNTs4NDA7MjI7O26.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
4 of 110 4/17/2007 4:12 PM
Cookie Properties/Attributes
nameexpiresdomainpathsecure
HTTP State Management Mechanism, RFC 2965
RFC 2109, February 1997RFC 2965, October 2000
namecommentcomment URLdiscarddomainmax-agepathportsecureversion
Additional Cookie Notes
Client: 300 total cookies4 kb per cookie20 cookies per server or domain
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
5 of 110 4/17/2007 4:12 PM
Cookie Example: Server Sets a Cookie
Form that will set a Cookie: http://cscie12.dce.harvard.edu/http/cookie.cgi
Set-Cookie HTTP Response Header:
Set-Cookie: YourName=David%20P.%20Heitmeyer; domain=cscie12.dce.harvard.edu; path=/http/;expires=Fri, 13-May-2005 18:05:04 GMT
view plain print ?
minerva% telnet 140.247.197.240 80 1. Trying 140.247.197.240... 2.Connected to 140.247.197.240. 3.Escape character is '^]'. 4.GET /http/cookie.cgi?name=David%20P.%20Heitmeyer HTTP/1.1 5.Host: cscie12.dce.harvard.edu 6.Connection: close 7. 8.HTTP/1.1 200 OK 9.Connection: close 10.Date: Wed, 13 Apr 2005 18:05:04 GMT 11.Server: Apache/2.0.49 (Fedora) 12.Content-Type: text/html; charset=ISO-8859-1 13.Client-Date: Wed, 13 Apr 2005 18:05:04 GMT 14.Client-Peer: 140.247.197.240:80 15.Client-Response-Num: 1 16.Client-Transfer-Encoding: chunked 17.Set-Cookie: YourName=David%20P.%20Heitmeyer; \ 18. domain=cscie12.dce.harvard.edu; \ 19. path=/http/; \ 20. expires=Fri, 13-May-2005 18:05:04 GMT 21. 22.<?xml version="1.0" encoding="iso-8859-1"?> 23.<!DOCTYPE html 24. PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 25. "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 26.<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US"><head><title>For27.</head><body> 28.<h1>Hello, David P. Heitmeyer</h1> 29.</body></html> 30.Connection closed by foreign host. 31.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
6 of 110 4/17/2007 4:12 PM
Cookie Example: Returning a Cookie
Form that will set a Cookie: http://cscie12.dce.harvard.edu/http/cookie.cgi
view plain print ?
minerva% telnet 140.247.197.240 80 1.Trying 140.247.197.240... 2.Connected to 140.247.197.240. 3.Escape character is '^]'. 4.GET /http/cookie.cgi HTTP/1.1 5.Cookie: YourName=David%20P.%20Heitmeyer 6.Host: cscie12.dce.harvard.edu 7.Connection: close 8. 9.HTTP/1.1 200 OK 10.Connection: close 11.Date: Wed, 13 Apr 2005 18:11:40 GMT 12.Server: Apache/2.0.49 (Fedora) 13.Content-Type: text/html; charset=ISO-8859-1 14.Client-Date: Wed, 13 Apr 2005 18:11:40 GMT 15.Client-Peer: 140.247.197.240:80 16.Client-Response-Num: 1 17.Client-Transfer-Encoding: chunked 18. 19.<?xml version="1.0" encoding="iso-8859-1"?> 20.<!DOCTYPE html 21. PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 22. "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 23.<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US"> 24.<head><title>Form</title></head><body> 25.<h1>Hello, David P. Heitmeyer</h1> 26.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
7 of 110 4/17/2007 4:12 PM
Your Cookies
Firefox Webdeveloper Toolbar has a "Cookies" section. screenshot
Mozilla Cookie Manager
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
8 of 110 4/17/2007 4:12 PM
Cookies and Session IDs
A UserID or SessionID (a long character/number string that is uniquely assigned) is often stored incookie. The SessionID is used as the key or identifier when storing information about the user orsession.
For example, a user logs in to a site. If the username and password match, the server sets a cookie("Set-Cookie") in the browser that contains a session id; the server also makes an entry in websitedatabase that maps the session id to the username. When the cookie is returned, the session id isread and the username is looked up in the database.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
9 of 110 4/17/2007 4:12 PM
Google Cookie Example
Using Google's "Preference" page and setting:
Search Language preference to: English, French, GermanSafeSearch Filtering: Strict FilteringNumber of Results: 50
The Cookie name is: PREF The Value is:ID=bb504f37cd318aa9:FF=1:LR=lang_en|lang_fr|lang_de:LD=en:NR=50:TM=1113416195:LM=111
This cookie contains a session id as well as the values of certain preferences in a colon-separateddata structure.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
10 of 110 4/17/2007 4:12 PM
Cookies and Ad Tracking
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
11 of 110 4/17/2007 4:12 PM
Method: POST
Form that will set a Cookie: http://cscie12.dce.harvard.edu/http/cookie.cgi
view plain print ?
minerva% telnet 140.247.197.240 80 1.Trying 140.247.197.240... 2.Connected to 140.247.197.240. 3.Escape character is '^]'. 4.POST /http/cookie.cgi HTTP/1.1 5.Host: cscie12.dce.harvard.edu 6.Content-Length: 10 7.Content-Type: application/x-www-form-urlencoded 8. 9.name=David 10.HTTP/1.1 200 OK 11.Date: Wed, 13 Apr 2005 19:31:11 GMT 12.Server: Apache/2.0.49 (Fedora) 13.Set-Cookie: YourName=David; domain=cscie12.dce.harvard.edu; path=/http/; expires=Fri, 13-14.Content-Length: 319 15.Connection: close 16.Content-Type: text/html; charset=ISO-8859-1 17. 18.<?xml version="1.0" encoding="iso-8859-1"?> 19.<!DOCTYPE html 20. PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 21. "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 22.<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US"> 23.<head><title>Form</title> 24.</head><body> 25.<h1>Hello, David</h1> 26.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
12 of 110 4/17/2007 4:12 PM
WebDAV: an extension of HTTP
Web-based Distributed Authoring and Versioning
WebDAV Resources http://www.webdav.org/From the WebDAV Resources :
WebDAV stands for "Web-based Distributed Authoring and Versioning". It is a setof extensions to the HTTP protocol which allows users to collaboratively edit andmanage files on remote web servers.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
13 of 110 4/17/2007 4:12 PM
HTTP Resources
W3C HTTP http://www.w3.org/Protocols/HTTP Pocket Reference http://www.oreilly.com/catalog/httppr/ by Clinton Wong (O'Reilly).Illustrated Guide to HTTP http://www.manning.com/hethmon/ by Paul Hethmon (Manning Publications; ISBN 0138582262) see sample chapters and resources online.
Other Readings:
W3C Recommendations Reduce 'World Wide Wait' http://www.w3.org/Protocols/NL-PerfNote.htmlApache Week: HTTP version 1.1 http://www.apacheweek.com/features/http11WebTechniques: HTTP 1.1: What's in it for Me? http://www.webtechniques.com/archives/1997/08/webm/Cookie Central: The Unofficial Cookie FAQ http://www.cookiecentral.com/faq/ http://www.cookiecentral.com/
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
14 of 110 4/17/2007 4:12 PM
Apache HTTP Server
Apache Software FoundationApache HTTP Server Project
Apache 1.3Apache 2.x
Apache ModulesPHPPerlPythonmany, many others
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
15 of 110 4/17/2007 4:12 PM
Apache: The Most Widely Used Web Server on the PublicInternet
Netcraft Web Server Survey
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
16 of 110 4/17/2007 4:12 PM
In this Unit: Configuring Apache with .htaccess files
Custom Error DocumentsRedirectRewriteDirectory IndexSetting HTTP Headers
ExpiresHeaders
Access ControlRequiring a Secure Connection (SSL)
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
17 of 110 4/17/2007 4:12 PM
Apache Configuration Overview
Server Configuration (httpd.conf) Unless you are the server administrator, you generally will not have access to this account. Onthe DCE systems, you do not have read or write access to this file. Server configuration isread at server start or restart.Per Directory (.htaccess) Certain configuration directives for Apache can be placed within per-directory .htaccess files..htaccess file is read on a per request basis.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
18 of 110 4/17/2007 4:12 PM
.htaccess File Example
document root: /home/e12/htdocs filename: .htaccess location: /home/e12/htdocs/apache/.htaccess contents:
filename: status404.html location: /home/e12/htdocs/apache/status404.html
http://cscie12.dce.harvard.edu/apache/ZZZ.html
ErrorDocument 404 status404.html 1.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
19 of 110 4/17/2007 4:12 PM
Scope of .htaccess files
Directives within .htaccess files apply to the directory that contains the .htaccess file and all its descendants.
Directives within the file, /home/e12/htdocs/.htaccess would apply to all files within and "under" the public_html directory for the user cscie12.
Directives within the file, /home/e12/htdocs/assignments/.htaccess would apply to all files within and "under" the public_html/assignments directory for the usercscie12.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
20 of 110 4/17/2007 4:12 PM
Problems You Will Have with .htaccess files
Internal Server ErrorCan't "see" the fileIncorrect Permissions
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
21 of 110 4/17/2007 4:12 PM
Problems You will encounter when using .htaccess files
500 Internal Server Error If you see begin seeing 500 Internal Server Error responses from the server after you havecreated or edited an .htaccessfile, the most likely cause of the problem is incorrect permissions and/or an error in the directivesyntax.
Permissions on the .htaccessfile are not set correctly. Just like HTML and image files, the server must be able to read the.htaccess file. The simplest way to allow that is to make your .htaccess file readable by "other".
Syntax Error. An error in the syntax of a directive the .htaccess file will result in a 500 Internal Server Error. In addition, correct usage of a directive that is not allowed in the.htaccess file will result in a 500status code. Whether or not a directive is allowed depends upon the server configuration file(httpd.conf; AllowOverride) and the directive itself.
view plain print ?
minerva% pwd 1./home/courses/j/h/jharvard/public_html 2.minerva% ls -l .htaccess 3.-rw------- 1 jharvard founder 349 Nov 27 00:03 .htaccess 4.minerva% chmod o+r .htaccess 5.minerva% ls -l ~/public_html/.htaccess 6.-rw----r-- 1 jharvard founder 349 Nov 27 00:03 .htaccess 7.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
22 of 110 4/17/2007 4:12 PM
Problems You will encounter when using .htaccess files
You can't "see" your .htaccess file.
HTTP The web server is typically configured to deny requests for .htaccess files. For example, the file corresponding to the URL, http://cscie12.dce.harvard.edu/.htaccess exists and is readable by the Web server, but if we try to follow the link, we get a 403 Forbidden response.UNIX The ls command will not list files or directories that begin with a '.' (dot). In order to see the .htaccess file when you do a directory listing, use the -a (all) option:SFTP Sometimes your SFTP program will hide the "dot" files unless explicitly told to show them.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
23 of 110 4/17/2007 4:12 PM
Apache Configuration Sections
Configuration directives can be limited by using "sections", such as
DirectoryLocationFilesVirtualHostDirectoryMatchLocationMatchFilesMatch
Within .htaccess
Note that only Files and FilesMatch can be used within .htaccess files.
Examples:
Examples:
<Files .htaccess> 1. Order allow,deny 2. Deny from all 3.</Files> 4.
# deny access to any tilde backup files 1.<Files *~> 2. Order allow,deny 3. Deny from all 4.</Files> 5.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
24 of 110 4/17/2007 4:12 PM
Configuring Apache with .htaccess files
Custom Error DocumentsRedirectRewriteDirectory IndexSetting HTTP Headers
ExpiresHeaders
Access Control
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
25 of 110 4/17/2007 4:12 PM
Custom Error Documents
.htaccess
ErrorDocument directive http://www.apache.org/docs/2.0/mod/core.html#errordocumentCustom Error Responses http://www.apache.org/docs/2.0/custom-error.html
ErrorDocument 401 /apache/status401.html 1.ErrorDocument 403 /apache/status403.html 2.ErrorDocument 404 /apache/status404.html 3.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
26 of 110 4/17/2007 4:12 PM
HTTP Redirect
Fight Linkrot!
Do your part to Fight Linkrot! (Jakob Nielson's Alertbox, http://www.useit.com/alertbox/980614.html )
RedirectRewriteMeta http-equiv refresh
Redirecting Requests
HTTP Status Codes: 301 Moved permanently 302 Moved temporarily
Redirecting client requests can be very useful:
URL moves to a new location Do your part to Fight Linkrot! (Jakob Nielson's Alertbox,http://www.useit.com/alertbox/980614.html )
resource removedsite structure is reorganized
Provide "friendly" or additional URLs to access a resource
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
27 of 110 4/17/2007 4:12 PM
Redirect
Redirect directive http://www.apache.org/docs/2.0/mod/mod_alias.html#redirect
.htaccess
Try it:
http://cscie12.dce.harvard.edu/apache/dce.htmlhttp://cscie12.dce.harvard.edu/apache/church_st
Redirect 302 /apache/dce.html http://www.dce.harvard.edu/ 1.Redirect 301 /apache/church_st http://map.harvard.edu/level3.cfm?mapname=camb_allston2.
view plain print ?
minerva% telnet cscie12.dce.harvard.edu 80 1.Trying 140.247.197.240... 2.Connected to cscie12.dce.harvard.edu. 3.Escape character is '^]'. 4.GET /apache/dce.html HTTP/1.1 5.Host: cscie12.dce.harvard.edu 6.Connection: close 7. 8.HTTP/1.1 302 Found 9.Date: Wed, 13 Apr 2005 20:03:10 GMT 10.Server: Apache/2.0.49 (Fedora) 11.Location: http://www.dce.harvard.edu/ 12.Content-Length: 302 13.Connection: close 14.Content-Type: text/html; charset=iso-8859-1 15.
view plain print ?
minerva% lwp-request -USed http://cscie12.dce.harvard.edu/apache/dce.html 1.GET http://www.dce.harvard.edu/ 2.User-Agent: lwp-request/2.06 3. 4.GET http://cscie12.dce.harvard.edu/apache/dce.html --> 302 Found 5.GET http://www.dce.harvard.edu/ --> 200 OK 6.Connection: Close 7.Date: Wed, 13 Apr 2005 20:01:26 GMT 8.Accept-Ranges: bytes 9.Server: Orion/2.0.6 10.Content-Length: 3619 11.Content-Type: text/html 12.Content-Type: text/html; charset=iso-8859-1 13.Last-Modified: Wed, 27 Oct 2004 18:45:00 GMT 14.Client-Date: Wed, 13 Apr 2005 20:01:49 GMT 15.Client-Peer: 140.247.198.100:80 16.Client-Response-Num: 1 17.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
28 of 110 4/17/2007 4:12 PM
Rewrite
mod_rewrite http://www.apache.org/docs/2.0/mod/mod_rewrite.htmlA Users Guide to URL Rewriting with the Apache Webserver http://www.engelschall.com/pw/apache/rewriteguide/
Rewrite uses regular expressions to match on a pattern and rewrite to a new location. For example,the Derek Bok Center site used to be a "user" account and had the "~bok_cen/" base. When movedto its own virtual host, all of the "~bok_cen" requests could be rewritten to the new site with a singlerewrite rule.
Old URL: http://www.fas.harvard.edu/~bok_cen/tf/resources.html(.*) matches on: /tf/resources.htmlNew URL: http://bokcenter.fas.harvard.edu/tf/resources.html
# rewrite for Bok Center 1.RewriteRule ^/~?bok_cen(.*) http://bokcenter.fas.harvard.edu$1 [R=301] 2.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
29 of 110 4/17/2007 4:12 PM
Examples of Rewrite Uses
Provide a standard mechanism to access course Web sites within HarvardCollege.
http://www.courses.harvard.edu/<4 digit catalog number>
For example, Chemistry 7 has a catalog number of 5118, so the URL for the course Web site can be reached through:
http://www.courses.harvard.edu/5118
The "real" location of the site is:
http://my.harvard.edu/icb/icb.do?course=fas-chem7
HASCS Site Restructure
Dozens of rewrite directives were put in place when the HASCS site was restructured so that linksto documents within the previous site would get redirected to the appropriate page in the new site.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
30 of 110 4/17/2007 4:12 PM
Rewrite: Can be conditional
Rewrite rules can conditional (match against almost any environment variable).
Here we match on host and user-agent to deliver an error page explaining that their browser is not supported.
RewriteEngine On 1.RewriteCond %{HTTP_USER_AGENT} ^Lynx 2.RewriteRule ^(index.html)?$ text/ [R=302] 3.
# rewrite rule to catch IE Mac browsers since 1.# the PIN Service does not support them as of 10/16/2006 2.RewriteCond %{HTTP_HOST} "^login.icommons.harvard.edu$" 3.RewriteCond %{HTTP_USER_AGENT} "MSIE 5.*\; Mac_PowerPC" 4.RewriteRule ^/pinproxy.* /pin_error_ie_mac.html [R,L] 5.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
31 of 110 4/17/2007 4:12 PM
An aside: Text-only sites and "link"
Meta-information can be used to describe alternate content.
W3C Web Content Accessibility Guidelines: alternate pages http://www.w3.org/TR/WAI-WEBCONTENT-TECHS/#alt-pages
In ~cscie12/public_html/index2.html
Lynx view of index2.html provides the text-only version as a
link:
view plain print ?
<link title="Text-only version" 1. rel="alternate" 2. href="http://cscie12.dce.harvard.edu/text/index.html" 3. media="aural, braille, tty"/> 4.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
32 of 110 4/17/2007 4:12 PM
Meta Refresh
Note: redirection may also be achieved on some browsers by using the http-equiv attribute of the <meta> element. More information and examples are provided athttp://www.fas.harvard.edu/~web/tutorial/meta/refresh/ . The recommended method is to do it at the server level.
view plain print ?
<!-- in head --> 1.<!-- will redirect in 10 seconds --> 2.<meta http-equiv="Refresh" content="10; URL=http://www.harvard.edu/"/> 3.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
33 of 110 4/17/2007 4:12 PM
Directory Index and Listings
Note: Remember the difference between a directory having rwx-----x and rwx---r-x permissions?
DirectoryIndex http://www.apache.org/docs/2.0/mod/mod_dir.html Would you prefer main.html or overview.html to be the default files returned when a directoryis requested?mod_autoindex http://www.apache.org/docs/2.0/mod/mod_autoindex.html Provides for automatic indexing of a directory.
DirectoryIndex
DirectoryIndex index.html main.html overview.html slide1.html 1.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
34 of 110 4/17/2007 4:12 PM
More Control over Directory Listings
mod_autoindex
Basic
Custom
The details:
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
35 of 110 4/17/2007 4:12 PM
view plain print ?
minerva% pwd /home10/c/s/cscie12/public_html/autoindex/ex2 1.minerva% ls -la 2.total 28 3.drwxr-xr-x 2 cscie12 courses 8192 Nov 27 13:28 . 4.drwxr-xr-x 6 cscie12 courses 8192 Nov 27 13:11 .. 5.-rw-r--r-- 1 cscie12 courses 207 Nov 27 13:12 .htaccess 6.-rw-r--r-- 1 cscie12 courses 147 Nov 27 13:09 HEADER.html 7.-rw-r--r-- 1 cscie12 courses 66 Nov 27 13:09 README.html 8.-rw-r--r-- 1 cscie12 courses 4168 Nov 27 12:58 client-server.gif 9.-rw-r--r-- 1 cscie12 courses 906 Nov 27 12:58 slide1.html 10.-rw-r--r-- 1 cscie12 courses 743 Nov 27 12:58 slide2.html 11.-rw-r--r-- 1 cscie12 courses 1208 Nov 27 12:58 slide3.html 12.minerva% cat .htaccess 13.IndexOptions FancyIndexing 14.IndexOptions IconsAreLinks IconHeight=22 IconWidth=20 \ 15. NameWidth=* ScanHTMLTitles SuppressLastModified \ 16. SuppressSize SuppressColumnSorting \ 17. SuppressHTMLPreamble 18.IndexIgnore *.gif .. 19.minerva% 20.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
36 of 110 4/17/2007 4:12 PM
Setting HTTP Headers
ExpiresHeaders
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
37 of 110 4/17/2007 4:12 PM
Expires
Module mod_expires http://www.apache.org/docs/2.0/mod/mod_expires.html
.htaccess
Or, expire based upon modification time of document:
From the Apache mod_expires documentation:
This module controls the setting of the Expires HTTP header in server responses. The expirationdate can set to be relative to either the time the source file was last modified, or to the time of theclient access.
The Expires HTTP header is an instruction to the client about the document's validity andpersistence. If cached, the document may be fetched from the cache rather than from the sourceuntil this time has passed. After that, the cache copy is considered "expired" and invalid, and anew copy must be obtained from the source.
ExpiresActive On 1. 2.ExpiresByType text/html A3600 3.# HTML expires in 1 hour 4. 5.ExpiresByType image/gif A2592000 6.# GIF expires in 30 days 7. 8.ExpiresByType image/jpeg A2592000 9.# JPEG expires in 30 days 10. 11.ExpiresByType image/png A2592000 12.# PNG expires in 30 days 13. 14.# types not specified 15.ExpiresDefault "now plus 1 day" 16.# expires in 1 day 17.
ExpiresActive On 1.ExpiresByType text/html M86400 2.# HTML expires 1 day after it was last modified 3.ExpiresDefault M86400 4.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
38 of 110 4/17/2007 4:12 PM
Do not cache
If you do not want your page cached, set these HTTP response headers:
In .htaccess in Apache, this would translate to:
view plain print ?
Cache-control: no-cache 1.Pragma: no-cache 2.Expires: <set to now> 3.
ExpiresDefault "now" 1.Header set Pragma "no-cache" 2.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
39 of 110 4/17/2007 4:12 PM
Headers
mod_headers http://www.apache.org/docs/2.0/mod/mod_headers.html
The optional headers module allows for the customization of HTTP response headers. Headers canbe merged, replaced or removed. The server will always add a "Server" and "Date" header to the HTTP response.
view plain print ?
Header set Author "David P. Heitmeyer" 1.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
40 of 110 4/17/2007 4:12 PM
Usertrack with Cookies
mod_usertrack
.htaccess file:
view plain print ?
minerva% lwp-request -USed http://cscie12.dce.harvard.edu/apache/usertrack/ 1.GET http://cscie12.dce.harvard.edu/apache/usertrack/ 2.User-Agent: lwp-request/1.38 3. 4.GET http://cscie12.dce.harvard.edu/apache/usertrack/ --> 200 OK 5.Connection: close 6.Date: Wed, 13 Apr 2005 20:32:54 GMT 7.Accept-Ranges: bytes 8.Server: Apache/2.0.49 (Fedora) 9.Content-Length: 59 10.Content-Type: text/html; charset=UTF-8 11.Client-Date: Wed, 13 Apr 2005 20:32:54 GMT 12.Client-Peer: 140.247.197.240:80 13.Client-Response-Num: 1 14.Set-Cookie2: MyCookie=140.247.197.240.1113424374035983; \ 15. path=/; max-age=2858400; \ 16. domain=.dce.harvard.edu; version=1 17.
CookieTracking on 1.CookieStyle RFC2965 2.CookieName MyCookie 3.CookieExpires "1 month 3 days 2 hours" 4.CookieDomain .dce.harvard.edu 5.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
41 of 110 4/17/2007 4:12 PM
WWW Access Control
You can implement access control on all or part of your Web site so that:
users must provide a username and password (Basic Authentication);users' computers must be within a particular domain
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
42 of 110 4/17/2007 4:12 PM
Basic Authentication: Warning
Basic Authentication alone does not provide the security and privacy to adequately protecttruly confidential or personal information.
Basic Authentication is analogous to simply "closing a door" to parts of your Web site. It will preventthe casual or polite users from "opening the door", but will not prevent someone mildly determinedto walking in.
Two issues that contribute to the lack of security and privacy are:
the content is transmitted over the network in plaintextthe usernames and passwords (submitted with each HTTP request) is transmitted over thenetwork in plaintext
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
43 of 110 4/17/2007 4:12 PM
HTTP: Authenticate
view plain print ?
minerva% telnet 140.247.197.240 80 1.Trying 140.247.197.240... 2.Connected to 140.247.197.240. 3.Escape character is '^]'. 4.HEAD /apache/access/example1/ HTTP/1.1 5.Host: cscie12.dce.harvard.edu 6. 7.HTTP/1.1 401 Authorization Required 8.Connection: close 9.Date: Wed, 13 Apr 2005 20:44:39 GMT 10.Server: Apache/2.0.49 (Fedora) 11.WWW-Authenticate: Basic realm="Basic Authentication Tutorial 1" 12.Content-Length: 492 13.Content-Type: text/html; charset=iso-8859-1 14.Client-Date: Wed, 13 Apr 2005 20:44:39 GMT 15.Client-Peer: 140.247.197.240:80 16.Client-Response-Num: 1 17.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
44 of 110 4/17/2007 4:12 PM
HTTP: Authentication/Authorization
The username:password is sent MIME BASE 64 encoded (not encrypted).
view plain print ?
minerva% telnet 140.247.197.240 80 1.Trying 140.247.197.240... 2.Connected to 140.247.197.240. 3.Escape character is '^]'. 4.HEAD /apache/access/example1/ HTTP/1.1 5.Host: cscie12.dce.harvard.edu 6.Authorization: Basic Z3Vlc3Q6Z3Vlc3Q= 7. 8.HTTP/1.1 200 OK 9.Connection: close 10.Date: Wed, 13 Apr 2005 20:47:53 GMT 11.Accept-Ranges: bytes 12.Server: Apache/2.0.49 (Fedora) 13.Content-Length: 124 14.Content-Type: text/html; charset=UTF-8 15.Client-Date: Wed, 13 Apr 2005 20:47:53 GMT 16.Client-Peer: 140.247.197.240:80 17.Client-Response-Num: 1 18. 19. 20.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
45 of 110 4/17/2007 4:12 PM
Access Control Documentation
Apache
Apache FAQ has a section on user authentication.Using User Authentication from Apache WeekRelevant Apache Module and Directive Documentation
mod_access modulemod_auth modulerequire directivesatisfy directive
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
46 of 110 4/17/2007 4:12 PM
Implementing Access Control
To implement access control, you must create a file name '.htaccess' that contains with the properconfiguration instructions. You may also need to create a ".htpasswd" file using the utility"htpasswd" and a ".htgroup" file.
htpasswd program.htaccess filehtpasswd filehtgroup file
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
47 of 110 4/17/2007 4:12 PM
htpasswd file
.htpasswd This file contains usernames and encrypted passwords (username:enc_passwd). It is created andmanaged with the utility, "htpasswd", which can be run from the command line.
This file should notlie within your public_html. It should reside at the root level of your home directory (for example,/home/courses/j/h/jharvard/.htpasswd
This file needs to be readable by the Web Server.
Sample content:
view plain print ?
minerva% which htpasswd 1./usr/bin/htpasswd 2.minerva% htpasswd 3.Usage: htpasswd [-c] passwordfile username 4.The -c flag creates a new file. 5. 6.
view plain print ?
minerva% more ~e12/.htpasswd.demo 1.guest:79WeSn3vYGsKQ 2.guest2:wGcgIYLtHNIpM 3.guest3:j9VzpSX/C8Kr2 4.guest4:CjHmW1PWNFwXM 5.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
48 of 110 4/17/2007 4:12 PM
htgroup file
.htgroup This file contains group definitions (group_name:member1 member2 ...).
This file should notlie within your public_html. It should reside at the root level of your home directory (for example,/home/courses/j/h/jharvard/.htgroup
This file needs to be readable by the Web Server.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
49 of 110 4/17/2007 4:12 PM
Access Control Examples
For the examples given, the user "cscie12" is used. You should substitute your username andhome directory appropriately.
The following .htpasswd.demo and .htgroup.demo files are used:
/home/e12/.htpasswd.demo The .htpasswd.demo was generated by using the utility "htpasswd"
Password for "guest" (and all other entries) is "guest". Entries for guest2, guest3, and guest4 arecreated without the "-c" flag, since the .htpasswd.demo file already exists.
Contents of file:
.htgroup.demo Contents of file:
view plain print ?
minerva% htpasswd 1.Usage: htpasswd [-c] passwordfile username 2.The -c flag creates a new file. 3.minerva% htpasswd -c /home/e12/.htpasswd.demo guest 4.Adding password for guest 5.New password: ***** 6.Re-type password: ***** 7.
guest:79WeSn3vYGsKQ 1.guest2:PR4APgA.4CKO. 2.guest3:5DbCMPbSDstj2 3.guest4:htPnr8jT4bI5E 4.
view plain print ?
VIP: guest guest4 1. 2.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
50 of 110 4/17/2007 4:12 PM
Access Control Example 1
Any valid user in .htpasswd.demo is allowed access
The"AuthName" is the description that is displayed by the browser in the Basic Authentication dialogbox.
Contents of sample .htaccess file:
Demonstration of Example 1 You may login as any of the following users (username:password): guest:guest guest2:guest guest3:guest guest4:guest
AuthName "Basic Authentication Tutorial 1" 1.AuthType Basic 2.AuthUserFile /home/e12/.htpasswd.demo 3.require valid-user 4.
view plain print ?
minerva% lwp-request -USed -C"guest:iforgot" http://cscie12.dce.harvard.edu/apache/access1.GET http://cscie12.dce.harvard.edu/apache/access/example1/ 2.Authorization: Basic Z3Vlc3Q6aWZvcmdvdA== 3.User-Agent: lwp-request/2.06 4. 5.GET http://cscie12.dce.harvard.edu/apache/access/example1/ --> 401 Authorization Required6.GET http://cscie12.dce.harvard.edu/apache/access/example1/ --> 401 Authorization Required7.Connection: close 8.Date: Wed, 13 Apr 2005 20:53:42 GMT 9.Server: Apache/2.0.49 (Fedora) 10.WWW-Authenticate: Basic realm="Basic Authentication Tutorial 1" 11.Content-Length: 492 12.Content-Type: text/html; charset=iso-8859-1 13.Client-Date: Wed, 13 Apr 2005 20:53:42 GMT 14.Client-Peer: 140.247.197.240:80 15.Client-Response-Num: 1 16.Client-Warning: Credentials for 'guest' failed before 17.Title: 401 Authorization Required 18.X-Pad: avoid browser bug 19. 20.minerva% lwp-request -USed -C"guest2:guest2" http://cscie12.dce.harvard.edu/apache/access21.GET http://cscie12.dce.harvard.edu/apache/access/example1/ 22.Authorization: Basic Z3Vlc3QyOmd1ZXN0Mg== 23.User-Agent: lwp-request/2.06 24. 25.GET http://cscie12.dce.harvard.edu/apache/access/example1/ --> 401 Authorization Required26.GET http://cscie12.dce.harvard.edu/apache/access/example1/ --> 200 OK 27.Connection: close 28.Date: Wed, 13 Apr 2005 20:59:05 GMT 29.Accept-Ranges: bytes 30.Server: Apache/2.0.49 (Fedora) 31.Content-Length: 124 32.Content-Type: text/html; charset=UTF-8 33.Client-Date: Wed, 13 Apr 2005 20:59:05 GMT 34.Client-Peer: 140.247.197.240:80 35.Client-Response-Num: 1 36.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
51 of 110 4/17/2007 4:12 PM
Access Control Example 2
Only certain users in .htpasswd.demo are allowed access
Contents of sample .htaccess file:
Demonstration of Example 2 Only guest2 and guest3 are authorized: guest2:guest guest3:guest
Unauthorized: guest:guest guest4:guest
AuthName "Basic Authentication Tutorial 2" 1.AuthType Basic 2.AuthUserFile /home/e12/.htpasswd.demo 3.require user guest2 guest3 4.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
52 of 110 4/17/2007 4:12 PM
Access Control Example 3
Only members of a particular group are allowed access
Contents of .htaccess file:
Contents of .htgroup.demo file:
Demonstration of Example 3 Only members of the group "VIP" (as defined by /home/e12/.htgroup.demo) are authorized (guestand guest4): guest:guest guest4:guest
Unauthorized: guest2:guest guest3:guest
AuthName "Basic Authentication Tutorial 3" 1.AuthType Basic 2.AuthUserFile /home/e12/.htpasswd.demo 3.AuthGroupFile /home/e12/.htgroup.demo 4.require group VIP 5.
view plain print ?
VIP: guest guest4 1. 2.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
53 of 110 4/17/2007 4:12 PM
Access Control Example 4
Only certain computers are allowed access
Contents of sample .htaccess file:
Demonstration of Example 4 Computers that are on the Harvard network (computers with hostnames ending in .harvard.edu orwith IP addreses beginning with 128.103 or 140.247) will have access, others will be denied.
order deny,allow 1.deny from all 2.allow from 140.247 3.allow from 128.103 4.allow from .harvard.edu 5.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
54 of 110 4/17/2007 4:12 PM
Access Control Example 5
Only certain computers are denied access
Contents of sample .htaccess file:
Demonstration of Example 5 Connections from within the domain 'fas.harvard.edu' will be denied.
order allow,deny 1.allow from all 2.deny from .fas.harvard.edu 3.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
55 of 110 4/17/2007 4:12 PM
Access Control Example 6
Certain computers are allowed in; others must provide a username and password
Contents of sample .htaccess file:
Demonstration of Example 6 Connection from within ".yale.edu" will be allowed; others must provide a valid username andpassword.
order deny,allow 1.deny from all 2.allow from .yale.edu 3.AuthType Basic 4.AuthUserFile /home/e12/.htpasswd.demo 5.AuthName "Basic Authentication Tutorial 6" 6.require valid-user 7.satisfy any 8.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
56 of 110 4/17/2007 4:12 PM
Access Control Example 7
Only certain computers are allowed in and users must provide a valid username andpassword.
Contents of sample .htaccess file:
Demonstration of Example 7 and satisfy all
order deny,allow 1.deny from all 2.allow from .harvard.edu 3.AuthType Basic 4.AuthUserFile /home/e12/.htpasswd.demo 5.AuthName "Basic Authentication Tutorial 7" 6.require valid-user 7.satisfy all 8.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
57 of 110 4/17/2007 4:12 PM
Requiring SSL (https://)
SSL (Secure Socket Layer) is a protocol that encrypts data between the client and the server. httpsis HTTP over SSL. More details in our last lecture on Security and Privacy.
Contents of sample .htaccess file:
Allowed: https://www.people.fas.harvard.edu/~heitmey/secure/index.htmlForbidden: http://www.people.fas.harvard.edu/~heitmey/secure/index.html
SSLRequireSSL 1.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
58 of 110 4/17/2007 4:12 PM
Details about enabling .htaccess and allowed directives
Context: can these directives be in .htaccess files?AllowOverride: is the server configured to allow this group of directives to be overriden in thislocation?Is the required module loaded?
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
59 of 110 4/17/2007 4:12 PM
Legal Directives I: Context
Certain Apache directives are legal within .htaccess files. Some are not. See the Apache Documentation for details. Specifically, look at the Context line that is given for thedirective in question.
Apache Core Features http://www.apache.org/docs/2.0/mod/core.htmlApache Module List http://www.apache.org/docs/2.0/mod/standard Apache Directives http://www.apache.org/docs/2.0/mod/directives.html
The following is an excerpt from the Apache HTTP Server Version 1.3 documentation
ErrorDocument directive
Syntax: ErrorDocument error-code document Context: server config, virtual host, directory, .htaccess Status: core Override: FileInfo Compatibility: The directory and .htaccess contexts are only available in Apache 1.1 and later.
Also, the "a" indicator on the Apache Quick Reference Card indicates that the directive is valid within an .htaccess file.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
60 of 110 4/17/2007 4:12 PM
Legal Directives II: AllowOverride
Users are allowed to override certain aspects of the main server configuration. The main server configuration file (httpd.conf) contains an AllowOverride directive that determines which directives within .htaccess files Apache will process. The Override line that is given for eachdirective in the Apache documentationindicates which configuration directive must be active in order to use that directive with an .htaccessfile.
For the FAS system, the main server configuration file has the following directive in place for users'public_html directories:
The following is an excerpt from the Apache HTTP Server Version 1.3 documentation
ErrorDocument directive
Syntax: ErrorDocument error-code document Context: server config, virtual host, directory, .htaccess Status: core Override: FileInfo Compatibility: The directory and .htaccess contexts are only available in Apache 1.1 and later.
AllowOverride FileInfo AuthConfig Limit Indexes Options 1. 2.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
61 of 110 4/17/2007 4:12 PM
Legal Directives III: Apache Modules
Apache is distributed with several modules. These modules may or may not be active within the Apache server with which you are working. The Core features will always be available.
For example, if the Rewrite Module (mod_rewrite) has not been activated, none of the Rewritedirectives will be available to use.
Refer to the Status and Modulelines in the documentation for each directive and to the documentation for the specific Apacheinstallation you are using.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
62 of 110 4/17/2007 4:12 PM
Apache Modules
On the Apache (Apache/2.0.51) minerva.dce.harvard.edu web server, the following Apachemodules are active:
LoadModule access_module modules/mod_access.soLoadModule auth_module modules/mod_auth.soLoadModule auth_anon_module modules/mod_auth_anon.soLoadModule auth_dbm_module modules/mod_auth_dbm.soLoadModule auth_digest_module modules/mod_auth_digest.soLoadModule ldap_module modules/mod_ldap.soLoadModule auth_ldap_module modules/mod_auth_ldap.soLoadModule include_module modules/mod_include.soLoadModule log_config_module modules/mod_log_config.soLoadModule env_module modules/mod_env.soLoadModule mime_magic_module modules/mod_mime_magic.soLoadModule cern_meta_module modules/mod_cern_meta.soLoadModule expires_module modules/mod_expires.soLoadModule deflate_module modules/mod_deflate.soLoadModule headers_module modules/mod_headers.soLoadModule usertrack_module modules/mod_usertrack.soLoadModule setenvif_module modules/mod_setenvif.soLoadModule mime_module modules/mod_mime.soLoadModule dav_module modules/mod_dav.soLoadModule status_module modules/mod_status.soLoadModule autoindex_module modules/mod_autoindex.soLoadModule asis_module modules/mod_asis.soLoadModule info_module modules/mod_info.soLoadModule dav_fs_module modules/mod_dav_fs.soLoadModule vhost_alias_module modules/mod_vhost_alias.soLoadModule negotiation_module modules/mod_negotiation.soLoadModule dir_module modules/mod_dir.soLoadModule imap_module modules/mod_imap.soLoadModule actions_module modules/mod_actions.soLoadModule speling_module modules/mod_speling.soLoadModule userdir_module modules/mod_userdir.soLoadModule alias_module modules/mod_alias.soLoadModule rewrite_module modules/mod_rewrite.soLoadModule proxy_module modules/mod_proxy.soLoadModule proxy_ftp_module modules/mod_proxy_ftp.soLoadModule proxy_http_module modules/mod_proxy_http.soLoadModule proxy_connect_module modules/mod_proxy_connect.soLoadModule cache_module modules/mod_cache.soLoadModule suexec_module modules/mod_suexec.soLoadModule disk_cache_module modules/mod_disk_cache.soLoadModule file_cache_module modules/mod_file_cache.soLoadModule mem_cache_module modules/mod_mem_cache.soLoadModule cgi_module modules/mod_cgi.soLoadModule dav_svn_module /usr/lib/httpd/modules/mod_dav_svn.soLoadModule authz_svn_module /usr/lib/httpd/modules/mod_authz_svn.so
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
63 of 110 4/17/2007 4:12 PM
Webmaster Tools
Site Icons ('favicon.ico')Web Robots
Link CheckingSearch Robots
Other Webmaster ToolsHTML/CSS ValidationAccessibility ComplianceWeb Site MirroringConverting HTML to other formatsHTTP Server Performance
Log Analysis
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
64 of 110 4/17/2007 4:12 PM
Site Icons
favicon.ico at root of web siteor "link" element in "head" element of XHTML/HTML document
MSIE uses http://www.somesite.com/favicon.ico for icons in the bookmark list.
Firefox uses favicon.ico or link element, rel="icon", in the location bar, bookmark list and tab display.
The code in the 'head' of the XHTML would look something like:
view plain print ?
<link rel="icon" href="images/mozilla-16.png" type="image/png"/> 1.<link rel="shortcut icon" href="images/mozilla.ico" type="image/x-icon"/> 2.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
65 of 110 4/17/2007 4:12 PM
SEO: Search Engine Optimization
Make your site ready for search engines
well-formed (and hopefully valid) HTML/XHTML.use mark-up language for headings and liststitles that stand on their own"meta" keywords and description
An example using O'Reilly OnLamp.com
In "head" element of page:
view plain print ?
<meta name="keywords" content="ONLamp.com,O'Reilly Network,oreillynet, 1.oreillynet.com,O'Reilly,OREILLY,o'reilly network,o'reilly, 2.onlamp.com,lamp,lampp,linux,apache,mysql,perl,python, 3.php,linux,bsd,web development,server development reference, 4.technical information,open source" /> 5. 6.<meta name="description" content="Welcome to ONLamp.com, 7.the high performance web development site from the O'Reilly Network 8.offering comprehensive Lamp developer information and resources. 9.O'Reilly Network's ONLamp site features original articles, 10.news and commentary." /> 11.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
66 of 110 4/17/2007 4:12 PM
Firefox as a Web Development Tool
Web Developer Extension
Firefox Extension - Live HTTP Headers
Firefox Extension - Firebug
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
67 of 110 4/17/2007 4:12 PM
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
68 of 110 4/17/2007 4:12 PM
xurl and churl
xurl. A simple Perl script that extract the links for a single page. Adapted from The Perl Cookbook.minerva% xurl URL
churl. A simple Perl script that will check the links for a single page. Adapted from The Perl Cookbook.minerva% churl URL
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
69 of 110 4/17/2007 4:12 PM
xurl
view plain print ?
minerva% xurl http://www.extension.harvard.edu/ 1.http://dceweb.harvard.edu/prod/sowxcrq.taf?school=EXT 2.http://dceweb.harvard.edu/prod/sswckce.taf?wgrp=EXT 3.http://my.extension.harvard.edu/ 4.http://www.extension.harvard.edu/2006-07/courses/ 5.http://www.extension.harvard.edu/2006-07/courses/DistanceEd/courses/ 6.http://www.extension.harvard.edu/2006-07/courses/citations.jsp 7.http://www.extension.harvard.edu/2006-07/forms/ 8.http://www.extension.harvard.edu/2006-07/help/directory.jsp;jsessionid=HGELDPGDCCIK 9.http://www.extension.harvard.edu/2006-07/images/go2.jpg 10.http://www.extension.harvard.edu/2006-07/images/home.jpg 11.http://www.extension.harvard.edu/2006-07/images/profiles/default-5.jpg 12.http://www.extension.harvard.edu/2006-07/images/snaps.jpg 13.http://www.extension.harvard.edu/2006-07/images/veri.gif 14.http://www.extension.harvard.edu/2006-07/news/;jsessionid=HGELDPGDCCIK 15.http://www.extension.harvard.edu/2006-07/news/chaisson.jsp;jsessionid=HGELDPGDCCIK 16.http://www.extension.harvard.edu/2006-07/news/creatures.jsp;jsessionid=HGELDPGDCCIK 17.http://www.extension.harvard.edu/2006-07/news/earthday.jsp;jsessionid=HGELDPGDCCIK 18.http://www.extension.harvard.edu/2006-07/news/gittleman.jsp;jsessionid=HGELDPGDCCIK 19.http://www.extension.harvard.edu/2006-07/news/retirement.jsp;jsessionid=HGELDPGDCCIK 20.http://www.extension.harvard.edu/2006-07/news/volunteers.jsp;jsessionid=HGELDPGDCCIK 21.http://www.extension.harvard.edu/2006-07/overview/ 22.http://www.extension.harvard.edu/2006-07/overview/tradition.jsp 23.http://www.extension.harvard.edu/2006-07/overview/video/ 24.http://www.extension.harvard.edu/2006-07/overview/welcome.jsp 25.http://www.extension.harvard.edu/2006-07/programs/ 26.http://www.extension.harvard.edu/2006-07/programs/default.jsp#cert 27.http://www.extension.harvard.edu/2006-07/programs/info.jsp 28.http://www.extension.harvard.edu/2006-07/register/ 29.http://www.extension.harvard.edu/2006-07/register/financial/ 30.http://www.extension.harvard.edu/2006-07/register/financial/finaid.jsp 31.http://www.extension.harvard.edu/2006-07/register/guidelines/calendar/ 32.http://www.extension.harvard.edu/2006-07/register/guidelines/international.jsp 33.http://www.extension.harvard.edu/2006-07/register/policies/transcripts.jsp 34.http://www.extension.harvard.edu/2006-07/stylesheets/home-print.css 35.http://www.extension.harvard.edu/2006-07/stylesheets/home-screen.css 36.http://www.extension.harvard.edu/DistanceEd/ 37.http://www.extension.harvard.edu/chooser;jsessionid=HGELDPGDCCIK 38.http://www.google-analytics.com/urchin.js 39.https://dceweb.harvard.edu/prod/gowlogn3.taf 40.javascript:popUp('/2006-07/snapshots/') 41.javascript:popUp2('/2006-07/profiles/default.jsp?n=5') 42.mailto:[email protected] 43.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
70 of 110 4/17/2007 4:12 PM
churl
view plain print ?
minerva$ churl http://www.extension.harvard.edu/ 1.http://www.extension.harvard.edu/: 2.200 OK http://dceweb.harvard.edu/prod/sowxcrq.taf?school=EXT 3.200 OK http://dceweb.harvard.edu/prod/sswckce.taf?wgrp=EXT 4.200 OK http://my.extension.harvard.edu/ 5.200 OK http://www.extension.harvard.edu/2006-07/courses/ 6.200 OK http://www.extension.harvard.edu/2006-07/courses/DistanceEd/courses/ 7.200 OK http://www.extension.harvard.edu/2006-07/courses/citations.jsp 8.200 OK http://www.extension.harvard.edu/2006-07/forms/ 9.200 OK http://www.extension.harvard.edu/2006-07/help/directory.jsp;jsessionid=KPLBEAHDCC10.200 OK http://www.extension.harvard.edu/2006-07/images/go2.jpg 11.200 OK http://www.extension.harvard.edu/2006-07/images/home.jpg 12.200 OK http://www.extension.harvard.edu/2006-07/images/profiles/default-5.jpg 13.200 OK http://www.extension.harvard.edu/2006-07/images/snaps.jpg 14.200 OK http://www.extension.harvard.edu/2006-07/images/veri.gif 15.200 OK http://www.extension.harvard.edu/2006-07/news/;jsessionid=KPLBEAHDCCIK 16.200 OK http://www.extension.harvard.edu/2006-07/news/chaisson.jsp;jsessionid=KPLBEAHDCCI17.200 OK http://www.extension.harvard.edu/2006-07/news/creatures.jsp;jsessionid=KPLBEAHDCC18.200 OK http://www.extension.harvard.edu/2006-07/news/earthday.jsp;jsessionid=KPLBEAHDCCI19.200 OK http://www.extension.harvard.edu/2006-07/news/gittleman.jsp;jsessionid=KPLBEAHDCC20.200 OK http://www.extension.harvard.edu/2006-07/news/retirement.jsp;jsessionid=KPLBEAHDC21.200 OK http://www.extension.harvard.edu/2006-07/news/volunteers.jsp;jsessionid=KPLBEAHDC22.200 OK http://www.extension.harvard.edu/2006-07/overview/ 23.200 OK http://www.extension.harvard.edu/2006-07/overview/tradition.jsp 24.200 OK http://www.extension.harvard.edu/2006-07/overview/video/ 25.200 OK http://www.extension.harvard.edu/2006-07/overview/welcome.jsp 26.200 OK http://www.extension.harvard.edu/2006-07/programs/ 27.200 OK http://www.extension.harvard.edu/2006-07/programs/default.jsp#cert 28.200 OK http://www.extension.harvard.edu/2006-07/programs/info.jsp 29.200 OK http://www.extension.harvard.edu/2006-07/register/ 30.200 OK http://www.extension.harvard.edu/2006-07/register/financial/ 31.200 OK http://www.extension.harvard.edu/2006-07/register/financial/finaid.jsp 32.200 OK http://www.extension.harvard.edu/2006-07/register/guidelines/calendar/ 33.200 OK http://www.extension.harvard.edu/2006-07/register/guidelines/international.jsp 34.200 OK http://www.extension.harvard.edu/2006-07/register/policies/transcripts.jsp 35.200 OK http://www.extension.harvard.edu/2006-07/stylesheets/home-print.css 36.200 OK http://www.extension.harvard.edu/2006-07/stylesheets/home-screen.css 37.200 OK http://www.extension.harvard.edu/DistanceEd/ 38.200 OK http://www.extension.harvard.edu/chooser;jsessionid=KPLBEAHDCCIK 39.200 OK http://www.google-analytics.com/urchin.js 40.SKIP https://dceweb.harvard.edu/prod/gowlogn3.taf 41.SKIP javascript:popUp('/2006-07/snapshots/') 42.SKIP javascript:popUp2('/2006-07/profiles/default.jsp?n=5') 43.SKIP mailto:[email protected] 44. 45.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
71 of 110 4/17/2007 4:12 PM
Page Weight
Page weight of http://www.harvard.edu/
Firefox Web Developer Tool Bar
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
72 of 110 4/17/2007 4:12 PM
timefetch
view plain print ?
[dheitmey@minerva dheitmey]$ timefetch -rv http://www.harvard.edu/ 1. 0.007 1.6kb: http://www.harvard.edu/images/global/shield1.gif 2. 0.007 7.4kb: http://www.harvard.edu/images/global/shield2.gif 3. 0.007 11.9kb: http://www.harvard.edu/images/global/banner.gif 4. 0.006 2.6kb: http://www.harvard.edu/images/global/shield3.gif 5. 0.007 0.2kb: http://www.harvard.edu/images/global/home2.gif 6. 0.006 0.1kb: http://www.harvard.edu/images/global/nav_bullet.gif 7. 0.006 0.7kb: http://www.harvard.edu/images/global/admissions.gif 8. 0.006 0.4kb: http://www.harvard.edu/images/global/employment.gif 9. 0.006 0.3kb: http://www.harvard.edu/images/global/libraries.gif 10. 0.006 0.3kb: http://www.harvard.edu/images/global/museums.gif 11. 0.006 0.2kb: http://www.harvard.edu/images/global/arts.gif 12. 0.006 0.4kb: http://www.harvard.edu/images/global/president.gif 13. 0.006 0.1kb: http://www.harvard.edu/images/global/nav2_bullet.gif 14. 0.006 0.4kb: http://www.harvard.edu/images/global/administration.gif 15. 0.006 0.3kb: http://www.harvard.edu/images/global/schools.gif 16. 0.006 0.6kb: http://www.harvard.edu/images/global/neighbors.gif 17. 0.006 0.3kb: http://www.harvard.edu/images/global/athletics.gif 18. 0.006 0.3kb: http://www.harvard.edu/images/global/alumni.gif 19. 0.006 0.3kb: http://www.harvard.edu/images/global/search.gif 20. 0.006 0.7kb: http://www.harvard.edu/includes/inst_image/titles/062.gif 21. 0.008 28.7kb: http://www.harvard.edu/includes/inst_image/images/062.jpg 22. 0.006 0.8kb: http://www.harvard.edu/images/home/schools_header.gif 23. 0.006 0.0kb: http://www.harvard.edu/images/home/spacer.gif 24. 0.006 0.4kb: http://www.harvard.edu/images/home/sch_bus.gif 25. 0.006 0.4kb: http://www.harvard.edu/images/home/sch_eng.gif 26. 0.006 0.3kb: http://www.harvard.edu/images/home/sch_under.gif 27. 0.006 0.4kb: http://www.harvard.edu/images/home/sch_gov.gif 28. 0.006 0.5kb: http://www.harvard.edu/images/home/sch_dce.gif 29. 0.006 0.5kb: http://www.harvard.edu/images/home/sch_grad.gif 30. 0.006 0.3kb: http://www.harvard.edu/images/home/sch_dental.gif 31. 0.006 0.2kb: http://www.harvard.edu/images/home/sch_law.gif 32. 0.006 0.3kb: http://www.harvard.edu/images/home/sch_des.gif 33. 0.006 0.3kb: http://www.harvard.edu/images/home/sch_med.gif 34. 0.006 0.3kb: http://www.harvard.edu/images/home/sch_div.gif 35. 0.006 0.4kb: http://www.harvard.edu/images/home/sch_health.gif 36. 0.006 0.4kb: http://www.harvard.edu/images/home/sch_ed.gif 37. 0.006 0.5kb: http://www.harvard.edu/images/home/sch_rad.gif 38. 0.006 0.1kb: http://www.harvard.edu/images/home/event.gif 39. 0.006 0.1kb: http://www.harvard.edu/images/home/research.gif 40. 0.006 0.1kb: http://www.harvard.edu/images/home/video.gif 41. 0.008 36.1kb: http://www.harvard.edu/images/home/news/070416a.jpg 42. 0.008 46.9kb: http://www.harvard.edu/images/home/news/070415.jpg 43. 0.006 0.4kb: http://www.harvard.edu/images/home/news/othernews.gif 44. 0.007 19.1kb: http://www.harvard.edu/images/home/news/drew_t.jpg 45. 0.006 0.4kb: http://www.harvard.edu/images/global/about.gif 46. 0.006 0.1kb: http://www.harvard.edu/images/global/footer_bull.gif 47. 0.006 0.4kb: http://www.harvard.edu/images/global/directories.gif 48. 0.006 0.4kb: http://www.harvard.edu/images/global/contact.gif 49. 0.006 0.3kb: http://www.harvard.edu/images/global/infotech.gif 50. 0.006 0.3kb: http://www.harvard.edu/images/global/news.gif 51. 0.006 0.4kb: http://www.harvard.edu/images/global/siteguide.gif 52.---------------------------------------------------- 53. 0.014 21.5kb: http://www.harvard.edu/ 54. 0.445 190.8kb: http://www.harvard.edu/ (incl.: img) 55.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
73 of 110 4/17/2007 4:12 PM
view plain print ?
minerva% timefetch -h 1.Usage: /usr/local/bin/timefetch [dhFjrvXz] [-f host] [-a attempts] [-b broken_images] 2. [-s size] [-T text] [-t timeout] http://url [http://url ... ] 3./usr/local/bin/timefetch -h for help message 4. 5. -a Number of attempts for the initial page fetch. 6. -b Minimum number of broken images to trigger alarm. 7. -d Debug: view all kinds of marginally useful output. 8. -f Force host: before doing recursive downloads, munge each URL 9. and replace the host in the URL with some other host. 10. -h Help: print this help message. 11. -j Java: download java applets as well. 12. -F No frames: If the page is a frameset, do *not* fetch the frames. 13. Default is to fetch them. 14. -r Recursive: download all images and calculate cumulative time. 15. -s Minimum size for the entire document (in kilobytes). 16. -t Timeout value for HTTP requests. 17. -T HTML text to scan for (such as "</html>"). Not case sensitive. 18. -v Verbose: print out URLs as they are downloaded. 19. -X Don't exit on errors, just try to continue. 20. -z Exit immediately on errors in fetching the main page. 21. 22. NOTE: This program always downloads embedded frames and prints 23. a cumulative total for frames and framesets, even if you did not 24. specify a recursive download. 25.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
74 of 110 4/17/2007 4:12 PM
timefetch examples
timefetch is in need of updating (does not parse CSS and will not get images referenced in CSS), but can still be a useful tool
timefetch will show the actual download time (often not useful) and the total kilobytes downloaded (often useful). Warning: timefetch will not execute JavaScript, nor does it fetch images included byCSS).
Compare to:
view plain print ?
[dheitmey@minerva dheitmey]$ timefetch -rv http://cscie12.dce.harvard.edu/ 1. 0.089 17.9kb: http://images.amazon.com/images/P/059610197X.01._AA240_SCLZZZZZZZ_.jpg 2. 1.054 11.2kb: http://ec1.images-amazon.com/images/P/0596009879.01._AA240_SCLZZZZZZZ_V373.---------------------------------------------------- 4. 0.788 17.6kb: http://isites.harvard.edu/icb/icb.do?keyword=k12622 5. 2.033 46.6kb: http://cscie12.dce.harvard.edu/ (incl.: img) 6.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
75 of 110 4/17/2007 4:12 PM
Web Robots
Robots, Spider, Crawlers
As they "spider" a site, the robots can perform various actions, such as:
Gathering content for search engines or a website mirrorValidating, checking, or processing content
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
76 of 110 4/17/2007 4:12 PM
Spidering Behavior: an example with Lynx
After lynx is done, here's a look at the files we have:
the "lnkNNNNNNNN.dat" files contain the text dump of the pages lynx retrievedthe "traverse.dat" files contain the list of link that lynx retrivedthe "reject.dat" files contain a list of URLs that lynx did not fetch (due to the fact that they are outside the "realm" as specified on the command line).
view plain print ?
minerva% lynx -traversal \ 1.> -crawl \ 2.> -realm cscie12.dce.harvard.edu \ 3.> http://cscie12.dce.harvard.edu/lecture_notes/2006-07/20070410/toc.html 4.
view plain print ?
minerva% ls -l 1.total 540 2.-rw------- 1 dheitmey teaching 3559 Apr 17 15:34 lnk00000000.dat 3.-rw------- 1 dheitmey teaching 59557 Apr 17 15:34 lnk00000001.dat 4.-rw------- 1 dheitmey teaching 13851 Apr 17 15:34 lnk00000002.dat 5.-rw------- 1 dheitmey teaching 757 Apr 17 15:34 lnk00000003.dat 6.-rw------- 1 dheitmey teaching 3562 Apr 17 15:34 lnk00000004.dat 7..... truncated ... 8.-rw------- 1 dheitmey teaching 838 Apr 17 15:34 lnk00000101.dat 9.-rw------- 1 dheitmey teaching 2659 Apr 17 15:34 lnk00000102.dat 10.-rw------- 1 dheitmey teaching 28336 Apr 17 15:34 reject.dat 11.-rw------- 1 dheitmey teaching 20385 Apr 17 15:34 traverse2.dat 12.-rw------- 1 dheitmey teaching 7709 Apr 17 15:34 traverse.dat 13.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
77 of 110 4/17/2007 4:12 PM
Robots Exclusion Standard (RES)
The Web Robots Pageshttp://www.robotstxt.org/wc/robots.htmlA Standard for Robot Exclusionhttp://www.robotstxt.org/wc/norobots.htmlThe Web Robots FAQ
RES provides two mechanisms to instruct robots that visit your site:
robots.txt file1.robots meta tag2.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
78 of 110 4/17/2007 4:12 PM
robots.txt
Two directives:
User-AgentDisallow
Note: robots.txt must lie at the root level of the server.
Examples of robots.txt files:
http://www.fas.harvard.edu/robots.txthttp://www.foxnews.com/robots.txtfind them at a couple of your favorite web sites
Why won't the following robots.txt files do anything useful? (they aren't at the root level of server)
http://www.people.fas.harvard.edu/~jharvard/robots.txthttp://www.fas.harvard.edu/computing/robots.txt
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
79 of 110 4/17/2007 4:12 PM
robots.txt Examples
http://www.npr.org/robots.txt
http://www.foxnews.com/robots.txt
Disallow all robots from certain areas: 1.User-agent: * 2.Disallow: /cgi-bin 3.Disallow: /ramfiles/ 4.Disallow: /*.smil 5.Disallow: /*.asx 6.Disallow: /*.ram 7.Disallow: /*.rmm 8.Disallow: /*.js 9.Disallow: /*.au 10.Disallow: /stations/force/force_localization.php? 11.Disallow: /rundowns/segment.php? 12.
User-agent: * 1.Disallow: / 2. 3.User-agent: fusionbot 4.User-agent: Googlebot 5.Disallow: /printer_friendly_story 6. 7.User-agent: Mediapartners-Google* 8.Disallow: /printer_friendly_story 9. 10.User-agent: Teoma 11.Disallow: /printer_friendly_story 12. 13.User-agent: yahoo-newscrawler 14.Disallow: /printer_friendly_story 15. 16.User-agent: Yahoo! Slurp 17.Disallow: /printer_friendly_story 18. 19.User-agent: newslookup-bot 20.Disallow: /printer_friendly_story 21. 22.User-agent: gsa-crawler 23.Disallow: /printer_friendly_story 24.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
80 of 110 4/17/2007 4:12 PM
Robots meta element
name="robots"content
index or noindexfollow or nofollow
The Robots meta element can be used on a per document basis.
OK to index page; OK to follow links on page
OK to index page; Don'tfollow links on page
Don't index page; OK to follow links on page
Don't index page; Don't follow links on page
view plain print ?
<meta name="robots" content="index,follow"/> 1.
view plain print ?
<meta name="robots" content="index,nofollow"/> 1.
view plain print ?
<meta name="robots" content="noindex,follow"/> 1.
view plain print ?
<meta name="robots" content="noindex,nofollow"/> 1.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
81 of 110 4/17/2007 4:12 PM
Link Checking Robots
Check the links on a single page; or on an entire site.If following links, will do a get request, otherwise it should do a head request.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
82 of 110 4/17/2007 4:12 PM
Examples of Link Checking Robots
churlchecklinkwebbotcheckbotwebchecklynx
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
83 of 110 4/17/2007 4:12 PM
checklink
W3C Link Checkerhttp://validator.w3.org/docs/checklink
Use Onlinehttp://validator.w3.org/checklinkUse command line:
Perl, Free
view plain print ?
minerva% checklink URL 1.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
84 of 110 4/17/2007 4:12 PM
checklink
view plain print ?
minerva% checklink --help 1.W3C checklink version 3.6.2.26 (c) 1999-2004 W3C 2.Usage: checklink <options> <uris> 3.Options: 4. -s/--summary Result summary only. 5. -b/--broken Show only the broken links, not the redirects. 6. -e/--directory Hide directory redirects, for example 7. http://www.w3.org/TR -> http://www.w3.org/TR/ 8. -r/--recursive Check the documents linked from the first one. 9. -D/--depth n Check the documents linked from the first one 10. to depth n (implies --recursive). 11. -l/--location uri Scope of the documents checked in recursive mode. 12. By default, for example for 13. http://www.w3.org/TR/html4/Overview.html 14. it would be http://www.w3.org/TR/html4/ 15. -n/--noacclanguage Do not send an Accept-Language header. 16. -L/--languages Languages accepted (default: *). 17. -q/--quiet No output if no errors are found. Implies -s. 18. -v/--verbose Verbose mode. 19. -i/--indicator Show progress while parsing. 20. -u/--user username Specify a username for authentication. 21. -p/--password password Specify a password. 22. --hide-same-realm Hide 401's that are in the same realm as the 23. document checked. 24. -t/--timeout value Timeout for HTTP requests. 25. -d/--domain domain Regular expression describing the domain to 26. which the authentication information will be 27. sent. 28. --masquerade "base1 base2" Masquerade base URI base1 as base2. See manual 29. page for more information. 30. -y/--proxy proxy Specify an HTTP proxy server. 31. -h/--html HTML output. 32. -?/--help Show this message. 33. -V/--version Output version information. 34. 35.See "perldoc Net::FTP" for information about various environment variables 36.affecting FTP connections and "perldoc Net::NNTP" for setting a default 37.NNTP server for news: URIs. 38. 39.The W3C_CHECKLINK_CFG environment variable can be used to set the 40.configuration file to use. See details in the full manual page, it can 41.be displayed with: 42. perldoc /usr/local/bin/checklink 43. 44.More documentation at: http://www.w3.org/2000/07/checklink 45.Please send bug reports and comments to the www-validator mailing list: 46. [email protected] (with 'checklink' in the subject) 47. Archives are at: http://lists.w3.org/Archives/Public/www-validator/ 48.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
85 of 110 4/17/2007 4:12 PM
checklink
view plain print ?
minerva% checklink -r -D 0 -s -b \ 1.> http://cscie12.dce.harvard.edu/lecture_notes/20070131/ 2.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
86 of 110 4/17/2007 4:12 PM
webbot
webbot is part of the W3C Libwww package. http://www.w3.org/Robot/
view plain print ?
minerva% webbot 1. 2.W3C OpenSource Software 3.----------------------- 4. 5. Webbot version 5.4.0 6. using the W3C libwww library version 5.4.0. 7. 8. See "http://www.w3.org/Robot/User/CommandLine" for help 9. See "http://www.w3.org/Robot/User/" for user information 10. See "http://www.w3.org/Robot/" for general information 11. 12. Please send feedback to the <[email protected]> mailing list, 13. see "http://www.w3.org/Library/#Forums" for details 14.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
87 of 110 4/17/2007 4:12 PM
webbot example
view plain print ?
minerva% webbot -img \ 1.> -depth 99 \ 2.> -prefix http://cscie12.dce.harvard.edu/lecture_notes/20060131/ \ 3.> -include http://cscie12.dce.harvard.edu/lecture_notes/20060131/ 4.> -404 404.log 5.> -l clf.log 6.> -referer referer.log 7.> -reject reject.log 8.> http://cscie12.dce.harvard.edu/lecture_notes/20060131/ 9....content removed... 10.Robot....... Received element 0, attribute 5 with anchor 0x8073700 11.Robot....... Found `http://cscie12.dce.harvard.edu/' - 12............. Already checked 13.Robot....... Received element 0, attribute 5 with anchor 0x8073688 14.Robot....... Found `http://cscie12.dce.harvard.edu/lecture_notes/20060131/slide1.html' - 15............. Already checked 16.Robot....... done with http://cscie12.dce.harvard.edu/lecture_notes/20060131/slide0.html 17. 2 outstanding requests 18.Robot....... done with http://cscie12.dce.harvard.edu/lecture_notes/20060131/images/verit19. 1 outstanding request 20.Robot....... done with http://cscie12.dce.harvard.edu/lecture_notes/20060131/images/KUSea21. Everything is finished... 22. 23.Accessed 62 documents in 2.61 seconds (23.79 requests pr sec) 24. Did a GET on 53 document(s) and downloaded 182K bytes of document bodies (71396.825. Did a HEAD on 9 document(s) with a total of 49K bytes 26. 27.Raw Log files: 28. Logged 62 entries in general log file `clf.log' 29. Logged 61 entries in referer log file `referer.log' 30. Logged 51 entries in rejected log file `reject.log' 31. Logged 0 entries in not found log file `404.log' 32.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
88 of 110 4/17/2007 4:12 PM
checkbot
Checkbot http://degraaff.org/checkbot/
minerva% checkbot
Checkbot Example Output
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
89 of 110 4/17/2007 4:12 PM
Checkbot Example
view plain print ?
minerva% checkbot --help 1.Checkbot 1.75 command line options: 2. 3. --cookies Accept cookies from the server 4. --debug Debugging mode: No pauses, stop after 25 links. 5. --mailto address Mail brief synopsis to address when done. 6. --noproxy domains Do not proxy requests to given domains. 7. --verbose Verbose mode: display many messages about progress. 8. --url url Start URL 9. --match match Check pages only if URL matches `match' 10. If no match is given, the start URL is used as a match 11. --exclude exclude Exclude pages if the URL matches 'exclude' 12. --filter regexp Run regexp on each URL found 13. --ignore ignore Ignore URLs matching 'ignore' 14. --suppress file Use contents of 'file' to suppress errors in output 15. --file file Write results to file, default is checkbot.html 16. --note note Include Note (e.g. URL to report) along with Mail message. 17. --proxy URL URL of proxy server for HTTP and FTP requests. 18. --internal-only Only check internal links, skip checking external links. 19. --sleep seconds Sleep this many seconds between requests (default 0) 20. --style url Reference the style sheet at this URL. 21. --timeout seconds Timeout for http requests in seconds (default 120) 22. --interval seconds Maximum time interval between updates (default 10800) 23. --dontwarn codes Do not write warnings for these HTTP response codes 24. --enable-virtual Use only virtual names, not IP numbers for servers 25. --language Specify 2-letter language code for language negotiation 26. 27.Options --match, --exclude, and --ignore can take a perl regular expression 28.as their argument 29. 30.Use 'perldoc checkbot' for more verbose documentation. 31.Checkbot WWW page : http://degraaff.org/checkbot/ 32.Mail bugs and problems: [email protected] 33.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
90 of 110 4/17/2007 4:12 PM
checkbot
Results are given in an HTML page: checkbot.html
view plain print ?
minerva% checkbot 1.> --url http://cscie12.dce.harvard.edu/lecture_notes/20060131/ 2.> --verbose 3.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
91 of 110 4/17/2007 4:12 PM
For the Programmer: Writing Your Own
The Perl modules, LWP and WWW::Robot make writing robots almost trivial.
Examples in Perl Cookbook, published by O'ReillyPerl and LWP by Sean Burke, published by O'Reilly
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
92 of 110 4/17/2007 4:12 PM
Other Webmaster Tools
Checking HTML PagesWeb Site MirroringDocument Version ControlMonitor HTTP Server Performance
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
93 of 110 4/17/2007 4:12 PM
Checking HTML Pages
HTML tidy, http://tidy.sourceforge.net/W3C HTML Validation, http://validator.w3.org/W3C CSS Validation, http://jigsaw.w3.org/css-validator/WebXact (WAI and Section 508 Compliance), http://webxact.watchfire.com/Watchfire.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
94 of 110 4/17/2007 4:12 PM
Web Site Mirroring
GNU wget http://www.gnu.org/software/wget/wget.htmlw3mir http://langfeldt.net/w3mir/
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
95 of 110 4/17/2007 4:12 PM
GNU wget
GNU wget http://www.gnu.org/software/wget/wget.html
view plain print ?
minerva% wget --help 1.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
96 of 110 4/17/2007 4:12 PM
w3mir
w3mir http://langfeldt.net/w3mir/
w3miris a all purpose HTTP copying and mirroring tool. The main focus of w3mir is to create and maintaina browseable copy of one, or several, remote WWW site(s). Used to the max w3mir can retrieve thecontents of several related sites and leave the mirror browseable via a local web server, or from a filesystem, such as directly from a CDROM.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
97 of 110 4/17/2007 4:12 PM
HTTP Server Stress Test
ApacheBench (ab) is part of the Apache HTTP Server Distribution http://www.apache.org/httpd.html
minerva% ab -h
Apache JMeter http://jakarta.apache.org/jmeter/index.html
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
98 of 110 4/17/2007 4:12 PM
Apache Bench (ab)
view plain print ?
minerva% /usr/bin/ab -n 10000 -c 10 http://cscie12.dce.harvard.edu/lecture_notes/ 1.This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0 2.Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ 3.Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/ 4. 5.Benchmarking cscie12.dce.harvard.edu (be patient) 6.Completed 1000 requests 7.Completed 2000 requests 8.Completed 3000 requests 9.Completed 4000 requests 10.Completed 5000 requests 11.Completed 6000 requests 12.Completed 7000 requests 13.Completed 8000 requests 14.Completed 9000 requests 15.Finished 10000 requests 16. 17. 18.Server Software: Apache/2.0.51 19.Server Hostname: cscie12.dce.harvard.edu 20.Server Port: 80 21. 22.Document Path: /lecture_notes/ 23.Document Length: 1040 bytes 24. 25.Concurrency Level: 10 26.Time taken for tests: 23.653070 seconds 27.Complete requests: 10000 28.Failed requests: 0 29.Write errors: 0 30.Total transferred: 12097254 bytes 31.HTML transferred: 10406240 bytes 32.Requests per second: 422.78 [#/sec] (mean) 33.Time per request: 23.653 [ms] (mean) 34.Time per request: 2.365 [ms] (mean, across all concurrent requests) 35.Transfer rate: 499.43 [Kbytes/sec] received 36. 37.Connection Times (ms) 38. min mean[+/-sd] median max 39.Connect: 1 8 4.2 9 39 40.Processing: 6 14 4.3 14 44 41.Waiting: 0 8 4.2 8 36 42.Total: 18 23 4.9 21 47 43. 44.Percentage of the requests served within a certain time (ms) 45. 50% 21 46. 66% 21 47. 75% 22 48. 80% 23 49. 90% 31 50. 95% 38 51. 98% 39 52. 99% 40 53. 100% 47 (longest request) 54.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
99 of 110 4/17/2007 4:12 PM
Apache Bench (ab)
view plain print ?
minerva% /usr/sbin/ab -n 1000 -c 10 http://cscie12.dce.harvard.edu/tools/webcube.cgi 1.This is ApacheBench, Version 1.3d <$Revision: 1.67 $> apache-1.3 2.Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ 3.Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/ 4. 5.Benchmarking cscie12.dce.harvard.edu (be patient) 6.Completed 100 requests 7.Completed 200 requests 8.Completed 300 requests 9.Completed 400 requests 10.Completed 500 requests 11.Completed 600 requests 12.Completed 700 requests 13.Completed 800 requests 14.Completed 900 requests 15.Finished 1000 requests 16. 17. 18.Server Software: Apache/2.0.49 19.Server Hostname: cscie12.dce.harvard.edu 20.Server Port: 80 21. 22.Document Path: /tools/webcube.cgi 23.Document Length: 58163 bytes 24. 25.Concurrency Level: 10 26.Time taken for tests: 98.986587 seconds 27.Complete requests: 1000 28.Failed requests: 0 29.Write errors: 0 30.Total transferred: 58323402 bytes 31.HTML transferred: 58171098 bytes 32.Requests per second: 10.10 [#/sec] (mean) 33.Time per request: 989.866 [ms] (mean) 34.Time per request: 98.987 [ms] (mean, across all concurrent requests) 35.Transfer rate: 575.39 [Kbytes/sec] received 36. 37.Connection Times (ms) 38. min mean[+/-sd] median max 39.Connect: 0 0 0.1 0 1 40.Processing: 191 981 966.2 719 9244 41.Waiting: 188 810 750.6 586 6321 42.Total: 191 981 966.2 719 9244 43. 44.Percentage of the requests served within a certain time (ms) 45. 50% 719 46. 66% 1065 47. 75% 1278 48. 80% 1468 49. 90% 1956 50. 95% 2452 51. 98% 4006 52. 99% 5499 53. 100% 9244 (longest request) 54.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
100 of 110 4/17/2007 4:12 PM
HTTP Server Monitoring
Nagios http://www.nagios.org/Cricket http://cricket.sourceforge.net/ Monitoring Apache with Cricket
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
101 of 110 4/17/2007 4:12 PM
Web Logs
Common Tools
AnalogAnalog + Report MagicAWStatsWebTrends
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
102 of 110 4/17/2007 4:12 PM
HTTP Server Logs
access logerror logreferer log (no longer common)user-agent log (no longer common)
error log:
[Thu Dec 2 11:26:08 1999] [notice] Apache/1.3.9 (Unix) configured -- resuming normal operat[Thu Dec 2 11:27:19 1999] [notice] caught SIGTERM, shutting down[Mon Dec 6 19:15:04 1999] [notice] Apache/1.3.9 (Unix) configured -- resuming normal operat[Mon Dec 6 19:27:33 1999] [notice] caught SIGTERM, shutting down
access log:
is03.fas.harvard.edu - - [02/Dec/1999:11:26:42 -0500] "GET /server-status HTTP/1.0" 200 1544140.247.30.103 - - [02/Dec/1999:11:26:48 -0500] "GET / HTTP/1.0" 200 1622is03.fas.harvard.edu - - [02/Dec/1999:11:26:56 -0500] "GET /server-info HTTP/1.0" 200 45662140.247.30.104 - - [06/Dec/1999:19:16:58 -0500] "GET / HTTP/1.0" 200 1622140.247.27.63 - - [06/Dec/1999:19:17:08 -0500] "GET / HTTP/1.1" 200 1622140.247.27.63 - - [06/Dec/1999:19:17:09 -0500] "GET /apache_pb.gif HTTP/1.1" 200 2326is04.fas.harvard.edu - - [06/Dec/1999:19:18:32 -0500] "GET /server-status HTTP/1.0" 200 1546
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
103 of 110 4/17/2007 4:12 PM
Data in Access Logs
http://www.apache.org/docs/mod/mod_log_config.html
Typical Data:
TimeIP address / HostnameUsername (if under Authentication)RequestUser-AgentReferrer URLResponse StatusBytes returned
Possible Data:
The contents of a specified environment variableFilenameThe request protocolThe contents of specified HTTP request headersThe contents of specified HTTP response headersRemote logname (from identd, if supplied)The request methodThe canonical Port of the server serving the request.The process ID of the child that serviced the request.The query stringFirst line of requestThe time taken to serve the request.The URL path requested.The canonical ServerName of the server serving the request.The server name according to the UseCanonicalName setting.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
104 of 110 4/17/2007 4:12 PM
Log Formats
Common Log Format (CLF)host ident auth_user date request status bytes
User-Agent Logdate user-agent
Referer Logdate referrer-url request-url
Combined Log Formathost ident authuser date request status bytes referrer user-agent
Custom Log Formats in Apache
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
105 of 110 4/17/2007 4:12 PM
Combined Log Format
damhab.103.60.61 - - [14/Sep/1999:11:37:46 -0400] "GET /~cscie12/syllabus/workshops.html HTTdamhab.103.60.61 - - [14/Sep/1999:11:37:46 -0400] "GET /~cscie12/discussion/ HTTP/1.0" 404 2vicfux.115.63.27 - - [14/Sep/1999:11:38:28 -0400] "GET /~cscie12/ HTTP/1.0" 200 7281 "http:/vicfux.115.63.27 - - [14/Sep/1999:11:38:56 -0400] "GET /~cscie12/images/dce2.gif HTTP/1.0" 2vicfux.115.63.27 - - [14/Sep/1999:11:38:58 -0400] "GET /~cscie12/images/dot.gif HTTP/1.0" 20vicfux.115.63.27 - - [14/Sep/1999:11:38:58 -0400] "GET /~cscie12/images/syllabus.gif HTTP/1.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
106 of 110 4/17/2007 4:12 PM
Web Server Logs: Two perspectives
Server AdministratorContent Provider
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
107 of 110 4/17/2007 4:12 PM
Web Server Logs: What we would like to know and what we can know
What is the busiest time?1.How long do they stay?2.How long did it take to fulfill a request?3.How many requests were there for a specific resource?4.Where are the users coming from?5.What browsers are people using?6.What pages have they been to?7.How many were looking versus buying?8.What requests resulted in errors (status 404, etc)?9.Where do the users go when they leave the site?10.Do they come back?11.
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
108 of 110 4/17/2007 4:12 PM
Complicating Issues
HTTP is a stateless protocolLocal CacheProxy CacheProxy ServersShared Computers
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
109 of 110 4/17/2007 4:12 PM
Log Rotation
approximately 200 to 250 bytes per line (request)For example, 1,000,000 requests per day (12 requests per second)log grows at 2.8 kb per second238 Mb for 1,000,000 requestscompressed (gzip'ed) logs are 7 to 10% of original size!
Hypertext Transfer Protocol http://localhost:8080/cocoon/projects/cscie12/slides/20070417/handout.html
110 of 110 4/17/2007 4:12 PM
Tools for Log Analysis
Analog http://www.analog.cx/ Stephen TurnerUNIX, Windows, MacOS, othersFree!!
Report Magic
WebTrends Log Analyzer http://www.webtrends.com/
Table of Contents | All Slides | Link List | CSCI E-12