W3C-LinkChecker-4.81/0000755000000000000000000000000011646514071012725 5ustar rootrootW3C-LinkChecker-4.81/t/0000755000000000000000000000000011646514071013170 5ustar rootrootW3C-LinkChecker-4.81/t/00compile.t0000644000000000000000000000031311426614412015136 0ustar rootrootuse Test::More tests => 2; # -*- perl -*- use File::Spec (); ok(system($^X, '-wTc', File::Spec->catfile('bin', 'checklink')) == 0); require_ok('W3C::LinkChecker'); W3C-LinkChecker-4.81/lib/0000755000000000000000000000000011646514071013473 5ustar rootrootW3C-LinkChecker-4.81/lib/W3C/0000755000000000000000000000000011646514071014067 5ustar rootrootW3C-LinkChecker-4.81/lib/W3C/LinkChecker.pm0000644000000000000000000000017611646513407016615 0ustar rootroot# Dummy module for CPAN indexing purposes. package W3C::LinkChecker; use strict; use vars qw($VERSION); $VERSION = "4.81"; 1; W3C-LinkChecker-4.81/NEWS0000644000000000000000000000576011646513367013443 0ustar rootrootThis document contains information about high level changes between Link Checker releases. Version 4.81 - 2011-10-16 - Work around some related problems (#12720, rt.cpan.org#54361). - Eliminate some warnings (emitted by code, not from results). Version 4.8 - 2011-04-02 - Avoid some robot delays by improving the order in which links are checked. - Avoid some unnecessary HEAD requests in recursive mode. - Clarify output wrt. links that have already been checked. - Make connection cache size configurable, and increase the default to 2. - Move JavaScript to an external file. - Check applet and object archive links. Version 4.7 - 2011-03-17 - Support for IRI. - Support for more HTML5 links. - Decode query string parameters as UTF-8. - Decode command line arguments according to locale. - New dependencies: Encode-Locale (command line mode only). - Updated dependencies: libwww-perl >= 5.833, URI >= 1.53. Version 4.6 - 2010-05-01 - Support for checking links in CSS. - Results UI improvements, added "progress bar". - Support for larger variety of character and content encodings. - Support for HTTP responses with > 4kB header lines. - Additional output suppress options in command line mode. - Improved heuristics when passed non-absolute URLs. - Support for cookies (command line only for now). - More "false positive" failure avoidance efforts for "make test". - The set of forbidden protocols is now configurable. - New dependencies: CSS-DOM >= 0.09. - Updated dependencies: Perl >= 5.8. Version 4.5 - 2009-03-30 - Removed W3C trademarked icons from distribution tarball. - Avoid "false positive" failures from "make test" in certain setups. - Make quiet command line mode quieter. - Lowered default timeout to 30 seconds. Version 4.4 - 2009-02-12 - checking more elements and attributes, such as BLOCKQUOTE cite="", BODY background="", EMBED, etc - Changes in the UI to make it match other validators more closely - in HTML/cgi output, using javascript to show checklink status as it happens - added support for HTML5 links - softer wording for broken link results - Add non-robot developer mode - many bug fixes and code cleanup Version 4.3 - 2006-10-22 - Various minor improvements to result output, both in text and HTML modes. - Fix --quiet and checking multiple documents to match documentation. - Eliminate various warnings (emitted by code, not from results). - Documentation improvements. Version 4.2.1 - 2005-05-15 - Include documentation of the reorganized access keys. Version 4.2 - 2005-04-27 - Access key reorganization, making them less likely to conflict with browsers' "native" key bindings. - Redirects are now checked for private IP addresses too. Version 4.1 - 2004-11-24 - Added workarounds against browser timeouts in "summary only" mode. - Improved caching and reuse of fetched /robots.txt information. - Fixed a bug where a complete protocol response (including headers) was passed to the HTML parser, which caused unexpected behaviour. - Minor user interface and installation related improvements. W3C-LinkChecker-4.81/docs/0000755000000000000000000000000011646514071013655 5ustar rootrootW3C-LinkChecker-4.81/docs/linkchecker.js0000644000000000000000000000175711542175103016501 0ustar rootrootfunction uriOk(num) { if (!document.getElementById) { return true; } var u = document.getElementById('uri_' + num); var ok = false; if (u.value.length > 0) { if (u.value.search) { ok = (u.value.search(/\S/) !== -1); } else { ok = true; } } if (!ok) { u.focus(); } return ok; } function show_progress(progress_id, progress_text, progress_percentage) { var div = document.getElementById("progress" + progress_id); var head = div.getElementsByTagName("h3")[0]; var text = document.createTextNode(progress_text); var span = document.createElement("span"); span.appendChild(text); head.replaceChild(span, head.getElementsByTagName("span")[0]); var bar = div.getElementsByTagName("div")[0]; bar.firstChild.style.width = progress_percentage; bar.title = progress_percentage; var pre = div.getElementsByTagName("pre")[0]; pre.scrollTop = pre.scrollHeight; } W3C-LinkChecker-4.81/docs/checklink.html0000644000000000000000000003704011545660212016477 0ustar rootroot W3C Link Checker Documentation

About this service

In order to check the validity of the technical reports that W3C publishes, the Systems Team has developed a link checker.

A first version was developed in August 1998 by Renaud Bruyeron. Since it was lacking some functionalities, Hugo Haas rewrote it more or less from scratch in November 1999. It has been improved by Ville Skyttä and many other volunteers since.

The source code is available publicly under the W3C IPR software notice from CPAN (released versions) and a Mercurial repository (development and archived release versions).

What it does

The link checker reads an HTML or XHTML document or a CSS style sheet and extracts a list of anchors and links.

It checks that no anchor is defined twice.

It then checks that all the links are dereferenceable, including the fragments. It warns about HTTP redirects, including directory redirects.

It can check recursively a part of a Web site.

There is a command line version and a CGI version. They both support HTTP basic authentication. This is achieved in the CGI version by passing through the authorization information from the user browser to the site tested.

Use it online

There is an online version of the link checker.

In the online version (and in general, when run as a CGI script), the number of documents that can be checked recursively is limited.

Both the command line version and the online one sleep at least one second between requests to each server to avoid abuses and target server congestion.

Access keys

The following access keys are implemented throughout the site in an attempt to help users using screen readers.

  1. Home: access key "1" leads back to the service's home page.
  2. Downloads: access key "2" leads to downloads.
  3. Documentation: access key "3" leads to the documentation index for the service.
  4. Feedback: access key "4" leads to the feedback instructions.

Install it locally

The link checker is written in Perl. It is packaged as a standard CPAN distribution, and depends on a few other modules which are also available from CPAN.

Install with the CPAN utility

If you system has a working installation of Perl, you should be able to install the link checker and its dependencies with a single line from the commandline shell:

sudo perl -MCPAN -e 'install W3C::LinkChecker' (use without the sudo command if installing from an administrator account).

If this is the first time you use the CPAN utility, you may have to answer a few setup questions before the tool downloads, builds and installs the link checker.

Install by hand

If for any reason the technique described above is not working or if you prefer installing each package by hand, follow the instructions below:

  1. Install Perl, version 5.8 or newer.
  2. You will need the following CPAN distributions, as well as the distributions they possibly depend on. Depending on your Perl version, you might already have some of these installed. Also, the latest versions of these may require a recent version of Perl. As long as the minimum version requirement(s) below are satisfied, everything should be fine. The latest version should not be needed, just get an older version that works with your Perl. For an introduction to installing Perl modules, see The CPAN FAQ.
  3. Optionally install the link checker configuration file, etc/checklink.conf contained in the link checker distribution package into /etc/w3c/checklink.conf or set the W3C_CHECKLINK_CFG environment variable to the location where you installed it.
  4. Optionally, install the checklink script into a location in your web server which allows execution of CGI scripts (typically a directory named cgi-bin somewhere below your web server's root directory).
  5. See also the README and INSTALL file(s) included in the above distributions.

Running checklink --help shows how to use the command line version. The distribution package also includes more extensive POD documentation, use perldoc checklink (or man checklink on Unixish systems) to view it.

SSL/TLSv1 support for https in the link checker needs support for it in libwww-perl; see README.SSL in the libwww-perl distribution for more information.

In online mode, link checker's output should not be buffered to avoid browser timeouts. The link checker itself does not buffer its output, but in some cases output buffering needs to be explicitly disabled for it in the web server running it. One such case is Apache's mod_deflate compression module which as a side effect results in output buffering; one way to disable it for the link checker (while leaving it enabled for other resources if configured so elsewhere) is to add the following section to an appropriate place in the Apache configuration (assuming the link checker script's filename is checklink):

<Files checklink>
    SetEnv no-gzip
</Files>

If you want to enable the authentication capabilities with Apache, have a look at Steven Drake's hack.

The link checker honors proxy settings from the scheme_proxy environment variables. See LWP(3) and LWP::UserAgent(3)'s env_proxy method for more information.

Some environment variables affect the way how the link checker uses FTP. In particular, passive mode is the default. See Net::FTP(3) for more information.

There are multiple alternatives for configuring the default NNTP server for use with news: URIs without explicit hostnames, see Net::NNTP(3) for more information.

Robots exclusion

The link checker honors robots exclusion rules. To place rules specific to the W3C Link Checker in /robots.txt files, sites can use the W3C-checklink user agent string. For example, to allow the link checker to access all documents on a server and to disallow all other robots, one could use the following:

User-Agent: *
Disallow: /

User-Agent: W3C-checklink
Disallow:

Robots exlusion support in the link checker is based on the LWP::RobotUA Perl module. It currently supports the "original 1994 version" of the standard. The robots META tag, ie. <meta name="robots" content="...">, is not supported. Other than that, the link checker's implementation goes all the way in trying to honor robots exclusion rules; if a /robots.txt disallows it, not even the first document submitted as the root for a link checker run is fetched.

Note that /robots.txt rules affect only user agents that honor it; it is not a generic method for access control.

Comments, suggestions and bugs

The current version has proven to be stable. It could however be improved, see the list of open enhancement ideas and bugs for details.

Please send comments, suggestions and bug reports about the link checker to the www-validator mailing list (archives), with 'checklink' in the subject. See examples below

Good
Subject: online checklink times out when accessed with Iceweasel 2.1.12
Bad
Subject: checklink
Bad
Subject: checklink does not work

Known issues

If a link checker run in "summary only" mode takes a long time, some user agents may stop loading the results page due to a timeout. We have placed workarounds hoping to avoid this in the code, but have not yet found one that would work reliably for all browsers. If you experience these timeouts, try avoiding "summary only" mode, or try using the link checker with another browser.

The W3C QA-dev Team
W3C-LinkChecker-4.81/docs/linkchecker.css0000644000000000000000000001750211537353303016654 0ustar rootroot/* Base Style Sheet for the W3C Link Checker. Copyright 2000-2011 W3C (MIT, INRIA, Keio). All Rights Reserved. See http://www.w3.org/Consortium/Legal/ipr-notice.html#Copyright */ html, body { line-height: 120%; color: black; background: white; font-family: "Bitstream Vera Sans", sans-serif; margin: 0; padding: 0; border: 0; } div#main { margin: 1em 2em; } div#main form { clear: both; background: #EAEBEE url(../images/round-tr.png) no-repeat top right; padding: 0.5em 1.3em; border-bottom: 1px solid #DCDDE0; } a img { border: 0; } a:link, a:visited { text-decoration: underline; color: #365D95; } a:hover, a:active { text-decoration: underline; color: #1F2126; } acronym:hover, abbr:hover { cursor: help; } abbr[title], acronym[title], span[title], strong[title] { border-bottom: thin dotted; cursor: help; } pre, code, tt { font-family: "Bitstream Vera Sans Mono", monospace; line-height: 100%; white-space: pre; } div.progress pre { height: 12em; font-size: small; overflow: auto; padding: 0.5em 1.3em; border: 1px solid #DCDDE0; margin-top: 0; } div.progress h3 { margin-bottom: 0; background: white; border: 1px solid #DCDDE0; border-bottom: 0; padding: .4em .8em; text-indent: 0; overflow: hidden; } div.progressbar { border: 1px solid #DCDDE0; border-bottom: 0; } div.progressbar div { height: .15em; width: 0; background-color: #55B05A; } fieldset { border: 0; padding: 0; } legend { font-size: 1.1em; padding: 1em 0 0.23em; letter-spacing: 0.06em; } fieldset p { margin: 0 !important; padding: 0.7em 0 0.5em 1em; border-top: 1px solid #CBCDD5; background: #EAEBEE url(../images/double.png) left top repeat-x; } input#uri { font-family: Monaco, "Courier New", Monospace; font-size: 0.9em; border: 1px solid #BBB; border-top: 1px solid #777; border-bottom: 1px solid #DDD; background: #FEFEFE url(../images/textbg.png) no-repeat top left; padding: 0.2em 0.2em; max-width: 1000px; font-variant: normal; width: 95%; margin: 0.3em 0 0 1em; } p.submit_button { padding: 0.6em 0 0; margin: 0; text-align: center; border-top: 1px solid #CBCDD5; background: #EAEBEE url(../images/double.png) left top repeat-x; } p.submit_button input { overflow: visible; width: auto; background: #FFF; color: #365D95; padding: 0.3em 0.4em 0.1em 0.3em; font-size: 1em; width: 9em; text-align: center; border-bottom: 2px solid #444; border-right: 2px solid #444; border-top: 1px solid #AAA; border-left: 1px solid #AAA; background: #EEE url(../images/grad.png) repeat-x top left; cursor: pointer; } p.submit_button input:active { color: #1F2126; border-bottom: 1px solid #AAA; border-right: 1px solid #AAA; border-top: 2px solid #444; border-left: 2px solid #444; } a:link img, a:visited img { border-style: none; } a img { color: black; /* The only way to hide the border in NS 4.x */ } ul.toc { list-style: none; } ol li { padding: .1em; } th { text-align: left; } .hideme { display: none; } /* These are usually targets and not links */ h1 a, h1 a:hover, h2 a, h2 a:hover, h3 a, h3 a:hover { color: inherit; background-color: inherit; } img { vertical-align: middle; } address img { float: right; width: 88px; } address { padding: 0 2em; font-size: small; text-align: center; color: #888; background-color: white; } p.copyright { margin-top: 5em; padding-top: .5em; font-size: xx-small; max-width: 85ex; text-align: justify; text-transform: uppercase; margin-left: auto; margin-right:auto; font-family: "Bitstream Vera Sans Mono", monospace; color: #888; line-height: 120%; } p.copyright a { color: #88F; text-decoration: none; } /* Various header(ish) things. Definitions cribbed from the CORE Styles. */ h1#title { font-family: "Myriad Web", "Myriad Pro", "Gill Sans", Helvetica, Arial, Sans-Serif; background-color: #365D95; color: #FDFDFD; font-size: 1.6em; font-weight: normal; background: url(../images/head-bl.png) bottom left no-repeat; padding-bottom: 0.430em; margin: 0; line-height: 1; } h1#title a, h1#title a img { background-color: #365D95; } h1 span { border-bottom: 1px solid #6383B1; border-color: #4E6F9E; } h1#title a:link, h1#title a:hover, h1#title a:visited, h1#title a:active { color: #FDFDFD !important; text-decoration: none; } h1#title img { vertical-align: middle; margin-right: 0.7em; } p#tagline { font-size: 0.7em; margin: -2em 0 0 12.1em; padding-bottom: 1em; letter-spacing: 0.1em; line-height: 100% !important; color: #D0DCEE; background-color: transparent; } div#banner { background: #365D95 url(../images/head-br.png) bottom right no-repeat; margin: 1.5em 2em; } h2 { font-size: 1.5em; text-align: left; font-weight: bold; font-style: normal; text-decoration: none; margin-top: 1em; margin-bottom: 1em; line-height: 120%; } h3 { font-size: 1.3em; font-weight: normal; font-style: normal; text-decoration: none; background-color: #EEE; text-indent: 1em; padding: .2em; border-top: 1px dotted black; } /* Navbar */ ul#menu { text-align: center; margin: 1em 2em; background: #EAEBEE url(../images/round-br.png) no-repeat bottom right; padding: 0.5em 0 0.3em; border-top: 1px solid #DCDDE0; } ul#menu span { display: none; } ul#menu a:link, ul#menu a:visited { background: #EAEBEE; color: #365D95; text-decoration: none; } ul#menu a:hover, ul#menu a:active { color: #1F2126; text-decoration: underline; } ul#menu li { display: inline; margin-right: 0.8em; } /* Results */ .report { width: 100%; } table.report { border-collapse: collapse; } table.report th { padding: .5em; background-color: #FCFCFC; } table.report td { padding: .5em; } dl.report { margin-left: 0 !important; margin-right: 0 !important; padding: 0; border-bottom: 1px solid #EAEBEE; border-left: 1px solid #EAEBEE; border-right: 1px solid #EAEBEE; } dl.report dt, dl.report dd { border-bottom: 0; } dl.report dt { border-top: 1px solid #EAEBEE; margin-top: .8em; padding-left: .5em; padding-top: .5em; font-weight: bold; } dl.report dt span.msg_loc, dl.report dt span.redirected_to { font-weight: normal; } dl.report dd { border-top: 0; margin: 0; text-indent: 0; padding: 0; margin-left: 1.5em; } dl.report dd.responsecode { padding-top: 1em; font-size: smaller; } dl.report dd.message_explanation { font-size: smaller; margin-bottom: 1.5em; } dl.report dd p{ padding: 0; line-height: 150%; } div.settings { font-size: smaller; float: right; } div.settings ul { margin: 0; padding-left: 1.5em; } .unauthorized { background-color: aqua; } .redirect { font-weight: normal; font-style: italic; } .broken { color: #A00; } dl.report .broken { font-weight: bold; } .multiple { color: fuchsia; } .dubious { font-style: italic; } span.err_type img { width: 1.2em; height: 1.2em; padding-bottom: .2em; margin-right: .5em; vertical-align: middle; } /* donation and sponsorship program */ div#don_program { border: 1px solid #55B05A; padding: .5em; line-height: 150%; text-align: center; margin-top: .5em; } div#don_program span#don_program_img { float: left; width: 150px; height: 60px; } div#don_program span#don_program_img img { vertical-align: middle; } div#don_program span#don_program_text { } div#don_program span#don_program_text a { font-weight: bold; } W3C-LinkChecker-4.81/META.yml0000644000000000000000000000332111646514071014175 0ustar rootroot--- #YAML:1.0 name: W3C-LinkChecker version: 4.81 abstract: W3C Link Checker author: - W3C QA-dev Team license: open_source distribution_type: module configure_requires: ExtUtils::MakeMaker: 0 build_requires: ExtUtils::MakeMaker: 0 requires: CGI: 0 CGI::Carp: 0 CGI::Cookie: 0 Config::General: 2.06 CSS::DOM: 0.09 CSS::DOM::Constants: 0 CSS::DOM::Style: 0 CSS::DOM::Util: 0 Encode: 0 Encode::Locale: 0 File::Spec: 0 Getopt::Long: 2.17 HTML::Entities: 0 HTML::Parser: 3.40 HTTP::Cookies: 0 HTTP::Headers::Util: 0 HTTP::Message: 5.827 HTTP::Request: 0 HTTP::Response: 1.50 Locale::Country: 0 Locale::Language: 0 LWP::RobotUA: 1.19 LWP::UserAgent: 0 Net::hostent: 0 Net::HTTP::Methods: 5.833 Net::IP: 0 Socket: 0 Term::ReadKey: 2 Test::More: 0 Text::Wrap: 0 Time::HiRes: 0 URI: 1.53 URI::Escape: 0 URI::file: 0 resources: bugtracker: http://www.w3.org/Bugs/Public/ homepage: http://validator.w3.org/checklink MailingList: http://lists.w3.org/Archives/Public/www-validator/ repository: http://dvcs.w3.org/hg/link-checker/ no_index: directory: - t - inc generated_by: ExtUtils::MakeMaker version 6.56 meta-spec: url: http://module-build.sourceforge.net/META-spec-v1.4.html version: 1.4 W3C-LinkChecker-4.81/SIGNATURE0000644000000000000000000000516411646513434014221 0ustar rootrootThis file contains message digests of all files listed in MANIFEST, signed via the Module::Signature module, version 0.68. To verify the content in this distribution, first make sure you have Module::Signature installed, then type: % cpansign -v It will check each file's integrity, as well as the signature's validity. If "==> Signature verified OK! <==" is not displayed, the distribution may already have been compromised, and you should not run its Makefile.PL or Build.PL. -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 SHA1 b075772a968f5694bfbb4ce33eadf26566a25f47 MANIFEST SHA1 e8175087619cebc9d0e0ead8ca06b9d8ee73b678 META.yml SHA1 ab9150095a45776c2020e5781d19054c7018da8b Makefile.PL SHA1 05b3e35c8352063f2c99efa9cd3881c208fa1bb0 NEWS SHA1 f1f868ea73db7d39ab491ebb50c84de76cce4b44 README SHA1 619d90efc63090552be8926418d69a0364989501 bin/checklink SHA1 4406433ae670dd4f7be3f2c76d55aefb239e9bc9 bin/checklink.pod SHA1 b188063249c820f0aa5a34b5f735e8f334a536e1 docs/checklink.html SHA1 fa101fed018fc8e41beca63a0a667fb94c10a557 docs/linkchecker.css SHA1 8fa71b54357c9ed6ac8e01ab600120032d35b080 docs/linkchecker.js SHA1 92d01a8a6e7edcd200d70492f4e551984b97b7a0 etc/checklink.conf SHA1 87c74944dbc80b5d6ab8aac1d09419607b15efff etc/perltidyrc SHA1 bcb7896bee3764f85a03ab14495efc233f70e215 images/double.png SHA1 ff9a7be7fee245dd81a7dc4124544d692a140119 images/grad.png SHA1 61aeb3ea5616833678f66c7baa6db373eedcd86b images/head-bl.png SHA1 bcb7bf006b79106309350bfa578e94af80aed82d images/head-br.png SHA1 11243aa6b3463dd8d6a9b2e69027e42a1d9480ab images/info_icons/README SHA1 a54abf3d12f207b81e19ea8ce783d37c6200cf40 images/info_icons/error.png SHA1 3fd2638079cd0698655614a5a5afc97a976a4af4 images/info_icons/info.png SHA1 552c52188188f560dc02a03200164de3045ac3f4 images/info_icons/warning.png SHA1 1631ed7d5b20c2c036e61225854134f0674cb10a images/no_w3c.png SHA1 401b5fba02d0d8484775a4a77503fa0d136b96ce images/round-br.png SHA1 9eb1ee6188391715284a3db080e6e92d163864d9 images/round-tr.png SHA1 cc01bd358bc1d6d42ca350ad0a4a42778ca4440e images/textbg.png SHA1 7587466f1487eb446fe5da1a70d445e7b33efd36 lib/W3C/LinkChecker.pm SHA1 962ba9fff082c4087239b55618ada2a8f1564464 t/00compile.t -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQEVAwUBTpqXHId580Rxl2NsAQJqPAf/XrqTWrlZa9DFkWrnOSxIYsyDGPl14fCl ohGFL7jBYxdKndEHo2aA7bA95EOypZVxakUIFcizpC5ujbrqasGPEnxinhJQYLqA S+4G+yzen3DqbbLndd5eIWVLPS5992gXwuLaeZrNFlGv/kG892NSLGfu3JQiePlc jNNUZ4dvwe+MHSSvs3DEkAPJqeIR7bx55tp+O7n5HX3ab/sqYIaqI2V3tXP/EHFy PA/Ig9QFQmfB7SY3TFN7iUFuIDRqIQOzC/Ij/WqY1Uj9885zZJvq0GWT/huFvyVG IGFM+sZp8gr6fr/bkB7de5xCoVUpkCz+mFkIJFQCu1cwJcP9pa81+g== =W5f6 -----END PGP SIGNATURE----- W3C-LinkChecker-4.81/etc/0000755000000000000000000000000011646514071013500 5ustar rootrootW3C-LinkChecker-4.81/etc/checklink.conf0000644000000000000000000000465711543730725016320 0ustar rootroot# # Configuration file for the W3C Link Checker # # See Config::General(3) for the syntax; 'SplitPolicy' is 'equalsign' here. # # # Trusted is a regular expression for matching "trusted" domains. This is # used to restrict the domains where HTTP basic authentication will be sent. # This is matched case insensitively against resoures' hostnames. # # Not specifying a value here means that the basic authentication will only # be sent to the same host where the authentication was requested from. # # For example, the following would allow sending the authentication to any # host in the w3.org domain (and *only* there): # Trusted = \.w3\.org$ # # Allow_Private_IPs is a boolean flag (1/0) for specifying whether checking of # links to non-public RFC 1918 IP addresses is allowed. # # The default, ie. not specifying the value here means that checking links # on non-public IP addresses is disabled when checklink runs as a CGI script, # and allowed in command line mode. # # For example, the following would disallow private IP addresses regardless # of the mode: # Allow_Private_IPs = 0 # # Markup_Validator_URI and CSS_Validator_URI are formatted URIs to the # respective validators. The %s in these will be replaced with the full # "URI encoded" URI to the document being checked, and shown in the link # checker results view in the online/CGI version. # # Defaults: # Markup_Validator_URI = http://validator.w3.org/check?uri=%s # CSS_Validator_URI = http://jigsaw.w3.org/css-validator/validator?uri=%s # # Doc_URI is the URI to the Link Checker documentation, shown in the # results report in CGI mode, and the usage message in command line mode. # The URIs to the CSS and JavaScript files in the generated HTML are also # formed using this as their base URI. If you have installed the documentation # locally somewhere, you may wish to change this to point to that location. # This must be an absolute URI. # # Default: # Doc_URI = http://validator.w3.org/docs/checklink.html # # Forbidden_Protocols is a comma separated list of additional protocols/URI # schemes that the link checker is not allowed to use. The javascript and # mailto schemes are always forbidden, and so is the file scheme when running # as a CGI script. # # Default: # Forbidden_Protocols = javascript,mailto # # Connection_Cache_Size is an integer denoting the maximum number of # connections the link checker will keep open at any given time. # # Default: # Connection_Cache_Size = 2 W3C-LinkChecker-4.81/etc/perltidyrc0000644000000000000000000000060211426614412015576 0ustar rootroot# perltidy(1) profile for the W3C Link Checker --standard-error-output --warning-output --output-line-ending=unix --maximum-line-length=79 --indent-columns=4 --continuation-indentation=4 --vertical-tightness=2 --paren-tightness=2 --brace-tightness=2 --square-bracket-tightness=2 --opening-sub-brace-on-new-line --nospace-for-semicolon --nooutdent-long-lines --break-after-all-operators W3C-LinkChecker-4.81/images/0000755000000000000000000000000011646514071014172 5ustar rootrootW3C-LinkChecker-4.81/images/head-bl.png0000644000000000000000000000024311426614412016167 0ustar rootrootPNG  IHDRtEXtSoftwareAdobe ImageReadyqe<PLTE6]Yz\{ehIDATxb`gf0"Y@$#3@ ,IENDB`W3C-LinkChecker-4.81/images/round-tr.png0000644000000000000000000000024011426614412016442 0ustar rootrootPNG  IHDR &DtEXtSoftwareAdobe ImageReadyqe<PLTEm$IDATxb`fFFF`b,f8b0qzIENDB`W3C-LinkChecker-4.81/images/textbg.png0000644000000000000000000000214511426614412016173 0ustar rootrootPNG  IHDRl>ƶtEXtSoftwareAdobe ImageReadyqe<$PLTEKrEIDATxԱ 0CRfJ`;jF@f f f`6f`6 `6 `6f `6f f f f`6f`6 `6 `6f `6f f f f`6f`6 `6 `6f `6f f f f`6f`6 `6 `6f `6f f f f`6f`6 `6 `6f `6f f f f`6f`6 `6 `6f `6f f f f`6f`6 `6 `6f `6f f f f`6f`6 `6 `6f `6f f f f`6f`6 `6 `6f `6f f ff`6f`6 `6 `6f `6f f ff f f`6f`6 `6 `6f `6f f f f`6f`6 `6 `6f `6f f f f`6f`6 `6 `6f `6f f f f`6f`6 `6 `6f `6f f f f`6f`6 `6 `6f `6f f f f`6f`6 `6 `6f `6f f f f`6f`6 `6 `6f `6flz4~RM=yIENDB`W3C-LinkChecker-4.81/images/head-br.png0000644000000000000000000000024611426614412016200 0ustar rootrootPNG  IHDRtEXtSoftwareAdobe ImageReadyqe<PLTE6]Yz\{eh!IDATxb`v "X+#3##@,M}IENDB`W3C-LinkChecker-4.81/images/info_icons/0000755000000000000000000000000011646514071016320 5ustar rootrootW3C-LinkChecker-4.81/images/info_icons/warning.png0000644000000000000000000000367111426614412020476 0ustar rootrootPNG  IHDR szzgAMAOX2tEXtSoftwareAdobe ImageReadyqe<KIDATxڴWil\}c=q́(H"VmBP D6a)J*EJ4,AP% 6*P Q;=m&3z{{o=qu=o޻|g<} 7Ko׾ƅB`VHX\5=|^-q] sճǻlc͂Lzi°7|I׀`jԅ6[Z8@vs'Rx&nx+-E|l? zzցm_1{#{'՚OMDF! ]0Dd|˂D*S Em@;{XxnrVP#L'o`-wҍ:áo X8#gӫlP@֖mӵOFQd;7i\#;wL%3o~5HՎx(74l9dZ M?4R,{1kG ӻ6Vg} x>r$1`T HaXm~H[g}hZE`zx's|GXnȏ6ǡsކ%`c|kYo>S86B tq65Ip)[<8ZA:nD'HgW7+R@(rF U3)pYXОtI橎b~j ~rs8aQ1ģyVݫ7ۯcd. %\Gk362&DV1h ԇ&aYrY!t4&QڻdƹXjg*Z^uN# ,+%.twR0JK+e!sFl9#N$&T> "BFƐu_")P)5 BHﶺdNѴ a8?kϽ!B|IV&=LT N$jDBC9=Xnym1OgqEq\ziuVuW`$2#zo]ts!A/zVR,xlk(:aV,}nyab >ry\g+kZ@] ^O CkqRo[V xL>۫+!WF,:g(*H {=SU EhpILUM Nԉϼ~L=-F!eYʺwv> (z,3Ccn#.7\dSVv&%p?\D(h7!6 \M7vTY}iĭZrJ+2~-gb,gz x 4)VCy߇{wrM!雂yi)NrlPBe oK*Tb7k._d~mQH<'Q+PRO>)9dlrɐ)8 $TV:1yTj\S<#+VJ5/*%KRH@Y/^0>"NluR#0,Ԑ NrcD)VF_^_JU_@)wґ&d jW#@033;tP|"I*c2c ]BX\!G?}_/pe癤Ͻ Hnˮ9vʟ299ka-sN/g6 }$EX)9DW_~IENDB`W3C-LinkChecker-4.81/images/info_icons/info.png0000644000000000000000000000351411426614412017760 0ustar rootrootPNG  IHDR szzgAMAOX2tEXtSoftwareAdobe ImageReadyqe<IDATxڔVYlTU3 h]hgZ EP R 1Q 1W4&꓀T*bJ7PmYϹwЍLs%M7˥L;sKZvLIS2n늒NES48_ 56lZ|WǾ=+Jl[J8@]箵Vob)*!; ع\sབྷa&~m+KP4݃[`N`fz8L1hGW?8+ Cv!1*|jɕ!"&(x8{8!hޠ*֪IoPe@$8o=Z=ܜGs \V^~ns9p@5DKhk֏?[pV%rL.1năC¼{EW`r.)oia%?)Ï{BP$4j87WqdYu%̀+ i@Aj<JHtJhaIK]xNbUN,H|x <׿[m%/'jnd~ڌ21щXW9I :CJc{E,];ɵ՚aT(+z='֗$ʷ{h -9uC!1{Јϴ{LXxy`sل!2K g; )Z7RW-dg˧]ٟ.2 'lOMFR![I!y㒢kxW Y!H*Ha ږ,YX9ZASǑ߯f<#BeEYOx[@3 ^ˍ~0L:,p^6ؾO6xālBO% $̉fٕOxwVLKM@{?%A3DfZ(EݞPF‚JCj|p_?uy pOLDn Xtp贓16N?dxh8|YGN,'s{߯9M3bFXP.lM]z-!,MS11i+7"Âхt7{ĸ$1Nב?mzl!}Hr"NijYHs@'@>)Ӆ!FNI7<$`WC/ԄV&dX3w 5<<KOㅹi5%*RrD !F>$g0krB{3H>Tm2Q<9!xT%1  2n5$>=m bvNxHٲăl~5&pwQꎚƊueE0NܼՆWakxgv%7oU4A&=֫~аψh /(ĀIENDB`W3C-LinkChecker-4.81/images/info_icons/error.png0000644000000000000000000000447111426614412020161 0ustar rootrootPNG  IHDR szzgAMAOX2tEXtSoftwareAdobe ImageReadyqe<IDATxWkpT~9{^.rD)FkcGZ+Gh:8LK;vZJ5V" @$B@ds}KG3eϙ=y;au5/ U%~p?&[(:pRe8lv̟mAk]e+{& _D%9.\S,ӗJvR?9+~kD+|(I#iǻ=glǒݾs俪M%I͏^ٸtfVn'CQHe+x$N?r ߌ,^t Im㏼-^6GqITRC!R1ғI",# )"oA捤x,B}?Z>zdӖJ6{`21N,x|0O\d^ 0PçJ |&IM$2[i9EBAs׻;93yВؗ%wVKw'THt,-%5 cEAƢ$Bѣ-$]XDG|QgVh !A`$Q$CfX~eH黜0Q^ 2!{lXIcmJIIz%dB쎭//DkIs\iB:kC&?lSgrͿc)r3 ~0_=KQ񿭧B$$Pi2tZeW-@J`j Zd`( ʈ M]ln!׬T92s hGUqT <u镢a2s;;œ&|R\h SC~3w~|)-V~mVG6^3OF+I%|w \WGiO Cwއ,R<{}50Ḣ;^Bk @3BZdwWM8iWu \R 4}ar=@XgٟFjYti3 S쾐> EgԐ1PfVJP:y*k`i?NrL#Ke>Ut6yo +?ߦ|w5 VJC)8oS3tdj Jmy4:N93 dc;ѫE2GS%˜9[rBzM0%IEq"YhmA'_#{P#HO פ8d/4=R@ETq+Jˤ6lQO_?\Кi=LJZ2#BֽeX dNUTi$]f4tVj}q|y_^'M#zsȂ ܷNNyL 3o"cb=|26;W:qs3/n5r`!DZK6f(n}F犆{`p:r68ؙ͖:~#g;fV (=֐B_HDC?5?bRӡ!iP{YUG -Cѣ{+ŵ-Wcd0v J\8@C; YLC&3 12c߱{:S,7,PӔ^ǎqq} tg+Ǟl@v3U*7gA1g]5dn5I𕲕loz"bLߝk`&actT@@h5{"%|T^8.\>P Xp~؈E-#TEf̈́P1Jϳ{sv7|ȝ'uڬy*?D=hX9پ2 g]fgl# `4>BM(]QW=V9n@L/l[:ss]Z% q1e/ x}ľʋ _ TV}{(f;d #dK8+`p5猿X^-μ"_voJ@RIENDB`W3C-LinkChecker-4.81/images/info_icons/README0000644000000000000000000000042611426614412017176 0ustar rootroot* error.png and info.png from: information icons set by: Jakub Jankiewicz license: public domain http://openclipart.org/media/files/kuba/2051 * warning.png from: Warning Notification by: eastshores license: public domain http://openclipart.org/media/files/eastshores/2833 W3C-LinkChecker-4.81/images/double.png0000644000000000000000000000017511426614412016151 0ustar rootrootPNG  IHDRtEXtSoftwareAdobe ImageReadyqe<PLTET IDATxb``0cQIENDB`W3C-LinkChecker-4.81/images/round-br.png0000644000000000000000000000023311426614412016422 0ustar rootrootPNG  IHDR &DtEXtSoftwareAdobe ImageReadyqe<PLTEmIDATxb`,p3cBX,̌IoIENDB`W3C-LinkChecker-4.81/images/grad.png0000644000000000000000000000042211426614412015607 0ustar rootrootPNG  IHDRs{tEXtSoftwareAdobe ImageReadyqe<QPLTE<0WIDATx #L.{1|}ߌ1Zs]!rXkYkFscDJI)Z+w!Ĝ真ɖrIENDB`W3C-LinkChecker-4.81/images/no_w3c.png0000644000000000000000000000031011426621613016057 0ustar rootrootPNG  IHDRn=hqtEXtSoftwareAdobe ImageReadyqe<PLTE6];aYz\{ehXa@IDATHc@PF&GRa4|F(1n@2 Q0 F(g|[IENDB`W3C-LinkChecker-4.81/bin/0000755000000000000000000000000011646514071013475 5ustar rootrootW3C-LinkChecker-4.81/bin/checklink0000755000000000000000000031771511646513354015377 0ustar rootroot#!/usr/bin/perl -wT # # W3C Link Checker # by Hugo Haas # (c) 1999-2011 World Wide Web Consortium # based on Renaud Bruyeron's checklink.pl # # This program is licensed under the W3C(r) Software License: # http://www.w3.org/Consortium/Legal/copyright-software # # The documentation is at: # http://validator.w3.org/docs/checklink.html # # See the Mercurial interface at: # http://dvcs.w3.org/hg/link-checker/ # # An online version is available at: # http://validator.w3.org/checklink # # Comments and suggestions should be sent to the www-validator mailing list: # www-validator@w3.org (with 'checklink' in the subject) # http://lists.w3.org/Archives/Public/www-validator/ (archives) use strict; use 5.008; # Get rid of potentially unsafe and unneeded environment variables. delete(@ENV{qw(IFS CDPATH ENV BASH_ENV)}); $ENV{PATH} = undef; # ...but we want PERL5?LIB honored even in taint mode, see perlsec, perl5lib, # http://www.mail-archive.com/cpan-testers-discuss%40perl.org/msg01064.html use Config qw(%Config); use lib map { /(.*)/ } defined($ENV{PERL5LIB}) ? split(/$Config{path_sep}/, $ENV{PERL5LIB}) : defined($ENV{PERLLIB}) ? split(/$Config{path_sep}/, $ENV{PERLLIB}) : (); # ----------------------------------------------------------------------------- package W3C::UserAgent; use LWP::RobotUA 1.19 qw(); use LWP::UserAgent qw(); use Net::HTTP::Methods 5.833 qw(); # >= 5.833 for 4kB cookies (#6678) # if 0, ignore robots exclusion (useful for testing) use constant USE_ROBOT_UA => 1; if (USE_ROBOT_UA) { @W3C::UserAgent::ISA = qw(LWP::RobotUA); } else { @W3C::UserAgent::ISA = qw(LWP::UserAgent); } sub new { my $proto = shift; my $class = ref($proto) || $proto; my ($name, $from, $rules) = @_; # For security/privacy reasons, if $from was not given, do not send it. # Cheat by defining something for the constructor, and resetting it later. my $from_ok = $from; $from ||= 'www-validator@w3.org'; my $self; if (USE_ROBOT_UA) { $self = $class->SUPER::new($name, $from, $rules); } else { my %cnf; @cnf{qw(agent from)} = ($name, $from); $self = LWP::UserAgent->new(%cnf); $self = bless $self, $class; } $self->from(undef) unless $from_ok; $self->env_proxy(); $self->allow_private_ips(1); $self->protocols_forbidden([qw(mailto javascript)]); return $self; } sub allow_private_ips { my $self = shift; if (@_) { $self->{Checklink_allow_private_ips} = shift; if (!$self->{Checklink_allow_private_ips}) { # Pull in dependencies require Net::IP; require Socket; require Net::hostent; } } return $self->{Checklink_allow_private_ips}; } sub redirect_progress_callback { my $self = shift; $self->{Checklink_redirect_callback} = shift if @_; return $self->{Checklink_redirect_callback}; } sub simple_request { my $self = shift; my $response = $self->ip_disallowed($_[0]->uri()); # RFC 2616, section 15.1.3 $_[0]->remove_header("Referer") if ($_[0]->referer() && (!$_[0]->uri()->secure() && URI->new($_[0]->referer())->secure())); $response ||= do { local $SIG{__WARN__} = sub { # Suppress some warnings, rt.cpan.org #18902 warn($_[0]) if ($_[0] && $_[0] !~ /^RobotRules/); }; # @@@ Why not just $self->SUPER::simple_request? $self->W3C::UserAgent::SUPER::simple_request(@_); }; if (!defined($self->{FirstResponse})) { $self->{FirstResponse} = $response->code(); $self->{FirstMessage} = $response->message() || '(no message)'; } return $response; } sub redirect_ok { my ($self, $request, $response) = @_; if (my $callback = $self->redirect_progress_callback()) { # @@@ TODO: when an LWP internal robots.txt request gets redirected, # this will a bit confusingly fire for it too. Would need a robust # way to determine whether the request is such a LWP "internal # robots.txt" one. &$callback($request->method(), $request->uri()); } return 0 unless $self->SUPER::redirect_ok($request, $response); if (my $res = $self->ip_disallowed($request->uri())) { $response->previous($response->clone()); $response->request($request); $response->code($res->code()); $response->message($res->message()); return 0; } return 1; } # # Checks whether we're allowed to retrieve the document based on its IP # address. Takes an URI object and returns a HTTP::Response containing the # appropriate status and error message if the IP was disallowed, 0 # otherwise. URIs without hostname or IP address are always allowed, # including schemes where those make no sense (eg. data:, often javascript:). # sub ip_disallowed { my ($self, $uri) = @_; return 0 if $self->allow_private_ips(); # Short-circuit my $hostname = undef; eval { $hostname = $uri->host() }; # Not all URIs implement host()... return 0 unless $hostname; my $addr = my $iptype = my $resp = undef; if (my $host = Net::hostent::gethostbyname($hostname)) { $addr = Socket::inet_ntoa($host->addr()) if $host->addr(); if ($addr && (my $ip = Net::IP->new($addr))) { $iptype = $ip->iptype(); } } if ($iptype && $iptype ne 'PUBLIC') { $resp = HTTP::Response->new(403, 'Checking non-public IP address disallowed by link checker configuration' ); $resp->header('Client-Warning', 'Internal response'); } return $resp; } # ----------------------------------------------------------------------------- package W3C::LinkChecker; use vars qw($AGENT $PACKAGE $PROGRAM $VERSION $REVISION $DocType $Head $Accept $ContentTypes %Cfg $CssUrl); use CSS::DOM 0.09 qw(); # >= 0.09 for many bugfixes use CSS::DOM::Constants qw(:rule); use CSS::DOM::Style qw(); use CSS::DOM::Util qw(); use Encode qw(); use HTML::Entities qw(); use HTML::Parser 3.40 qw(); # >= 3.40 for utf8_mode() use HTTP::Headers::Util qw(); use HTTP::Message 5.827 qw(); # >= 5.827 for content_charset() use HTTP::Request 5.814 qw(); # >= 5.814 for accept_decodable() use HTTP::Response 1.50 qw(); # >= 1.50 for decoded_content() use Time::HiRes qw(); use URI 1.53 qw(); # >= 1.53 for secure() use URI::Escape qw(); use URI::Heuristic qw(); # @@@ Needs also W3C::UserAgent but can't use() it here. use constant RC_ROBOTS_TXT => -1; use constant RC_DNS_ERROR => -2; use constant RC_IP_DISALLOWED => -3; use constant RC_PROTOCOL_DISALLOWED => -4; use constant LINE_UNKNOWN => -1; use constant MP2 => (exists($ENV{MOD_PERL_API_VERSION}) && $ENV{MOD_PERL_API_VERSION} >= 2); # Tag=>attribute mapping of things we treat as links. # Note: meta/@http-equiv gets special treatment, see start() for details. use constant LINK_ATTRS => { a => ['href'], # base/@href intentionally not checked # http://www.w3.org/mid/200802091439.27764.ville.skytta%40iki.fi area => ['href'], audio => ['src'], blockquote => ['cite'], body => ['background'], command => ['icon'], # button/@formaction not checked (side effects) del => ['cite'], # @pluginspage, @pluginurl, @href: pre-HTML5 proprietary embed => ['href', 'pluginspage', 'pluginurl', 'src'], # form/@action not checked (side effects) frame => ['longdesc', 'src'], html => ['manifest'], iframe => ['longdesc', 'src'], img => ['longdesc', 'src'], # input/@action, input/@formaction not checked (side effects) input => ['src'], ins => ['cite'], link => ['href'], object => ['data'], q => ['cite'], script => ['src'], source => ['src'], track => ['src'], video => ['src', 'poster'], }; # Tag=>[separator, attributes] mapping of things we treat as lists of links. use constant LINK_LIST_ATTRS => { a => [qr/\s+/, ['ping']], applet => [qr/[\s,]+/, ['archive']], area => [qr/\s+/, ['ping']], head => [qr/\s+/, ['profile']], object => [qr/\s+/, ['archive']], }; # TBD/TODO: # - applet/@code? # - bgsound/@src? # - object/@classid? # - isindex/@action? # - layer/@background,@src? # - ilayer/@background? # - table,tr,td,th/@background? # - xmp/@href? @W3C::LinkChecker::ISA = qw(HTML::Parser); BEGIN { # Version info $PACKAGE = 'W3C Link Checker'; $PROGRAM = 'W3C-checklink'; $VERSION = '4.81'; $REVISION = sprintf('version %s (c) 1999-2011 W3C', $VERSION); $AGENT = sprintf( '%s/%s %s', $PROGRAM, $VERSION, ( W3C::UserAgent::USE_ROBOT_UA ? LWP::RobotUA->_agent() : LWP::UserAgent->_agent() ) ); # Pull in mod_perl modules if applicable. eval { local $SIG{__DIE__} = undef; require Apache2::RequestUtil; } if MP2(); my @content_types = qw( text/html application/xhtml+xml;q=0.9 application/vnd.wap.xhtml+xml;q=0.6 ); $Accept = join(', ', @content_types, '*/*;q=0.5'); push(@content_types, 'text/css', 'text/html-sandboxed'); my $re = join('|', map { s/;.*//; quotemeta } @content_types); $ContentTypes = qr{\b(?:$re)\b}io; # Regexp for matching URL values in CSS. $CssUrl = qr/(?:\s|^)url\(\s*(['"]?)(.*?)\1\s*\)(?=\s|$)/; # # Read configuration. If the W3C_CHECKLINK_CFG environment variable has # been set or the default contains a non-empty file, read it. Otherwise, # skip silently. # my $defaultconfig = '/etc/w3c/checklink.conf'; if ($ENV{W3C_CHECKLINK_CFG} || -s $defaultconfig) { require Config::General; Config::General->require_version(2.06); # Need 2.06 for -SplitPolicy my $conffile = $ENV{W3C_CHECKLINK_CFG} || $defaultconfig; eval { my %config_opts = ( -ConfigFile => $conffile, -SplitPolicy => 'equalsign', -AllowMultiOptions => 'no', ); %Cfg = Config::General->new(%config_opts)->getall(); }; if ($@) { die <<"EOF"; Failed to read configuration from '$conffile': $@ EOF } } $Cfg{Markup_Validator_URI} ||= 'http://validator.w3.org/check?uri=%s'; $Cfg{CSS_Validator_URI} ||= 'http://jigsaw.w3.org/css-validator/validator?uri=%s'; $Cfg{Doc_URI} ||= 'http://validator.w3.org/docs/checklink.html'; # Untaint config params that are used as the format argument to (s)printf(), # Perl 5.10 does not want to see that in taint mode. ($Cfg{Markup_Validator_URI}) = ($Cfg{Markup_Validator_URI} =~ /^(.*)$/); ($Cfg{CSS_Validator_URI}) = ($Cfg{CSS_Validator_URI} =~ /^(.*)$/); $DocType = ''; my $css_url = URI->new_abs('linkchecker.css', $Cfg{Doc_URI}); my $js_url = URI->new_abs('linkchecker.js', $Cfg{Doc_URI}); $Head = sprintf(<<'EOF', HTML::Entities::encode($AGENT), $css_url, $js_url); EOF # Trusted environment variables that need laundering in taint mode. for (qw(NNTPSERVER NEWSHOST)) { ($ENV{$_}) = ($ENV{$_} =~ /^(.*)$/) if $ENV{$_}; } # Use passive FTP by default, see Net::FTP(3). $ENV{FTP_PASSIVE} = 1 unless exists($ENV{FTP_PASSIVE}); } # Autoflush $| = 1; # Different options specified by the user my $cmdline = !($ENV{GATEWAY_INTERFACE} && $ENV{GATEWAY_INTERFACE} =~ /^CGI/); my %Opts = ( Command_Line => $cmdline, Quiet => 0, Summary_Only => 0, Verbose => 0, Progress => 0, HTML => 0, Timeout => 30, Redirects => 1, Dir_Redirects => 1, Accept_Language => $cmdline ? undef : $ENV{HTTP_ACCEPT_LANGUAGE}, Cookies => undef, No_Referer => 0, Hide_Same_Realm => 0, Depth => 0, # < 0 means unlimited recursion. Sleep_Time => 1, Connection_Cache_Size => 2, Max_Documents => 150, # For the online version. User => undef, Password => undef, Base_Locations => [], Exclude => undef, Exclude_Docs => undef, Suppress_Redirect => [], Suppress_Redirect_Prefix => [], Suppress_Redirect_Regexp => [], Suppress_Temp_Redirects => 1, Suppress_Broken => [], Suppress_Fragment => [], Masquerade => 0, Masquerade_From => '', Masquerade_To => '', Trusted => $Cfg{Trusted}, Allow_Private_IPs => defined($Cfg{Allow_Private_IPs}) ? $Cfg{Allow_Private_IPs} : $cmdline, ); undef $cmdline; # Global variables # What URI's did we process? (used for recursive mode) my %processed; # Result of the HTTP query my %results; # List of redirects my %redirects; # Count of the number of documents checked my $doc_count = 0; # Time stamp my $timestamp = &get_timestamp(); # Per-document header; undefined if already printed. See print_doc_header(). my $doc_header; &parse_arguments() if $Opts{Command_Line}; my $ua = W3C::UserAgent->new($AGENT); # @@@ TODO: admin address $ua->conn_cache({total_capacity => $Opts{Connection_Cache_Size}}); if ($ua->can('delay')) { $ua->delay($Opts{Sleep_Time} / 60); } $ua->timeout($Opts{Timeout}); # Set up cookie stash if requested if (defined($Opts{Cookies})) { require HTTP::Cookies; my $cookie_file = $Opts{Cookies}; if ($cookie_file eq 'tmp') { $cookie_file = undef; } elsif ($cookie_file =~ /^(.*)$/) { $cookie_file = $1; # untaint } $ua->cookie_jar(HTTP::Cookies->new(file => $cookie_file, autosave => 1)); } eval { $ua->allow_private_ips($Opts{Allow_Private_IPs}); }; if ($@) { die <<"EOF"; Allow_Private_IPs is false; this feature requires the Net::IP, Socket, and Net::hostent modules: $@ EOF } # Add configured forbidden protocols if ($Cfg{Forbidden_Protocols}) { my $forbidden = $ua->protocols_forbidden(); push(@$forbidden, split(/[,\s]+/, lc($Cfg{Forbidden_Protocols}))); $ua->protocols_forbidden($forbidden); } if ($Opts{Command_Line}) { require Text::Wrap; Text::Wrap->import('wrap'); require URI::file; &usage(1) unless scalar(@ARGV); $Opts{_Self_URI} = 'http://validator.w3.org/checklink'; # For HTML output &ask_password() if ($Opts{User} && !$Opts{Password}); if (!$Opts{Summary_Only}) { printf("%s %s\n", $PACKAGE, $REVISION) unless $Opts{HTML}; } else { $Opts{Verbose} = 0; $Opts{Progress} = 0; } # Populate data for print_form() my %params = ( summary => $Opts{Summary_Only}, hide_redirects => !$Opts{Redirects}, hide_type => $Opts{Dir_Redirects} ? 'dir' : 'all', no_accept_language => !( defined($Opts{Accept_Language}) && $Opts{Accept_Language} eq 'auto' ), no_referer => $Opts{No_Referer}, recursive => ($Opts{Depth} != 0), depth => $Opts{Depth}, ); my $check_num = 1; my @bases = @{$Opts{Base_Locations}}; for my $uri (@ARGV) { # Reset base locations so that previous URI's given on the command line # won't affect the recursion scope for this URI (see check_uri()) @{$Opts{Base_Locations}} = @bases; # Transform the parameter into a URI $uri = &urize($uri); $params{uri} = $uri; &check_uri(\%params, $uri, $check_num, $Opts{Depth}, undef, undef, 1); $check_num++; } undef $check_num; if ($Opts{HTML}) { &html_footer(); } elsif ($doc_count > 0 && !$Opts{Summary_Only}) { printf("\n%s\n", &global_stats()); } } else { require CGI; require CGI::Carp; CGI::Carp->import(qw(fatalsToBrowser)); require CGI::Cookie; # file: URIs are not allowed in CGI mode my $forbidden = $ua->protocols_forbidden(); push(@$forbidden, 'file'); $ua->protocols_forbidden($forbidden); my $query = CGI->new(); for my $param ($query->param()) { my @values = map { Encode::decode_utf8($_) } $query->param($param); $query->param($param, @values); } # Set a few parameters in CGI mode $Opts{Verbose} = 0; $Opts{Progress} = 0; $Opts{HTML} = 1; $Opts{_Self_URI} = $query->url(-relative => 1); # Backwards compatibility my $uri = undef; if ($uri = $query->param('url')) { $query->param('uri', $uri) unless $query->param('uri'); $query->delete('url'); } $uri = $query->param('uri'); if (!$uri) { &html_header('', undef); # Set cookie only from results page. my %cookies = CGI::Cookie->fetch(); &print_form(scalar($query->Vars()), $cookies{$PROGRAM}, 1); &html_footer(); exit; } # Backwards compatibility if ($query->param('hide_dir_redirects')) { $query->param('hide_redirects', 'on'); $query->param('hide_type', 'dir'); $query->delete('hide_dir_redirects'); } $Opts{Summary_Only} = 1 if $query->param('summary'); if ($query->param('hide_redirects')) { $Opts{Dir_Redirects} = 0; if (my $type = $query->param('hide_type')) { $Opts{Redirects} = 0 if ($type ne 'dir'); } else { $Opts{Redirects} = 0; } } $Opts{Accept_Language} = undef if $query->param('no_accept_language'); $Opts{No_Referer} = $query->param('no_referer'); $Opts{Depth} = -1 if ($query->param('recursive') && $Opts{Depth} == 0); if (my $depth = $query->param('depth')) { # @@@ Ignore invalid depth silently for now. $Opts{Depth} = $1 if ($depth =~ /(-?\d+)/); } # Save, clear or leave cookie as is. my $cookie = undef; if (my $action = $query->param('cookie')) { if ($action eq 'clear') { # Clear the cookie. $cookie = CGI::Cookie->new(-name => $PROGRAM); $cookie->value({clear => 1}); $cookie->expires('-1M'); } elsif ($action eq 'set') { # Set the options. $cookie = CGI::Cookie->new(-name => $PROGRAM); my %options = $query->Vars(); delete($options{$_}) for qw(url uri check cookie); # Non-persistent. $cookie->value(\%options); } } if (!$cookie) { my %cookies = CGI::Cookie->fetch(); $cookie = $cookies{$PROGRAM}; } # Always refresh cookie expiration time. $cookie->expires('+1M') if ($cookie && !$cookie->expires()); # All Apache configurations don't set HTTP_AUTHORIZATION for CGI scripts. # If we're under mod_perl, there is a way around it... eval { local $SIG{__DIE__} = undef; my $auth = Apache2::RequestUtil->request()->headers_in()->{Authorization}; $ENV{HTTP_AUTHORIZATION} = $auth if $auth; } if (MP2() && !$ENV{HTTP_AUTHORIZATION}); $uri =~ s/^\s+//g; if ($uri =~ /:/) { $uri = URI->new($uri); } else { if ($uri =~ m|^//|) { $uri = URI->new("http:$uri"); } else { local $ENV{URL_GUESS_PATTERN} = ''; my $guess = URI::Heuristic::uf_uri($uri); if ($guess->scheme() && $ua->is_protocol_supported($guess)) { $uri = $guess; } else { $uri = URI->new("http://$uri"); } } } $uri = $uri->canonical(); $query->param("uri", $uri); &check_uri(scalar($query->Vars()), $uri, 1, $Opts{Depth}, $cookie); undef $query; # Not needed any more. &html_footer(); } ############################################################################### ################################ # Command line and usage stuff # ################################ sub parse_arguments () { require Encode::Locale; Encode::Locale::decode_argv(); require Getopt::Long; Getopt::Long->require_version(2.17); Getopt::Long->import('GetOptions'); Getopt::Long::Configure('bundling', 'no_ignore_case'); my $masq = ''; my @locs = (); GetOptions( 'help|h|?' => sub { usage(0) }, 'q|quiet' => sub { $Opts{Quiet} = 1; $Opts{Summary_Only} = 1; }, 's|summary' => \$Opts{Summary_Only}, 'b|broken' => sub { $Opts{Redirects} = 0; $Opts{Dir_Redirects} = 0; }, 'e|dir-redirects' => sub { $Opts{Dir_Redirects} = 0; }, 'v|verbose' => \$Opts{Verbose}, 'i|indicator' => \$Opts{Progress}, 'H|html' => \$Opts{HTML}, 'r|recursive' => sub { $Opts{Depth} = -1 if $Opts{Depth} == 0; }, 'l|location=s' => \@locs, 'X|exclude=s' => \$Opts{Exclude}, 'exclude-docs=s@' => \@{$Opts{Exclude_Docs}}, 'suppress-redirect=s@' => \@{$Opts{Suppress_Redirect}}, 'suppress-redirect-prefix=s@' => \@{$Opts{Suppress_Redirect_Prefix}}, 'suppress-temp-redirects' => \$Opts{Suppress_Temp_Redirects}, 'suppress-broken=s@' => \@{$Opts{Suppress_Broken}}, 'suppress-fragment=s@' => \@{$Opts{Suppress_Fragment}}, 'u|user=s' => \$Opts{User}, 'p|password=s' => \$Opts{Password}, 't|timeout=i' => \$Opts{Timeout}, 'C|connection-cache=i' => \$Opts{Connection_Cache_Size}, 'S|sleep=i' => \$Opts{Sleep_Time}, 'L|languages=s' => \$Opts{Accept_Language}, 'c|cookies=s' => \$Opts{Cookies}, 'R|no-referer' => \$Opts{No_Referer}, 'D|depth=i' => sub { $Opts{Depth} = $_[1] unless $_[1] == 0; }, 'd|domain=s' => \$Opts{Trusted}, 'masquerade=s' => \$masq, 'hide-same-realm' => \$Opts{Hide_Same_Realm}, 'V|version' => \&version, ) || usage(1); if ($masq) { $Opts{Masquerade} = 1; my @masq = split(/\s+/, $masq); if (scalar(@masq) != 2 || !defined($masq[0]) || $masq[0] !~ /\S/ || !defined($masq[1]) || $masq[1] !~ /\S/) { usage(1, "Error: --masquerade takes two whitespace separated URIs."); } else { require URI::file; $Opts{Masquerade_From} = $masq[0]; my $u = URI->new($masq[1]); $Opts{Masquerade_To} = $u->scheme() ? $u : URI::file->new_abs($masq[1]); } } if ($Opts{Accept_Language} && $Opts{Accept_Language} eq 'auto') { $Opts{Accept_Language} = &guess_language(); } if (($Opts{Sleep_Time} || 0) < 1) { warn( "*** Warning: minimum allowed sleep time is 1 second, resetting.\n" ); $Opts{Sleep_Time} = 1; } push(@{$Opts{Base_Locations}}, map { URI->new($_)->canonical() } @locs); $Opts{Depth} = -1 if ($Opts{Depth} == 0 && @locs); # Precompile/error-check regular expressions. if (defined($Opts{Exclude})) { eval { $Opts{Exclude} = qr/$Opts{Exclude}/o; }; &usage(1, "Error in exclude regexp: $@") if $@; } for my $i (0 .. $#{$Opts{Exclude_Docs}}) { eval { $Opts{Exclude_Docs}->[$i] = qr/$Opts{Exclude_Docs}->[$i]/; }; &usage(1, "Error in exclude-docs regexp: $@") if $@; } if (defined($Opts{Trusted})) { eval { $Opts{Trusted} = qr/$Opts{Trusted}/io; }; &usage(1, "Error in trusted domains regexp: $@") if $@; } # Sanity-check error-suppression arguments for my $i (0 .. $#{$Opts{Suppress_Redirect}}) { ${$Opts{Suppress_Redirect}}[$i] =~ s/ /->/; my $sr_arg = ${$Opts{Suppress_Redirect}}[$i]; if ($sr_arg !~ /.->./) { &usage(1, "Bad suppress-redirect argument, should contain \"->\": $sr_arg" ); } } for my $i (0 .. $#{$Opts{Suppress_Redirect_Prefix}}) { my $srp_arg = ${$Opts{Suppress_Redirect_Prefix}}[$i]; $srp_arg =~ s/ /->/; if ($srp_arg !~ /^(.*)->(.*)$/) { &usage(1, "Bad suppress-redirect-prefix argument, should contain \"->\": $srp_arg" ); } # Turn prefixes into a regexp. ${$Opts{Suppress_Redirect_Prefix}}[$i] = qr/^\Q$1\E(.*)->\Q$2\E\1$/ism; } for my $i (0 .. $#{$Opts{Suppress_Broken}}) { ${$Opts{Suppress_Broken}}[$i] =~ s/ /:/; my $sb_arg = ${$Opts{Suppress_Broken}}[$i]; if ($sb_arg !~ /^(-1|[0-9]+):./) { &usage(1, "Bad suppress-broken argument, should be prefixed by a numeric response code: $sb_arg" ); } } for my $sf_arg (@{$Opts{Suppress_Fragment}}) { if ($sf_arg !~ /.#./) { &usage(1, "Bad suppress-fragment argument, should contain \"#\": $sf_arg" ); } } return; } sub version () { print "$PACKAGE $REVISION\n"; exit 0; } sub usage () { my ($exitval, $msg) = @_; $exitval = 0 unless defined($exitval); $msg ||= ''; $msg =~ s/[\r\n]*$/\n\n/ if $msg; die($msg) unless $Opts{Command_Line}; my $trust = defined($Cfg{Trusted}) ? $Cfg{Trusted} : 'same host only'; select(STDERR) if $exitval; print "$msg$PACKAGE $REVISION Usage: checklink Options: -s, --summary Result summary only. -b, --broken Show only the broken links, not the redirects. -e, --directory Hide directory redirects, for example http://www.w3.org/TR -> http://www.w3.org/TR/ -r, --recursive Check the documents linked from the first one. -D, --depth N Check the documents linked from the first one to depth N (implies --recursive). -l, --location URI Scope of the documents checked in recursive mode (implies --recursive). Can be specified multiple times. If not specified, the default eg. for http://www.w3.org/TR/html4/Overview.html would be http://www.w3.org/TR/html4/ -X, --exclude REGEXP Do not check links whose full, canonical URIs match REGEXP; also limits recursion the same way as --exclude-docs with the same regexp would. --exclude-docs REGEXP In recursive mode, do not check links in documents whose full, canonical URIs match REGEXP. This option may be specified multiple times. --suppress-redirect URI->URI Do not report a redirect from the first to the second URI. This option may be specified multiple times. --suppress-redirect-prefix URI->URI Do not report a redirect from a child of the first URI to the same child of the second URI. This option may be specified multiple times. --suppress-temp-redirects Suppress warnings about temporary redirects. --suppress-broken CODE:URI Do not report a broken link with the given CODE. CODE is HTTP response, or -1 for robots exclusion. This option may be specified multiple times. --suppress-fragment URI Do not report the given broken fragment URI. A fragment URI contains \"#\". This option may be specified multiple times. -L, --languages LANGS Accept-Language header to send. The special value 'auto' causes autodetection from the environment. -c, --cookies FILE Use cookies, load/save them in FILE. The special value 'tmp' causes non-persistent use of cookies. -R, --no-referer Do not send the Referer HTTP header. -q, --quiet No output if no errors are found (implies -s). -v, --verbose Verbose mode. -i, --indicator Show percentage of lines processed while parsing. -u, --user USERNAME Specify a username for authentication. -p, --password PASSWORD Specify a password. --hide-same-realm Hide 401's that are in the same realm as the document checked. -S, --sleep SECS Sleep SECS seconds between requests to each server (default and minimum: 1 second). -t, --timeout SECS Timeout for requests in seconds (default: 30). -d, --domain DOMAIN Regular expression describing the domain to which authentication information will be sent (default: $trust). --masquerade \"BASE1 BASE2\" Masquerade base URI BASE1 as BASE2. See the manual page for more information. -H, --html HTML output. -?, -h, --help Show this message and exit. -V, --version Output version information and exit. See \"perldoc LWP\" for information about proxy server support, \"perldoc Net::FTP\" for information about various environment variables affecting FTP connections and \"perldoc Net::NNTP\" for setting a default NNTP server for news: URIs. The W3C_CHECKLINK_CFG environment variable can be used to set the configuration file to use. See details in the full manual page, it can be displayed with: perldoc checklink More documentation at: $Cfg{Doc_URI} Please send bug reports and comments to the www-validator mailing list: www-validator\@w3.org (with 'checklink' in the subject) Archives are at: http://lists.w3.org/Archives/Public/www-validator/ "; exit $exitval; } sub ask_password () { eval { local $SIG{__DIE__} = undef; require Term::ReadKey; Term::ReadKey->require_version(2.00); Term::ReadKey->import(qw(ReadMode)); }; if ($@) { warn('Warning: Term::ReadKey 2.00 or newer not available, ' . "password input disabled.\n"); return; } printf(STDERR 'Enter the password for user %s: ', $Opts{User}); ReadMode('noecho', *STDIN); chomp($Opts{Password} = ); ReadMode('restore', *STDIN); print(STDERR "ok.\n"); return; } ############################################################################### ########################################################################### # Guess an Accept-Language header based on the $LANG environment variable # ########################################################################### sub guess_language () { my $lang = $ENV{LANG} or return; $lang =~ s/[\.@].*$//; # en_US.UTF-8, fi_FI@euro... return 'en' if ($lang eq 'C' || $lang eq 'POSIX'); my $res = undef; eval { require Locale::Language; if (my $tmp = Locale::Language::language2code($lang)) { $lang = $tmp; } if (my ($l, $c) = (lc($lang) =~ /^([a-z]+)(?:[-_]([a-z]+))?/)) { if (Locale::Language::code2language($l)) { $res = $l; if ($c) { require Locale::Country; $res .= "-$c" if Locale::Country::code2country($c); } } } }; return $res; } ############################ # Transform foo into a URI # ############################ sub urize ($) { my $arg = shift; my $uarg = URI::Escape::uri_unescape($arg); my $uri; if (-d $uarg) { # look for an "index" file in dir, return it if found require File::Spec; for my $index (map { File::Spec->catfile($uarg, $_) } qw(index.html index.xhtml index.htm index.xhtm)) { if (-e $index) { $uri = URI::file->new_abs($index); last; } } # return dir itself if an index file was not found $uri ||= URI::file->new_abs($uarg); } elsif ($uarg =~ /^[.\/\\]/ || -e $uarg) { $uri = URI::file->new_abs($uarg); } else { my $newuri = URI->new($arg); if ($newuri->scheme()) { $uri = $newuri; } else { local $ENV{URL_GUESS_PATTERN} = ''; $uri = URI::Heuristic::uf_uri($arg); $uri = URI::file->new_abs($uri) unless $uri->scheme(); } } return $uri->canonical(); } ######################################## # Check for broken links in a resource # ######################################## sub check_uri (\%\$$$$;\$$) { my ($params, $uri, $check_num, $depth, $cookie, $referer, $is_start) = @_; $is_start ||= ($check_num == 1); my $start = $Opts{Summary_Only} ? 0 : &get_timestamp(); # Get and parse the document my $response = &get_document( 'GET', $uri, $doc_count, \%redirects, $referer, $cookie, $params, $check_num, $is_start ); # Can we check the resource? If not, we exit here... return if defined($response->{Stop}); if ($Opts{HTML}) { &html_header($uri, $cookie) if ($check_num == 1); &print_form($params, $cookie, $check_num) if $is_start; } if ($is_start) { # Starting point of a new check, eg. from the command line # Use the first URI as the recursion base unless specified otherwise. push(@{$Opts{Base_Locations}}, $response->{absolute_uri}->canonical()) unless @{$Opts{Base_Locations}}; } else { # Before fetching the document, we don't know if we'll be within the # recursion scope or not (think redirects). if (!&in_recursion_scope($response->{absolute_uri})) { hprintf("Not in recursion scope: %s\n") if ($Opts{Verbose}); $response->content(""); return; } } # Define the document header, and perhaps print it. # (It might still be defined if the previous document had no errors; # just redefine it in that case.) if ($check_num != 1) { if ($Opts{HTML}) { $doc_header = "\n
\n"; } else { $doc_header = "\n" . ('-' x 40) . "\n"; } } if ($Opts{HTML}) { $doc_header .= ("

\nProcessing\t" . &show_url($response->{absolute_uri}) . "\n

\n\n"); } else { $doc_header .= "\nProcessing\t$response->{absolute_uri}\n\n"; } if (!$Opts{Quiet}) { print_doc_header(); } # We are checking a new document $doc_count++; my $result_anchor = 'results' . $doc_count; if ($check_num == 1 && !$Opts{HTML} && !$Opts{Summary_Only}) { my $s = $Opts{Sleep_Time} == 1 ? '' : 's'; my $acclang = $Opts{Accept_Language} || '(not sent)'; my $send_referer = $Opts{No_Referer} ? 'not sent' : 'sending'; my $cookies = 'not used'; if (defined($Opts{Cookies})) { $cookies = 'used, '; if ($Opts{Cookies} eq 'tmp') { $cookies .= 'non-persistent'; } else { $cookies .= "file $Opts{Cookies}"; } } printf( <<'EOF', $Accept, $acclang, $send_referer, $cookies, $Opts{Sleep_Time}, $s); Settings used: - Accept: %s - Accept-Language: %s - Referer: %s - Cookies: %s - Sleeping %d second%s between requests to each server EOF printf("- Excluding links matching %s\n", $Opts{Exclude}) if defined($Opts{Exclude}); printf("- Excluding links in documents whose URIs match %s\n", join(', ', @{$Opts{Exclude_Docs}})) if @{$Opts{Exclude_Docs}}; } if ($Opts{HTML}) { if (!$Opts{Summary_Only}) { my $accept = &encode($Accept); my $acclang = &encode($Opts{Accept_Language} || '(not sent)'); my $send_referer = $Opts{No_Referer} ? 'not sent' : 'sending'; my $s = $Opts{Sleep_Time} == 1 ? '' : 's'; printf( <<'EOF', $accept, $acclang, $send_referer, $Opts{Sleep_Time}, $s);
Settings used:
EOF printf("

Go to the results.

\n", $result_anchor); my $esc_uri = URI::Escape::uri_escape($response->{absolute_uri}, "^A-Za-z0-9."); print "

For reliable link checking results, check "; if (!$response->{IsCss}) { printf("HTML validity and ", &encode(sprintf($Cfg{Markup_Validator_URI}, $esc_uri))); } printf( "CSS validity first.

Back to the link checker.

\n", &encode(sprintf($Cfg{CSS_Validator_URI}, $esc_uri)), &encode($Opts{_Self_URI}) ); printf(<<'EOF', $result_anchor);

Status:

EOF
        }
    }

    if ($Opts{Summary_Only} && !$Opts{Quiet}) {
        print '

' if $Opts{HTML}; print 'This may take some time'; print "... (why?)

" if $Opts{HTML}; print " if the document has many links to check.\n" unless $Opts{HTML}; } # Record that we have processed this resource $processed{$response->{absolute_uri}} = 1; # Parse the document my $p = &parse_document($uri, $response->base(), $response, 1, ($depth != 0)); my $base = URI->new($p->{base}); # Check anchors ############### print "Checking anchors...\n" unless $Opts{Summary_Only}; my %errors; while (my ($anchor, $lines) = each(%{$p->{Anchors}})) { if (!length($anchor)) { # Empty IDREF's are not allowed $errors{$anchor} = 1; } else { my $times = 0; $times += $_ for values(%$lines); # They should appear only once $errors{$anchor} = 1 if ($times > 1); } } print " done.\n" unless $Opts{Summary_Only}; # Check links ############# &hprintf("Recording all the links found: %d\n", scalar(keys %{$p->{Links}})) if ($Opts{Verbose}); my %links; my %hostlinks; # Record all the links found while (my ($link, $lines) = each(%{$p->{Links}})) { my $link_uri = URI->new($link); my $abs_link_uri = URI->new_abs($link_uri, $base); if ($Opts{Masquerade}) { if ($abs_link_uri =~ m|^\Q$Opts{Masquerade_From}\E|) { print_doc_header(); printf("processing %s in base %s\n", $abs_link_uri, $Opts{Masquerade_To}); my $nlink = $abs_link_uri; $nlink =~ s|^\Q$Opts{Masquerade_From}\E|$Opts{Masquerade_To}|; $abs_link_uri = URI->new($nlink); } } my $canon_uri = URI->new($abs_link_uri->canonical()); my $fragment = $canon_uri->fragment(undef); if (!defined($Opts{Exclude}) || $canon_uri !~ $Opts{Exclude}) { if (!exists($links{$canon_uri})) { my $hostport; $hostport = $canon_uri->host_port() if $canon_uri->can('host_port'); $hostport = '' unless defined $hostport; push(@{$hostlinks{$hostport}}, $canon_uri); } for my $line_num (keys(%$lines)) { if (!defined($fragment) || !length($fragment)) { # Document without fragment $links{$canon_uri}{location}{$line_num} = 1; } else { # Resource with a fragment $links{$canon_uri}{fragments}{$fragment}{$line_num} = 1; } } } } my @order = &distribute_links(\%hostlinks); undef %hostlinks; # Build the list of broken URI's my $nlinks = scalar(@order); &hprintf("Checking %d links to build list of broken URI's\n", $nlinks) if ($Opts{Verbose}); my %broken; my $link_num = 0; for my $u (@order) { my $ulinks = $links{$u}; if ($Opts{Summary_Only}) { # Hack: avoid browser/server timeouts in summary only CGI mode, bug 896 print ' ' if ($Opts{HTML} && !$Opts{Command_Line}); } else { &hprintf("\nChecking link %s\n", $u); my $progress = ($link_num / $nlinks) * 100; printf( '', $result_anchor, &encode($u), $progress) if (!$Opts{Command_Line} && $Opts{HTML} && !$Opts{Summary_Only}); } $link_num++; # Check that a link is valid &check_validity($uri, $u, ($depth != 0 && &in_recursion_scope($u)), \%links, \%redirects); &hprintf("\tReturn code: %s\n", $results{$u}{location}{code}) if ($Opts{Verbose}); if ($results{$u}{location}{success}) { # Even though it was not broken, we might want to display it # on the results page (e.g. because it required authentication) $broken{$u}{location} = 1 if ($results{$u}{location}{display} >= 400); # List the broken fragments while (my ($fragment, $lines) = each(%{$ulinks->{fragments}})) { my $fragment_ok = $results{$u}{fragments}{$fragment}; if ($Opts{Verbose}) { my @line_nums = sort { $a <=> $b } keys(%$lines); &hprintf( "\t\t%s %s - Line%s: %s\n", $fragment, $fragment_ok ? 'OK' : 'Not found', (scalar(@line_nums) > 1) ? 's' : '', join(', ', @line_nums) ); } # A broken fragment? $broken{$u}{fragments}{$fragment} += 2 unless $fragment_ok; } } elsif (!($Opts{Quiet} && &informational($results{$u}{location}{code}))) { # Couldn't find the document $broken{$u}{location} = 1; # All the fragments associated are hence broken for my $fragment (keys %{$ulinks->{fragments}}) { $broken{$u}{fragments}{$fragment}++; } } } &hprintf( "\nProcessed in %s seconds.\n", &time_diff($start, &get_timestamp()) ) unless $Opts{Summary_Only}; printf( '', $result_anchor, &time_diff($start, &get_timestamp())) if ($Opts{HTML} && !$Opts{Summary_Only}); # Display results if ($Opts{HTML} && !$Opts{Summary_Only}) { print("
\n
\n"); printf("

Results

\n", $result_anchor); } print "\n" unless $Opts{Quiet}; &links_summary(\%links, \%results, \%broken, \%redirects); &anchors_summary($p->{Anchors}, \%errors); # Do we want to process other documents? if ($depth != 0) { for my $u (map { URI->new($_) } keys %links) { next unless $results{$u}{location}{success}; # Broken link? next unless &in_recursion_scope($u); # Do we understand its content type? next unless ($results{$u}{location}{type} =~ $ContentTypes); # Have we already processed this URI? next if &already_processed($u, $uri); # Do the job print "\n" unless $Opts{Quiet}; if ($Opts{HTML}) { if (!$Opts{Command_Line}) { if ($doc_count == $Opts{Max_Documents}) { print( "
\n

Maximum number of documents ($Opts{Max_Documents}) reached!

\n" ); } if ($doc_count >= $Opts{Max_Documents}) { $doc_count++; print("

Not checking $u

\n"); $processed{$u} = 1; next; } } } # This is an inherently recursive algorithm, so Perl's warning is not # helpful. You may wish to comment this out when debugging, though. no warnings 'recursion'; if ($depth < 0) { &check_uri($params, $u, 0, -1, $cookie, $uri); } else { &check_uri($params, $u, 0, $depth - 1, $cookie, $uri); } } } return; } ############################################################### # Distribute links based on host:port to avoid RobotUA delays # ############################################################### sub distribute_links(\%) { my $hostlinks = shift; # Hosts ordered by weight (number of links), descending my @order = sort { scalar(@{$hostlinks->{$b}}) <=> scalar(@{$hostlinks->{$a}}) } keys %$hostlinks; # All link list flattened into one, in host weight order my @all; push(@all, @{$hostlinks->{$_}}) for @order; return @all if (scalar(@order) < 2); # Indexes and chunk size for "zipping" the end result list my $num = scalar(@{$hostlinks->{$order[0]}}); my @indexes = map { $_ * $num } (0 .. $num - 1); # Distribute them my @result; while (my @chunk = splice(@all, 0, $num)) { @result[@indexes] = @chunk; @indexes = map { $_ + 1 } @indexes; } # Weed out undefs @result = grep(defined, @result); return @result; } ########################################## # Decode Content-Encodings in a response # ########################################## sub decode_content ($) { my $response = shift; my $error = undef; my $docref = $response->decoded_content(ref => 1); if (defined($docref)) { utf8::encode($$docref); $response->content_ref($docref); # Remove Content-Encoding so it won't be decoded again later. $response->remove_header('Content-Encoding'); } else { my $ce = $response->header('Content-Encoding'); $ce = defined($ce) ? "'$ce'" : 'undefined'; my $ct = $response->header('Content-Type'); $ct = defined($ct) ? "'$ct'" : 'undefined'; my $request_uri = $response->request->url; my $cs = $response->content_charset(); $cs = defined($cs) ? "'$cs'" : 'unknown'; $error = "Error decoding document at <$request_uri>, Content-Type $ct, " . "Content-Encoding $ce, content charset $cs: '$@'"; } return $error; } ####################################### # Get and parse a resource to process # ####################################### sub get_document ($\$$;\%\$$$$$) { my ($method, $uri, $in_recursion, $redirects, $referer, $cookie, $params, $check_num, $is_start ) = @_; # $method contains the HTTP method the use (GET or HEAD) # $uri object contains the identifier of the resource # $in_recursion is > 0 if we are in recursion mode (i.e. it is at least # the second resource checked) # $redirects is a pointer to the hash containing the map of the redirects # $referer is the URI object of the referring document # $cookie, $params, $check_num, and $is_start are for printing HTTP headers # and the form if $in_recursion == 0 and not authenticating # Get the resource my $response; if (defined($results{$uri}{response}) && !($method eq 'GET' && $results{$uri}{method} eq 'HEAD')) { $response = $results{$uri}{response}; } else { $response = &get_uri($method, $uri, $referer); &record_results($uri, $method, $response, $referer); &record_redirects($redirects, $response); } if (!$response->is_success()) { if (!$in_recursion) { # Is it too late to request authentication? if ($response->code() == 401) { &authentication($response, $cookie, $params, $check_num, $is_start); } else { if ($Opts{HTML}) { &html_header($uri, $cookie) if ($check_num == 1); &print_form($params, $cookie, $check_num) if $is_start; print "

", &status_icon($response->code()); } &hprintf("\nError: %d %s\n", $response->code(), $response->message() || '(no message)'); print "

\n" if $Opts{HTML}; } } $response->{Stop} = 1; $response->content(""); return ($response); } # What is the URI of the resource that we are processing by the way? my $base_uri = $response->base(); my $request_uri = URI->new($response->request->url); $response->{absolute_uri} = $request_uri->abs($base_uri); # Can we parse the document? my $failed_reason; my $ct = $response->header('Content-Type'); if (!$ct || $ct !~ $ContentTypes) { $failed_reason = "Content-Type for <$request_uri> is " . (defined($ct) ? "'$ct'" : 'undefined'); } else { $failed_reason = decode_content($response); } if ($failed_reason) { # No, there is a problem... if (!$in_recursion) { if ($Opts{HTML}) { &html_header($uri, $cookie) if ($check_num == 1); &print_form($params, $cookie, $check_num) if $is_start; print "

", &status_icon(406); } &hprintf("Can't check links: %s.\n", $failed_reason); print "

\n" if $Opts{HTML}; } $response->{Stop} = 1; $response->content(""); } # Ok, return the information return ($response); } ######################################################### # Check whether a URI is within the scope of recursion. # ######################################################### sub in_recursion_scope (\$) { my ($uri) = @_; return 0 unless $uri; my $candidate = $uri->canonical(); return 0 if (defined($Opts{Exclude}) && $candidate =~ $Opts{Exclude}); for my $excluded_doc (@{$Opts{Exclude_Docs}}) { return 0 if ($candidate =~ $excluded_doc); } for my $base (@{$Opts{Base_Locations}}) { my $rel = $candidate->rel($base); next if ($candidate eq $rel); # Relative path not possible? next if ($rel =~ m|^(\.\.)?/|); # Relative path upwards? return 1; } return 0; # We always have at least one base location, but none matched. } ################################# # Check for content type match. # ################################# sub is_content_type ($$) { my ($candidate, $type) = @_; return 0 unless ($candidate && $type); my @v = HTTP::Headers::Util::split_header_words($candidate); return scalar(@v) ? $type eq lc($v[0]->[0]) : 0; } ################################################## # Check whether a URI has already been processed # ################################################## sub already_processed (\$\$) { my ($uri, $referer) = @_; # Don't be verbose for that part... my $summary_value = $Opts{Summary_Only}; $Opts{Summary_Only} = 1; # Do a GET: if it fails, we stop, if not, the results are cached my $response = &get_document('GET', $uri, 1, undef, $referer); # ... but just for that part $Opts{Summary_Only} = $summary_value; # Can we process the resource? return -1 if defined($response->{Stop}); # Have we already processed it? return 1 if defined($processed{$response->{absolute_uri}->as_string()}); # It's not processed yet and it is processable: return 0 return 0; } ############################ # Get the content of a URI # ############################ sub get_uri ($\$;\$$\%$$$$) { # Here we have a lot of extra parameters in order not to lose information # if the function is called several times (401's) my ($method, $uri, $referer, $start, $redirects, $code, $realm, $message, $auth ) = @_; # $method contains the method used # $uri object contains the target of the request # $referer is the URI object of the referring document # $start is a timestamp (not defined the first time the function is called) # $redirects is a map of redirects # $code is the first HTTP return code # $realm is the realm of the request # $message is the HTTP message received # $auth equals 1 if we want to send out authentication information # For timing purposes $start = &get_timestamp() unless defined($start); # Prepare the query # Do we want printouts of progress? my $verbose_progress = !($Opts{Summary_Only} || (!$doc_count && $Opts{HTML})); &hprintf("%s %s ", $method, $uri) if $verbose_progress; my $request = HTTP::Request->new($method, $uri); $request->header('Accept-Language' => $Opts{Accept_Language}) if $Opts{Accept_Language}; $request->header('Accept', $Accept); $request->accept_decodable(); # Are we providing authentication info? if ($auth && $request->url()->host() =~ $Opts{Trusted}) { if (defined($ENV{HTTP_AUTHORIZATION})) { $request->header(Authorization => $ENV{HTTP_AUTHORIZATION}); } elsif (defined($Opts{User}) && defined($Opts{Password})) { $request->authorization_basic($Opts{User}, $Opts{Password}); } } # Tell the user agent if we want progress reports for redirects or not. $ua->redirect_progress_callback(sub { &hprintf("\n-> %s %s ", @_); }) if $verbose_progress; # Set referer $request->referer($referer) if (!$Opts{No_Referer} && $referer); # Telling caches in the middle we want a fresh copy (Bug 4998) $request->header(Cache_Control => "max-age=0"); # Do the query my $response = $ua->request($request); # Get the results # Record the very first response if (!defined($code)) { ($code, $message) = delete(@$ua{qw(FirstResponse FirstMessage)}); } # Authentication requested? if ($response->code() == 401 && !defined($auth) && (defined($ENV{HTTP_AUTHORIZATION}) || (defined($Opts{User}) && defined($Opts{Password}))) ) { # Set host as trusted domain unless we already have one. if (!$Opts{Trusted}) { my $re = sprintf('^%s$', quotemeta($response->base()->host())); $Opts{Trusted} = qr/$re/io; } # Deal with authentication and avoid loops if (!defined($realm) && $response->www_authenticate() =~ /Basic realm=\"([^\"]+)\"/) { $realm = $1; } print "\n" if $verbose_progress; return &get_uri($method, $response->request()->url(), $referer, $start, $redirects, $code, $realm, $message, 1); } # @@@ subtract robot delay from the "fetched in" time? &hprintf(" fetched in %s seconds\n", &time_diff($start, &get_timestamp())) if $verbose_progress; $response->{IsCss} = is_content_type($response->content_type(), "text/css"); $response->{Realm} = $realm if defined($realm); return $response; } ######################################### # Record the results of an HTTP request # ######################################### sub record_results (\$$$$) { my ($uri, $method, $response, $referer) = @_; $results{$uri}{referer} = $referer; $results{$uri}{response} = $response; $results{$uri}{method} = $method; $results{$uri}{location}{code} = $response->code(); $results{$uri}{location}{code} = RC_ROBOTS_TXT() if ($results{$uri}{location}{code} == 403 && $response->message() =~ /Forbidden by robots\.txt/); $results{$uri}{location}{code} = RC_IP_DISALLOWED() if ($results{$uri}{location}{code} == 403 && $response->message() =~ /non-public IP/); $results{$uri}{location}{code} = RC_DNS_ERROR() if ($results{$uri}{location}{code} == 500 && $response->message() =~ /Bad hostname '[^\']*'/); $results{$uri}{location}{code} = RC_PROTOCOL_DISALLOWED() if ($results{$uri}{location}{code} == 500 && $response->message() =~ /Access to '[^\']*' URIs has been disabled/); $results{$uri}{location}{type} = $response->header('Content-type'); $results{$uri}{location}{display} = $results{$uri}{location}{code}; # Rewind, check for the original code and message. for (my $tmp = $response->previous(); $tmp; $tmp = $tmp->previous()) { $results{$uri}{location}{orig} = $tmp->code(); $results{$uri}{location}{orig_message} = $tmp->message() || '(no message)'; } $results{$uri}{location}{success} = $response->is_success(); # If a suppressed broken link, fill the data structure like a typical success. # print STDERR "success? " . $results{$uri}{location}{success} . ": $uri\n"; if (!$results{$uri}{location}{success}) { my $code = $results{$uri}{location}{code}; my $match = grep { $_ eq "$code:$uri" } @{$Opts{Suppress_Broken}}; if ($match) { $results{$uri}{location}{success} = 1; $results{$uri}{location}{code} = 100; $results{$uri}{location}{display} = 100; } } # Stores the authentication information if (defined($response->{Realm})) { $results{$uri}{location}{realm} = $response->{Realm}; $results{$uri}{location}{display} = 401 unless $Opts{Hide_Same_Realm}; } # What type of broken link is it? (stored in {record} - the {display} # information is just for visual use only) if ($results{$uri}{location}{display} == 401 && $results{$uri}{location}{code} == 404) { $results{$uri}{location}{record} = 404; } else { $results{$uri}{location}{record} = $results{$uri}{location}{display}; } # Did it fail? $results{$uri}{location}{message} = $response->message() || '(no message)'; if (!$results{$uri}{location}{success}) { &hprintf( "Error: %d %s\n", $results{$uri}{location}{code}, $results{$uri}{location}{message} ) if ($Opts{Verbose}); } return; } #################### # Parse a document # #################### sub parse_document (\$\$$$$) { my ($uri, $base_uri, $response, $links, $rec_needs_links) = @_; print("parse_document($uri, $base_uri, ..., $links, $rec_needs_links)\n") if $Opts{Verbose}; my $p; if (defined($results{$uri}{parsing})) { # We have already done the job. Woohoo! $p->{base} = $results{$uri}{parsing}{base}; $p->{Anchors} = $results{$uri}{parsing}{Anchors}; $p->{Links} = $results{$uri}{parsing}{Links}; return $p; } $p = W3C::LinkChecker->new(); $p->{base} = $base_uri; my $stype = $response->header("Content-Style-Type"); $p->{style_is_css} = !$stype || is_content_type($stype, "text/css"); my $start; if (!$Opts{Summary_Only}) { $start = &get_timestamp(); print("Parsing...\n"); } # Content-Encoding etc already decoded in get_document(). my $docref = $response->content_ref(); # Count lines beforehand if needed (for progress indicator, or CSS while # we don't get any line context out of the parser). In case of HTML, the # actual final number of lines processed shown is populated by our # end_document handler. $p->{Total} = ($$docref =~ tr/\n//) if ($response->{IsCss} || $Opts{Progress}); # We only look for anchors if we are not interested in the links # obviously, or if we are running a recursive checking because we # might need this information later $p->{only_anchors} = !($links || $rec_needs_links); if ($response->{IsCss}) { # Parse as CSS $p->parse_css($$docref, LINE_UNKNOWN()); } else { # Parse as HTML # Transform into for parsing # Processing instructions are not parsed by process, but in this case # it should be. It's expensive, it's horrible, but it's the easiest way # for right now. $$docref =~ s/\<\?(xml:stylesheet.*?)\?\>/\<$1\>/ unless $p->{only_anchors}; $p->xml_mode(1) if ($response->content_type() =~ /\+xml$/); $p->parse($$docref)->eof(); } $response->content(""); if (!$Opts{Summary_Only}) { my $stop = &get_timestamp(); print "\r" if $Opts{Progress}; &hprintf(" done (%d lines in %s seconds).\n", $p->{Total}, &time_diff($start, $stop)); } # Save the results before exiting $results{$uri}{parsing}{base} = $p->{base}; $results{$uri}{parsing}{Anchors} = $p->{Anchors}; $results{$uri}{parsing}{Links} = $p->{Links}; return $p; } #################################### # Constructor for W3C::LinkChecker # #################################### sub new { my $p = HTML::Parser::new(@_, api_version => 3); $p->utf8_mode(1); # Set up handlers $p->handler(start => 'start', 'self, tagname, attr, line'); $p->handler(end => 'end', 'self, tagname, line'); $p->handler(text => 'text', 'self, dtext, line'); $p->handler( declaration => sub { my $self = shift; $self->declaration(substr($_[0], 2, -1)); }, 'self, text, line' ); $p->handler(end_document => 'end_document', 'self, line'); if ($Opts{Progress}) { $p->handler(default => 'parse_progress', 'self, line'); $p->{last_percentage} = 0; } # Check ? $p->{check_name} = 1; # Check <[..] id="..">? $p->{check_id} = 1; # Don't interpret comment loosely $p->strict_comment(1); return $p; } ################################################# # Record or return the doctype of the document # ################################################# sub doctype { my ($self, $dc) = @_; return $self->{doctype} unless $dc; $_ = $self->{doctype} = $dc; # What to look for depending on the doctype # Check for ? $self->{check_name} = 0 if m%^-//(W3C|WAPFORUM)//DTD XHTML (Basic|Mobile) %; # Check for <* id="...">? $self->{check_id} = 0 if (m%^-//IETF//DTD HTML [23]\.0//% || m%^-//W3C//DTD HTML 3\.2//%); # Enable XML mode (XHTML, XHTML Mobile, XHTML-Print, XHTML+RDFa, ...) $self->xml_mode(1) if (m%^-//(W3C|WAPFORUM)//DTD XHTML[ \-\+]%); return; } ################################### # Print parse progress indication # ################################### sub parse_progress { my ($self, $line) = @_; return unless defined($line) && $line > 0 && $self->{Total} > 0; my $percentage = int($line / $self->{Total} * 100); if ($percentage != $self->{last_percentage}) { printf("\r%4d%%", $percentage); $self->{last_percentage} = $percentage; } return; } ############################# # Extraction of the anchors # ############################# sub get_anchor { my ($self, $tag, $attr) = @_; my $anchor = $self->{check_id} ? $attr->{id} : undef; if ($self->{check_name} && ($tag eq 'a')) { # @@@@ In XHTML, is mandatory # Force an error if it's not the case (or if id's and name's values # are different) # If id is defined, name if defined must have the same value $anchor ||= $attr->{name}; } return $anchor; } ############################# # W3C::LinkChecker handlers # ############################# sub add_link { my ($self, $uri, $base, $line) = @_; if (defined($uri)) { # Remove repeated slashes after the . or .. in relative links, to avoid # duplicated checking or infinite recursion. $uri =~ s|^(\.\.?/)/+|$1|o; $uri = Encode::decode_utf8($uri); $uri = URI->new_abs($uri, $base) if defined($base); $self->{Links}{$uri}{defined($line) ? $line : LINE_UNKNOWN()}++; } return; } sub start { my ($self, $tag, $attr, $line) = @_; $line = LINE_UNKNOWN() unless defined($line); # Anchors my $anchor = $self->get_anchor($tag, $attr); $self->{Anchors}{$anchor}{$line}++ if defined($anchor); # Links if (!$self->{only_anchors}) { my $tag_local_base = undef; # Special case: base/@href # @@@TODO: The reason for handling ourselves is that LWP's # head parsing magic fails at least for responses that have # Content-Encodings: https://rt.cpan.org/Ticket/Display.html?id=54361 if ($tag eq 'base') { # Ignore with missing/empty href. $self->{base} = $attr->{href} if (defined($attr->{href}) && length($attr->{href})); } # Special case: meta[@http-equiv=Refresh]/@content elsif ($tag eq 'meta') { if ($attr->{'http-equiv'} && lc($attr->{'http-equiv'}) eq 'refresh') { my $content = $attr->{content}; if ($content && $content =~ /.*?;\s*(?:url=)?(.+)/i) { $self->add_link($1, undef, $line); } } } # Special case: tags that have "local base" elsif ($tag eq 'applet' || $tag eq 'object') { if (my $codebase = $attr->{codebase}) { # Applet codebases are directories, append trailing slash # if it's not there so that new_abs does the right thing. $codebase .= "/" if ($tag eq 'applet' && $codebase !~ m|/$|); # TODO: HTML 4 spec says applet/@codebase may only point to # subdirs of the directory containing the current document. # Should we do something about that? $tag_local_base = URI->new_abs($codebase, $self->{base}); } } # Link attributes: if (my $link_attrs = LINK_ATTRS()->{$tag}) { for my $la (@$link_attrs) { $self->add_link($attr->{$la}, $tag_local_base, $line); } } # List of links attributes: if (my $link_attrs = LINK_LIST_ATTRS()->{$tag}) { my ($sep, $attrs) = @$link_attrs; for my $la (@$attrs) { if (defined(my $value = $attr->{$la})) { for my $link (split($sep, $value)) { $self->add_link($link, $tag_local_base, $line); } } } } # Inline CSS: delete $self->{csstext}; if ($tag eq 'style') { $self->{csstext} = '' if ((!$attr->{type} && $self->{style_is_css}) || is_content_type($attr->{type}, "text/css")); } elsif ($self->{style_is_css} && (my $style = $attr->{style})) { $style = CSS::DOM::Style::parse($style); $self->parse_style($style, $line); } } $self->parse_progress($line) if $Opts{Progress}; return; } sub end { my ($self, $tagname, $line) = @_; $self->parse_css($self->{csstext}, $line) if ($tagname eq 'style'); delete $self->{csstext}; $self->parse_progress($line) if $Opts{Progress}; return; } sub parse_css { my ($self, $css, $line) = @_; return unless $css; my $sheet = CSS::DOM::parse($css); for my $rule (@{$sheet->cssRules()}) { if ($rule->type() == IMPORT_RULE()) { $self->add_link($rule->href(), $self->{base}, $line); } elsif ($rule->type == STYLE_RULE()) { $self->parse_style($rule->style(), $line); } } return; } sub parse_style { my ($self, $style, $line) = @_; return unless $style; for (my $i = 0, my $len = $style->length(); $i < $len; $i++) { my $prop = $style->item($i); my $val = $style->getPropertyValue($prop); while ($val =~ /$CssUrl/go) { my $url = CSS::DOM::Util::unescape($2); $self->add_link($url, $self->{base}, $line); } } return; } sub declaration { my ($self, $text, $line) = @_; # Extract the doctype my @declaration = split(/\s+/, $text, 4); if ($#declaration >= 3 && $declaration[0] eq 'DOCTYPE' && lc($declaration[1]) eq 'html') { # Parse the doctype declaration if ($text =~ m/^DOCTYPE\s+html\s+(?:PUBLIC\s+"([^"]+)"|SYSTEM)(\s+"([^"]+)")?\s*$/i ) { # Store the doctype $self->doctype($1) if $1; # If there is a link to the DTD, record it $self->add_link($3, undef, $line) if (!$self->{only_anchors} && $3); } } $self->text($text) unless $self->{only_anchors}; return; } sub text { my ($self, $text, $line) = @_; $self->{csstext} .= $text if defined($self->{csstext}); $self->parse_progress($line) if $Opts{Progress}; return; } sub end_document { my ($self, $line) = @_; $self->{Total} = $line; delete $self->{csstext}; return; } ################################ # Check the validity of a link # ################################ sub check_validity (\$\$$\%\%) { my ($referer, $uri, $want_links, $links, $redirects) = @_; # $referer is the URI object of the document checked # $uri is the URI object of the target that we are verifying # $want_links is true if we're interested in links in the target doc # $links is a hash of the links in the documents checked # $redirects is a map of the redirects encountered # Get the document with the appropriate method: GET if there are # fragments to check or links are wanted, HEAD is enough otherwise. my $fragments = $links->{$uri}{fragments} || {}; my $method = ($want_links || %$fragments) ? 'GET' : 'HEAD'; my $response; my $being_processed = 0; if (!defined($results{$uri}) || ($method eq 'GET' && $results{$uri}{method} eq 'HEAD')) { $being_processed = 1; $response = &get_uri($method, $uri, $referer); # Get the information back from get_uri() &record_results($uri, $method, $response, $referer); # Record the redirects &record_redirects($redirects, $response); } elsif (!($Opts{Summary_Only} || (!$doc_count && $Opts{HTML}))) { my $ref = $results{$uri}{referer}; &hprintf("Already checked%s\n", $ref ? ", referrer $ref" : "."); } # We got the response of the HTTP request. Stop here if it was a HEAD. return if ($method eq 'HEAD'); # There are fragments. Parse the document. my $p; if ($being_processed) { # Can we really parse the document? if (!defined($results{$uri}{location}{type}) || $results{$uri}{location}{type} !~ $ContentTypes) { &hprintf("Can't check content: Content-Type for '%s' is '%s'.\n", $uri, $results{$uri}{location}{type}) if ($Opts{Verbose}); $response->content(""); return; } # Do it then if (my $error = decode_content($response)) { &hprintf("%s\n.", $error); } # @@@TODO: this isn't the best thing to do if a decode error occurred $p = &parse_document($uri, $response->base(), $response, 0, $want_links); } else { # We already had the information $p->{Anchors} = $results{$uri}{parsing}{Anchors}; } # Check that the fragments exist for my $fragment (keys %$fragments) { if (defined($p->{Anchors}{$fragment}) || &escape_match($fragment, $p->{Anchors}) || grep { $_ eq "$uri#$fragment" } @{$Opts{Suppress_Fragment}}) { $results{$uri}{fragments}{$fragment} = 1; } else { $results{$uri}{fragments}{$fragment} = 0; } } return; } sub escape_match ($\%) { my ($a, $hash) = (URI::Escape::uri_unescape($_[0]), $_[1]); for my $b (keys %$hash) { return 1 if ($a eq URI::Escape::uri_unescape($b)); } return 0; } ########################## # Ask for authentication # ########################## sub authentication ($;$$$$) { my ($response, $cookie, $params, $check_num, $is_start) = @_; my $realm = ''; if ($response->www_authenticate() =~ /Basic realm=\"([^\"]+)\"/) { $realm = $1; } if ($Opts{Command_Line}) { printf STDERR <<'EOF', $response->request()->url(), $realm; Authentication is required for %s. The realm is "%s". Use the -u and -p options to specify a username and password and the -d option to specify trusted domains. EOF } else { printf( "Status: 401 Authorization Required\nWWW-Authenticate: %s\n%sConnection: close\nContent-Language: en\nContent-Type: text/html; charset=utf-8\n\n", $response->www_authenticate(), $cookie ? "Set-Cookie: $cookie\n" : "", ); printf( "%s W3C Link Checker: 401 Authorization Required %s ", $DocType, $Head ); &banner(': 401 Authorization Required'); &print_form($params, $cookie, $check_num) if $is_start; printf( '

%s You need "%s" access to %s to perform link checking.
', &status_icon(401), &encode($realm), (&encode($response->request()->url())) x 2 ); my $host = $response->request()->url()->host(); if ($Opts{Trusted} && $host !~ $Opts{Trusted}) { printf <<'EOF', &encode($Opts{Trusted}), &encode($host); This service has been configured to send authentication only to hostnames matching the regular expression %s, but the hostname %s does not match it. EOF } print "

\n"; } return; } ################## # Get statistics # ################## sub get_timestamp () { return pack('LL', Time::HiRes::gettimeofday()); } sub time_diff ($$) { my @start = unpack('LL', $_[0]); my @stop = unpack('LL', $_[1]); for ($start[1], $stop[1]) { $_ /= 1_000_000; } return (sprintf("%.2f", ($stop[0] + $stop[1]) - ($start[0] + $start[1]))); } ######################## # Handle the redirects # ######################## # Record the redirects in a hash sub record_redirects (\%$) { my ($redirects, $response) = @_; for (my $prev = $response->previous(); $prev; $prev = $prev->previous()) { # Check for redirect match. my $from = $prev->request()->url(); my $to = $response->request()->url(); # same on every loop iteration my $from_to = $from . '->' . $to; my $match = grep { $_ eq $from_to } @{$Opts{Suppress_Redirect}}; # print STDERR "Result $match of redirect checking $from_to\n"; if ($match) { next; } $match = grep { $from_to =~ /$_/ } @{$Opts{Suppress_Redirect_Prefix}}; # print STDERR "Result $match of regexp checking $from_to\n"; if ($match) { next; } my $c = $prev->code(); if ($Opts{Suppress_Temp_Redirects} && ($c == 307 || $c == 302)) { next; } $redirects->{$prev->request()->url()} = $response->request()->url(); } return; } # Determine if a request is redirected sub is_redirected ($%) { my ($uri, %redirects) = @_; return (defined($redirects{$uri})); } # Get a list of redirects for a URI sub get_redirects ($%) { my ($uri, %redirects) = @_; my @history = ($uri); my %seen = ($uri => 1); # for tracking redirect loops my $loop = 0; while ($redirects{$uri}) { $uri = $redirects{$uri}; push(@history, $uri); if ($seen{$uri}) { $loop = 1; last; } else { $seen{$uri}++; } } return ($loop, @history); } #################################################### # Tool for sorting the unique elements of an array # #################################################### sub sort_unique (@) { my %saw; @saw{@_} = (); return (sort { $a <=> $b } keys %saw); } ##################### # Print the results # ##################### sub line_number ($) { my $line = shift; return $line if ($line >= 0); return "(N/A)"; } sub http_rc ($) { my $rc = shift; return $rc if ($rc >= 0); return "(N/A)"; } # returns true if the given code is informational sub informational ($) { my $rc = shift; return $rc == RC_ROBOTS_TXT() || $rc == RC_IP_DISALLOWED() || $rc == RC_PROTOCOL_DISALLOWED(); } sub anchors_summary (\%\%) { my ($anchors, $errors) = @_; # Number of anchors found. my $n = scalar(keys(%$anchors)); if (!$Opts{Quiet}) { if ($Opts{HTML}) { print("

Anchors

\n

"); } else { print("Anchors\n\n"); } &hprintf("Found %d anchor%s.\n", $n, ($n == 1) ? '' : 's'); print("

\n") if $Opts{HTML}; } # List of the duplicates, if any. my @errors = keys %{$errors}; if (!scalar(@errors)) { print("

Valid anchors!

\n") if (!$Opts{Quiet} && $Opts{HTML} && $n); return; } undef $n; print_doc_header(); print('

') if $Opts{HTML}; print('List of duplicate and empty anchors'); print <<'EOF' if $Opts{HTML};

EOF print("\n"); for my $anchor (@errors) { my $format; my @unique = &sort_unique( map { line_number($_) } keys %{$anchors->{$anchor}} ); if ($Opts{HTML}) { $format = "\n"; } else { my $s = (scalar(@unique) > 1) ? 's' : ''; $format = "\t%s\tLine$s: %s\n"; } printf($format, &encode(length($anchor) ? $anchor : 'Empty anchor'), join(', ', @unique)); } print("\n
Anchor Lines
%s%s
\n") if $Opts{HTML}; return; } sub show_link_report (\%\%\%\%\@;$\%) { my ($links, $results, $broken, $redirects, $urls, $codes, $todo) = @_; print("\n
") if $Opts{HTML}; print("\n") if (!$Opts{Quiet}); # Process each URL my ($c, $previous_c); for my $u (@$urls) { my @fragments = keys %{$broken->{$u}{fragments}}; # Did we get a redirect? my $redirected = &is_redirected($u, %$redirects); # List of lines my @total_lines; push(@total_lines, keys(%{$links->{$u}{location}})); for my $f (@fragments) { push(@total_lines, keys(%{$links->{$u}{fragments}{$f}})) unless ($f eq $u && defined($links->{$u}{$u}{LINE_UNKNOWN()})); } my ($redirect_loop, @redirects_urls) = get_redirects($u, %$redirects); my $currloc = $results->{$u}{location}; # Error type $c = &code_shown($u, $results); # What to do my $whattodo; my $redirect_too; if ($todo) { if ($u =~ m/^javascript:/) { if ($Opts{HTML}) { $whattodo = 'You must change this link: people using a browser without JavaScript support will not be able to follow this link. See the Web Content Accessibility Guidelines on the use of scripting on the Web and the techniques on how to solve this.'; } else { $whattodo = 'Change this link: people using a browser without JavaScript support will not be able to follow this link.'; } } elsif ($c == RC_ROBOTS_TXT()) { $whattodo = 'The link was not checked due to robots exclusion ' . 'rules. Check the link manually.'; } elsif ($redirect_loop) { $whattodo = 'Retrieving the URI results in a redirect loop, that should be ' . 'fixed. Examine the redirect sequence to see where the loop ' . 'occurs.'; } else { $whattodo = $todo->{$c}; } } elsif (defined($redirects{$u})) { # Redirects if (($u . '/') eq $redirects{$u}) { $whattodo = 'The link is missing a trailing slash, and caused a redirect. Adding the trailing slash would speed up browsing.'; } elsif ($c == 307 || $c == 302) { $whattodo = 'This is a temporary redirect. Update the link if you believe it makes sense, or leave it as is.'; } elsif ($c == 301) { $whattodo = 'This is a permanent redirect. The link should be updated.'; } } my @unique = &sort_unique(map { line_number($_) } @total_lines); my $lines_list = join(', ', @unique); my $s = (scalar(@unique) > 1) ? 's' : ''; undef @unique; my @http_codes = ($currloc->{code}); unshift(@http_codes, $currloc->{orig}) if $currloc->{orig}; @http_codes = map { http_rc($_) } @http_codes; if ($Opts{HTML}) { # Style stuff my $idref = ''; if ($codes && (!defined($previous_c) || ($c != $previous_c))) { $idref = ' id="d' . $doc_count . 'code_' . $c . '"'; $previous_c = $c; } # Main info for (@redirects_urls) { $_ = &show_url($_); } # HTTP message my $http_message; if ($currloc->{message}) { $http_message = &encode($currloc->{message}); if ($c == 404 || $c == 500) { $http_message = '' . $http_message . ''; } } my $redirmsg = $redirect_loop ? ' redirect loop detected' : ''; printf(" %s Line%s: %s %s
Status: %s %s %s

%s %s

\n", # Anchor for return codes $idref, # Color &status_icon($c), $s, # List of lines $lines_list, # List of redirects $redirected ? join(' redirected to ', @redirects_urls) . $redirmsg : &show_url($u), # Realm defined($currloc->{realm}) ? sprintf('Realm: %s
', &encode($currloc->{realm})) : '', # HTTP original message # defined($currloc->{orig_message}) # ? &encode($currloc->{orig_message}). # ' -> ' # : '', # Response code chain join( ' -> ', map { &encode($_) } @http_codes), # HTTP final message $http_message, # What to do $whattodo, # Redirect too? $redirect_too ? sprintf(' %s', &bgcolor(301), $redirect_too) : '', ); if ($#fragments >= 0) { printf("
Broken fragments:
    \n"); } } else { my $redirmsg = $redirect_loop ? ' redirect loop detected' : ''; printf( "\n%s\t%s\n Code: %s %s\n%s\n", # List of redirects $redirected ? join("\n-> ", @redirects_urls) . $redirmsg : $u, # List of lines $lines_list ? sprintf("\n%6s: %s", "Line$s", $lines_list) : '', # Response code chain join(' -> ', @http_codes), # HTTP message $currloc->{message} || '', # What to do wrap(' To do: ', ' ', $whattodo) ); if ($#fragments >= 0) { if ($currloc->{code} == 200) { print("The following fragments need to be fixed:\n"); } else { print("Fragments:\n"); } } } # Fragments for my $f (@fragments) { my @unique_lines = &sort_unique(keys %{$links->{$u}{fragments}{$f}}); my $plural = (scalar(@unique_lines) > 1) ? 's' : ''; my $unique_lines = join(', ', @unique_lines); if ($Opts{HTML}) { printf("
  • %s#%s (line%s %s)
  • \n", &encode($u), &encode($f), $plural, $unique_lines); } else { printf("\t%-30s\tLine%s: %s\n", $f, $plural, $unique_lines); } } print("
\n") if ($Opts{HTML} && scalar(@fragments)); } # End of the table print("
\n") if $Opts{HTML}; return; } sub code_shown ($$) { my ($u, $results) = @_; if ($results->{$u}{location}{record} == 200) { return $results->{$u}{location}{orig} || $results->{$u}{location}{record}; } else { return $results->{$u}{location}{record}; } } sub links_summary (\%\%\%\%) { # Advices to fix the problems my %todo = ( 200 => 'Some of the links to this resource point to broken URI fragments (such as index.html#fragment).', 300 => 'This often happens when a typo in the link gets corrected automatically by the server. For the sake of performance, the link should be fixed.', 301 => 'This is a permanent redirect. The link should be updated to point to the more recent URI.', 302 => 'This is a temporary redirect. Update the link if you believe it makes sense, or leave it as is.', 303 => 'This rare status code points to a "See Other" resource. There is generally nothing to be done.', 307 => 'This is a temporary redirect. Update the link if you believe it makes sense, or leave it as is.', 400 => 'This is usually the sign of a malformed URL that cannot be parsed by the server. Check the syntax of the link.', 401 => "The link is not public and the actual resource is only available behind authentication. If not already done, you could specify it.", 403 => 'The link is forbidden! This needs fixing. Usual suspects: a missing index.html or Overview.html, or a missing ACL.', 404 => 'The link is broken. Double-check that you have not made any typo, or mistake in copy-pasting. If the link points to a resource that no longer exists, you may want to remove or fix the link.', 405 => 'The server does not allow HTTP HEAD requests, which prevents the Link Checker to check the link automatically. Check the link manually.', 406 => "The server isn't capable of responding according to the Accept* headers sent. This is likely to be a server-side issue with negotiation.", 407 => 'The link is a proxy, but requires Authentication.', 408 => 'The request timed out.', 410 => 'The resource is gone. You should remove this link.', 415 => 'The media type is not supported.', 500 => 'This is a server side problem. Check the URI.', 501 => 'Could not check this link: method not implemented or scheme not supported.', 503 => 'The server cannot service the request, for some unknown reason.', # Non-HTTP codes: RC_ROBOTS_TXT() => sprintf( 'The link was not checked due to %srobots exclusion rules%s. Check the link manually, and see also the link checker %sdocumentation on robots exclusion%s.', $Opts{HTML} ? ( '', '', "", '' ) : ('') x 4 ), RC_DNS_ERROR() => 'The hostname could not be resolved. Check the link for typos.', RC_IP_DISALLOWED() => sprintf( 'The link resolved to a %snon-public IP address%s, and this link checker instance has been configured to not access such addresses. This may be a real error or just a quirk of the name resolver configuration on the server where the link checker runs. Check the link manually, in particular its hostname/IP address.', $Opts{HTML} ? ('', '') : ('') x 2), RC_PROTOCOL_DISALLOWED() => 'Accessing links with this URI scheme has been disabled in link checker.', ); my %priority = ( 410 => 1, 404 => 2, 403 => 5, 200 => 10, 300 => 15, 401 => 20 ); my ($links, $results, $broken, $redirects) = @_; # List of the broken links my @urls = keys %{$broken}; my @dir_redirect_urls = (); if ($Opts{Redirects}) { # Add the redirected URI's to the report for my $l (keys %$redirects) { next unless (defined($results->{$l}) && defined($links->{$l}) && !defined($broken->{$l})); # Check whether we have a "directory redirect" # e.g. http://www.w3.org/TR -> http://www.w3.org/TR/ my ($redirect_loop, @redirects) = get_redirects($l, %$redirects); if ($#redirects == 1) { push(@dir_redirect_urls, $l); next; } push(@urls, $l); } } # Broken links and redirects if ($#urls < 0) { if (!$Opts{Quiet}) { print_doc_header(); if ($Opts{HTML}) { print "

Links

\n

Valid links!

\n"; } else { print "\nValid links.\n"; } } } else { print_doc_header(); print('

') if $Opts{HTML}; print("\nList of broken links and other issues"); #print(' and redirects') if $Opts{Redirects}; # Sort the URI's by HTTP Code my %code_summary; my @idx; for my $u (@urls) { if (defined($results->{$u}{location}{record})) { my $c = &code_shown($u, $results); $code_summary{$c}++; push(@idx, $c); } } my @sorted = @urls[ sort { defined($priority{$idx[$a]}) ? defined($priority{$idx[$b]}) ? $priority{$idx[$a]} <=> $priority{$idx[$b]} : -1 : defined($priority{$idx[$b]}) ? 1 : $idx[$a] <=> $idx[$b] } 0 .. $#idx ]; @urls = @sorted; undef(@sorted); undef(@idx); if ($Opts{HTML}) { # Print a summary print <<'EOF';

There are issues with the URLs listed below. The table summarizes the issues and suggested actions by HTTP response status code.

EOF for my $code (sort(keys(%code_summary))) { printf('', &bgcolor($code)); printf('', $doc_count, $code, http_rc($code)); printf('', $code_summary{$code}); printf('', $todo{$code}); print "\n"; } print "\n
Code Occurrences What to do
%s%s%s
\n"; } else { print(':'); } &show_link_report($links, $results, $broken, $redirects, \@urls, 1, \%todo); } # Show directory redirects if ($Opts{Dir_Redirects} && ($#dir_redirect_urls > -1)) { print_doc_header(); print('

') if $Opts{HTML}; print("\nList of redirects"); print( "

\n

The links below are not broken, but the document does not use the exact URL, and the links were redirected. It may be a good idea to link to the final location, for the sake of speed.

" ) if $Opts{HTML}; &show_link_report($links, $results, $broken, $redirects, \@dir_redirect_urls); } return; } ############################################################################### ################ # Global stats # ################ sub global_stats () { my $stop = &get_timestamp(); my $n_docs = ($doc_count <= $Opts{Max_Documents}) ? $doc_count : $Opts{Max_Documents}; return sprintf( 'Checked %d document%s in %s seconds.', $n_docs, ($n_docs == 1) ? '' : 's', &time_diff($timestamp, $stop) ); } ################## # HTML interface # ################## sub html_header ($$) { my ($uri, $cookie) = @_; my $title = defined($uri) ? $uri : ''; $title = ': ' . $title if ($title =~ /\S/); my $headers = ''; if (!$Opts{Command_Line}) { $headers .= "Cache-Control: no-cache\nPragma: no-cache\n" if $uri; $headers .= "Content-Type: text/html; charset=utf-8\n"; $headers .= "Set-Cookie: $cookie\n" if $cookie; # mod_perl 1.99_05 doesn't seem to like it if the "\n\n" isn't in the same # print() statement as the last header $headers .= "Content-Language: en\n\n"; } my $onload = $uri ? '' : ' onload="if(document.getElementById){document.getElementById(\'uri_1\').focus()}"'; print $headers, $DocType, " W3C Link Checker", &encode($title), " ", $Head, " '; &banner($title); return; } sub banner ($) { my $tagline = "Check links and anchors in Web pages or full Web sites"; printf( <<'EOF', URI->new_abs("../images/no_w3c.png", $Cfg{Doc_URI}), $tagline);
EOF return; } sub status_icon($) { my ($code) = @_; my $icon_type; my $r = HTTP::Response->new($code); if ($r->is_success()) { $icon_type = 'error' ; # if is success but reported, it's because of broken frags => error } elsif (&informational($code)) { $icon_type = 'info'; } elsif ($code == 300) { $icon_type = 'info'; } elsif ($code == 401) { $icon_type = 'error'; } elsif ($r->is_redirect()) { $icon_type = 'warning'; } elsif ($r->is_error()) { $icon_type = 'error'; } else { $icon_type = 'error'; } return sprintf('%s', URI->new_abs("../images/info_icons/$icon_type.png", $Cfg{Doc_URI}), $icon_type); } sub bgcolor ($) { my ($code) = @_; my $class; my $r = HTTP::Response->new($code); if ($r->is_success()) { return ''; } elsif ($code == RC_ROBOTS_TXT() || $code == RC_IP_DISALLOWED()) { $class = 'dubious'; } elsif ($code == 300) { $class = 'multiple'; } elsif ($code == 401) { $class = 'unauthorized'; } elsif ($r->is_redirect()) { $class = 'redirect'; } elsif ($r->is_error()) { $class = 'broken'; } else { $class = 'broken'; } return (' class="' . $class . '"'); } sub show_url ($) { my ($url) = @_; return sprintf('%s', (&encode($url)) x 2); } sub html_footer () { printf("

%s

\n", &global_stats()) if ($doc_count > 0 && !$Opts{Quiet}); if (!$doc_count) { print <<'EOF';

This Link Checker looks for issues in links, anchors and referenced objects in a Web page, CSS style sheet, or recursively on a whole Web site. For best results, it is recommended to first ensure that the documents checked use Valid (X)HTML Markup and CSS. The Link Checker is part of the W3C's validators and Quality Web tools.

EOF } printf(<<'EOF', $Cfg{Doc_URI}, $Cfg{Doc_URI}, $PACKAGE, $REVISION);
%s
%s
EOF return; } sub print_form (\%$$) { my ($params, $cookie, $check_num) = @_; # Split params on \0, see CGI's docs on Vars() while (my ($key, $value) = each(%$params)) { if ($value) { my @vals = split(/\0/, $value, 2); $params->{$key} = $vals[0]; } } # Override undefined values from the cookie, if we got one. my $valid_cookie = 0; if ($cookie) { my %cookie_values = $cookie->value(); if (!$cookie_values{clear}) { # XXX no easy way to check if cookie expired? $valid_cookie = 1; while (my ($key, $value) = each(%cookie_values)) { $params->{$key} = $value unless defined($params->{$key}); } } } my $chk = ' checked="checked"'; $params->{hide_type} = 'all' unless $params->{hide_type}; my $requested_uri = &encode($params->{uri} || ''); my $sum = $params->{summary} ? $chk : ''; my $red = $params->{hide_redirects} ? $chk : ''; my $all = ($params->{hide_type} ne 'dir') ? $chk : ''; my $dir = $all ? '' : $chk; my $acc = $params->{no_accept_language} ? $chk : ''; my $ref = $params->{no_referer} ? $chk : ''; my $rec = $params->{recursive} ? $chk : ''; my $dep = &encode($params->{depth} || ''); my $cookie_options = ''; if ($valid_cookie) { $cookie_options = " "; } else { $cookie_options = " "; } print "

More Options





,

", $cookie_options, "

"; return; } sub encode (@) { return $Opts{HTML} ? HTML::Entities::encode(@_) : @_; } sub hprintf (@) { print_doc_header(); if (!$Opts{HTML}) { printf(@_); } else { print HTML::Entities::encode(sprintf($_[0], @_[1 .. @_ - 1])); } return; } # Print the document header, if it hasn't been printed already. # This is invoked before most other output operations, in order # to enable quiet processing that doesn't clutter the output with # "Processing..." messages when nothing else will be reported. sub print_doc_header () { if (defined($doc_header)) { print $doc_header; undef($doc_header); } } # Local Variables: # mode: perl # indent-tabs-mode: nil # cperl-indent-level: 4 # cperl-continued-statement-offset: 4 # cperl-brace-offset: -4 # perl-indent-level: 4 # End: # ex: ts=4 sw=4 et W3C-LinkChecker-4.81/bin/checklink.pod0000644000000000000000000002203711543731267016144 0ustar rootroot=encoding utf8 =head1 NAME checklink - check the validity of links in an HTML or XHTML document =head1 SYNOPSIS B [ I ] I ... =head1 DESCRIPTION This manual page documents briefly the B command, a.k.a. the W3C® Link Checker. B is a program that reads an HTML or XHTML document, extracts a list of anchors and lists and checks that no anchor is defined twice and that all the links are dereferenceable, including the fragments. It warns about HTTP redirects, including directory redirects, and can check recursively a part of a web site. The program can be used either as a command line tool or as a CGI script. =head1 OPTIONS This program follow the usual GNU command line syntax, with long options starting with two dashes (`-'). A summary of options is included below. =over 5 =item B<-?, -h, --help> Show summary of options. =item B<-V, --version> Output version information. =item B<-s, --summary> Show result summary only. =item B<-b, --broken> Show only the broken links, not the redirects. =item B<-e, --directory> Hide directory redirects - e.g. L -> L. =item B<-r, --recursive> Check the documents linked from the first one. =item B<-D, --depth> I Check the documents linked from the first one to depth I (implies B<--recursive>). =item B<-l, --location> I Scope of the documents checked (implies B<--recursive>). Can be specified multiple times in order to specify multiple recursion bases. If the URI of a candidate document is downwards relative to any of the bases, it is considered to be within the scope. If not specified, the default is the base URI of the initial document, for example for L it would be L. =item B<-X, --exclude> I Do not check links whose full, canonical URIs match I. Note that this option limits recursion the same way as B<--exclude-docs> with the same regular expression would. =item B<--exclude-docs> I In recursive mode, do not check links in documents whose full, canonical URIs match I. This option may be specified multiple times. =item B<--suppress-redirect> IURI> Do not report a redirect from the first to the second URI. The "-E" is literal text. This option may be specified multiple times. Whitespace may be used instead of "-E" to separate the URIs. =item B<--suppress-redirect-prefix> IURI> Do not report a redirect from a child of the first URI to the same child of the second URI. The \"->\" is literal text. This option may be specified multiple times. Whitespace may be used instead of "-E" to separate the URIs. =item B<--suppress-temp-redirects> Do not report warnings about temporary redirects. =item B<--suppress-broken> I Do not report a broken link with the given CODE. CODE is the HTTP response, or -1 for robots exclusion. The ":" is literal text. This option may be specified multiple times. Whitespace may be used instead of ":" to separate the CODE and the URI. =item B<--suppress-fragment> I Do not report the given broken fragment URI. A fragment URI contains "#". This option may be specified multiple times. =item B<-L, --languages> I The C HTTP header to send. In command line mode, this header is not sent by default. The special value C causes a value to be detected from the C environment variable, and sent if found. In CGI mode, the default is to send the value received from the client as is. =item B<-c, --cookies> I Use cookies, load/save them in I. The special value C causes non-persistent use of cookies, i.e. they are used but only stored in memory for the duration of this link checker run. =item B<-R, --no-referer> Do not send the C HTTP header. =item B<-q, --quiet> No output if no errors are found. Implies B<--summary>. =item B<-v, --verbose> Verbose mode. =item B<-i, --indicator> Show progress while parsing as percentage of lines processed. No indicator is shown for documents containing no linefeeds. =item B<-u, --user> I Specify a username for authentication. =item B<-p, --password> I Specify a password for authentication. =item B<--hide-same-realm> Hide 401's that are in the same realm as the document checked. =item B<-S, --sleep> I Sleep the specified number of seconds between requests to each server. Defaults to 1 second, which is also the minimum allowed. =item B<-t, --timeout> I Timeout for requests, in seconds. The default is 30. =item B<-C, --connection-cache> I Maximum number of cached connections. Using this option overrides the C configuration file parameter, see its documentation below for the default value and more information. =item B<-d, --domain> I Perl regular expression describing the domain to which the authentication information (if present) will be sent. The default value can be specified in the configuration file. See the C entry in the configuration file description below for more information. =item B<--masquerade> I<"real-prefix surrogate-prefix"> Perform a simple string substitution: URIs which begin with the string C are rewritten using the C before being dereferenced. Useful for making a local directory masquerade as a remote one. For example: --masquerade "http://example.com/x/y/z/ file:///my/local/dir/" If the document being checked contains a link to http://example.com/x/y/z/foo.html, then the local file system will be checked for file:///my/local/dir/foo.html. B<--masquerade> takes a single argument consisting of two URIs, separated by whitespace. The quote marks are not part of the argument, but one usual way of providing a value with embedded whitespace is to enclose it in quotes. =item B<-H, --html> HTML output. =back =head1 FILES =over 5 =item F The main configuration file. You can use the L environment variable to override the default location. C specifies a regular expression for matching trusted domains (ie. domains where HTTP basic authentication, if any, will be sent). The regular expression will be matched case insensitively against host names. The default behavior (when unset, that is) is to send the authentication information only to the host which requests it; usually you don't want to change this. For example, the following configures I the w3.org domain as trusted: Trusted = \.w3\.org$ C is a boolean flag indicating whether checking links on non-public IP addresses is allowed. The default is true in command line mode and false when run as a CGI script. For example, to disallow checking non-public IP addresses, regardless of the mode, use: Allow_Private_IPs = 0 C is a comma separated list of additional protocols/URI schemes that the link checker is not allowed to use. The C and C schemes are always forbidden, and so is the C scheme when running as a CGI script. Forbidden_Protocols = javascript,mailto C and C are formatted URIs to the respective validators. The C<%s> in these will be replaced with the full "URI encoded" URI to the document being checked, and shown in the link checker results view in the online/CGI version. The defaults are: Markup_Validator_URI = http://validator.w3.org/check?uri=%s CSS_Validator_URI = http://jigsaw.w3.org/css-validator/validator?uri=%s C is a URI used for linking to the documentation, and CSS and JavaScript files in the dynamically generated content of the link checker. The default is: Doc_URI = http://validator.w3.org/docs/checklink.html C is an integer denoting the maximum number of connections the link checker will keep open at any given time. The default is: Connection_Cache_Size = 2 =back =head1 ENVIRONMENT checklink uses the libwww-perl library which has a number of environment variables affecting its behaviour. See L for some pointers. =over 5 =item B If set, overrides the path to the configuration file. =back =head1 SEE ALSO The documentation for this program is available on the web at L. L, L, L, L, L. =head1 AUTHOR This program was originally written by Hugo Haas Ehugo@w3.orgE, based on Renaud Bruyeron's F. It has been enhanced by Ville Skyttä and many other volunteers since. Use the Ewww-validator@w3.orgE mailing list for feedback, and see L for more information. This manual page was originally written by Frédéric Schütz Eschutz@mathgen.chE for the Debian GNU/Linux system (but may be used by others). =head1 COPYRIGHT This program is licensed under the W3C® Software License, L. =cut W3C-LinkChecker-4.81/Makefile.PL0000644000000000000000000000644411566533610014710 0ustar rootrootuse 5.008; use ExtUtils::MakeMaker; WriteMakefile( NAME => 'W3C::LinkChecker', ABSTRACT => 'W3C Link Checker', AUTHOR => 'W3C QA-dev Team ', LICENSE => 'open_source', VERSION_FROM => 'bin/checklink', PREREQ_PM => { # Hard dependencies: CSS::DOM => 0.09, CSS::DOM::Constants => 0, CSS::DOM::Style => 0, CSS::DOM::Util => 0, Encode => 0, HTML::Entities => 0, HTML::Parser => "3.40", HTTP::Headers::Util => 0, HTTP::Message => 5.827, HTTP::Request => 0, HTTP::Response => "1.50", LWP::RobotUA => 1.19, LWP::UserAgent => 0, Net::HTTP::Methods => 5.833, Time::HiRes => 0, URI => 1.53, URI::Escape => 0, # Optional, but required if using a config file: Config::General => 2.06, # Optional, but required if private IPs are disallowed: Net::hostent => 0, Net::IP => 0, Socket => 0, # Optional, but required in command line mode: File::Spec => 0, Getopt::Long => 2.17, Text::Wrap => 0, URI::file => 0, # Optional, used for password input in command line mode: Term::ReadKey => 2.00, # Optional, used for guessing language in command line mode: Locale::Country => 0, Locale::Language => 0, # Optional, used when decoding arguments in command line mode: Encode::Locale => 0, # Optional, but required in CGI mode: CGI => 0, CGI::Carp => 0, CGI::Cookie => 0, # Optional, required if using cookies: HTTP::Cookies => 0, # Required for the test suite: File::Spec => 0, Test::More => 0, }, PM => {'lib/W3C/LinkChecker.pm' => '$(INST_LIB)/W3C/LinkChecker.pm'}, EXE_FILES => ['bin/checklink'], MAN1PODS => {'bin/checklink.pod' => '$(INST_MAN1DIR)/checklink.$(MAN1EXT)',}, META_MERGE => { resources => { homepage => 'http://validator.w3.org/checklink', bugtracker => 'http://www.w3.org/Bugs/Public/', repository => 'http://dvcs.w3.org/hg/link-checker/', MailingList => 'http://lists.w3.org/Archives/Public/www-validator/', }, }, depend => {distdir => 'lib/W3C/LinkChecker.pm'}, dist => {TARFLAGS => '--owner=0 --group=0 -cvf'}, clean => {FILES => 'Makefile.PL.bak bin/checklink.bak'}, ); sub MY::postamble { return <<'MAKE_FRAG'; lib/W3C/LinkChecker.pm: Makefile.PL bin/checklink $(MKPATH) lib/W3C $(ECHO) "# Dummy module for CPAN indexing purposes." > $@ $(ECHO) "package $(NAME);" >> $@ $(ECHO) "use strict;" >> $@ $(ECHO) "use vars qw(\$$VERSION);" >> $@ $(ECHO) "\$$VERSION = \"$(VERSION)\";" >> $@ $(ECHO) "1;" >> $@ PERLTIDY = perltidy --profile=etc/perltidyrc --backup-and-modify-in-place perltidy: @for file in Makefile.PL bin/checklink ; do \ echo "$(PERLTIDY) $$file" ; \ $(PERLTIDY) $$file ; \ done MAKE_FRAG } W3C-LinkChecker-4.81/README0000644000000000000000000000311111537353310013575 0ustar rootrootW3C-LinkChecker =============== This distribution contains the W3C Link Checker. The link checker can be run as a CGI script in a web server as well as on the command line. The CGI version provides a HTML interface as seen at . To install the distribution for command line use: perl Makefile.PL make make test make install # as root To install the CGI version, in addition to the above, copy the bin/checklink script into a location in your web server from where execution of CGI scripts is allowed, and make sure that the web server user has execute permissions to the script. The CGI directory is typically named "cgi-bin" somewhere under your web server root directory. For more information, please consult the POD documentation in the checklink.pod file, typically (in the directory where you unpacked the source): perldoc ./bin/checklink.pod ...as well as the HTML documentation in docs/checklink.html. COPYRIGHT AND LICENCE Written by the following people for the W3C: - Hugo Haas - Ville Skyttä - W3C QA-dev Team Copyright (C) 1994-2011 World Wide Web Consortium, (Massachusetts Institute of Technology, European Research Consortium for Informatics and Mathematics, Keio University). All Rights Reserved. This work is distributed under the W3C(R) Software License [1] in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. [1] http://www.w3.org/Consortium/Legal/copyright-software W3C-LinkChecker-4.81/MANIFEST0000644000000000000000000000154011542177677014072 0ustar rootrootMakefile.PL MANIFEST META.yml NEWS Overview of changes between releases README Start by reading this SIGNATURE bin/checklink The link checker bin/checklink.pod Manual page for the link checker etc/checklink.conf Optional configuration file etc/perltidyrc perltidy(1) profile docs/checklink.html Additional documentation docs/linkchecker.css Cascading style sheet used in docs and generated HTML docs/linkchecker.js JavaScript used in the generated HTML images/double.png images/grad.png images/head-bl.png images/head-br.png images/no_w3c.png images/round-br.png images/round-tr.png images/textbg.png images/info_icons/README images/info_icons/error.png images/info_icons/info.png images/info_icons/warning.png lib/W3C/LinkChecker.pm Dummy *.pm for CPAN indexing purposes t/00compile.t