rawdog-2.21/0000755000471500047150000000000012552556407012344 5ustar atsats00000000000000rawdog-2.21/config0000644000471500047150000003212712167755742013546 0ustar atsats00000000000000# Sample rawdog config file. Copy this into your ~/.rawdog/ directory, and edit # it to suit your preferences. # All paths in this file should be either absolute, or relative to your .rawdog # directory. # If you want to include another config file, then use "include FILENAME". # Times in this file are specified as a value and a unit (for instance, # "4h"). Units available are "s" (seconds), "m" (minutes), "h" (hours), # "d" (days) and "w" (weeks). If no unit is specified, rawdog will # assume minutes. # Boolean (yes/no) values in this file are specified as "true" or "false". # rawdog can be extended using plugin modules written in Python. This # option specifies the directories to search for plugins to load. If a # directory does not exist or cannot be read, it will be ignored. This # option must appear before any options that are implemented by plugins. plugindirs plugins # Whether to split rawdog's state amongst multiple files. # If this is turned on, rawdog will use significantly less memory, but # will do more disk IO -- probably a good idea if you read a lot of # feeds. splitstate false # The maximum number of articles to show on the generated page. # Set this to 0 for no limit. maxarticles 200 # The maximum age of articles to show on the generated page. # Set this to 0 for no limit. maxage 0 # The age after which articles will be discarded if they do not appear # in a feed. Set this to a larger value if you want your rawdog output # to cover more than a day's worth of articles. expireage 1d # The minimum number of articles from each feed to keep around in the history. # Set this to 0 to only keep articles that were returned the last time the feed # was fetched. (If this is set to 0, or "currentonly" below is set to true, # then rawdog will not send the RFC3229+feed "A-IM: feed" header when making # HTTP requests, since it can't tell from the response to such a request # whether any articles have been removed from the feed; this makes rawdog # slightly less bandwidth-efficient.) keepmin 20 # Whether to only display articles that are currently included in a feed # (useful for "planet" pages where you only want to display the current # articles from several feeds). If this is false, rawdog will keep a # history of older articles. currentonly false # Whether to divide the articles up by day, writing a "dayformat" heading # before each set. daysections true # The format to write day headings in. See "man strftime" for more # information; for example: # %A, %d %B Wednesday, 21 January # %Y-%m-%d 2004-01-21 (ISO 8601 format) dayformat %A, %d %B # Whether to divide the articles up by time, writing a "timeformat" heading # before each set. timesections true # The format to write time headings in. For example: # %H:%M 18:07 (ISO 8601 format) # %I:%M %p 06:07 PM timeformat %H:%M # The format to display feed update and article times in. For example: # %H:%M, %A, %d %B 18:07, Wednesday, 21 January # %Y-%m-%d %H:%M 2004-01-21 18:07 (ISO 8601 format) datetimeformat %H:%M, %A, %d %B # The page template file to use, or "default" to use the built-in template # (which is probably sufficient for most users). Use "rawdog -s page" to show # the template currently in use as a starting-point for customisation. # The following strings will be replaced in the output: # __version__ The rawdog version in use # __refresh__ The HTML 4 header # __items__ The aggregated items # __num_items__ The number of items on the page # __feeds__ The feed list # __num_feeds__ The number of feeds listed # You can define additional strings using "define" in this config file; for # example, if you say "define myname Adam Sampson", then "__myname__" will be # replaced by "Adam Sampson" in the output. pagetemplate default # Similarly, the template used for each item shown. Use "rawdog -s item" to # show the template currently in use as a starting-point for customisation. # The following strings will be replaced in the output: # __title__ The item title (as an HTML link, if possible) # __title_no_link__ The item title (as text) # __url__ The item's URL, or the empty string if it doesn't # have one # __guid__ The item's GUID, or the empty string if it doesn't # have one # __description__ The item's descriptive text, or the empty string # if it doesn't have a description # __date__ The item's date as provided by the feed # __added__ The date the article was received by rawdog # __hash__ A hash of the article (useful for summary pages) # # All of the __feed_X__ strings from feeditemtemplate below will also be # expanded here, for the feed that the article came from. # # You can define additional strings on a per-feed basis by using the # "define_X" feed option; see the description of "feed" below for more # details. # # Simple conditional expansion is possible by saying something like # "__if_items__ hello __endif__"; the text between the if and endif will # only be included if __items__ would expand to something other than # the empty string. Ifs can be nested, and __else__ is supported. # (This also works for the other templates, but it's most useful here.) itemtemplate default # The template used to generate the feed list (__feeds__ above). Use "rawdog # -s feedlist" to show the current template. # The following strings will be replaced in the output: # __feeditems__ The feed items feedlisttemplate default # The template used to generate each item in the feed list. Use "rawdog # -s feeditem" to show the current template. # The following strings will be replaced in the output: # __feed_id__ The feed's title with non-alphanumeric characters # (and HTML markup) removed (useful for per-feed # styles); you can use the "id" feed option below to # set a custom ID if you prefer # __feed_hash__ A hash of the feed URL (useful for per-feed styles) # __feed_title__ The feed title (as an HTML link, if possible) # __feed_title_no_link__ # The feed title (as text) # __feed_url__ The feed URL # __feed_icon__ An "XML button" linking to the feed URL # __feed_last_update__ # The time when the feed was last updated # __feed_next_update__ # The time when the feed will next need updating feeditemtemplate default # Where to write the output HTML to. You should place style.css in the same # directory. Specify this as "-" to write the HTML to stdout. # (You will probably want to make this an absolute path, else rawdog will write # to a file in your ~/.rawdog directory.) outputfile output.html #outputfile /home/you/public_html/rawdog.html # Whether to use a tag in the generated # HTML to indicate that the page should be refreshed automatically. If # this is turned on, then the page will refresh every N minutes, where N # is the shortest feed period value specified below. # (This works by controlling whether the default template includes # __refresh__; if you use a custom template, __refresh__ is always # available.) userefresh true # Whether to show the list of active feeds in the generated HTML. # (This works by controlling whether the default template includes # __feeds__; if you use a custom template, __feeds__ is always # available.) showfeeds true # The number of concurrent threads that rawdog will use when fetching # feeds -- i.e. the number of feeds that rawdog will attempt to fetch at # the same time. If you have a lot of feeds, setting this to be 20 or # so will significantly speed up updates. If this is set to 1 (or # fewer), rawdog will not start any additional threads at all. numthreads 1 # The time that rawdog will wait before considering a feed unreachable # when trying to connect. If you're getting lots of timeout errors and # are on a slow connection, increase this. # (Unlike other times in this file, this will be assumed to be in # seconds if no unit is specified.) timeout 30s # Whether to ignore timeouts. If this is false, timeouts will be reported as # errors; if this is true, rawdog will silently ignore them. ignoretimeouts false # Whether to show Python traceback messages. If this is true, rawdog will show # a traceback message if an exception is thrown while fetching a feed; this is # mostly useful for debugging rawdog or feedparser. showtracebacks false # Whether to display verbose status messages saying what rawdog's doing # while it runs. Specifying -v or --verbose on the command line is # equivalent to saying "verbose true" here. verbose false # Whether to attempt to fix bits of HTML that should start with a # block-level element (such as article descriptions) by prepending "

" # if they don't already start with a block-level element. blocklevelhtml true # Whether to attempt to turn feed-provided HTML into valid HTML. # The most common problem that this solves is a non-closed element in an # article causing formatting problems for the rest of the page. # For this option to have any effect, you need to have PyTidyLib or mx.Tidy # installed. tidyhtml true # Whether the articles displayed should be sorted first by the date # provided in the feed (useful for "planet" pages, where you're # displaying several feeds and want new articles to appear in the right # chronological place). If this is false, then articles will first be # sorted by the time that rawdog first saw them. sortbyfeeddate false # Whether to consider articles' unique IDs or GUIDs when updating rawdog's # database. If you turn this off, then rawdog will create a new article in its # database when it sees an updated version of an existing article in a feed. # You probably want this turned on. useids true # The fields to use when detecting duplicate articles: "id" is the article's # unique ID or GUID; "link" is the article's link. rawdog will find the first # one of these that's present in the article, and ignore the article if it's # seen an article before (in any feed) that had the same value. For example, # specifying "hideduplicates id link" will first look for id/guid, then for # link. # Note that some feeds use the same link for all their articles; if you specify # "link" here, you will probably want to specify the "allowduplicates" feed # argument (see below) for those feeds. hideduplicates id # The period to use for new feeds added to the config file via the -a|--add # option. newfeedperiod 3h # Whether rawdog should automatically update this config file (and its # internal state) if feed URLs change (for instance, if a feed URL # results in a permanent HTTP redirect). If this is false, then rawdog # will ask you to make the necessary change by hand. changeconfig true # The feeds you want to watch, in the format "feed period url [args]". # The period is the minimum time between updates; if less than period # minutes have passed, "rawdog update" will skip that feed. Specifying # a period less than 30 minutes is considered to be bad manners; it is # suggested that you make the period as long as possible. # Arguments are optional, and can be given in two ways: either on the end of # the "feed" line in the form "key=value", separated by spaces, or as extra # indented lines after the feed line. # possible arguments are: # id Value for the __feed_id__ value in the item # template for items in this feed (defaults to the # feed title with non-alphanumeric characters and # HTML markup removed) # user User for HTTP basic authentication # password Password for HTTP basic authentication # format "text" to indicate that the descriptions in this feed # are unescaped plain text (rather than the usual HTML), # and should be escaped and wrapped in a

 element
# X_proxy             Proxy URL for protocol X (for instance, "http_proxy")
# proxyuser           User for proxy basic authentication
# proxypassword       Password for proxy basic authentication
# allowduplicates     "true" to disable duplicate detection for this feed
# maxage              Override the global "maxage" value for this feed
# keepmin             Override the global "keepmin" value for this feed
# define_X            Equivalent to "define X ..." for item templates
#                     when displaying items from this feed
# You can provide a default set of arguments for all feeds using
# "feeddefaults". You can specify as many feeds as you like.
# (These examples have been commented out; remove the leading "#" on each line
# to use them.)
#feeddefaults
#	http_proxy http://proxy.example.com:3128/
#feed 1h http://example.com/feed.rss
#feed 30m http://example.com/feed2.rss id=newsfront
#feed 3h http://example.com/feed3.rss keepmin=5
#feed 3h http://example.com/secret.rss user=bob password=secret
#feed 3h http://example.com/broken.rss
#	format text
#	define_myclass broken
#feed 3h http://proxyfeed.example.com/proxied.rss http_proxy=http://localhost:1234/
#feed 3h http://dupsfeed.example.com/duplicated.rss allowduplicates=true

rawdog-2.21/rawdog0000755000471500047150000000210411563740634013550 0ustar  atsats00000000000000#!/usr/bin/env python
# rawdog: RSS aggregator without delusions of grandeur.
# Copyright 2003, 2004, 2005, 2006 Adam Sampson 
#
# rawdog is free software; you can redistribute and/or modify it
# under the terms of that license as published by the Free Software
# Foundation; either version 2 of the License, or (at your option)
# any later version.
#
# rawdog is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with rawdog; see the file COPYING. If not, write to the Free
# Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
# MA 02110-1301, USA, or see http://www.gnu.org/.

from rawdoglib.rawdog import main
import sys, os

def launch():
	sys.exit(main(sys.argv[1:]))

if __name__ == "__main__":
	if os.getenv("RAWDOG_PROFILE") is not None:
		import profile
		profile.run("launch()")
	else:
		launch()

rawdog-2.21/style.css0000644000471500047150000000323612327741626014221 0ustar  atsats00000000000000/* Default stylesheet for rawdog. Customise this as you like.
   Adam Sampson  */
.xmlbutton {
	/* From Dylan Greene's suggestion:
           http://www.dylangreene.com/blog.asp?blogID=91 */
	border: 1px solid;
	border-color: #FC9 #630 #330 #F96;
	padding: 0 3px;
	font: bold 10px sans-serif;
	color: #FFF;
	background: #F60;
	text-decoration: none;
	margin: 0;
}
/* Scale down large images in feeds */
img {
	max-width: 100%;
	height: auto;
}
html {
	margin: 0;
	padding: 0;
}
body {
	color: black;
	background-color: white;
	margin: 0;
	padding: 10px;
	font-size: medium;
}
#header {
	background-color: #ffe;
	border: 1px solid gray;
	padding: 10px;
	margin-bottom: 20px;
}
h1 {
	font-weight: bold;
	font-size: xx-large;
	text-align: left;
	margin: 0;
	padding: 0;
}
#items {
}
.day {
	clear: both;
}
h2 {
	font-weight: bold;
	font-size: x-large;
	text-align: left;
	margin: 10px 0;
	padding: 0;
}
.time {
	clear: both;
}
h3 {
	font-weight: bold;
	font-size: large;
	text-align: left;
	margin: 10px 0;
	padding: 0;
}
.item {
	margin: 20px 30px;
	border: 1px solid gray;
	clear: both;
}
.itemheader {
	padding: 6px;
	margin: 0;
	background-color: #eee;
}
.itemtitle {
	font-weight: bold;
}
.itemfrom {
	font-style: italic;
}
.itemdescription {
	border-top: 1px solid gray;
	margin: 0;
	padding: 6px;
}
#feedstatsheader {
}
#feedstats {
}
#feeds {
	margin: 10px 0;
	border: 1px solid gray;
	border-spacing: 0;
}
#feedsheader TH {
	background-color: #eee;
	border-bottom: 1px solid gray;
	padding: 5px;
	margin: 0;
}
.feedsrow TD {
	padding: 5px 10px;
	margin: 0;
}
#footer {
	background-color: #ffe;
	border: 1px solid gray;
	margin-top: 20px;
	padding: 10px;
}
#aboutrawdog {
}
rawdog-2.21/MANIFEST.in0000644000471500047150000000033412167056532014076 0ustar  atsats00000000000000include COPYING
include MANIFEST.in
include NEWS
include PLUGINS
include README
include config
include rawdog
include rawdog.1
include style.css
include test-rawdog
include testserver.py
recursive-include rawdoglib *.py
rawdog-2.21/PKG-INFO0000644000471500047150000000105312552556407013440 0ustar  atsats00000000000000Metadata-Version: 1.1
Name: rawdog
Version: 2.21
Summary: RSS Aggregator Without Delusions Of Grandeur
Home-page: http://offog.org/code/rawdog/
Author: Adam Sampson
Author-email: ats@offog.org
License: UNKNOWN
Description: UNKNOWN
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: License :: OSI Approved :: GNU General Public License v2 or later (GPLv2+)
Classifier: Operating System :: POSIX
Classifier: Programming Language :: Python :: 2
Classifier: Topic :: Internet :: WWW/HTTP
rawdog-2.21/README0000644000471500047150000000640512176436301013221 0ustar  atsats00000000000000rawdog: RSS Aggregator Without Delusions Of Grandeur
Adam Sampson 

rawdog is a feed aggregator, capable of producing a personal "river of
news" or a public "planet" page. It supports all common feed formats,
including all versions of RSS and Atom. By default, it is run from cron,
collects articles from a number of feeds, and generates a static HTML
page listing the newest articles in date order. It supports per-feed
customizable update times, and uses ETags, Last-Modified, gzip
compression, and RFC3229+feed to minimize network bandwidth usage. Its
behaviour is highly customisable using plugins written in Python.

rawdog has the following dependencies:

- Python 2.6 or later (but not Python 3)
- feedparser 5.1.2 or later
- PyTidyLib 0.2.1 or later (optional but strongly recommended)

To install rawdog on your system, use distutils -- "python setup.py
install". This will install the "rawdog" command and the "rawdoglib"
Python module that it uses internally. (If you want to install to a
non-standard prefix, read the help provided by "python setup.py install
--help".)

rawdog needs a config file to function. Make the directory ".rawdog" in
your $HOME directory, copy the provided file "config" into that
directory, and edit it to suit your preferences. Comments in that file
describe what each of the options does.

You should copy the provided file "style.css" into the same directory
that you've told rawdog to write its HTML output to. rawdog should be
usable from a browser that doesn't support CSS, but it won't be very
pretty.

When you invoke rawdog from the command line, you give it a series of
actions to perform -- for instance, "rawdog --update --write" tells it
to do the "--update" action (downloading articles from feeds), then the
"--write" action (writing the latest articles it knows about to the HTML
file).

For details of all rawdog's actions and command-line options, see the
rawdog(1) man page -- "man rawdog" after installation.

You will want to run "rawdog -uw" periodically to fetch data and write
the output file. The easiest way to do this is to add a crontab entry
that looks something like this:

0,10,20,30,40,50 * * * *        /path/to/rawdog -uw

(If you don't know how to use cron, then "man crontab" is probably a good
start.) This will run rawdog every ten minutes.

If you want rawdog to fetch URLs through a proxy server, then set your
"http_proxy" environment variable appropriately; depending on your
version of cron, putting something like:

http_proxy=http://myproxy.mycompany.com:3128/

at the top of your crontab should be appropriate. (The http_proxy
variable will work for many other programs too.)

In the event that rawdog gets horribly confused (for instance, if your
system clock has a huge jump and it thinks it won't need to fetch
anything for the next thirty years), you can forcibly clear its state by
removing the ~/.rawdog/state file (and the ~/.rawdog/feeds/*.state
files, if you've got the "splitstate" option turned on).

If you don't like the appearance of rawdog, then customise the style.css
file. If you come up with one that looks much better than the existing
one, please send it to me!

This should, hopefully, be all you need to know. If rawdog breaks in
interesting ways, please tell me at the email address at the top of this
file.

rawdog-2.21/test-rawdog0000755000471500047150000014450712550513565014541 0ustar  atsats00000000000000#!/bin/sh
# test-rawdog: run some basic tests to make sure rawdog's working.
# Copyright 2013, 2014, 2015 Adam Sampson 
#
# rawdog is free software; you can redistribute and/or modify it
# under the terms of that license as published by the Free Software
# Foundation; either version 2 of the License, or (at your option)
# any later version.
#
# rawdog is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with rawdog; see the file COPYING. If not, write to the Free
# Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
# MA 02110-1301, USA, or see http://www.gnu.org/.

# Default to the C locale, to avoid localised error messages.
default_LC_ALL="C"

# Try to find generic UTF-8 and Japanese UTF-8 locales. (They may not be
# installed.)
utf8_LC_ALL="$(locale -a | grep -a -i 'utf-\?8' | head -1)"
ja_LC_ALL="$(locale -a | grep -a -i 'ja_JP.utf-\?8' | head -1)"

# Default to UTC so that local times are reported consistently.
default_TZ="UTC"

statedir="testauto"

# Hostname and ports to run the test server on.
serverhost="localhost"
timeoutport="8431"
httpport="8432"

# Connections to this host should time out.
# (This is distinct from timeoutport above: if you connect to timeoutport, it
# will accept the connection but not do anything, whereas this will timeout
# while connecting.)
timeouthost=""

httpdir="$statedir/pub"
httpurl="http://$serverhost:$httpport"

usage () {
	cat <.
EOF
	exit 1
}

knownbad=false
keepgoing=false
rawdog="./rawdog"
while getopts bkr:T: OPT; do
	case "$OPT" in
	b)
		knownbad=true
		;;
	k)
		keepgoing=true
		;;
	r)
		rawdog="$OPTARG"
		;;
	T)
		timeouthost="$OPTARG"
		;;
	?)
		usage
		;;
	esac
done

# Start the server, and kill it when this script exits.
serverpid=""
trap 'test -n "$serverpid" && kill $serverpid' 0
python testserver.py "$serverhost" "$timeoutport" "$httpport" "$httpdir" &
serverpid="$!"

exitcode=0
die () {
	echo "Test failed:" "$@"
	exitcode=1
	if ! $keepgoing; then
		exit $exitcode
	fi
}

cleanstate () {
	rm -fr $statedir $httpdir
	mkdir -p $statedir $statedir/plugins $httpdir
	cp config $statedir/config

	export LC_ALL="$default_LC_ALL"
	export TZ="$default_TZ"
}

add () {
	echo "$1" >>$statedir/config
}

begin () {
	echo ">>> Testing $1"
	cleanstate
	add "showtracebacks true"
	cmdnum=0
}

equals () {
	if [ "$1" != "$2" ]; then
		die "expected '$1'; got '$2'"
	fi
}

exists () {
	local fn
	for fn in "$@"; do
		if ! [ -e "$fn" ]; then
			die "expected $fn to exist"
		fi
	done
}

not_exists () {
	local fn
	for fn in "$@"; do
		if [ -e "$fn" ]; then
			die "expected $fn not to exist"
		fi
	done
}

same () {
	exists "$1" "$2"
	if ! cmp "$1" "$2"; then
		die "expected $1 to have the same contents as $2"
	fi
}

contains () {
	local key
	local file="$1"
	exists "$file"
	shift
	for key in "$@"; do
		if ! grep -q "$key" "$file"; then
			cat "$file"
			die "expected $file to contain '$key'"
		fi
	done
}

not_contains () {
	local key
	local file="$1"
	exists "$file"
	shift
	for key in "$@"; do
		if grep -q "$key" "$file"; then
			cat "$file"
			die "expected $file not to contain '$key'"
		fi
	done
}

# Run rawdog.
runf () {
	cmdnum=$(expr $cmdnum + 1)
	outfile=$statedir/out$cmdnum
	$rawdog -d $statedir -V log$cmdnum "$@" >$outfile 2>&1
}

# Run rawdog, expecting it to exit 0.
run () {
	if ! runf "$@"; then
		cat $outfile
		die "exited non-0"
	fi
}

# Run rawdog, expecting it to crash with an exception message.
runcrash () {
	if runf "$@"; then
		cat $outfile
		die "exited 0"
	fi

	contains $outfile "Traceback (most recent call last)"
}

# Run rawdog, expecting it to exit non-0 (but not crash).
runn () {
	if runf "$@"; then
		cat $outfile
		die "exited 0"
	fi

	not_contains $outfile "Traceback (most recent call last)"
}

# Run rawdog, expecting no complaints.
runs () {
	run "$@"
	if [ -s $outfile ]; then
		cat $outfile
		die "expected no output"
	fi
}

# Run rawdog, expecting a complaint containing the first arg.
rune () {
	local key="$1"
	shift
	run "$@"
	contains $outfile "$key"
}

# Run rawdog, expecting it to exit non-0 with a complaint containing the first
# arg.
runne () {
	local key="$1"
	shift
	runn "$@"
	contains $outfile "$key"
}

make_text () {
	cat >"$1" <"$1" <


  Not a feed


  

This is manifestly not a feed.

EOF } make_html_head () { cat >"$1" < Not a feed EOF cat >>"$1" cat >>"$1" <

This is manifestly not a feed.

EOF } make_html_body () { cat >"$1" < Not a feed

This is manifestly not a feed.

EOF cat >>"$1" cat >>"$1" < EOF } make_rss10 () { cat >"$1" < example-feed-title http://example.org/ example-feed-description example-item-title http://example.org/item example-item-description EOF } make_rss20 () { cat >"$1" < example-feed-title http://example.org/ example-feed-description example-item-title http://example.org/item example-item-description

]]>
EOF } make_rss20_desc () { cat >"$1" < example-feed-title http://example.org/ example-feed-description example-item-title http://example.org/item EOF cat >>"$1" cat >>"$1" < EOF } write_desc () { make_rss20_desc $httpdir/feed.rss add "feed 0 $httpurl/feed.rss" runs -uw } make_atom10 () { cat >"$1" < example-feed-title 2013-01-01T18:00:00Z example-feed-author http://example.org/feed-id example-item-title http://example.org/item-id 2013-01-01T18:00:00Z example-item-description EOF } make_atom10_with () { cat >"$1" < example-feed-title 2013-01-01T18:00:00Z example-feed-author http://example.org/feed-id example-item-title http://example.org/item-id 2013-01-01T18:00:00Z EOF cat >>"$1" cat >>"$1" < EOF } make_single () { cat >"$1" < example-feed-title 2013-01-01T18:00:00Z example-feed-author http://example.org/feed-id $2-title $4 2013-01-01T18:00:00Z $2-description EOF } make_range () { local i local from="$1" local to="$2" local file="$3" cat >"$file" < example-feed-title http://example.org/ example-feed-description EOF for i in $(seq $from $to); do cat >>"$file" < range-title-$i- http://example.org/item$i range-description-$i

]]>
EOF done cat >>"$file" < EOF } make_n () { make_range 1 "$@" } # Make time.time() return a fixed value. fake_time () { # A test can use this more than once within the same second, so the # .pyc's timestamp might not change. Ensure it gets deleted. rm -f $statedir/plugins/fake_time.* cat >$statedir/plugins/fake_time.py <$statedir/config.inc run -v -c config.inc -u contains $outfile "Starting update" begin "listing feeds" make_rss20 $httpdir/0.rss make_rss20 $httpdir/1.rss add "feed 0 $httpurl/0.rss" add "feed 0 $httpurl/1.rss" run -l contains $outfile $httpurl/0.rss $httpurl/1.rss runs -u run -l contains $outfile "Title: example-feed-title" begin "updating one feed" make_rss20 $httpdir/feed.rss add "feed 0 $httpurl/feed.rss" runs -u runs -f $httpurl/feed.rss begin "updating nonexistant feed" rune "No such feed" -f $httpurl/feed.rss begin "bad config syntax" add "foo" runne "Bad line in config" begin "config error is fatal" add "foo" cat >$statedir/plugins/crash.py <$statedir/extra.conf runne "Bad line in config" --config extra.conf begin "config error in include" echo "foo" >$statedir/extra.conf add "include extra.conf" runne "Bad line in config" begin "bad config directive" add "foo bar" runne "Unknown config command" begin "bad boolean value in config" add "sortbyfeeddate aubergine" runne "Bad value" begin "bad time value in config" add "timeout aubergine" runne "Bad value" begin "bad integer value in config" add "maxarticles aubergine" runne "Bad value" begin "bad inline feed argument" add "feed 0 $httpurl/feed.rss aubergine" runne "Bad feed argument" begin "bad feed argument line" add "feed 0 $httpurl/feed.rss" add " aubergine" runne "Bad argument line" begin "feed argument line with no feed" : >$statedir/config add " allowduplicates true" runne "First line in config cannot be an argument" begin "feeddefaults on one line" add "feeddefaults allowduplicates=true" runs begin "feeddefaults argument lines" add "feeddefaults" add " allowduplicates true" runs begin "argument lines in the wrong place" add "tidyhtml false" add " allowduplicates true" runne "Bad argument lines" begin "feed with no time" add "feed" runne "Bad line in config" begin "feed with no URL" add "feed 3h" runne "Bad line in config" begin "define with no name" add "define" runne "Bad line in config" begin "define with no value" add "define thing" runne "Bad line in config" begin "define" add "define myvar This is my variable!" echo "myvar(__myvar__)" >$statedir/page add "pagetemplate page" runs -uw contains $statedir/output.html "myvar(This is my variable!)" begin "missing config file" rm $statedir/config runne "Can't read config file" -u begin "empty config file" : >$statedir/config runs -uw begin "--config and include" make_rss20 $httpdir/feed.rss add "feed 0 $httpurl/feed.rss" runs -uw exists $statedir/output.html rm $statedir/output.html echo "outputfile second.html" >$statedir/config.inc runs -c config.inc -w exists $statedir/second.html not_exists $statedir/output.html rm $statedir/second.html add "include config.inc" runs -w exists $statedir/second.html not_exists $statedir/output.html rm $statedir/second.html begin "missing state dir" runn -d aubergine contains $outfile "No aubergine directory" begin "corrupt state file" echo this is not a valid state file >$statedir/state runne "means the file is corrupt" -u begin "empty state file" touch $statedir/state runne "means the file is corrupt" -u begin "corrupt splitstate file" make_rss20 $statedir/simple.rss add "splitstate true" add "feed 0 simple.rss" runs -u echo this is not a valid state file >$(echo $statedir/feeds/*.state) runne "means the file is corrupt" -u for run in first second feed-adding; do for state in false true; do begin "recover from crash on $run run, splitstate $state" make_rss20 $statedir/0.rss add "splitstate $state" add "feed 0 0.rss" if [ "$run" != first ]; then runs -u fi if [ "$run" = feed-adding ]; then make_rss20 $statedir/1.rss add "feed 0 1.rss" fi # Crash while updating, so we have both state files open. cat >$statedir/plugins/crash.py <$statedir/plugins/crash.py <$statedir/plugins/nolock.py <$statedir/lock.py <$statedir/plugins/junk.txt cat >$statedir/plugins/.hidden.py <$statedir/plugins/a.py <$statedir/plugins/b.py <$httpdir/empty.xml < example-feed-title http://example.org/ example-feed-description EOF add "feed 0 $httpurl/empty.xml" runs -u begin "HTTP 404" add "feed 0 $httpurl/notthere" rune "404" -u for proto in http https ftp; do if [ -n "$timeouthost" ]; then begin "$proto: connect timeout" add "timeout 1s" add "feed 0 $proto://$timeouthost/feed.xml" rune "Timeout while reading" -u fi begin "$proto: response timeout" add "timeout 1s" add "feed 0 $proto://$serverhost:$timeoutport/feed.xml" rune "Timeout while reading" -u done begin "ignoretimeouts true" add "timeout 1s" add "ignoretimeouts true" add "feed 0 http://$serverhost:$timeoutport/feed.xml" runs -u begin "0 period" make_rss20 $httpdir/simple.rss add "feed 0 $httpurl/simple.rss" runs -u rm $httpdir/simple.rss rune "404" -u begin "1h period" make_rss20 $httpdir/simple.rss add "feed 1h $httpurl/simple.rss" runs -u rm $httpdir/simple.rss runs -u begin "10 items" make_n 10 $httpdir/feed.rss add "feed 0 $httpurl/feed.rss" runs -uw output_n 10 begin "new articles are collected" make_n 3 $httpdir/feed.rss add "feed 0 $httpurl/feed.rss" runs -uw output_n 3 make_n 6 $httpdir/feed.rss runs -uw output_n 6 begin "outputfile" make_rss20 $httpdir/feed.rss add "feed 0 $httpurl/feed.rss" add "outputfile second.html" runs -uw contains $statedir/second.html example-feed-title begin "outputfile -" make_rss20 $httpdir/feed.rss add "feed 0 $httpurl/feed.rss" add "outputfile -" run -uw contains $outfile example-feed-title begin "maxarticles 10" make_n 20 $httpdir/feed.rss add "maxarticles 10" add "feed 0 $httpurl/feed.rss" runs -uw output_n 10 not_output_range 11 20 begin "maxage 30m" fake_time 1408794484.0 make_n 10 $httpdir/feed.rss add "maxage 30m" add "feed 0 $httpurl/feed.rss" runs -uw output_n 10 # This is 45 minutes later than the time above. fake_time 1408797184.0 make_n 20 $httpdir/feed.rss runs -uw not_output_range 1 10 output_range 11 20 begin "keepmin 10" make_n 20 $httpdir/feed.rss add "keepmin 10" add "expireage 0" add "feed 0 $httpurl/feed.rss" runs -uw output_n 20 make_n 5 $httpdir/feed.rss runs -uw # Should have the 5 currently in the feed, and 10 in total output_n 5 if [ $(grep range-title- $statedir/output.html | wc -l) != 10 ]; then die "Should contain 10 items" fi begin "currentonly true" make_n 10 $httpdir/feed.rss add "currentonly true" add "feed 0 $httpurl/feed.rss" runs -uw output_n 10 make_n 5 $httpdir/feed.rss runs -uw output_n 5 not_output_range 6 10 for state in false true; do begin "useids $state" add "useids $state" add "hideduplicates none" add "feed 0 $httpurl/feed.atom" echo "Original" | make_atom10_with $httpdir/feed.atom runs -uw contains $statedir/output.html Original echo "Revised" | make_atom10_with $httpdir/feed.atom runs -uw contains $statedir/output.html Revised if $state; then # Should have updated the existing article not_contains $statedir/output.html Original else # Should have kept both versions contains $statedir/output.html Original fi done dupecheck () { add "useids false" add "feed 0 $httpurl/feed.atom" make_single $httpdir/feed.atom item-a \ http://example.org/link/x http://example.org/id/0 runs -u make_single $httpdir/feed.atom item-b \ http://example.org/link/x http://example.org/id/1 runs -u make_single $httpdir/feed.atom item-c \ http://example.org/link/y http://example.org/id/1 runs -uw } begin "hideduplicates none" add "hideduplicates none" dupecheck contains $statedir/output.html item-a-title item-b-title item-c-title begin "hideduplicates id" add "hideduplicates id" dupecheck contains $statedir/output.html item-a-title item-c-title not_contains $statedir/output.html item-b-title begin "hideduplicates link" add "hideduplicates link" dupecheck contains $statedir/output.html item-b-title item-c-title not_contains $statedir/output.html item-a-title begin "hideduplicates link id" add "hideduplicates link id" dupecheck contains $statedir/output.html item-c-title not_contains $statedir/output.html item-a-title item-b-title begin "allowduplicates" add "feeddefaults allowduplicates=true" add "hideduplicates link id" dupecheck contains $statedir/output.html item-a-title item-b-title item-c-title begin "sortbyfeeddate false/true" # Debian bug 651080. for day in 01 02 03; do cat >$httpdir/$day.atom < example-feed-title-${day} 2013-01-${day}T18:00:00Z example-feed-author http://example.org/${day}/feed-id example-item-title-${day} http://example.org/${day}/item-id 2013-01-${day}01T18:00:00Z ENTRY-${day} EOF done entries () { grep 'ENTRY' $statedir/output.html | sed 's,.*ENTRY-\(..\).*,\1,' | xargs -n10 echo } add "feed 0 $httpurl/03.atom" runs -u add "feed 0 $httpurl/02.atom" runs -u add "feed 0 $httpurl/01.atom" add "sortbyfeeddate false" runs -uw equals "01 02 03" "$(entries)" add "sortbyfeeddate true" runs -w equals "03 02 01" "$(entries)" for dstate in false true; do for tstate in false true; do begin "daysections $dstate, timesections $tstate" cat >$httpdir/feed.rss < example-feed-title http://example.org/ example-feed-description Thu, 03 Jan 2013 18:00:00 +0000 item-1 http://example.org/1 Wed, 02 Jan 2013 18:00:00 +0000 item-2 http://example.org/2 Tue, 01 Jan 2013 19:00:00 +0000 item-3 http://example.org/3 Tue, 01 Jan 2013 18:00:00 +0000 item-4 http://example.org/4 EOF add "dayformat day(%d)" add "timeformat time(%H)" add "daysections $dstate" add "timesections $tstate" add "sortbyfeeddate true" add "feed 0 $httpurl/feed.rss" runs -uw if $dstate; then contains $statedir/output.html \ 'day(01)' 'day(02)' 'day(03)' else not_contains $statedir/output.html 'day(' fi if $tstate; then contains $statedir/output.html \ 'time(18)' 'time(19)' else not_contains $statedir/output.html 'time(' fi done done begin "default templates" make_rss20 $httpdir/simple.rss add "feed 0 $httpurl/simple.rss" runs -uw cp $statedir/output.html $statedir/output.html.orig for template in page item feedlist feeditem; do run -s $template cp $outfile $statedir/$template run --show $template same $outfile $statedir/$template add "${template}template ${template}" done run -w same $statedir/output.html.orig $statedir/output.html begin "show unknown template" run -s aubergine contains $outfile "Unknown template name: aubergine" begin "pre-2.15 template options" make_rss20 $httpdir/simple.rss add "feed 0 $httpurl/simple.rss" runs -uw cp $statedir/output.html $statedir/output.html.orig run -t cp $outfile $statedir/page run --show-template same $outfile $statedir/page run -T cp $outfile $statedir/item run --show-itemtemplate same $outfile $statedir/item add "template page" add "itemtemplate item" run -w same $statedir/output.html.orig $statedir/output.html echo MAGIC1__items__ >$statedir/page echo MAGIC2 >$statedir/item run -uw contains $statedir/output.html MAGIC1 MAGIC2 for template in page item feedlist feeditem; do begin "missing ${template} template file" add "${template}template ${template}" runne "Can't read template file" -u done begin "template conditionals" make_atom10 $httpdir/feed.atom cat >$statedir/item <$statedir/item make_atom10 $httpdir/feed.atom add "feed 0 $httpurl/feed.atom" add "itemtemplate item" runne "Character encoding problem" -uw if [ -n "$utf8_LC_ALL" ]; then begin "UTF-8 in template, UTF-8 locale" echo "char(ø)" >$statedir/item make_atom10 $httpdir/feed.atom add "feed 0 $httpurl/feed.atom" add "itemtemplate item" export LC_ALL="$utf8_LC_ALL" runs -uw contains $statedir/output.html "char(ø)" fi begin "UTF-8 in define, ASCII locale" make_atom10 $httpdir/feed.atom echo "expand(__thing__)" >$statedir/item add "itemtemplate item" add "feed 0 $httpurl/feed.atom" add " define_thing char(ø)" runne "Character encoding problem" -uw if [ -n "$utf8_LC_ALL" ]; then begin "UTF-8 in define, UTF-8 locale" make_atom10 $httpdir/feed.atom echo "expand(__thing__)" >$statedir/item add "itemtemplate item" add "feed 0 $httpurl/feed.atom" add " define_thing char(ø)" export LC_ALL="$utf8_LC_ALL" runs -uw contains $statedir/output.html "expand(char(ø))" fi begin "item dates" # Debian bug 651080. run -s item cp $outfile $statedir/item echo "__date__" >>$statedir/item make_atom10 $httpdir/feed.atom add "feed 0 $httpurl/feed.atom" add "sortbyfeeddate true" add "timeformat HEADING-%m-%d-%H:%M" add "datetimeformat ITEMDATE-%m-%d-%H:%M" add "itemtemplate item" runs -uw contains $statedir/output.html "HEADING-01-01-18:00" "ITEMDATE-01-01-18:00" begin "dates shown in local time" echo "__date__" >$statedir/item make_atom10 $httpdir/feed.atom add "feed 0 $httpurl/feed.atom" add "sortbyfeeddate true" add "timeformat HEADING-%m-%d-%H:%M" add "datetimeformat ITEMDATE-%m-%d-%H:%M" add "itemtemplate item" runs -u export TZ="GMT+5" runs -w contains $statedir/output.html "HEADING-01-01-13:00" "ITEMDATE-01-01-13:00" export TZ="$default_TZ" runs -w contains $statedir/output.html "HEADING-01-01-18:00" "ITEMDATE-01-01-18:00" if [ -n "$ja_LC_ALL" ]; then begin "dates shown in Japanese" echo "__date__" >$statedir/item make_atom10 $httpdir/feed.atom add "feed 0 $httpurl/feed.atom" add "sortbyfeeddate true" add "timeformat HEADING-%A-%c" add "datetimeformat ITEMDATE-%A-%c" add "itemtemplate item" export LC_ALL="$ja_LC_ALL" runs -uw # Japanese for Tuesday, in Unicode. tue="火曜日" contains $statedir/output.html "HEADING-$tue-" "ITEMDATE-$tue-" not_contains $statedir/output.html "Tuesday" export LC_ALL="$default_LC_ALL" runs -uw contains $statedir/output.html "HEADING-Tuesday" "ITEMDATE-Tuesday" fi begin "strange dates in feeds" # Python's time.strftime can't handle all possible dates, and the range of # dates that Python can work with in time_t format varies between platforms. # rawdog won't be able to display dates that Python can't handle, but it # should at least not crash if feedparser decides to present them # (for example, if feedparser misparses a timezone as a feed). echo "__date__" >$statedir/item add "itemtemplate item" add "sortbyfeeddate true" cat >$httpdir/feed.rss < example-feed-title http://example.org/ example-feed-description Date in 300 outside 32-bit time_t range http://example.org/item Mon, 1 Jan 0300 01:23:45 +0000 Date in 1750 using Julian calendar http://example.org/item Mon, 1 Jan 1750 01:23:45 +0000 Date in 1969 with negative time_t http://example.org/item Wed, 1 Jan 1969 01:23:45 +0000 Date in 2015 that feedparser 5.2.0 misparses as 300 http://zeptobars.ru/en/rss Fri, 20 Mar 15 17:32:14 +0300 EOF add "feed 0 $httpurl/feed.rss" runs -uw begin "item authors" cat >$httpdir/feed.atom < example-feed-title 2013-01-01T18:00:00Z http://example.org/feed-id author-1 example-item-title-1 http://example.org/item-id/1 2013-01-01T18:00:00Z example-item-description author-2 author2@example.org example-item-title-2 http://example.org/item-id/2 2013-01-01T18:00:00Z example-item-description author-3 http://example.org/author3 example-item-title-3 http://example.org/item-id/3 2013-01-01T18:00:00Z example-item-description author-4 author4@example.org http://example.org/author4 example-item-title-4 http://example.org/item-id/4 2013-01-01T18:00:00Z example-item-description http://a5.example.org example-item-title-5 http://example.org/item-id/5 2013-01-01T18:00:00Z example-item-description EOF cat >$statedir/item <author-2)" \ "author(author-3)" \ "author(author-4)" \ "author(http://a5.example.org)" begin "feed list templates" make_rss20 $httpdir/0.rss make_rss20 $httpdir/1.rss make_rss20 $httpdir/2.rss add "feed 0 $httpurl/0.rss" add "feed 0 $httpurl/1.rss" add "feed 0 $httpurl/2.rss" run -s feedlist cp $outfile $statedir/feedlist echo "FEEDLIST" >>$statedir/feedlist run -s feeditem cp $outfile $statedir/feeditem echo "FEEDITEM-__feed_url__" >>$statedir/feeditem add "feedlisttemplate feedlist" add "feeditemtemplate feeditem" run -w contains $statedir/output.html \ FEEDLIST \ FEEDITEM-$httpurl/0.rss FEEDITEM-$httpurl/1.rss FEEDITEM-$httpurl/2.rss begin "prefer content over summary" make_atom10_with $httpdir/1.atom <Content1 EOF make_atom10_with $httpdir/2.atom <Summary2 EOF # Note that feedparser 5.1.3 will do odd things if summary follows content -- # feedparser issue 412. make_atom10_with $httpdir/3.atom <Summary3 Content3 EOF add "useids false" add "hideduplicates none" add "feed 0 $httpurl/1.atom" add "feed 0 $httpurl/2.atom" add "feed 0 $httpurl/3.atom" runs -uw contains $statedir/output.html Content1 Summary2 Content3 not_contains $statedir/output.html Summary3 begin "showfeeds true/false" make_atom10 $httpdir/simple.atom add "feed 0 $httpurl/simple.atom" runs -u add "showfeeds true" runs -w contains $statedir/output.html $httpurl/simple.atom add "showfeeds false" runs -w not_contains $statedir/output.html $httpurl/simple.atom begin "userefresh true/false" make_atom10 $httpdir/0.atom make_atom10 $httpdir/1.atom # It should pick the lowest of these and convert to seconds. add "feed 1m $httpurl/0.atom" add "feed 2m $httpurl/1.atom" runs -u add "userefresh true" runs -w contains $statedir/output.html 'http-equiv="Refresh" content="60"' add "userefresh false" runs -w not_contains $statedir/output.html 'http-equiv="Refresh"' begin "HTTP basic authentication" make_rss20 $httpdir/private.rss add "feed 0 $httpurl/auth-TestUser-TestPass/private.rss" rune "401" -u add " user TestUser" add " password TestPass" runs -u # Generate a plugin to check that feedparser returned a particular HTTP status # code. checkstatus () { cat >$statedir/plugins/checkstatus.py <$httpdir/.rewrites "/old.rss /301/new.rss" rune "has been updated automatically" -uw # We should still have the original items at this point. output_range 1 10 runs -uw output_range 1 10 done begin "changeconfig for feed from included file" make_rss20 $httpdir/feed.rss add "changeconfig true" add "include config2" echo >$statedir/config2 "feed 0 $httpurl/301/feed.rss" rune "has been updated automatically" -u # FIXME: this behaviour is probably not what the user wanted. # rawdog should probably complain that it's trying to change # something but hasn't succeeded. not_contains $statedir/config "$httpurl/feed.rss" contains $statedir/config2 "$httpurl/301/feed.rss" not_contains $statedir/config2 "$httpurl/feed.rss" begin "changeconfig to same URL as existing feed" make_rss20 $httpdir/feed.rss add "changeconfig true" add "feed 0 $httpurl/feed.rss" runs -u add "feed 0 $httpurl/301/feed.rss" rune "already subscribed" -u for state in false true; do begin "changeconfig to URL of just-removed feed, splitstate $state" make_rss20 $httpdir/feed.rss add "splitstate $state" add "changeconfig true" add "feed 0 $httpurl/feed.rss" runs -u # Simulate the change failing, then succeeding. for i in 1 2; do : >$statedir/config add "splitstate $state" add "changeconfig true" add "feed 0 $httpurl/301/feed.rss" rune "has been updated automatically" -u contains $statedir/config "$httpurl/feed.rss" not_contains $statedir/config "$httpurl/301/feed.rss" done runs -u done begin "feed format text" make_rss20_desc $httpdir/feed.rss <three < four" begin "feed id" make_rss20 $httpdir/0.rss make_rss20 $httpdir/1.rss add "feed 0 $httpurl/0.rss id=blah" add "feed 0 $httpurl/1.rss" add "itemtemplate item" echo "feed-id(__feed_id__)" >$statedir/item runs -uw contains $statedir/output.html "feed-id(blah)" "feed-id(examplefeedtitle)" begin "shorttag expansion" #

bug fixed 2006-01-07. #
/ has a workaround in feedparser for sgmllib. add "tidyhtml false" write_desc <0
" \ "1
" \ "2
/" begin "broken processing instruction" write_desc < link

]]> EOF contains $statedir/output.html \ "$httpurl/rel-link" \ "$httpurl/rel-img" begin "Javascript removed" write_desc < span

]]> EOF not_contains $statedir/output.html "Annoying1" "Annoying2" begin "stray ] in URL" # This produced an "Invalid IPv6 URL" exception with feedparser r738. write_desc <link

]]> EOF contains $statedir/output.html not-broken if $knownbad; then begin "escaped slashes in URL" # feedparser issue 407: links with :// escaped get mangled (reported in # rawdog by Joseph Reagle). write_desc <link link link link ]]> EOF contains $statedir/output.html \ http://example.com/0 http://example.com/1 \ http://example.com/2 http://example.com/3 fi begin "add feed, actually a feed" make_rss20 $httpdir/feed.rss rune "Adding feed" -a $httpurl/feed.rss contains "$statedir/config" $httpurl/feed.rss begin "add feed, relative " # Debian bug 657206. make_rss20 $httpdir/feed.rss make_html_head $httpdir/page.html < EOF rune "Adding feed" -a $httpurl/page.html contains "$statedir/config" $httpurl/feed.rss begin "add feed, absolute " make_rss20 $httpdir/feed.rss make_html_head $httpdir/page.html < EOF rune "Adding feed" -a $httpurl/page.html contains "$statedir/config" $httpurl/feed.rss begin "add feed, typical blog" # Roughly what blogspot pages have. make_atom10 $httpdir/posts make_html_head $httpdir/page.html < EOF rune "Adding feed" -a $httpurl/page.html contains "$statedir/config" $httpurl/posts not_contains "$statedir/config" "alt=rss" begin "add feed, avoid HTML " make_html $httpdir/dummy.html make_html_head $httpdir/page.html < EOF rune "Cannot find any feeds" -a $httpurl/page.html begin "add feed, with obvious URL" make_rss20 $httpdir/foo.rss make_html_body $httpdir/page.html <Here is our feed!

EOF rune "Adding feed" -a $httpurl/page.html if $knownbad; then begin "add feed, with non-obvious URL" # ... as boingboing.net currently has (old feedfinder doesn't find # this; it finds /atom.xml by brute force). make_rss20 $httpdir/foo make_html_body $httpdir/page.html <Here is our RSS feed!

EOF rune "Adding feed" -a $httpurl/page.html fi if $knownbad; then # Old feedfinder could find this because it tried appending lots of # likely suffixes to URLs. However, this generally isn't needed # nowdays; most of the feeds that it could find that way have proper # elements. begin "add feed, brute force" make_atom10 $httpdir/index.atom make_html $httpdir/page.html rune "Adding feed" -a $httpurl/page.html fi begin "add feed, no feeds to be found" make_html $httpdir/page.html rune "Cannot find any feeds" -a $httpurl/page.html begin "add feed, nonsense in HTML" # Debian bug 650776. This will provoke a HTMLParseError. make_rss20 $httpdir/feed.rss make_html_head $httpdir/page.html < EOF rune "Adding feed" -a $httpurl/page.html contains "$statedir/config" $httpurl/feed.rss begin "add feed, already present" make_atom10 $httpdir/feed.atom add "feed 3h $httpurl/feed.atom" rune "already in the config file" -a $httpurl/feed.atom begin "add feed, prefer RSS 1.0 over nonsense" make_rss10 $httpdir/feed.rdf echo "this is nonsense" >$httpdir/feed.rss make_html_head $httpdir/page.html < EOF rune "Adding feed" -a $httpurl/page.html contains "$statedir/config" $httpurl/feed.rdf begin "add feed, prefer RSS 2 over RSS 1.0" make_rss10 $httpdir/feed.rdf make_rss20 $httpdir/feed.rss make_html_head $httpdir/page.html < EOF rune "Adding feed" -a $httpurl/page.html contains "$statedir/config" $httpurl/feed.rss begin "add feed, prefer .rss2 over .rss" make_rss20 $httpdir/feed.rss make_rss20 $httpdir/feed.rss2 make_html_head $httpdir/page.html < EOF rune "Adding feed" -a $httpurl/page.html contains "$statedir/config" $httpurl/feed.rss2 begin "add feed, prefer Atom over RSS" make_rss10 $httpdir/feed.rdf make_rss20 $httpdir/feed.rss make_rss20 $httpdir/feed.rss2 make_atom10 $httpdir/feed.atom make_html_head $httpdir/page.html < EOF rune "Adding feed" -a $httpurl/page.html contains "$statedir/config" $httpurl/feed.atom begin "add feed, prefer entries over comments" make_atom10 $httpdir/comments.atom make_atom10 $httpdir/entries.atom make_html_head $httpdir/page.html < EOF rune "Adding feed" -a $httpurl/page.html contains "$statedir/config" $httpurl/entries.atom begin "add feed, keep page order" make_atom10 $httpdir/0.atom make_atom10 $httpdir/1.atom make_html_head $httpdir/page.html < EOF rune "Adding feed" -a $httpurl/page.html contains "$statedir/config" $httpurl/0.atom begin "add feed, ignore broken link" make_atom10 $httpdir/1.atom make_html_head $httpdir/page.html < EOF rune "Adding feed" -a $httpurl/page.html contains "$statedir/config" $httpurl/1.atom begin "add feed, UTF-8 in attr" # This problem showed up in orbitbooks.net's front page. The intent is fine, # but it crashes Python 2.7's HTMLParser if it's not properly decoded. make_atom10 $httpdir/feed.atom make_html_head $httpdir/page.html < EOF rune "Adding feed" -a $httpurl/page.html contains "$statedir/config" $httpurl/feed.atom begin "add feed, gzip-encoded response" make_rss20 $httpdir/feed.rss make_html_head $httpdir/page.html < EOF rune "Adding feed" -a $httpurl/gzip/page.html contains "$statedir/config" $httpurl/feed.rss begin "add feed, gzip-encoded feed" make_rss20 $httpdir/feed.rss make_html_head $httpdir/page.html < EOF rune "Adding feed" -a $httpurl/page.html contains "$statedir/config" $httpurl/gzip/feed.rss begin "remove feed" add "feed 3h $httpurl/0.rss" add "feed 3h $httpurl/1.rss" add "feed 3h $httpurl/2.rss" rune "Removing feed" -r $httpurl/1.rss contains "$statedir/config" $httpurl/0.rss $httpurl/2.rss not_contains "$statedir/config" $httpurl/1.rss begin "remove feed with options" add "feed 3h $httpurl/0.rss" add " define_foo 0a" add " define_foo 0b" add "feed 3h $httpurl/1.rss" add " define_foo 1a" add " define_foo 1b" add "feed 3h $httpurl/2.rss" add " define_foo 2a" add " define_foo 2b" rune "Removing feed" -r $httpurl/1.rss contains "$statedir/config" \ $httpurl/0.rss "foo 0a" "foo 0b" \ $httpurl/2.rss "foo 2a" "foo 2b" not_contains "$statedir/config" \ $httpurl/1.rss "foo 1a" "foo 1b" begin "remove feed, preserving comments" add "feed 3h $httpurl/0.rss" add " define_foo 0a" add "# Keep this comment" add " define_foo 0b" rune "Removing feed" -r $httpurl/0.rss contains $statedir/config "# Keep this comment" not_contains $statedir/config "foo 0a" "foo 0b" begin "remove nonexistant feed" add "feed 3h $httpurl/0.rss" add "feed 3h $httpurl/1.rss" add "feed 3h $httpurl/2.rss" rune "not in the config file" -r $httpurl/3.rss for state in false true; do for fetched in false true; do not=$(if ! $fetched; then echo "not "; fi) begin "remove feed, ${not}fetched, splitstate $state" make_rss20 $httpdir/feed.rss add "feed 0 $httpurl/feed.rss" add "splitstate $state" if $fetched; then runs -uw contains $statedir/output.html example-item-title if $state; then exists $statedir/feeds/* fi fi rune "Removing feed" -r $httpurl/feed.rss if $state; then not_exists $statedir/feeds/* fi runs -uw not_contains $statedir/output.html example-item-title done done # Run the plugins test suite if it's there. if [ -e rawdog-plugins/test-plugins ]; then . rawdog-plugins/test-plugins fi exit $exitcode rawdog-2.21/NEWS0000644000471500047150000010563012550467160013043 0ustar atsats00000000000000- rawdog 2.21 Don't crash when asked to show a non-existant template ("-s foo") -- and fix test-rawdog so that it detects when a test is expected to fail with an error message, but actually crashes. Use grep -a when searching for locales in test-rawdog, since some locales have non-ASCII names. Fix some style problems reported by pylint. Use cStringIO rather than StringIO in all modules (rather than some using one and some using the other). Don't crash when feedparser returns a date that Python can't format. Since feedparser's date parser is pretty liberal, it can occasionally interpret an invalid date incorrectly (e.g. treating a time zone as a year number). When an error occurs while parsing the config file, exit with a non-zero return code, rather than continuing with an incomplete config. This has always been the intended behaviour, but rawdog 2.13 accidentally removed the check. - rawdog 2.20 Add a test for the maxage option (suggested by joelmo). Add a rule to style.css to scale down large images. When looking for SSL timeout exceptions, match any form of "timeout", "time out" or "timed out" in the message. This makes rawdog detect SSL connection timeouts correctly with Debian's Python 2.7.8 package (reported by David Suárez). - rawdog 2.19 Make test-rawdog not depend on having a host it can test connection timeouts against, and add a -T option if you do have one. When renaming a feed's state file in splitstate mode, don't fail if the state file doesn't exist -- which can happen if we get a 301 response for a feed the first time we fetch it. Also rename the lock file along with the state file. Add some more comprehensive tests for the changeconfig option; in particular, test it more thoroughly with splitstate both on and off. Don't crash if feedparser raises an exception during an update (i.e. assume that any part of feedparser's response might be missing, until we've checked that there wasn't an exception). - rawdog 2.18 Be consistent about catching AttributeError when looking for attributes that were added to Rawdog during the 2.x series (spotted by Jakub Wilk). Add some advice in PLUGINS about escaping template parameters. Willem reported that the enclosure plugin didn't do this, and having had a look at the others it seems to be a common problem. Make feedscanner handle "Content-Encoding: gzip" in responses, as tumblr.com's webservers will use this even if you explicitly refuse it in the request. - rawdog 2.17 Add a one-paragraph description of rawdog to the README file, for use by packgers. Fix some misquoted dashes in the man page (spotted by lintian). Set LC_ALL=C and TZ=UTC when running tests, in order to get predictable results regardless of locale or timezone (reported by Etienne Millon). Give sensible error messages on startup (rather than crashing) if the config file or a template file is missing, or contains characters that aren't in the system encoding. Give test-rawdog some command-line options; you can now use it to test an installed version of rawdog, or rawdog running under a non-default Python version. Add some more tests to the test suite, having done a coverage analysis to work out which features weren't yet being tested: date formatting in varying locales and timezones; RSS 1.0 support; --dump, -c, -f, -l, -N, -v and -W; include; plugin loading; feed format and id options; author formatting; template conditionals: broken 301 redirects; useids; content vs. summary; daysections/timesections; removing articles from a feed; keeping comments; numthreads; outputfile; various error messages. Use author URIs retrieved from feeds when formatting author names (rather than ignoring them; this was the result of a feedparser change). Make subclasses of Persistable call Persistable's constructor. (Identified by coverage analysis.) Don't crash when trying to show a template that doesn't exist. When removing a feed in splitstate mode, remove its lock file too. - rawdog 2.16 Remove the bundled copy of feedparser, and document that it's now a dependency. Update the package metadata in setup.py. - rawdog 2.15 rawdog now requires Python 2.6 (rather than Python 2.2). This is the version in Debian and Red Hat's previous stable releases, so it should be safe to assume on current systems. Make setup.py complain if you have an inappropriate Python version. Remove obsolete code that supported pre-2.6 versions of Python (timeoutsocket.py, conditional imports, 0/1 for bools, dicts for sets, locking without with, various standard library features). Tidy up the code formatting in a few places to make it closer to PEP 8. Make the rawdog(1) man page describe all of rawdog's options, and make some other minor improvements to the documentation and help. Remove the --upgrade option; I think it's highly unlikely that anybody still has any rawdog 1 state files around. Make the code that manages the pool of feed-fetching threads only start as many threads as necessary (including none if there's only one feed to fetch), and generally tidy it up. Add test-rawdog, a simple test suite for rawdog with a built-in webserver. You should be able to run this from the rawdog source directory to check that much of rawdog is working correctly. (If you have the rawdog plugins repo in a subdirectory called "rawdog-plugins", it'll run tests on some of the plugins too.) Add a -V option, which is like -v but appends the verbose output to a file. This is mostly useful for testing. Significantly rework the Persister class: there's now a Persisted class that can act as a context manager for "with" statements, which simplifies the code quite a bit, and it correctly handles persisted objects being opened multiple times and renamed. persister.py is now under the same license as the rest of rawdog (GPLv2+). Fix a bug: if you're using splitstate mode, and a feed returns a 301 permanent redirect, rawdog needs to rename the state file and adjust the articles in it so they're attached to the feed's new URL. In previous versions this didn't work correctly for two reasons: it tried to load the existing articles from the new filename, and the resulting file got clobbered because it was already being used by --update. Rework the locking logic in persister so that it uses a separate lock file. This fixes a (mostly) harmless bug: previously if rawdog A was waiting for rawdog B to finish, then rawdog A wouldn't see the changes rawdog B had written to the state file. More importantly, it means rawdog won't leave an empty ("corrupt") state file if it crashes during the first update or write. Split state files are now explicitly marked as modified if any articles were expired from them. (This won't actually change rawdog's behaviour, since articles were only expired if some articles had been seen during the update, and that would also have marked the state as modified.) When splitstate is enabled, make the feeds directory if it doesn't already exist. This avoids a confusing error message if you didn't make it by hand. rawdog now complains if feedparser can't detect the type of a feed or retrieve any items from it. This usually means that the URL isn't actually a feed -- for example, if it's redirecting to an error page. rawdog can now report more than one error for a feed at once -- e.g. a permanent redirection to something that isn't a feed. Show URLError exceptions returned by feedparser -- this means rawdog gives a sensible error message for a file: or ftp: URL that gives an error, rather than claiming it's a timeout. Plain filenames are now turned into file: URLs so you get consistent errors for both, and timeouts are detected by looking for a timeout exception. Use a custom urllib2 handler to capture all the HTTP responses that feedparser sees when handling redirects. This means rawdog can now see both the initial and final status code (rather than the combined one feedparser returns) -- so it can correctly handle redirects to errors, and redirects to redirects. Make "hideduplicates id link" work correctly in the odd corner case where an article has both id and link duplicated, but to different other articles. Upgrade feedparser to version 5.1.3. As a result of the other changes below, rawdog's copy of feedparser is now completely unmodified -- so it should be safe to remove it and use your system version if you prefer (provided it's new enough). Add a --dump option to pretty-print feedparser's output for a URL. The feedparser module used to do this if invoked as a script, but more recent versions of feedparser don't support this. Use a custom urllib2 handler to do HTTP basic authentication, instead of a feedparser patch. This also fixes proxy authentication, which I accidentally broke by removing a helper class several releases ago. Use a custom urllib2 handler to disable RFC 3229, instead of a feedparser patch. The behaviour is slightly different in that it now sends "A-IM: identity" rather than no header at all; this should have the same effect, though. Remove the feedparser patch that provided "_raw" versions of content (before sanitisation) for use in the article hash, and use the normal version instead. Since we disable sanitisation at fetch time anyway, the only difference with current feedparser is that the _raw versions didn't have CP1252 encoding fixes applied -- so in the process of upgrading to this version, you'll see some duplicate articles on feeds with CP1252 encoding problems. Tests suggest this doesn't affect many feeds (3 out of the 1000-odd in my test setup). Set feedparser behaviour using SANITIZE_HTML etc., rather than by directly changing the lists of elements it's looking for. Replace feedfinder, which has unfixable unclear licensing, with the module that Decklin Foster wrote for his Debian package of rawdog (specifically rawdog_2.13.dfsg.1-1). I've renamed it to "feedscanner", on the grounds that it may be useful to other projects as well in the future. Put feedscanner's license notice into __license__, for consistency with feedparser. Make feedscanner understand HTML-style as well as XHTML-style . Fix Debian bug 657206: make feedscanner understand relative links (reported by Peter J. Weisberg). Fix Debian bug 650776: make feedscanner not crash if it can't parse the URL it was given as HTML (reported by Jonathan Polley). Make rawdog use feedscanner's preferred order of feeds in addition to its own. Make feedscanner only return URLs that feedparser can parse successfully as feeds. Make feedscanner look for links pointing to URLs with words in them that suggest they're probably feeds. Make feedscanner check whether the URL it was given is already a feed before scanning it for links. Make feedscanner decode the HTML it reads (silently ignoring errors) before trying to parse it. Move rawdog's feed quality heuristic into feedscanner. Simplify the options for dealing with templates: there is now a -s/--show command-line option that takes a template name as an argument (i.e. you do "rawdog -s item" rather than "rawdog -T"), and the "template" config file option is now called "pagetemplate". This simplifies the code, and makes it possible to add more templates without adding more command-line options. (For backwards compatibility, all the old command-line and config-file options are still accepted, and rawdog.get_template(config) will still return the page template.) Add templates for the feed list and each item in the feed list (based on patch from Arnout Engelen). Don't append an extra newline when showing a template. - rawdog 2.14 When adding a new feed from a page that provides several feeds, make a more informed choice rather than just taking the first one: many blogs provide both content and comments feeds, and we usually want the first one. Add a note to PLUGINS about making sure plugin storage gets saved. Use updated_parsed instead of the obsolete modified_parsed when extracting the feed-provided date for an item, and fall back to published_parsed and then created_parsed if it doesn't exist (reported by Cristian Rigamonti, Martintxo and chrysn). feedparser currently does fallback automatically, but it's scheduled to be removed at some point, so it's better for rawdog to do it. - rawdog 2.13 Forcibly disable BeautifulSoup support in feedparser, since it returns unpickleable pseudo-string objects, and it crashes when trying to parse twenty or so of my feeds (reported by Joseph Reagle). Make the code that cleans up feedparser's return value more thorough -- in particular, turn subclasses of "unicode" into real unicode objects. Decode the config file from the system encoding, and escape "define_"d strings when they're written to the output file (reported by Cristian Rigamonti). Add the "showtracebacks" option, which causes exceptions that occur while a feed is being fetched to be reported with a traceback in the resulting error message. Use PyTidyLib in preference to mx.Tidy when available (suggested by Joseph Reagle). If neither is available, "tidyhtml true" just does nothing, so it's now turned on in the provided config file. The mxtidy_args hook is now called tidy_args. Allow template variables to start with an underscore (patch from Oberon Faelord). Work around broken DOCTYPEs that confuse sgmllib. If -v is specified, force verbose on again after reading a secondary config file (reported by Jonathan Phillips). Resynchronise the feed list after loading a secondary config file; previously feeds in secondary config files were ignored (reported by Jonathan Philips). - rawdog 2.12 Make rawdog work with Python 2.6 (reported by Roy Lanek). If feedfinder (which now needs Python 2.4 or later) can't be imported, just disable it. Several changes as a result of profiling that significantly speed up writing output files: - Make encode_references() use regexp replacement. - Cache the result of locale.getpreferredencoding(). - Use tuple lists rather than custom comparisons when sorting. Update feedparser to revision 291, which fixes the handling of elements (reported by Darren Griffith). Only update the stored Etag and Last-Modified when a feed changes. Add the "splitstate" option, which makes rawdog use a separate state file for each feed rather than one large one. This significantly reduces rawdog's memory usage at the cost of some more disk IO during --write. The old behaviour is still the default, but I would recommend turning splitstate on if you read a lot of feeds, if you use a long expiry time, or if you're on a machine with limited memory. As a result of the splitstate work, the output_filter and output_sort hooks have been removed (because there's no longer a complete list of articles to work with). Instead, there's now an output_sort_articles hook that works with a list of article summaries. Add the "useids" option, which makes rawdog respect article GUIDs when updating feeds; if an article's GUID matches one we already know about, we just update the existing article's contents rather than treating it as a new article (like most aggregators do). This is turned on in the default configuration, since the behaviour it produces is generally more useful these days -- many feeds include random advertisements, or other dynamic content, and so the old approach resulted in lots of duplicated articles. - rawdog 2.11 Avoid a crash when a feed's URL is changed and expiry is done on the same run. Encode dates correctly in non-ASCII locales (reported by Damjan Georgievski). Strengthen the warning in PLUGINS about the effects of overriding output_write_files (suggested by Virgil Bucoci). Add the state directory to sys.path, so you can put modules that plugins need in your ~/.rawdog (suggested by Stuart Langridge). When adding a feed, check that it isn't already present in the config file (suggested by Stuart Langridge). Add --no-lock-wait option to make rawdog exit silently if it can't lock the state file (i.e. if there's already a rawdog running). Update to the latest feedparser, which fixes an encoding bug with Python 2.5, among various other stuff (reported by Paul Tomblin, Tim Bishop and Joseph Reagle). Handle the author_detail fields being None. - rawdog 2.10 Work around a feedparser bug (returning a detail without a type field for posts with embedded SVG). Pull in most of the changes from feedparser 4.1. Fix a bug that stopped rawdog from working properly when no locale information was present in the environment, or on versions of Python without locale.getpreferredencoding() (reported by Michael Watkins). Add --remove option to remove a feed from the config file (suggested by Wolfram Sieber). Produce a more useful error message when $HOME isn't set (reported by Wolfram Sieber). Fix a bug in the expiry code: if you were using keepmin, it could expire articles that were no longer current but should be kept. Clean up the example config file a bit. - rawdog 2.9 Fix a documentation bug about time formats (reported by Tim Bishop). Fix a text-handling problem related to the locale changes (patch from Samuel Hym). Fix use of the "A-IM: feed" header in HTTP requests. A previous upstream change to feedparser had modified it so that it always sent this header, which results in a subtle rawdog bug: if a feed returns a partial result (226) and then has no changes for a long time, rawdog can expire articles which should still be "current" in the feed. This version adds a "keepmin" option which make a minimum number of articles be kept for each feed; this should avoid expiring articles that are still current. If you want the old behaviour, you can set "keepmin" to 0, in which case rawdog won't send the "A-IM: feed" header in its requests. rawdog also won't send that header if "currentonly" is set to true, since in that case the current set of articles is all rawdog cares about. (See for Sam Ruby's discussion of the same problem in Planet.) If the author's name is given as the empty string, fall back to the email address, URL or "author". Change the labels in the feed information table to "Last fetched" and "Next fetched after", to match what rawdog actually does with the times it stores (reported by D. Stussy). - rawdog 2.8 Fix authentication support -- feedparser now supports Basic and Digest authentication internally, but it needed tweaking to make it useful for rawdog (reported by Tim Bishop). - rawdog 2.7 Make feedfinder smarter about trying to find the preferred type of feed (patch from Decklin Foster). Add a plugin hook to let you modify mx.Tidy options (suggested by Jon Lasser). Work correctly if the threading module isn't available (patch from Jack Diederich). Update to feedparser 4.0.2, which includes some of our patches and fixes an unclear license notice (reported by Jason Diamond, Joe Wreschnig and Decklin Foster). Fix a feedparser bug that caused things preceding shorttags to be duplicated when sanitising HTML. Set the locale correctly when rawdog starts up (patch from Samuel Hym). - rawdog 2.6 Allow maxage to be set per feed (patch from Craig Allen). Support feeddefaults with no options on the same line, as used in the sample config file (reported by asher). - rawdog 2.5 Ensure that all the strings in entry_info are in Unicode form, to make it easier for plugins to deal with them. Fix a feedparser bug that was breaking feeds which includes itunes elements (reported by James Cameron). Make feedparser handle content types and modes in atom:content correctly (reported by David Dorward). Make feedparser handle the new elements in Atom 1.0 (patch from Decklin Foster). Remove some unnecessary imports found by pyflakes. Add output_sorted_filter and output_write_files hooks, deprecating the output_write hook (which wasn't very useful originally, and isn't used by any of the plugins I've been sent). Restructure the "write" code so that it should be far easier to write custom output plugins: there are several new methods on Rawdog for doing different bits of the write process. When selecting articles to display, don't assume they're sorted in date order (a plugin might have done something different). Don't write an extra newline at the end of the output file (i.e. use f.write rather than print >>f), and be more careful about encoding when writing output to stdout. Provide arbitrary persistent storage for plugins via a get_plugin_storage method on Rawdog (suggested by BAM). Add -N option to avoid locking the state file, which may be useful if you're on an OS or filesystem that doesn't support locks (suggested by Andy Dustman). If RAWDOG_PROFILE is set as an environment variable, rawdog will run under the Python profiler. Make some minor performance improvements. Change the "Error parsing feed" message to "Error fetching or parsing feed", since it really just indicates an error somewhere within feedparser (reported by Fred Barnes). Add support for using multiple threads when fetching feeds, which makes updates go much faster if you've got lots of feeds. (The state-updating part of the update is still done sequentially, since parallelising it would mean adding lots of locking and making the code very messy.) To use this, set "numthreads" to be greater than 0 in your config file. Since it changes the semantics of one of the plugin hooks, it's off by default. Update the GPL and LGPL headers to include the FSF's new address (reported by Decklin Foster). - rawdog 2.4 Provide guid in item templates (suggested by Rick van Rein). Update article-added dates correctly when "currentonly true" is used (reported by Rick van Rein). Clarify description of -c in README and man page (reported by Rick van Rein). If you return false from an output_items_heading function, then disable DayWriter (suggested by Ian Glover). Fix description of article_seen in PLUGINS (reported by Steve Atwell). Escape odd characters in links and guids, and add a sanity check that'll trip if non-ASCII somehow makes it to the output (reported by TheCrypto). - rawdog 2.3 Make the id= parameter work correctly (patch from Jon Nelson). - rawdog 2.2 Add "feeddefaults" statement to specify default feed options. Update feeds list from the config file whenever rawdog runs, rather than just when doing an update (reported by Decklin Foster). Reload the config files after -a, so that "rawdog -a URL -u" has the expected behaviour (reported by Decklin Foster). Add "define" statement and "define_X" feed option to allow the user to define extra strings for the template; you can use this, for example, to select classes for groups of feeds, generate different HTML for different sorts of feeds, or set the title in different pages generated from the same template (suggested by Decklin Foster). Fix a logic error in the _raw changes to feedparser: if a feed didn't specify its encoding but contained non-ASCII characters, rawdog will now try to parse it as UTF-8 (which it should be) and, failing that, as ISO-8859-1 (in case it just contains non-UTF-8 junk). Don't print the "state file may be corrupt" error if the user hits Ctrl-C while rawdog's loading it. Add support for extending rawdog with plugin modules; see the "PLUGINS" file for more information. Make "verbose true" work in the config file. Provide __author__ in items, for use in feeds that support that (patch from Decklin Foster). Fix conditional template expansion (patch from Decklin Foster). Add "blocklevelhtml" statement to disable the "

" workaround for non-block-level HTML; this may be useful if you have a plugin that is doing different HTML sanitisation, or if your template already forces a block-level element around article descriptions. Fix -l for feeds with non-ASCII characters in their titles. Provide human-readable __feed_id__ in items (patch from David Durschlag), and add feed-whatevername class to the default item template; this should make it somewhat easier to add per-feed styles. Handle feeds that are local files correctly, and handle file: URLs in feedparser (reported by Chris Niekel). Allow feed arguments to be given on indented lines after the "feed" or "feeddefaults" lines; this makes it possible to have spaces in feed arguments. Add a meta element to the default template to stop search engines indexing rawdog pages (patch from Rick van Rein). Add new feeds at the end of the config file rather than before the first feed line (patch from Decklin Foster). - rawdog 2.1 Fix a character encoding problem with format=text feeds. Add proxyuser and proxypassword options for feeds, so that you can use per-feed proxies requiring HTTP Basic authentication (patch from Jon Nelson). Add a manual page (written by Decklin Foster). Remove extraneous #! line from feedparser.py (reported by Decklin Foster). Update an article's modified date when a new version of it is seen (reported by Decklin Foster). Support nested ifs in templates (patch from David Durschlag), and add __else__. Make the README file list all the options that rawdog now supports (reported by David Durschlag). Make --verbose work even if it's specified after an action (reported by Dan Noe and David Durschlag). - rawdog 2.0 Update to feedparser 3.3. This meant reworking some of rawdog's internals; state files from old versions will no longer work with rawdog 2.0 (and external programs that manipulate rawdog state files will also be broken). The new feedparser provides a much nicer API, and is significantly more robust; several feeds that previously caused feedparser internal errors or Python segfaults now work fine. Add an --upgrade option to import state from rawdog 1.x state files into rawdog 2.x. To upgrade from 1.x to 2.x, you'll need to perform the following steps after installing the new rawdog: - cp -R ~/.rawdog ~/.rawdog-old - rm ~/.rawdog/state - rawdog -u - rawdog --upgrade ~/.rawdog-old ~/.rawdog (to copy the state) - rawdog -w - rm -r ~/.rawdog-old (once you're happy with the new version) Keep track of a version number in the state file, and complain if you use a state file from an incompatible version. Remove support for the old option syntax ("rawdog update write"). Remove workarounds for early 1.x state file versions. Save the state file in the binary pickle format, and use cPickle instead of pickle so it can be read and written more rapidly. Add hideduplicates and allowduplicates options to attempt to hide duplicate articles (based on patch from Grant Edwards). Fix a bug when sorting feeds with no titles (found by Joseph Reagle). Write the updated state file more safely, to reduce the chance that it'll be damaged or truncated if something goes wrong while it's being written (requested by Tim Bishop). Include feedfinder, and add a -a|--add option to add a feed to the config file. Correctly handle dates with timezones specified in non-UTC locales (reported by Paul Tomblin and Jon Lasser). When a feed's URL changes, as indicated by a permanent HTTP redirect, automatically update the config file and state. - rawdog 1.13 Handle OverflowError with parsed dates (patch from Matthew Scott). - rawdog 1.12 Add "sortbyfeeddate" option for planet pages (requested by David Dorward). Add "currentonly" option (patch from Chris Cutler). Handle nested CDATA blocks in feed XML and HTML correctly in feedparser. - rawdog 1.11 Add __num_items__ and __num_feeds__ to the page template, and __url__ to the item template (patch from Chris Cutler). Add "daysections" and "timesections" options to control whether to split items up by day and time (based on patch from Chris Cutler). Add "tidyhtml" option to use mx.Tidy to clean feed-provided HTML. Remove the

wrapping __description__ from the default item template, and make rawdog add

...

around the description only if it doesn't start with a block-level element (which isn't perfect, but covers the majority of problem cases). If you have a custom item template and want rawdog to generate a better approximation to valid HTML, you should change "

__description__

" to "__description__". HTML metacharacters in links are now encoded correctly in generated HTML ("foo?a=b&c=d" as "foo?a=b&c=d"). Content type selection is now performed for all elements returned from the feed, since some Blogger v5 feeds cause feedparser to return multiple versions of the title and link (reported by Eric Cronin). - rawdog 1.10 Add "ignoretimeouts" option to silently ignore timeout errors. Fix SSL and socket timeouts on Python 2.3 (reported by Tim Bishop). Fix entity encoding problem with HTML sanitisation that was causing rawdog to throw an exception upon writing with feeds containing non-US-ASCII characters in attribute values (reported by David Dorward, Dmitry Mark and Steve Pomeroy). Include MANIFEST.in in the distribution (reported by Chris Cutler). - rawdog 1.9 Add "clear: both;" to item, time and date styles, so that items with floated images in don't extend into the items below them. Changed how rawdog selects the feeds to update; --verbose now shows only the feeds being updated. rawdog now uses feedparser 2.7.6, which adds date parsing and limited sanitisation of feed-provided HTML; I've removed rawdog's own date-parsing (including iso8601.py) and relative-link-fixing code in favour of the more-capable feedparser equivalents. The persister module in rawdoglib is now licensed under the LGPL (requested by Giles Radford). Made the error messages that listed the state dir reflect the -b setting (patch from Antonin Kral). Treat empty titles, links or descriptions as if they weren't supplied at all, to cope with broken feeds that specify "" (patch from Michael Leuchtenburg). Make the expiry age configurable; previously it was hard-wired to 24 hours. Setting this to a larger value is useful if you want to have a page covering more than a day's feeds. Time specifications in the config file can now include a unit; if no unit is specified it'll default to minutes or seconds as appropriate to maintain compatibility with old config files. Boolean values can now be specified as "true" or "false" (or "1" or "0" for backwards compatibility). rawdog now gives useful errors rather than Python exceptions for bad values. (Based on suggestions by Tero Karvinen.) Added datetimeformat option so that you can display feed and article times differently from the day and time headings, and added some examples including ISO 8601 format to the config file (patch from Tero Karvinen). Forcing a feed to be updated with -f now clears its ETag and Last-Modified, so it should always be refetched from the server. Short-form XML tags in RSS () are now handled correctly. Numeric entities in RSS encoded content are now handled correctly. - rawdog 1.8 Add format=text feed option to handle broken feeds that make their descriptions unescaped text. Add __hash__ and unlinked titles to item templates, so that you can use multiple config files to build a summary list of item titles (for use in the Mozilla sidebar, for instance). (Requested by David Dorward.) Add the --verbose argument (and the "verbose" option to match); this makes rawdog show what it's doing while it's running. Add an "include" statement in config files that can be used to include another config file. Add feed options to select proxies (contributed by Neil Padgen). This is straightforward for Python 2.3, but 2.2's urllib2 has a bug which prevents ProxyHandlers from working; I've added a workaround for now. - rawdog 1.7 Fix code in iso8601.py that caused a warning with Python 2.3. - rawdog 1.6 Config file lines are now split on arbitary strings of whitespace, not just single spaces (reported by Joseph Reagle). Include a link to the rawdog home page in the default template. Fix the --dir argument: -d worked fine, but the getopt call was missing an "=" (reported by Gregory Margo). Relative links (href and src attributes) in feed-provided HTML are now made absolute in the output. (The feed validator will complain about feeds with relative links in, but there are quite a few out there.) Item templates are now supported, making it easier to customise item appearance (requested by a number of users, including Giles Radford and David Dorward). In particular, note that __feed_hash__ can be used to apply a CSS style to a particular feed. Simple conditions are supported in templates: __if_x__ .. __endif__ only expands to its contents if x is not empty. These conditions cannot be nested. PyXML's iso8601 module is now included so that rawdog can parse dates in feeds. - rawdog 1.5 Remove some debugging code that broke timeouts. - rawdog 1.4 Fix option-compatibility code (reported by BAM). Add HTTP basic authentication support (which means modifying feedparser again). Print a more useful error if the statefile can't be read. - rawdog 1.3 Reverted the "retry immediately" behaviour from 1.2, since it causes denied or broken feeds to get checked every time rawdog is run. Updated feedparser to 2.5.3, which now returns the XML encoding used. rawdog uses this information to convert all incoming items into Unicode, so multiple encodings are now handled correctly. Non-ASCII characters are encoded using HTML numeric character references (since this allows me to leave the HTML charset as ISO-8859-1; it's non-trivial to get Apache to serve arbitrary HTML files with the right Content-Type, and using won't override HTTP headers). Use standard option syntax (i.e. "--update --write" instead of "update write"). The old syntax will be supported until 2.0. Error output from reading the config file and from --update now goes to stderr instead of stdout. Made the socket timeout configurable (which also means the included copy of feedparser isn't modified any more). Added --config option to read an additional config file; this lets you have multiple output files with different options. Allow "outputfile -" to write the output to stdout; useful if you want to have cron mail the output to you rather than putting it on a web page. Added --show-template option to show the template currently in use (so you can customise it yourself), and "template" config option to allow the user to specify their own template. Added --dir option for people who want two lots of rawdog state (for two sets of feeds, for instance). Added "maxage" config option for people who want "only items added in the last hour", and made it possible to disable maxarticles by setting it to 0. - rawdog 1.2 Updated feedparser to 2.5.2, which fixes a bug that was making rawdog handle content incorrectly in Echo feeds, handles more content encoding methods, and returns HTTP status codes. (I've applied a small patch to correct handling of some Echo feeds.) Added useful messages for different HTTP status codes and HTTP timeouts. Since rawdog reads a config file, it can't automatically update redirected feeds, but it will now tell you about them. Note that for "fatal" errors (anything except a 2xx response or a redirect), rawdog will now retry the feed next time it's run. Prefer "content" over "content_encoded", and fall back correctly if no useful "content" is found. - rawdog 1.1 rawdog now preserves the ordering of articles in the RSS when a group of articles are added at the same time. Updated rawdog URL in setup.py, since it now has a web page. Updated rssparser to feedparser 2.4, and added very preliminary support for the "content" element it can return (for Echo feeds). - rawdog 1.0 Initial stable release. rawdog-2.21/rawdog.10000644000471500047150000001202212173317060013673 0ustar atsats00000000000000.TH RAWDOG 1 .SH NAME rawdog \- an RSS Aggregator Without Delusions Of Grandeur .SH SYNOPSIS .B rawdog .RI [ options ] .SH DESCRIPTION \fBrawdog\fP is a feed aggregator for Unix-like systems. .PP \fBrawdog\fP uses the Python \fBfeedparser\fP module to retrieve articles from a number of feeds in RSS, Atom and other formats, and writes out a single HTML file, based on a template either provided by the user or generated by \fBrawdog\fP, containing the latest articles it's seen. .PP \fBrawdog\fP uses the ETags and Last-Modified headers to avoid fetching a file that hasn't changed, and supports gzip and delta compression to reduce bandwidth when it has. \fBrawdog\fP is configured from a simple text file; the only state kept between invocations that can't be reconstructed from the feeds is the ordering of articles. .SH OPTIONS This program follows the usual GNU command line syntax, with long options starting with two dashes (`\-'). .SS General Options .TP \fB\-d\fP \fIDIR\fP, \fB\-\-dir\fP \fIDIR\fP Use \fIDIR\fP instead of the $HOME/.rawdog directory. This option lets you have two or more \fBrawdog\fP setups with different configurations and sets of feeds. .TP \fB\-N\fP, \fB\-\-no\-locking\fP Do not lock the state file. .IP "" \fBrawdog\fP usually claims a lock on its state file, to stop more than one instance from running at the same time. Unfortunately, some filesystems don't support file locking; you can use this option to disable locking entirely if you're in that situation. .TP \fB\-v\fP, \fB\-\-verbose\fP Print more detailed information about what \fBrawdog\fP is doing to stderr while it runs. .TP \fB\-V\fP \fIFILE\fP, \fB\-\-log\fP \fIFILE\fP As with \fB\-V\fP, but write the information to \fIFILE\fP. .TP \fB\-W\fP, \fB\-\-no\-lock\-wait\fP Exit silently if the state file is already locked. .IP "" If the state file is already locked, \fBrawdog\fP will normally wait until it becomes available, then run. However, if you're got a lot of feeds and a slow network connection, you might prefer \fBrawdog\fP to just give up immediately if the previous instance is still running. .SS Actions \fBrawdog\fP will perform these actions in the order given. .TP \fB\-a\fP \fIURL\fP, \fB\-\-add\fP \fIURL\fP Try to find a feed associated with \fIURL\fP and add it to the config file. .IP "" \fIURL\fP may be a feed itself, or it can be an HTML page that links to a feed in any of a variety of ways. \fBrawdog\fP uses heuristics to pick the best feed it can find, and will complain if it can't find one. .TP \fB\-c\fP \fIFILE\fP, \fB\-\-config\fP \fIFILE\fP Read \fIFILE\fP as an additional config file; any options provided in \fIFILE\fP will override those set in the main config file (with the exception of "feed", which is cumulative). \fIFILE\fP may be an absolute path or a path relative to your .rawdog directory. .IP "" Note that $HOME/.rawdog/config will still be read first even if you specify this option. \fB\-c\fP is mostly useful when you want to write the same set of feeds out using two different sets of output options. .TP \fB\-f\fP \fIURL\fP, \fB\-\-update\-feed\fP \fIURL\fP Update the feed pointed to by \fIURL\fP immediately, even if its period hasn't elapsed since it was last updated. This is useful when you're publishing a feed yourself, and want to test whether it's working properly. .TP \fB\-l\fP, \fB\-\-list\fP List brief information about each of the feeds that was known about at the time of the last update. .TP \fB\-r\fP \fIURL\fP, \fB\-\-remove\fP \fIURL\fP Remove feed \fIURL\fP from the config file. .TP \fB\-s\fP \fITEMPLATE\fP, \fB\-\-show\fP \fITEMPLATE\fP Print one of the templates currently in use to stdout. \fBTEMPLATE\fP may be \fBpage\fP, \fBitem\fP, \fBfeedlist\fP or \fBfeeditem\fP. This can be used as a starting point if you want to design your own template for use with the corresponding \fBtemplate\fP option in the config file. .TP \fB\-u\fP, \fB\-\-update\fP Fetch data from the feeds and store it. This could take some time if you've got lots of feeds. .TP \fB\-w\fP, \fB\-\-write\fP Write out the HTML output file. .SS Special Actions If one of these options is specified, \fBrawdog\fP will perform only that action, then exit. .TP \fB\-\-dump\fP \fIURL\fP Show what \fBrawdog\fP's feed parser returns for \fIURL\fP. This can be useful when trying to understand why \fBrawdog\fP doesn't display a feed correctly. .TP \fB\-\-help\fP Provide a brief summary of all the options \fBrawdog\fP supports. .SH EXAMPLES \fBrawdog\fP is typically invoked from .BR cron (1). The following .BR crontab (5) entry would fetch data from feeds and write it to HTML once an hour, exiting if \fBrawdog\fP is already running: .PP .nf .RS 0 * * * * rawdog \-Wuw .RE .fi .SH FILES $HOME/.rawdog/config .SH SEE ALSO .BR cron (1). .SH AUTHOR \fBrawdog\fP was mostly written by Adam Sampson , with contributions and bug reports from many of \fBrawdog\fP's users. See \fBrawdog\fP's NEWS file for a complete list of contributors. .PP This manual page was originally written by Decklin Foster , for the Debian project (but may be used by others). rawdog-2.21/rawdoglib/0000755000471500047150000000000012552556407014316 5ustar atsats00000000000000rawdog-2.21/rawdoglib/__init__.py0000644000471500047150000000010412171062651016410 0ustar atsats00000000000000__all__ = [ 'feedscanner', 'persister', 'rawdog', ] rawdog-2.21/rawdoglib/feedscanner.py0000644000471500047150000001044212463200204017124 0ustar atsats00000000000000"""Scan a URL's contents to find related feeds This is a compatible replacement for Aaron Swartz's feedfinder module, using feedparser to check whether the URLs it returns are feeds. It finds links to feeds within the following elements: - (standard feed discovery) -
, if the href contains words that suggest it might be a feed It orders feeds using a quality heuristic: the first result is the most likely to be a feed for the given URL. Required: Python 2.4 or later, feedparser """ __license__ = """ Copyright (c) 2008 Decklin Foster Copyright (c) 2013, 2015 Adam Sampson Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies. THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. """ import cStringIO import feedparser import gzip import re import urllib2 import urlparse import HTMLParser def is_feed(url): """Return true if feedparser can understand the given URL as a feed.""" p = feedparser.parse(url) version = p.get("version") if version is None: version = "" return version != "" def fetch_url(url): """Fetch the given URL and return the data from it as a Unicode string.""" request = urllib2.Request(url) request.add_header("Accept-Encoding", "gzip") f = urllib2.urlopen(request) headers = f.info() data = f.read() f.close() # We have to support gzip encoding because some servers will use it # even if you explicitly refuse it in Accept-Encoding. encodings = headers.get("Content-Encoding", "") encodings = [s.strip() for s in encodings.split(",")] if "gzip" in encodings: f = gzip.GzipFile(fileobj=cStringIO.StringIO(data)) data = f.read() f.close() # Silently ignore encoding errors -- we don't need to go to the bother of # detecting the encoding properly (like feedparser does). data = data.decode("UTF-8", "ignore") return data class FeedFinder(HTMLParser.HTMLParser): def __init__(self, base_uri): HTMLParser.HTMLParser.__init__(self) self.found = [] self.count = 0 self.base_uri = base_uri def add(self, score, href): url = urlparse.urljoin(self.base_uri, href) lower = url.lower() # Some sites provide feeds both for entries and comments; # prefer the former. if lower.find("comment") != -1: score -= 50 # Prefer Atom, then RSS, then RDF (RSS 1). if lower.find("atom") != -1: score += 10 elif lower.find("rss2") != -1: score -= 5 elif lower.find("rss") != -1: score -= 10 elif lower.find("rdf") != -1: score -= 15 self.found.append((-score, self.count, url)) self.count += 1 def urls(self): return [link[2] for link in sorted(self.found)] def handle_starttag(self, tag, attrs): attrs = dict(attrs) href = attrs.get('href') if href is None: return if tag == 'link' and attrs.get('rel') == 'alternate' and \ not attrs.get('type') == 'text/html': self.add(200, href) if tag == 'a' and re.search(r'\b(rss|atom|rdf|feeds?)\b', href, re.I): self.add(100, href) def feeds(page_url): """Search the given URL for possible feeds, returning a list of them.""" # If the URL is a feed, there's no need to scan it for links. if is_feed(page_url): return [page_url] data = fetch_url(page_url) parser = FeedFinder(page_url) try: parser.feed(data) except HTMLParser.HTMLParseError: pass found = parser.urls() # Return only feeds that feedparser can understand. return [feed for feed in found if is_feed(feed)] rawdog-2.21/rawdoglib/persister.py0000644000471500047150000001205112267755143016710 0ustar atsats00000000000000# persister: persist Python objects safely to pickle files # Copyright 2003, 2004, 2005, 2013, 2014 Adam Sampson # # rawdog is free software; you can redistribute and/or modify it # under the terms of that license as published by the Free Software # Foundation; either version 2 of the License, or (at your option) # any later version. # # rawdog is distributed in the hope that it will be useful, but # WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU # General Public License for more details. # # You should have received a copy of the GNU General Public License # along with rawdog; see the file COPYING. If not, write to the Free # Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, # MA 02110-1301, USA, or see http://www.gnu.org/. import cPickle as pickle import errno import fcntl import os import sys class Persistable: """An object which can be persisted.""" def __init__(self): self._modified = False def modified(self, state=True): """Mark the object as having been modified (or not).""" self._modified = state def is_modified(self): return self._modified class Persisted: """Context manager for a persistent object. The object being persisted must implement the Persistable interface.""" def __init__(self, klass, filename, persister): self.klass = klass self.filename = filename self.persister = persister self.lock_file = None self.object = None self.refcount = 0 def rename(self, new_filename): """Rename the persisted file. This works whether the file is currently open or not.""" self.persister._rename(self.filename, new_filename) for ext in ("", ".lock"): try: os.rename(self.filename + ext, new_filename + ext) except OSError, e: # If the file doesn't exist (yet), # that's OK. if e.errno != errno.ENOENT: raise e self.filename = new_filename def __enter__(self): """As open().""" return self.open() def __exit__(self, type, value, tb): """As close(), unless an exception occurred in which case do nothing.""" if tb is None: self.close() def open(self, no_block=True): """Return the persistent object, loading it from its file if it isn't already open. You must call close() once you're finished with the object. If no_block is True, then this will return None if loading the object would otherwise block (i.e. if it's locked by another process).""" if self.refcount > 0: # Already loaded. self.refcount += 1 return self.object try: self._open(no_block) except KeyboardInterrupt: sys.exit(1) except: print "An error occurred while reading state from " + os.path.abspath(self.filename) + "." print "This usually means the file is corrupt, and removing it will fix the problem." sys.exit(1) self.refcount = 1 return self.object def _get_lock(self, no_block): if not self.persister.use_locking: return True self.lock_file = open(self.filename + ".lock", "w+") try: mode = fcntl.LOCK_EX if no_block: mode |= fcntl.LOCK_NB fcntl.lockf(self.lock_file.fileno(), mode) except IOError, e: if no_block and e.errno in (errno.EACCES, errno.EAGAIN): return False raise e return True def _open(self, no_block): self.persister.log("Loading state file: ", self.filename) if not self._get_lock(no_block): return None try: f = open(self.filename, "rb") except IOError: # File can't be opened. # Create a new object. self.object = self.klass() self.object.modified() return self.object = pickle.load(f) self.object.modified(False) f.close() def close(self): """Reduce the reference count of the persisted object, saving it back to its file if necessary.""" self.refcount -= 1 if self.refcount > 0: # Still in use. return if self.object.is_modified(): self.persister.log("Saving state file: ", self.filename) newname = "%s.new-%d" % (self.filename, os.getpid()) newfile = open(newname, "w") pickle.dump(self.object, newfile, pickle.HIGHEST_PROTOCOL) newfile.close() os.rename(newname, self.filename) if self.lock_file is not None: self.lock_file.close() self.persister._remove(self.filename) class Persister: """Manage the collection of persisted files.""" def __init__(self, config): self.files = {} self.log = config.log self.use_locking = config.locking def get(self, klass, filename): """Get a context manager for a persisted file. If the file is already open, this will return the existing context manager.""" if filename in self.files: return self.files[filename] p = Persisted(klass, filename, self) self.files[filename] = p return p def _rename(self, old_filename, new_filename): self.files[new_filename] = self.files[old_filename] del self.files[old_filename] def _remove(self, filename): del self.files[filename] def delete(self, filename): """Delete a persisted file, along with its lock file, if they exist.""" for ext in ("", ".lock"): try: os.unlink(filename + ext) except OSError: pass rawdog-2.21/rawdoglib/rawdog.py0000644000471500047150000016245412552556127016166 0ustar atsats00000000000000# rawdog: RSS aggregator without delusions of grandeur. # Copyright 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2012, 2013, 2014, 2015 Adam Sampson # # rawdog is free software; you can redistribute and/or modify it # under the terms of that license as published by the Free Software # Foundation; either version 2 of the License, or (at your option) # any later version. # # rawdog is distributed in the hope that it will be useful, but # WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU # General Public License for more details. # # You should have received a copy of the GNU General Public License # along with rawdog; see the file COPYING. If not, write to the Free # Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, # MA 02110-1301, USA, or see http://www.gnu.org/. VERSION = "2.21" HTTP_AGENT = "rawdog/" + VERSION STATE_VERSION = 2 import rawdoglib.feedscanner from rawdoglib.persister import Persistable, Persister from rawdoglib.plugins import Box, call_hook, load_plugins from cStringIO import StringIO import base64 import calendar import cgi import feedparser import getopt import hashlib import locale import os import re import socket import string import sys import threading import time import types import urllib2 try: import tidylib except: tidylib = None try: import mx.Tidy as mxtidy except: mxtidy = None # Turn off content-cleaning, since we want to see an approximation to the # original content for hashing. rawdog will sanitise HTML when writing. feedparser.RESOLVE_RELATIVE_URIS = 0 feedparser.SANITIZE_HTML = 0 # Disable microformat support, because it tends to return poor-quality data # (e.g. identifying inappropriate things as enclosures), and it relies on # BeautifulSoup which is unable to parse many feeds. feedparser.PARSE_MICROFORMATS = 0 # This is initialised in main(). persister = None system_encoding = None def get_system_encoding(): """Get the system encoding.""" return system_encoding def safe_ftime(format, t): """Format a time value into a string in the current locale (as time.strftime), but encode the result as ASCII HTML.""" try: u = unicode(time.strftime(format, t), get_system_encoding()) except ValueError, e: u = u"(bad time %s; %s)" % (repr(t), str(e)) return encode_references(u) def format_time(secs, config): """Format a time and date nicely.""" try: t = time.localtime(secs) except ValueError, e: return u"(bad time %s; %s)" % (repr(secs), str(e)) format = config["datetimeformat"] if format is None: format = config["timeformat"] + ", " + config["dayformat"] return safe_ftime(format, t) high_char_re = re.compile(r'[^\000-\177]') def encode_references(s): """Encode characters in a Unicode string using HTML references.""" def encode(m): return "&#" + str(ord(m.group(0))) + ";" return high_char_re.sub(encode, s) # This list of block-level elements came from the HTML 4.01 specification. block_level_re = re.compile(r'^\s*<(p|h1|h2|h3|h4|h5|h6|ul|ol|pre|dl|div|noscript|blockquote|form|hr|table|fieldset|address)[^a-z]', re.I) def sanitise_html(html, baseurl, inline, config): """Attempt to turn arbitrary feed-provided HTML into something suitable for safe inclusion into the rawdog output. The inline parameter says whether to expect a fragment of inline text, or a sequence of block-level elements.""" if html is None: return None html = encode_references(html) type = "text/html" # sgmllib handles "
/" as a SHORTTAG; this workaround from # feedparser. html = re.sub(r'(\S)/>', r'\1 />', html) # sgmllib is fragile with broken processing instructions (e.g. # ""); just remove them all. html = re.sub(r']*>', '', html) html = feedparser._resolveRelativeURIs(html, baseurl, "UTF-8", type) p = feedparser._HTMLSanitizer("UTF-8", type) p.feed(html) html = p.output() if not inline and config["blocklevelhtml"]: # If we're after some block-level HTML and the HTML doesn't # start with a block-level element, then insert a

tag # before it. This still fails when the HTML contains text, then # a block-level element, then more text, but it's better than # nothing. if block_level_re.match(html) is None: html = "

" + html if config["tidyhtml"]: args = { "numeric_entities": 1, "output_html": 1, "output_xhtml": 0, "output_xml": 0, "wrap": 0, } call_hook("mxtidy_args", config, args, baseurl, inline) call_hook("tidy_args", config, args, baseurl, inline) if tidylib is not None: # Disable PyTidyLib's somewhat unhelpful defaults. tidylib.BASE_OPTIONS = {} output = tidylib.tidy_document(html, args)[0] elif mxtidy is not None: output = mxtidy.tidy(html, None, None, **args)[2] else: # No Tidy bindings installed -- do nothing. output = "" + html + "" html = output[output.find("") + 6 : output.rfind("")].strip() html = html.decode("UTF-8") box = Box(html) call_hook("clean_html", config, box, baseurl, inline) return box.value def select_detail(details): """Pick the preferred type of detail from a list of details. (If the argument isn't a list, treat it as a list of one.)""" TYPES = { "text/html": 30, "application/xhtml+xml": 20, "text/plain": 10, } if details is None: return None if type(details) is not list: details = [details] ds = [] for detail in details: ctype = detail.get("type", None) if ctype is None: continue if TYPES.has_key(ctype): score = TYPES[ctype] else: score = 0 if detail["value"] != "": ds.append((score, detail)) ds.sort() if len(ds) == 0: return None else: return ds[-1][1] def detail_to_html(details, inline, config, force_preformatted=False): """Convert a detail hash or list of detail hashes as returned by feedparser into HTML.""" detail = select_detail(details) if detail is None: return None if force_preformatted: html = "

" + cgi.escape(detail["value"]) + "
" elif detail["type"] == "text/plain": html = cgi.escape(detail["value"]) else: html = detail["value"] return sanitise_html(html, detail["base"], inline, config) def author_to_html(entry, feedurl, config): """Convert feedparser author information to HTML.""" author_detail = entry.get("author_detail") if author_detail is not None and author_detail.has_key("name"): name = author_detail["name"] else: name = entry.get("author") url = None fallback = "author" if author_detail is not None: if author_detail.has_key("href"): url = author_detail["href"] elif author_detail.has_key("email") and author_detail["email"] is not None: url = "mailto:" + author_detail["email"] if author_detail.has_key("email") and author_detail["email"] is not None: fallback = author_detail["email"] elif author_detail.has_key("href") and author_detail["href"] is not None: fallback = author_detail["href"] if name == "": name = fallback if url is None: html = name else: html = "
" + cgi.escape(name) + "" # We shouldn't need a base URL here anyway. return sanitise_html(html, feedurl, True, config) def string_to_html(s, config): """Convert a string to HTML.""" return sanitise_html(cgi.escape(s), "", True, config) template_re = re.compile(r'(__[^_].*?__)') def fill_template(template, bits): """Expand a template, replacing __x__ with bits["x"], and only including sections bracketed by __if_x__ .. [__else__ ..] __endif__ if bits["x"] is not "". If not bits.has_key("x"), __x__ expands to "".""" result = Box() call_hook("fill_template", template, bits, result) if result.value is not None: return result.value encoding = get_system_encoding() f = StringIO() if_stack = [] def write(s): if not False in if_stack: f.write(s) for part in template_re.split(template): if part.startswith("__") and part.endswith("__"): key = part[2:-2] if key.startswith("if_"): k = key[3:] if_stack.append(bits.has_key(k) and bits[k] != "") elif key == "endif": if if_stack != []: if_stack.pop() elif key == "else": if if_stack != []: if_stack.append(not if_stack.pop()) elif bits.has_key(key): if type(bits[key]) == types.UnicodeType: write(bits[key].encode(encoding)) else: write(bits[key]) else: write(part) v = f.getvalue() f.close() return v file_cache = {} def load_file(name): """Read the contents of a template file, caching the result so we don't have to read the file multiple times. The file is assumed to be in the system encoding; the result will be an ASCII string.""" if not file_cache.has_key(name): try: f = open(name) data = f.read() f.close() except IOError: raise ConfigError("Can't read template file: " + name) try: data = data.decode(get_system_encoding()) except UnicodeDecodeError, e: raise ConfigError("Character encoding problem in template file: " + name + ": " + str(e)) data = encode_references(data) file_cache[name] = data.encode(get_system_encoding()) return file_cache[name] def write_ascii(f, s, config): """Write the string s, which should only contain ASCII characters, to file f; if it isn't encodable in ASCII, then print a warning message and write UTF-8.""" try: f.write(s) except UnicodeEncodeError, e: config.bug("Error encoding output as ASCII; UTF-8 has been written instead.\n", e) f.write(s.encode("UTF-8")) def short_hash(s): """Return a human-manipulatable 'short hash' of a string.""" return hashlib.sha1(s).hexdigest()[-8:] def ensure_unicode(value, encoding): """Convert a structure returned by feedparser into an equivalent where all strings are represented as fully-decoded unicode objects.""" if isinstance(value, str): try: return value.decode(encoding) except: # If the encoding's invalid, at least preserve # the byte stream. return value.decode("ISO-8859-1") elif isinstance(value, unicode) and type(value) is not unicode: # This is a subclass of unicode (e.g. BeautifulSoup's # NavigableString, which is unpickleable in some versions of # the library), so force it to be a real unicode object. return unicode(value) elif isinstance(value, dict): d = {} for (k, v) in value.items(): d[k] = ensure_unicode(v, encoding) return d elif isinstance(value, list): return [ensure_unicode(v, encoding) for v in value] else: return value timeout_re = re.compile(r'timed? ?out', re.I) def is_timeout_exception(exc): """Return True if the given exception object suggests that a timeout occurred, else return False.""" # Since urlopen throws away the original exception object, # we have to look at the stringified form to tell if it was a timeout. # (We're in reasonable company here, since test_ssl.py in the Python # distribution does the same thing!) # # The message we're looking for is something like: # Stock Python 2.7.7 and 2.7.8: # # Debian python 2.7.3-4+deb7u1: # # Debian python 2.7.8-1: # return timeout_re.search(str(exc)) is not None class BasicAuthProcessor(urllib2.BaseHandler): """urllib2 handler that does HTTP basic authentication or proxy authentication with a fixed username and password. (Unlike the classes to do this in urllib2, this doesn't wait for a 401/407 response first.)""" def __init__(self, user, password, proxy=False): self.auth = base64.b64encode(user + ":" + password) if proxy: self.header = "Proxy-Authorization" else: self.header = "Authorization" def http_request(self, req): req.add_header(self.header, "Basic " + self.auth) return req https_request = http_request class DisableIMProcessor(urllib2.BaseHandler): """urllib2 handler that disables RFC 3229 for a request.""" def http_request(self, req): # Request doesn't provide a method for removing headers -- # so overwrite the header instead. req.add_header("A-IM", "identity") return req https_request = http_request class ResponseLogProcessor(urllib2.BaseHandler): """urllib2 handler that maintains a log of HTTP responses.""" # Run after anything that's mangling headers (usually 500 or less), but # before HTTPErrorProcessor (1000). handler_order = 900 def __init__(self): self.log = [] def http_response(self, req, response): entry = { "url": req.get_full_url(), "status": response.getcode(), } location = response.info().get("Location") if location is not None: entry["location"] = location self.log.append(entry) return response https_response = http_response def get_log(self): return self.log non_alphanumeric_re = re.compile(r'<[^>]*>|\&[^\;]*\;|[^a-z0-9]') class Feed: """An RSS feed.""" def __init__(self, url): self.url = url self.period = 30 * 60 self.args = {} self.etag = None self.modified = None self.last_update = 0 self.feed_info = {} def needs_update(self, now): """Return True if it's time to update this feed, or False if its update period has not yet elapsed.""" return (now - self.last_update) >= self.period def get_state_filename(self): return "feeds/%s.state" % (short_hash(self.url),) def fetch(self, rawdog, config): """Fetch the current set of articles from the feed.""" handlers = [] logger = ResponseLogProcessor() handlers.append(logger) proxies = {} for name, value in self.args.items(): if name.endswith("_proxy"): proxies[name[:-6]] = value if len(proxies) != 0: handlers.append(urllib2.ProxyHandler(proxies)) if self.args.has_key("proxyuser") and self.args.has_key("proxypassword"): handlers.append(BasicAuthProcessor(self.args["proxyuser"], self.args["proxypassword"], proxy=True)) if self.args.has_key("user") and self.args.has_key("password"): handlers.append(BasicAuthProcessor(self.args["user"], self.args["password"])) if self.get_keepmin(config) == 0 or config["currentonly"]: # If RFC 3229 and "A-IM: feed" is used, then there's # no way to tell when an article has been removed. # So if we only want to keep articles that are still # being published by the feed, we have to turn it off. handlers.append(DisableIMProcessor()) call_hook("add_urllib2_handlers", rawdog, config, self, handlers) url = self.url # Turn plain filenames into file: URLs. (feedparser will open # plain filenames itself, but we want it to open the file with # urllib2 so we get a URLError if something goes wrong.) if not ":" in url: url = "file:" + url try: result = feedparser.parse(url, etag=self.etag, modified=self.modified, agent=HTTP_AGENT, handlers=handlers) except Exception, e: result = { "rawdog_exception": e, "rawdog_traceback": sys.exc_info()[2], } result["rawdog_responses"] = logger.get_log() return result def update(self, rawdog, now, config, articles, p): """Add new articles from a feed to the collection. Returns True if any articles were read, False otherwise.""" # Note that feedparser might have thrown an exception -- # so until we print the error message and return, we # can't assume that p contains any particular field. responses = p.get("rawdog_responses") if len(responses) > 0: last_status = responses[-1]["status"] elif len(p.get("feed", [])) != 0: # Some protocol other than HTTP -- assume it's OK, # since we got some content. last_status = 200 else: # Timeout, or empty response from non-HTTP. last_status = 0 version = p.get("version") if version is None: version = "" self.last_update = now errors = [] fatal = False old_url = self.url if "rawdog_exception" in p: errors.append("Error fetching or parsing feed:") errors.append(str(p["rawdog_exception"])) if config["showtracebacks"]: from traceback import format_tb errors.append("".join(format_tb(p["rawdog_traceback"]))) errors.append("") fatal = True if len(responses) != 0 and responses[0]["status"] == 301: # Permanent redirect(s). Find the new location. i = 0 while i < len(responses) and responses[i]["status"] == 301: i += 1 location = responses[i - 1].get("location") if location is None: errors.append("The feed returned a permanent redirect, but without a new location.") else: errors.append("New URL: " + location) errors.append("The feed has moved permanently to a new URL.") if config["changeconfig"]: rawdog.change_feed_url(self.url, location, config) errors.append("The config file has been updated automatically.") else: errors.append("You should update its entry in your config file.") errors.append("") bozo_exception = p.get("bozo_exception") got_urlerror = isinstance(bozo_exception, urllib2.URLError) got_timeout = isinstance(bozo_exception, socket.timeout) if got_urlerror or got_timeout: # urllib2 reported an error when fetching the feed. # Check to see if it was a timeout. if not (got_timeout or is_timeout_exception(bozo_exception)): errors.append("Error while fetching feed:") errors.append(str(bozo_exception)) errors.append("") fatal = True elif config["ignoretimeouts"]: return False else: errors.append("Timeout while reading feed.") errors.append("") fatal = True elif last_status == 304: # The feed hasn't changed. Return False to indicate # that we shouldn't do expiry. return False elif last_status in [403, 410]: # The feed is disallowed or gone. The feed should be # unsubscribed. errors.append("The feed has gone.") errors.append("You should remove it from your config file.") errors.append("") fatal = True elif last_status / 100 != 2: # Some sort of client or server error. The feed may # need unsubscribing. errors.append("The feed returned an error.") errors.append("If this condition persists, you should remove it from your config file.") errors.append("") fatal = True elif version == "" and len(p.get("entries", [])) == 0: # feedparser couldn't detect the type of this feed or # retrieve any entries from it. errors.append("The data retrieved from this URL could not be understood as a feed.") errors.append("You should check whether the feed has changed URLs or been removed.") errors.append("") fatal = True old_error = "\n".join(errors) call_hook("feed_fetched", rawdog, config, self, p, old_error, not fatal) if len(errors) != 0: print >>sys.stderr, "Feed: " + old_url if last_status != 0: print >>sys.stderr, "HTTP Status: " + str(last_status) for line in errors: print >>sys.stderr, line if fatal: return False # From here, we can assume that we've got a complete feedparser # response. p = ensure_unicode(p, p.get("encoding") or "UTF-8") # No entries means the feed hasn't changed, but for some reason # we didn't get a 304 response. Handle it the same way. if len(p["entries"]) == 0: return False self.etag = p.get("etag") self.modified = p.get("modified") self.feed_info = p["feed"] feed = self.url article_ids = {} if config["useids"]: # Find IDs for existing articles. for (hash, a) in articles.items(): id = a.entry_info.get("id") if a.feed == feed and id is not None: article_ids[id] = a seen_articles = set() sequence = 0 for entry_info in p["entries"]: article = Article(feed, entry_info, now, sequence) ignore = Box(False) call_hook("article_seen", rawdog, config, article, ignore) if ignore.value: continue seen_articles.add(article.hash) sequence += 1 id = entry_info.get("id") if id in article_ids: existing_article = article_ids[id] elif article.hash in articles: existing_article = articles[article.hash] else: existing_article = None if existing_article is not None: existing_article.update_from(article, now) call_hook("article_updated", rawdog, config, existing_article, now) else: articles[article.hash] = article call_hook("article_added", rawdog, config, article, now) if config["currentonly"]: for (hash, a) in articles.items(): if a.feed == feed and hash not in seen_articles: del articles[hash] return True def get_html_name(self, config): if self.feed_info.has_key("title_detail"): r = detail_to_html(self.feed_info["title_detail"], True, config) elif self.feed_info.has_key("link"): r = string_to_html(self.feed_info["link"], config) else: r = string_to_html(self.url, config) if r is None: r = "" return r def get_html_link(self, config): s = self.get_html_name(config) if self.feed_info.has_key("link"): return '' + s + '' else: return s def get_id(self, config): if self.args.has_key("id"): return self.args["id"] else: r = self.get_html_name(config).lower() return non_alphanumeric_re.sub('', r) def get_keepmin(self, config): return self.args.get("keepmin", config["keepmin"]) class Article: """An article retrieved from an RSS feed.""" def __init__(self, feed, entry_info, now, sequence): self.feed = feed self.entry_info = entry_info self.sequence = sequence self.date = None parsed = entry_info.get("updated_parsed") if parsed is None: parsed = entry_info.get("published_parsed") if parsed is None: parsed = entry_info.get("created_parsed") if parsed is not None: try: self.date = calendar.timegm(parsed) except OverflowError: pass self.hash = self.compute_initial_hash() self.last_seen = now self.added = now def compute_initial_hash(self): """Compute an initial unique hash for an article. The generated hash must be unique amongst all articles in the system (i.e. it can't just be the article ID, because that would collide if more than one feed included the same article).""" h = hashlib.sha1() def add_hash(s): h.update(s.encode("UTF-8")) add_hash(self.feed) entry_info = self.entry_info if entry_info.has_key("title"): add_hash(entry_info["title"]) if entry_info.has_key("link"): add_hash(entry_info["link"]) if entry_info.has_key("content"): for content in entry_info["content"]: add_hash(content["value"]) if entry_info.has_key("summary_detail"): add_hash(entry_info["summary_detail"]["value"]) return h.hexdigest() def update_from(self, new_article, now): """Update this article's contents from a newer article that's been identified to be the same.""" self.entry_info = new_article.entry_info self.sequence = new_article.sequence self.date = new_article.date self.last_seen = now def can_expire(self, now, config): return (now - self.last_seen) > config["expireage"] def get_sort_date(self, config): if config["sortbyfeeddate"]: return self.date or self.added else: return self.added class DayWriter: """Utility class for writing day sections into a series of articles.""" def __init__(self, file, config): self.lasttime = [] self.file = file self.counter = 0 self.config = config def start_day(self, tm): print >>self.file, '
' day = safe_ftime(self.config["dayformat"], tm) print >>self.file, '

' + day + '

' self.counter += 1 def start_time(self, tm): print >>self.file, '
' clock = safe_ftime(self.config["timeformat"], tm) print >>self.file, '

' + clock + '

' self.counter += 1 def time(self, s): try: tm = time.localtime(s) except ValueError: # e.g. "timestamp out of range for platform time_t" return if tm[:3] != self.lasttime[:3] and self.config["daysections"]: self.close(0) self.start_day(tm) if tm[:6] != self.lasttime[:6] and self.config["timesections"]: if self.config["daysections"]: self.close(1) else: self.close(0) self.start_time(tm) self.lasttime = tm def close(self, n=0): while self.counter > n: print >>self.file, "
" self.counter -= 1 def parse_time(value, default="m"): """Parse a time period with optional units (s, m, h, d, w) into a time in seconds. If no unit is specified, use minutes by default; specify the default argument to change this. Raises ValueError if the format isn't recognised.""" units = { "s": 1, "m": 60, "h": 3600, "d": 86400, "w": 604800, } for unit, size in units.items(): if value.endswith(unit): return int(value[:-len(unit)]) * size return int(value) * units[default] def parse_bool(value): """Parse a boolean value (0, 1, false or true). Raise ValueError if the value isn't recognised.""" value = value.strip().lower() if value == "0" or value == "false": return False elif value == "1" or value == "true": return True else: raise ValueError("Bad boolean value: " + value) def parse_list(value): """Parse a list of keywords separated by whitespace.""" return value.strip().split(None) def parse_feed_args(argparams, arglines): """Parse a list of feed arguments. Raise ConfigError if the syntax is invalid, or ValueError if an argument value can't be parsed.""" args = {} for p in argparams: ps = p.split("=", 1) if len(ps) != 2: raise ConfigError("Bad feed argument in config: " + p) args[ps[0]] = ps[1] for p in arglines: ps = p.split(None, 1) if len(ps) != 2: raise ConfigError("Bad argument line in config: " + p) args[ps[0]] = ps[1] for name, value in args.items(): if name == "allowduplicates": args[name] = parse_bool(value) elif name == "keepmin": args[name] = int(value) elif name == "maxage": args[name] = parse_time(value) return args class ConfigError(Exception): pass class Config: """The aggregator's configuration.""" def __init__(self, locking=True, logfile_name=None): self.locking = locking self.files_loaded = [] self.loglock = threading.Lock() self.logfile = None if logfile_name: self.logfile = open(logfile_name, "a") self.reset() def reset(self): # Note that these default values are *not* the same as # in the supplied config file. The idea is that someone # who has an old config file shouldn't notice a difference # in behaviour on upgrade -- so new options generally # default to False here, and True in the sample file. self.config = { "feedslist" : [], "feeddefaults" : {}, "defines" : {}, "outputfile" : "output.html", "maxarticles" : 200, "maxage" : 0, "expireage" : 24 * 60 * 60, "keepmin" : 0, "dayformat" : "%A, %d %B %Y", "timeformat" : "%I:%M %p", "datetimeformat" : None, "userefresh" : False, "showfeeds" : True, "timeout" : 30, "pagetemplate" : "default", "itemtemplate" : "default", "feedlisttemplate" : "default", "feeditemtemplate" : "default", "verbose" : False, "ignoretimeouts" : False, "showtracebacks" : False, "daysections" : True, "timesections" : True, "blocklevelhtml" : True, "tidyhtml" : False, "sortbyfeeddate" : False, "currentonly" : False, "hideduplicates" : [], "newfeedperiod" : "3h", "changeconfig": False, "numthreads": 1, "splitstate": False, "useids": False, } def __getitem__(self, key): return self.config[key] def get(self, key, default=None): return self.config.get(key, default) def __setitem__(self, key, value): self.config[key] = value def reload(self): self.log("Reloading config files") self.reset() for filename in self.files_loaded: self.load(filename, False) def load(self, filename, explicitly_loaded=True): """Load configuration from a config file.""" if explicitly_loaded: self.files_loaded.append(filename) lines = [] try: f = open(filename, "r") for line in f.xreadlines(): try: line = line.decode(get_system_encoding()) except UnicodeDecodeError, e: raise ConfigError("Character encoding problem in config file: " + filename + ": " + str(e)) stripped = line.strip() if stripped == "" or stripped[0] == "#": continue if line[0] in string.whitespace: if lines == []: raise ConfigError("First line in config cannot be an argument") lines[-1][1].append(stripped) else: lines.append((stripped, [])) f.close() except IOError: raise ConfigError("Can't read config file: " + filename) for line, arglines in lines: try: self.load_line(line, arglines) except ValueError: raise ConfigError("Bad value in config: " + line) def load_line(self, line, arglines): """Process a configuration directive.""" l = line.split(None, 1) if len(l) == 1 and l[0] == "feeddefaults": l.append("") elif len(l) != 2: raise ConfigError("Bad line in config: " + line) # Load template files immediately, so we produce an error now # rather than later if anything goes wrong. if l[0].endswith("template") and l[1] != "default": load_file(l[1]) handled_arglines = False if l[0] == "feed": l = l[1].split(None) if len(l) < 2: raise ConfigError("Bad line in config: " + line) self["feedslist"].append((l[1], parse_time(l[0]), parse_feed_args(l[2:], arglines))) handled_arglines = True elif l[0] == "feeddefaults": self["feeddefaults"] = parse_feed_args(l[1].split(None), arglines) handled_arglines = True elif l[0] == "define": l = l[1].split(None, 1) if len(l) != 2: raise ConfigError("Bad line in config: " + line) self["defines"][l[0]] = l[1] elif l[0] == "plugindirs": for dir in parse_list(l[1]): load_plugins(dir, self) elif l[0] == "outputfile": self["outputfile"] = l[1] elif l[0] == "maxarticles": self["maxarticles"] = int(l[1]) elif l[0] == "maxage": self["maxage"] = parse_time(l[1]) elif l[0] == "expireage": self["expireage"] = parse_time(l[1]) elif l[0] == "keepmin": self["keepmin"] = int(l[1]) elif l[0] == "dayformat": self["dayformat"] = l[1] elif l[0] == "timeformat": self["timeformat"] = l[1] elif l[0] == "datetimeformat": self["datetimeformat"] = l[1] elif l[0] == "userefresh": self["userefresh"] = parse_bool(l[1]) elif l[0] == "showfeeds": self["showfeeds"] = parse_bool(l[1]) elif l[0] == "timeout": self["timeout"] = parse_time(l[1], "s") elif l[0] in ("template", "pagetemplate"): self["pagetemplate"] = l[1] elif l[0] == "itemtemplate": self["itemtemplate"] = l[1] elif l[0] == "feedlisttemplate": self["feedlisttemplate"] = l[1] elif l[0] == "feeditemtemplate": self["feeditemtemplate"] = l[1] elif l[0] == "verbose": self["verbose"] = parse_bool(l[1]) elif l[0] == "ignoretimeouts": self["ignoretimeouts"] = parse_bool(l[1]) elif l[0] == "showtracebacks": self["showtracebacks"] = parse_bool(l[1]) elif l[0] == "daysections": self["daysections"] = parse_bool(l[1]) elif l[0] == "timesections": self["timesections"] = parse_bool(l[1]) elif l[0] == "blocklevelhtml": self["blocklevelhtml"] = parse_bool(l[1]) elif l[0] == "tidyhtml": self["tidyhtml"] = parse_bool(l[1]) elif l[0] == "sortbyfeeddate": self["sortbyfeeddate"] = parse_bool(l[1]) elif l[0] == "currentonly": self["currentonly"] = parse_bool(l[1]) elif l[0] == "hideduplicates": self["hideduplicates"] = parse_list(l[1]) elif l[0] == "newfeedperiod": self["newfeedperiod"] = l[1] elif l[0] == "changeconfig": self["changeconfig"] = parse_bool(l[1]) elif l[0] == "numthreads": self["numthreads"] = int(l[1]) elif l[0] == "splitstate": self["splitstate"] = parse_bool(l[1]) elif l[0] == "useids": self["useids"] = parse_bool(l[1]) elif l[0] == "include": self.load(l[1], False) elif call_hook("config_option_arglines", self, l[0], l[1], arglines): handled_arglines = True elif call_hook("config_option", self, l[0], l[1]): pass else: raise ConfigError("Unknown config command: " + l[0]) if arglines != [] and not handled_arglines: raise ConfigError("Bad argument lines in config after: " + line) def log(self, *args): """Print a status message. If running in verbose mode, write the message to stderr; if using a logfile, write it to the logfile.""" if self["verbose"]: with self.loglock: print >>sys.stderr, "".join(map(str, args)) if self.logfile is not None: with self.loglock: print >>self.logfile, "".join(map(str, args)) self.logfile.flush() def bug(self, *args): """Report detection of a bug in rawdog.""" print >>sys.stderr, "Internal error detected in rawdog:" print >>sys.stderr, "".join(map(str, args)) print >>sys.stderr, "This could be caused by a bug in rawdog itself or in a plugin." print >>sys.stderr, "Please send this error message and your config file to the rawdog author." def edit_file(filename, editfunc): """Edit a file in place: for each line in the input file, call editfunc(line, outputfile), then rename the output file over the input file.""" newname = "%s.new-%d" % (filename, os.getpid()) oldfile = open(filename, "r") newfile = open(newname, "w") editfunc(oldfile, newfile) newfile.close() oldfile.close() os.rename(newname, filename) class AddFeedEditor: def __init__(self, feedline): self.feedline = feedline def edit(self, inputfile, outputfile): d = inputfile.read() outputfile.write(d) if not d.endswith("\n"): outputfile.write("\n") outputfile.write(self.feedline) def add_feed(filename, url, rawdog, config): """Try to add a feed to the config file.""" feeds = rawdoglib.feedscanner.feeds(url) if feeds == []: print >>sys.stderr, "Cannot find any feeds in " + url return feed = feeds[0] if feed in rawdog.feeds: print >>sys.stderr, "Feed " + feed + " is already in the config file" return print >>sys.stderr, "Adding feed " + feed feedline = "feed %s %s\n" % (config["newfeedperiod"], feed) edit_file(filename, AddFeedEditor(feedline).edit) class ChangeFeedEditor: def __init__(self, oldurl, newurl): self.oldurl = oldurl self.newurl = newurl def edit(self, inputfile, outputfile): for line in inputfile.xreadlines(): ls = line.strip().split(None) if len(ls) > 2 and ls[0] == "feed" and ls[2] == self.oldurl: line = line.replace(self.oldurl, self.newurl, 1) outputfile.write(line) class RemoveFeedEditor: def __init__(self, url): self.url = url def edit(self, inputfile, outputfile): while True: l = inputfile.readline() if l == "": break ls = l.strip().split(None) if len(ls) > 2 and ls[0] == "feed" and ls[2] == self.url: while True: l = inputfile.readline() if l == "": break elif l[0] == "#": outputfile.write(l) elif l[0] not in string.whitespace: outputfile.write(l) break else: outputfile.write(l) def remove_feed(filename, url, config): """Try to remove a feed from the config file.""" if url not in [f[0] for f in config["feedslist"]]: print >>sys.stderr, "Feed " + url + " is not in the config file" else: print >>sys.stderr, "Removing feed " + url edit_file(filename, RemoveFeedEditor(url).edit) class FeedFetcher: """Class that will handle fetching a set of feeds in parallel.""" def __init__(self, rawdog, feedlist, config): self.rawdog = rawdog self.config = config self.lock = threading.Lock() self.jobs = set(feedlist) self.results = {} def worker(self, num): rawdog = self.rawdog config = self.config while True: with self.lock: try: job = self.jobs.pop() except KeyError: # No jobs left. break config.log("[", num, "] Fetching feed: ", job) feed = rawdog.feeds[job] call_hook("pre_update_feed", rawdog, config, feed) result = feed.fetch(rawdog, config) with self.lock: self.results[job] = result def run(self, max_workers): max_workers = max(max_workers, 1) num_workers = min(max_workers, len(self.jobs)) self.config.log("Fetching ", len(self.jobs), " feeds using ", num_workers, " threads") workers = [] for i in range(1, num_workers): t = threading.Thread(target=self.worker, args=(i,)) t.start() workers.append(t) self.worker(0) for worker in workers: worker.join() self.config.log("Fetch complete") return self.results class FeedState(Persistable): """The collection of articles in a feed.""" def __init__(self): Persistable.__init__(self) self.articles = {} class Rawdog(Persistable): """The aggregator itself.""" def __init__(self): Persistable.__init__(self) self.feeds = {} self.articles = {} self.plugin_storage = {} self.state_version = STATE_VERSION self.using_splitstate = None def get_plugin_storage(self, plugin): try: st = self.plugin_storage.setdefault(plugin, {}) except AttributeError: # rawdog before 2.5 didn't have plugin storage. st = {} self.plugin_storage = {plugin: st} return st def check_state_version(self): """Check the version of the state file.""" try: version = self.state_version except AttributeError: # rawdog 1.x didn't keep track of this. version = 1 return version == STATE_VERSION def change_feed_url(self, oldurl, newurl, config): """Change the URL of a feed.""" assert self.feeds.has_key(oldurl) if self.feeds.has_key(newurl): print >>sys.stderr, "Error: New feed URL is already subscribed; please remove the old one" print >>sys.stderr, "from the config file by hand." return edit_file("config", ChangeFeedEditor(oldurl, newurl).edit) feed = self.feeds[oldurl] # Changing the URL will change the state filename as well, # so we need to save the old name to load from. old_state = feed.get_state_filename() feed.url = newurl del self.feeds[oldurl] self.feeds[newurl] = feed if config["splitstate"]: feedstate_p = persister.get(FeedState, old_state) feedstate_p.rename(feed.get_state_filename()) with feedstate_p as feedstate: for article in feedstate.articles.values(): article.feed = newurl feedstate.modified() else: for article in self.articles.values(): if article.feed == oldurl: article.feed = newurl print >>sys.stderr, "Feed URL automatically changed." def list(self, config): """List the configured feeds.""" for url, feed in self.feeds.items(): feed_info = feed.feed_info print url print " ID:", feed.get_id(config) print " Hash:", short_hash(url) print " Title:", feed.get_html_name(config) print " Link:", feed_info.get("link") def sync_from_config(self, config): """Update rawdog's internal state to match the configuration.""" # Make sure the splitstate directory exists. if config["splitstate"]: try: os.mkdir("feeds") except OSError: # Most likely it already exists. pass # Convert to or from splitstate if necessary. try: u = self.using_splitstate except AttributeError: # We were last run with a version of rawdog that didn't # have this variable -- so we must have a single state # file. u = False if u is None: self.using_splitstate = config["splitstate"] elif u != config["splitstate"]: if config["splitstate"]: config.log("Converting to split state files") for feed_hash, feed in self.feeds.items(): with persister.get(FeedState, feed.get_state_filename()) as feedstate: feedstate.articles = {} for article_hash, article in self.articles.items(): if article.feed == feed_hash: feedstate.articles[article_hash] = article feedstate.modified() self.articles = {} else: config.log("Converting to single state file") self.articles = {} for feed_hash, feed in self.feeds.items(): with persister.get(FeedState, feed.get_state_filename()) as feedstate: for article_hash, article in feedstate.articles.items(): self.articles[article_hash] = article feedstate.articles = {} feedstate.modified() persister.delete(feed.get_state_filename()) self.modified() self.using_splitstate = config["splitstate"] seen_feeds = set() for (url, period, args) in config["feedslist"]: seen_feeds.add(url) if not self.feeds.has_key(url): config.log("Adding new feed: ", url) self.feeds[url] = Feed(url) self.modified() feed = self.feeds[url] if feed.period != period: config.log("Changed feed period: ", url) feed.period = period self.modified() newargs = {} newargs.update(config["feeddefaults"]) newargs.update(args) if feed.args != newargs: config.log("Changed feed options: ", url) feed.args = newargs self.modified() for url in self.feeds.keys(): if url not in seen_feeds: config.log("Removing feed: ", url) if config["splitstate"]: persister.delete(self.feeds[url].get_state_filename()) else: for key, article in self.articles.items(): if article.feed == url: del self.articles[key] del self.feeds[url] self.modified() def update(self, config, feedurl=None): """Perform the update action: check feeds for new articles, and expire old ones.""" config.log("Starting update") now = time.time() socket.setdefaulttimeout(config["timeout"]) if feedurl is None: update_feeds = [url for url in self.feeds.keys() if self.feeds[url].needs_update(now)] elif self.feeds.has_key(feedurl): update_feeds = [feedurl] self.feeds[feedurl].etag = None self.feeds[feedurl].modified = None else: print "No such feed: " + feedurl update_feeds = [] numfeeds = len(update_feeds) config.log("Will update ", numfeeds, " feeds") fetcher = FeedFetcher(self, update_feeds, config) fetched = fetcher.run(config["numthreads"]) seen_some_items = set() def do_expiry(articles): """Expire articles from a list. Return True if any articles were expired.""" feedcounts = {} for key, article in articles.items(): url = article.feed feedcounts[url] = feedcounts.get(url, 0) + 1 expiry_list = [] feedcounts = {} for key, article in articles.items(): url = article.feed feedcounts[url] = feedcounts.get(url, 0) + 1 expiry_list.append((article.added, article.sequence, key, article)) expiry_list.sort() count = 0 for date, seq, key, article in expiry_list: url = article.feed if url not in self.feeds: config.log("Expired article for nonexistent feed: ", url) count += 1 del articles[key] continue if (url in seen_some_items and self.feeds.has_key(url) and article.can_expire(now, config) and feedcounts[url] > self.feeds[url].get_keepmin(config)): call_hook("article_expired", self, config, article, now) count += 1 feedcounts[url] -= 1 del articles[key] config.log("Expired ", count, " articles, leaving ", len(articles)) return count > 0 count = 0 for url in update_feeds: count += 1 config.log("Updating feed ", count, " of ", numfeeds, ": ", url) feed = self.feeds[url] if config["splitstate"]: feedstate_p = persister.get(FeedState, feed.get_state_filename()) feedstate = feedstate_p.open() articles = feedstate.articles else: articles = self.articles content = fetched[url] call_hook("mid_update_feed", self, config, feed, content) rc = feed.update(self, now, config, articles, content) url = feed.url call_hook("post_update_feed", self, config, feed, rc) if rc: seen_some_items.add(url) if config["splitstate"]: feedstate.modified() if config["splitstate"]: if do_expiry(articles): feedstate.modified() feedstate_p.close() if config["splitstate"]: self.articles = {} else: do_expiry(self.articles) self.modified() config.log("Finished update") def get_template(self, config, name="page"): """Return the contents of a template.""" filename = config.get(name + "template", "default") if filename != "default": return load_file(filename) if name == "page": template = """ """ if config["userefresh"]: template += """__refresh__ """ template += """ rawdog
__items__
""" if config["showfeeds"]: template += """

Feeds

__feeds__
""" template += """ """ return template elif name == "item": return """

__title__ [__feed_title__]

__if_description__
__description__
__endif__
""" elif name == "feedlist": return """ __feeditems__
FeedRSSLast fetchedNext fetched after
""" elif name == "feeditem": return """ __feed_title__ __feed_icon__ __feed_last_update__ __feed_next_update__ """ else: raise KeyError("Unknown template name: " + name) def show_template(self, name, config): """Show the contents of a template, as currently configured.""" try: print self.get_template(config, name), except KeyError: print >>sys.stderr, "Unknown template name: " + name def write_article(self, f, article, config): """Write an article to the given file.""" feed = self.feeds[article.feed] entry_info = article.entry_info link = entry_info.get("link") if link == "": link = None guid = entry_info.get("id") if guid == "": guid = None itembits = self.get_feed_bits(config, feed) for name, value in feed.args.items(): if name.startswith("define_"): itembits[name[7:]] = sanitise_html(value, "", True, config) title = detail_to_html(entry_info.get("title_detail"), True, config) key = None for k in ["content", "summary_detail"]: if entry_info.has_key(k): key = k break if key is None: description = None else: force_preformatted = (feed.args.get("format", "default") == "text") description = detail_to_html(entry_info[key], False, config, force_preformatted) date = article.date if title is None: if link is None: title = "Article" else: title = "Link" itembits["title_no_link"] = title if link is not None: itembits["url"] = string_to_html(link, config) else: itembits["url"] = "" if guid is not None: itembits["guid"] = string_to_html(guid, config) else: itembits["guid"] = "" if link is None: itembits["title"] = title else: itembits["title"] = '' + title + '' itembits["hash"] = short_hash(article.hash) if description is not None: itembits["description"] = description else: itembits["description"] = "" author = author_to_html(entry_info, feed.url, config) if author is not None: itembits["author"] = author else: itembits["author"] = "" itembits["added"] = format_time(article.added, config) if date is not None: itembits["date"] = format_time(date, config) else: itembits["date"] = "" call_hook("output_item_bits", self, config, feed, article, itembits) itemtemplate = self.get_template(config, "item") f.write(fill_template(itemtemplate, itembits)) def write_remove_dups(self, articles, config, now): """Filter the list of articles to remove articles that are too old or are duplicates.""" kept_articles = [] seen_links = set() seen_guids = set() dup_count = 0 for article in articles: feed = self.feeds[article.feed] age = now - article.added maxage = feed.args.get("maxage", config["maxage"]) if maxage != 0 and age > maxage: continue entry_info = article.entry_info link = entry_info.get("link") if link == "": link = None guid = entry_info.get("id") if guid == "": guid = None if not feed.args.get("allowduplicates", False): is_dup = False for key in config["hideduplicates"]: if key == "id" and guid is not None: if guid in seen_guids: is_dup = True seen_guids.add(guid) elif key == "link" and link is not None: if link in seen_links: is_dup = True seen_links.add(link) if is_dup: dup_count += 1 continue kept_articles.append(article) return (kept_articles, dup_count) def get_feed_bits(self, config, feed): """Get the bits that are used to describe a feed.""" bits = {} bits["feed_id"] = feed.get_id(config) bits["feed_hash"] = short_hash(feed.url) bits["feed_title"] = feed.get_html_link(config) bits["feed_title_no_link"] = detail_to_html(feed.feed_info.get("title_detail"), True, config) bits["feed_url"] = string_to_html(feed.url, config) bits["feed_icon"] = 'XML' bits["feed_last_update"] = format_time(feed.last_update, config) bits["feed_next_update"] = format_time(feed.last_update + feed.period, config) return bits def write_feeditem(self, f, feed, config): """Write a feed list item.""" bits = self.get_feed_bits(config, feed) f.write(fill_template(self.get_template(config, "feeditem"), bits)) def write_feedlist(self, f, config): """Write the feed list.""" bits = {} feeds = [(feed.get_html_name(config).lower(), feed) for feed in self.feeds.values()] feeds.sort() feeditems = StringIO() for key, feed in feeds: self.write_feeditem(feeditems, feed, config) bits["feeditems"] = feeditems.getvalue() feeditems.close() f.write(fill_template(self.get_template(config, "feedlist"), bits)) def get_main_template_bits(self, config): """Get the bits that are used in the default main template, with the exception of items and num_items.""" bits = {"version": VERSION} bits.update(config["defines"]) refresh = min([config["expireage"]] + [feed.period for feed in self.feeds.values()]) bits["refresh"] = '' f = StringIO() self.write_feedlist(f, config) bits["feeds"] = f.getvalue() f.close() bits["num_feeds"] = str(len(self.feeds)) return bits def write_output_file(self, articles, article_dates, config): """Write a regular rawdog HTML output file.""" f = StringIO() dw = DayWriter(f, config) call_hook("output_items_begin", self, config, f) for article in articles: if not call_hook("output_items_heading", self, config, f, article, article_dates[article]): dw.time(article_dates[article]) self.write_article(f, article, config) dw.close() call_hook("output_items_end", self, config, f) bits = self.get_main_template_bits(config) bits["items"] = f.getvalue() f.close() bits["num_items"] = str(len(articles)) call_hook("output_bits", self, config, bits) s = fill_template(self.get_template(config, "page"), bits) outputfile = config["outputfile"] if outputfile == "-": write_ascii(sys.stdout, s, config) else: config.log("Writing output file: ", outputfile) f = open(outputfile + ".new", "w") write_ascii(f, s, config) f.close() os.rename(outputfile + ".new", outputfile) def write(self, config): """Perform the write action: write articles to the output file.""" config.log("Starting write") now = time.time() def list_articles(articles): return [(-a.get_sort_date(config), a.feed, a.sequence, a.hash) for a in articles.values()] if config["splitstate"]: article_list = [] for feed in self.feeds.values(): with persister.get(FeedState, feed.get_state_filename()) as feedstate: article_list += list_articles(feedstate.articles) else: article_list = list_articles(self.articles) numarticles = len(article_list) if not call_hook("output_sort_articles", self, config, article_list): article_list.sort() if config["maxarticles"] != 0: article_list = article_list[:config["maxarticles"]] if config["splitstate"]: wanted = {} for (date, feed_url, seq, hash) in article_list: if not feed_url in self.feeds: # This can happen if you've managed to # kill rawdog between it updating a # split state file and the main state # -- so just ignore the article and # it'll expire eventually. continue wanted.setdefault(feed_url, []).append(hash) found = {} for (feed_url, article_hashes) in wanted.items(): feed = self.feeds[feed_url] with persister.get(FeedState, feed.get_state_filename()) as feedstate: for hash in article_hashes: found[hash] = feedstate.articles[hash] else: found = self.articles articles = [] article_dates = {} for (date, feed, seq, hash) in article_list: a = found.get(hash) if a is not None: articles.append(a) article_dates[a] = -date call_hook("output_write", self, config, articles) if not call_hook("output_sorted_filter", self, config, articles): (articles, dup_count) = self.write_remove_dups(articles, config, now) else: dup_count = 0 config.log("Selected ", len(articles), " of ", numarticles, " articles to write; ignored ", dup_count, " duplicates") if not call_hook("output_write_files", self, config, articles, article_dates): self.write_output_file(articles, article_dates, config) config.log("Finished write") def usage(): """Display usage information.""" print """rawdog, version """ + VERSION + """ Usage: rawdog [OPTION]... General options (use only once): -d|--dir DIR Use DIR instead of ~/.rawdog -N, --no-locking Do not lock the state file -v, --verbose Print more detailed status information -V|--log FILE Append detailed status information to FILE -W, --no-lock-wait Exit silently if state file is locked Actions (performed in order given): -a|--add URL Try to find a feed associated with URL and add it to the config file -c|--config FILE Read additional config file FILE -f|--update-feed URL Force an update on the single feed URL -l, --list List feeds known at time of last update -r|--remove URL Remove feed URL from the config file -s|--show TEMPLATE Show the contents of a template (TEMPLATE may be: page item feedlist feeditem) -u, --update Fetch data from feeds and store it -w, --write Write out HTML output Special actions (all other options are ignored if one of these is specified): --dump URL Show what rawdog's parser returns for URL --help Display this help and exit Report bugs to .""" def main(argv): """The command-line interface to the aggregator.""" locale.setlocale(locale.LC_ALL, "") # This is quite expensive and not threadsafe, so we do it on # startup and cache the result. global system_encoding system_encoding = locale.getpreferredencoding() try: SHORTOPTS = "a:c:d:f:lNr:s:tTuvV:wW" LONGOPTS = [ "add=", "config=", "dir=", "dump=", "help", "list", "log=", "no-lock-wait", "no-locking", "remove=", "show=", "show-itemtemplate", "show-template", "update", "update-feed=", "verbose", "write", ] (optlist, args) = getopt.getopt(argv, SHORTOPTS, LONGOPTS) except getopt.GetoptError, s: print s usage() return 1 if len(args) != 0: usage() return 1 if "HOME" in os.environ: statedir = os.environ["HOME"] + "/.rawdog" else: statedir = None verbose = False logfile_name = None locking = True no_lock_wait = False for o, a in optlist: if o == "--dump": import pprint pprint.pprint(feedparser.parse(a, agent=HTTP_AGENT)) return 0 elif o == "--help": usage() return 0 elif o in ("-d", "--dir"): statedir = a elif o in ("-N", "--no-locking"): locking = False elif o in ("-v", "--verbose"): verbose = True elif o in ("-V", "--log"): logfile_name = a elif o in ("-W", "--no-lock-wait"): no_lock_wait = True if statedir is None: print "$HOME not set and state dir not explicitly specified; please use -d/--dir" return 1 try: os.chdir(statedir) except OSError: print "No " + statedir + " directory" return 1 sys.path.append(".") config = Config(locking, logfile_name) def load_config(fn): try: config.load(fn) except ConfigError, err: print >>sys.stderr, "In " + fn + ":" print >>sys.stderr, err return 1 if verbose: config["verbose"] = True return 0 rc = load_config("config") if rc != 0: return rc global persister persister = Persister(config) rawdog_p = persister.get(Rawdog, "state") rawdog = rawdog_p.open(no_block=no_lock_wait) if rawdog is None: return 0 if not rawdog.check_state_version(): print "The state file " + statedir + "/state was created by an older" print "version of rawdog, and cannot be read by this version." print "Removing the state file will fix it." return 1 rawdog.sync_from_config(config) call_hook("startup", rawdog, config) for o, a in optlist: if o in ("-a", "--add"): add_feed("config", a, rawdog, config) config.reload() rawdog.sync_from_config(config) elif o in ("-c", "--config"): rc = load_config(a) if rc != 0: return rc rawdog.sync_from_config(config) elif o in ("-f", "--update-feed"): rawdog.update(config, a) elif o in ("-l", "--list"): rawdog.list(config) elif o in ("-r", "--remove"): remove_feed("config", a, config) config.reload() rawdog.sync_from_config(config) elif o in ("-s", "--show"): rawdog.show_template(a, config) elif o in ("-t", "--show-template"): rawdog.show_template("page", config) elif o in ("-T", "--show-itemtemplate"): rawdog.show_template("item", config) elif o in ("-u", "--update"): rawdog.update(config) elif o in ("-w", "--write"): rawdog.write(config) call_hook("shutdown", rawdog, config) rawdog_p.close() return 0 rawdog-2.21/rawdoglib/plugins.py0000644000471500047150000000453712463202773016355 0ustar atsats00000000000000# plugins: handle add-on modules for rawdog. # Copyright 2004, 2005, 2013 Adam Sampson # # rawdog is free software; you can redistribute and/or modify it # under the terms of that license as published by the Free Software # Foundation; either version 2 of the License, or (at your option) # any later version. # # rawdog is distributed in the hope that it will be useful, but # WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU # General Public License for more details. # # You should have received a copy of the GNU General Public License # along with rawdog; see the file COPYING. If not, write to the Free # Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, # MA 02110-1301, USA, or see http://www.gnu.org/. # The design of rawdog's plugin API was inspired by Stuart Langridge's # Vellum weblog system: # http://www.kryogenix.org/code/vellum/ import imp import os class Box: """Utility class that holds a mutable value. Useful for passing immutable types by reference.""" def __init__(self, value=None): self.value = value plugin_count = 0 def load_plugins(dir, config): global plugin_count try: files = os.listdir(dir) except OSError: # Ignore directories that can't be read. return for file in files: if file == "" or file[0] == ".": continue desc = None for d in imp.get_suffixes(): if file.endswith(d[0]) and d[2] == imp.PY_SOURCE: desc = d if desc is None: continue fn = os.path.join(dir, file) config.log("Loading plugin ", fn) f = open(fn, "r") imp.load_module("plugin%d" % (plugin_count,), f, fn, desc) plugin_count += 1 f.close() attached = {} def attach_hook(hookname, func): """Attach a function to a hook. The function should take the appropriate arguments for the hook, and should return either True or False to indicate whether further functions should be processed.""" attached.setdefault(hookname, []).append(func) def call_hook(hookname, *args): """Call all the functions attached to a hook with the given arguments, in the order they were added, stopping if a hook function returns False. Returns True if any hook function returned False (i.e. returns True if any hook function handled the request).""" for func in attached.get(hookname, []): if not func(*args): return True return False rawdog-2.21/setup.py0000644000471500047150000000143212552556117014054 0ustar atsats00000000000000#!/usr/bin/env python from distutils.core import setup import sys if sys.version_info < (2, 6) or sys.version_info >= (3,): print("rawdog requires Python 2.6 or later, and not Python 3.") sys.exit(1) setup(name = "rawdog", version = "2.21", description = "RSS Aggregator Without Delusions Of Grandeur", author = "Adam Sampson", author_email = "ats@offog.org", url = "http://offog.org/code/rawdog/", scripts = ['rawdog'], data_files = [('share/man/man1', ['rawdog.1'])], packages = ['rawdoglib'], classifiers = [ "Development Status :: 5 - Production/Stable", "Environment :: Console", "License :: OSI Approved :: GNU General Public License v2 or later (GPLv2+)", "Operating System :: POSIX", "Programming Language :: Python :: 2", "Topic :: Internet :: WWW/HTTP", ]) rawdog-2.21/PLUGINS0000644000471500047150000003023312176717227013412 0ustar atsats00000000000000# Writing rawdog plugins ## Introduction As provided, rawdog provides a fairly small set of features. In order to make it do more complex jobs, rawdog can be extended using plugin modules written in Python. This document is intended for developers who want to extend rawdog by writing plugins. Extensions work by registering hook functions which are called by various bits of rawdog's core as it runs. These functions can modify rawdog's internal state in various interesting ways. An arbitrary number of functions can be attached to each hook; they are called in the order they were attached. Hook functions take various arguments depending on where they're called from, and returns a boolean value indicating whether further functions attached to the same hook should be called. The "plugindirs" config option gives a list of directories to search for plugins; all Python modules found in those directories will be loaded by rawdog. In practice, this means that you need to call your file something ending in ".py" to have it recognised as a plugin. ## The plugins module All plugins should import the `rawdoglib.plugins` module, which provides the functions for registering and calling hooks, along with some utilities for plugins. Many plugins will also want to import the `rawdoglib.rawdog` module, which contains rawdog's core functionality, much of which is reusable. ### rawdoglib.plugins.attach_hook(hook_name, function) The attach_hook function adds a hook function to the hook of the given name. ### rawdoglib.plugins.Box The Box class is used to pass immutable types by reference to hook functions; this allows several plugins to modify a value. It contains a single `value` attribute for the value it is holding. ## Plugin storage Since some plugins will need to keep state between runs, the Rawdog object that most hook functions are provided with has a `get_plugin_storage` method, which when called with a plugin identifier for your plugin as an argument will give you a reference to a dictionary which will be persisted in the rawdog state file. The dictionary is empty to start with; you may store any pickleable objects you like in it. Plugin identifiers should be strings based on your email address, in order to be globally unique -- for example, `org.offog.ats.archive`. After changing a plugin storage dictionary, you must call "rawdog.modified()" to ensure that rawdog will write out its state file. ## Hooks Most hook functions are called with "rawdog" and "config" as their first two arguments; these are references to the aggregator's Rawdog and Config objects. If you need a hook that doesn't currently exist, please contact me. The following hooks are supported: ### startup(rawdog, config) Run when rawdog starts up, after the state file and config file have been loaded, but before rawdog starts processing command-line arguments. ### shutdown(rawdog, config) Run just before rawdog saves the state file and exits. ### config_option(config, name, value) * name: the option name * value: the option value Called when rawdog encounters a config file option that it doesn't recognise. The rawdoglib.rawdog.parse_* functions will probably be useful when dealing with config options. You can raise ValueError to have rawdog print an appropriate error message. You should return False from this hook if name is an option you recognise. Note that using config.log in this hook will probably not do what you want, because the verbose flag may not yet have been turned on. ### config_option_arglines(config, name, value, arglines) * name: the option name * value: the option value * arglines: a list of extra indented lines given after the option (which can be used to supply extra arguments for the option) As config_option for options that can handle extra argument lines. If the options you are implementing should not have extra arguments, then use the config_option hook instead. ### output_sort_articles(rawdog, config, articles) * articles: the mutable list of (date, feed_url, sequence_number, article_hash) tuples Called to sort the list of articles to write. The default action here is to just call the list's sort method; if you sort the list in a different way, you should return False from this hook to prevent rawdog from resorting it afterwards. Later versions of rawdog may add more items at the end of the tuple; bear this in mind when you're manipulating the items. ### output_write(rawdog, config, articles) * articles: the mutable list of Article objects Called immediately before output_sorted_filter; this hook is here for backwards compatibility, and should not be used in new plugins. ### output_sorted_filter(rawdog, config, articles) * articles: the mutable list of Article objects Called after rawdog sorts the list of articles to write, but before it removes duplicate and excessively old articles. This hook can be used to implement alternate duplicate-filtering methods. If you return False from this hook, then rawdog will not do its usual duplicate-removing filter pass. ### output_write_files(rawdog, config, articles, article_dates) * articles: the mutable list of Article objects * article_dates: a dictionary mapping Article objects to the dates that were used to sort them Called when rawdog is about to write its output to files. This hook can be used to implement alternative output methods. If you return False from this hook, then rawdog will not write any output itself (and the later output_ hooks will thus not be called). I would suggest not returning False here unless you plan to call the rawdog.write_output_file method from your hook implementation; failure to do so will most likely break other plugins. ### output_items_begin(rawdog, config, f) * f: a writable file object (__items__) Called before rawdog starts expanding the items template. This set of hooks can be used to implement alternative date (or other section) headings. ### output_items_heading(rawdog, config, f, article, date) * f: a writable file object (__items__) * article: the Article object about to be written * date: the Article's date for sorting purposes Called before each item is written. If you return False from this hook, then rawdog's normal time-based section headings will not be written. ### output_items_end(rawdog, config, f) * f: a writable file object (__items__) Called after all items are written. ### output_bits(rawdog, config, bits) * bits: a dictionary of template parameters Called before expanding the page template. This hook can be used to add extra template parameters. Note that template parameters should be valid HTML, with entities escaped, even if they're URLs or similar. You can use rawdog's `rawdoglib.rawdog.string_to_html` function to do this for you: the_thing = "This can contain arbitary text & stuff" bits["thing"] = string_to_html(the_thing, config) It's also good idea for template parameter names to be valid Python identifiers, so that plugins that replace the template system with something smarter can make them into local variables. ### output_item_bits(rawdog, config, feed, article, bits) * feed: the Feed containing this article * article: the Article being templated * bits: a dictionary of template parameters Called before expanding the item template for an article. This hook can be used to add extra template parameters. (See the documentation for `output_bits` for some advice on adding template parameters.) ### pre_update_feed(rawdog, config, feed) * feed: the Feed about to be updated Called before a feed's content is fetched. This hook can be used to perform extra actions before fetching a feed. Note that if `usethreads` is set to a positive number in the config file, this hook may be called from a worker thread. ### mid_update_feed(rawdog, config, feed, content) * feed: the Feed being updated * content: the feedparser output from the feed (may be None) Called after a feed's content has been fetched, but before rawdog's internal state has been updated. This hook can be used to modify feedparser's output. ### post_update_feed(rawdog, config, feed, seen_articles) * feed: the Feed that has been updated * seen_articles: a boolean indicating whether any articles were read from the feed Called after a feed is updated. ### article_seen(rawdog, config, article, ignore) * article: the Article that has been received * ignore: a Boxed boolean indicating whether to ignore the article Called when an article is received from a feed. This hook can be used to modify or ignore incoming articles. ### article_updated(rawdog, config, article, now) * article: the Article that has been updated * now: the current time Called after an article has been updated (when rawdog receives an article from a feed that it already has). ### article_added(rawdog, config, article, now) * article: the Article that has been added * now: the current time Called after a new article has been added. ### article_expired(rawdog, config, article, now) * article: the Article that will be expired * now: the current time Called before an article is expired. ### fill_template(template, bits, result) * template: the template string to fill * bits: a dictionary of template arguments * result: a Boxed Unicode string for the result of template expansion Called whenever template expansion is performed. If you set the value inside result to something other than None, then rawdog will treat that value as the result of template expansion (rather than performing its normal expansion process); you can thus use this hook either for manipulating template parameters, or for replacing the template system entirely. ### tidy_args(config, args, baseurl, inline) * args: a dictionary of keyword arguments for Tidy * baseurl: the URL at which the HTML was originally found * inline: a boolean indicating whether the output should be inline HTML or a block element When HTML is being sanitised by rawdog and the "tidyhtml" option is enabled, this hook will be called just before Tidy is run (either via PyTidyLib or via mx.Tidy). It can be used to add or modify Tidy options; for example, to make it produce XHTML output. ### clean_html(config, html, baseurl, inline) * html: a Boxed Unicode string containing the HTML being cleaned * baseurl: the URL at which the HTML was originally found * inline: a boolean indicating whether the output should be inline HTML or a block element Called whenever HTML is being sanitised by rawdog (after its existing HTML sanitisation processes). You can use this to implement extra sanitisation passes. You'll need to update the boxed value with the new, cleaned string. ### add_urllib2_handlers(rawdog, config, feed, handlers) * feed: the Feed to which the request will be made * handlers: the mutable list of urllib2 *Handler objects that will be passed to feedparser Called before feedparser is used to fetch feed content. This hook can be used to add additional urllib2 handlers to cope with unusual protocol requirements; use `handlers.append` to add extra handlers. ### feed_fetched(rawdog, config, feed, feed_data, error, non_fatal) * feed: the Feed that has just been fetched * feed_data: the data returned from feedparser.parse * error: the error string if an error occurred, or None if no error occurred * non_fatal: if error is not None, a boolean indicating whether the error was fatal Called after feedparser has been called to fetch the feed. This hook can be used to manipulate the received feed data or implement custom error handling. ## Obsolete hooks The following hooks existed in previous versions of rawdog, but are no longer supported: * output_filter (since rawdog 2.12); use output_sorted_filter instead * output_sort (since rawdog 2.12); use output_sort_articles instead ## Examples ### backwards.py This is probably the simplest useful example plugin: it reverses the sort order of the output. import rawdoglib.plugins def backwards(rawdog, config, articles): articles.sort() articles.reverse() return False rawdoglib.plugins.attach_hook("output_sort_articles", backwards) ### option.py This plugin shows how to handle a config file option. import rawdoglib.plugins def option(config, name, value): if name == "myoption": print "Test plugin option:", value return False else: return True rawdoglib.plugins.attach_hook("config_option", option) rawdog-2.21/testserver.py0000644000471500047150000001605512177453766015142 0ustar atsats00000000000000# testserver: servers for rawdog's test suite. # Copyright 2013 Adam Sampson # # rawdog is free software; you can redistribute and/or modify it # under the terms of that license as published by the Free Software # Foundation; either version 2 of the License, or (at your option) # any later version. # # rawdog is distributed in the hope that it will be useful, but # WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU # General Public License for more details. # # You should have received a copy of the GNU General Public License # along with rawdog; see the file COPYING. If not, write to the Free # Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, # MA 02110-1301, USA, or see http://www.gnu.org/. import BaseHTTPServer import SimpleHTTPServer import SocketServer import base64 import cStringIO import gzip import hashlib import os import re import sys import threading import time class TimeoutRequestHandler(SocketServer.BaseRequestHandler): """Request handler for a server that just does nothing for a few seconds, then disconnects. This is used for testing timeout handling.""" def handle(self): time.sleep(5) class TimeoutServer(SocketServer.ThreadingMixIn, SocketServer.TCPServer): """Timeout server for rawdog's test suite.""" pass class HTTPRequestHandler(SimpleHTTPServer.SimpleHTTPRequestHandler): """HTTP request handler for rawdog's test suite.""" # do_GET/do_HEAD are copied from SimpleHTTPServer because send_head isn't # part of the API. def do_GET(self): f = self.send_head() if f: self.copyfile(f, self.wfile) f.close() def do_HEAD(self): f = self.send_head() if f: f.close() def send_head(self): # Look for lines of the form "/oldpath /newpath" in .rewrites. try: f = open(os.path.join(self.server.files_dir, ".rewrites")) for line in f.readlines(): (old, new) = line.split(None, 1) if self.path == old: self.path = new f.close() except IOError: pass m = re.match(r'^/auth-([^/-]+)-([^/]+)(/.*)$', self.path) if m: # Require basic authentication. auth = "Basic " + base64.b64encode(m.group(1) + ":" + m.group(2)) if self.headers.get("Authorization") != auth: self.send_response(401) self.end_headers() return None self.path = m.group(3) m = re.match(r'^/digest-([^/-]+)-([^/]+)(/.*)$', self.path) if m: # Require digest authentication. (Not a good implementation!) realm = "rawdog test server" nonce = "0123456789abcdef" a1 = m.group(1) + ":" + realm + ":" + m.group(2) a2 = "GET:" + self.path def h(s): return hashlib.md5(s).hexdigest() response = h(h(a1) + ":" + nonce + ":" + h(a2)) mr = re.search(r'response="([^"]*)"', self.headers.get("Authorization", "")) if mr is None or mr.group(1) != response: self.send_response(401) self.send_header("WWW-Authenticate", 'Digest realm="%s", nonce="%s"' % (realm, nonce)) self.end_headers() return None self.path = m.group(3) m = re.match(r'^/(\d\d\d)(/.*)?$', self.path) if m: # Request for a particular response code. code = int(m.group(1)) self.send_response(code) if m.group(2): self.send_header("Location", self.server.base_url + m.group(2)) self.end_headers() return None encoding = None m = re.match(r'^/(gzip)(/.*)$', self.path) if m: # Request for a content encoding. encoding = m.group(1) self.path = m.group(2) m = re.match(r'^/([^/]+)$', self.path) if m: # Request for a file. filename = os.path.join(self.server.files_dir, m.group(1)) try: f = open(filename, "rb") except IOError: self.send_response(404) self.end_headers() return None # Use the SHA1 hash as an ETag. etag = '"' + hashlib.sha1(f.read()).hexdigest() + '"' f.seek(0) # Oversimplistic, but matches what feedparser sends. if self.headers.get("If-None-Match", "") == etag: self.send_response(304) self.end_headers() return None size = os.fstat(f.fileno()).st_size mime_type = "text/plain" if filename.endswith(".rss") or filename.endswith(".rss2"): mime_type = "application/rss+xml" elif filename.endswith(".rdf"): mime_type = "application/rdf+xml" elif filename.endswith(".atom"): mime_type = "application/atom+xml" elif filename.endswith(".html"): mime_type = "text/html" self.send_response(200) if encoding: self.send_header("Content-Encoding", encoding) if encoding == "gzip": data = f.read() f.close() f = cStringIO.StringIO() g = gzip.GzipFile(fileobj=f, mode="wb") g.write(data) g.close() size = f.tell() f.seek(0) self.send_header("Content-Length", size) self.send_header("Content-Type", mime_type) self.send_header("ETag", etag) self.end_headers() return f # A request we can't handle. self.send_response(500) self.end_headers() return None def log_message(self, fmt, *args): f = open(self.server.files_dir + "/.log", "a") f.write(fmt % args + "\n") f.close() class HTTPServer(BaseHTTPServer.HTTPServer): """HTTP server for rawdog's test suite.""" def __init__(self, base_url, files_dir, *args, **kwargs): self.base_url = base_url self.files_dir = files_dir BaseHTTPServer.HTTPServer.__init__(self, *args, **kwargs) def main(args): if len(args) < 3: print "Usage: testserver.py HOSTNAME TIMEOUT-PORT HTTP-PORT FILES-DIR" sys.exit(1) hostname = args[0] timeout_port = int(args[1]) http_port = int(args[2]) files_dir = args[3] timeoutd = TimeoutServer((hostname, timeout_port), TimeoutRequestHandler) t = threading.Thread(target=timeoutd.serve_forever) t.daemon = True t.start() base_url = "http://" + hostname + ":" + str(http_port) httpd = HTTPServer(base_url, files_dir, (hostname, http_port), HTTPRequestHandler) httpd.serve_forever() if __name__ == "__main__": main(sys.argv[1:]) rawdog-2.21/COPYING0000644000471500047150000004313510404407140013364 0ustar atsats00000000000000 GNU GENERAL PUBLIC LICENSE Version 2, June 1991 Copyright (C) 1989, 1991 Free Software Foundation, Inc. 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This General Public License applies to most of the Free Software Foundation's software and to any other program whose authors commit to using it. (Some other Free Software Foundation software is covered by the GNU Library General Public License instead.) You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things. To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it. For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software. Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations. Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all. The precise terms and conditions for copying, distribution and modification follow. GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modifications and/or translated into another language. (Hereinafter, translation is included without limitation in the term "modification".) Each licensee is addressed as "you". Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). Whether that is true depends on what the Program does. 1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and give any other recipients of the Program a copy of this License along with the Program. You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. 2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions: a) You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change. b) You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License. c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License. (Exception: if the Program itself is interactive but does not normally print such an announcement, your work based on the Program is not required to print an announcement.) These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it. Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Program. In addition, mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License. 3. You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following: a) Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, b) Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.) The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable. If distribution of executable or object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place counts as distribution of the source code, even though third parties are not compelled to copy the source along with the object code. 4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance. 5. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Program (or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Program or works based on it. 6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties to this License. 7. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program. If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances. It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice. This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License. 8. If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License. 9. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you may choose any version ever published by the Free Software Foundation. 10. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally. NO WARRANTY 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. Copyright (C) This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA Also add information on how to contact you by electronic and paper mail. If the program is interactive, make it output a short notice like this when it starts in an interactive mode: Gnomovision version 69, Copyright (C) year name of author Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details. The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, the commands you use may be called something other than `show w' and `show c'; they could even be mouse-clicks or menu items--whatever suits your program. You should also get your employer (if you work as a programmer) or your school, if any, to sign a "copyright disclaimer" for the program, if necessary. Here is a sample; alter the names: Yoyodyne, Inc., hereby disclaims all copyright interest in the program `Gnomovision' (which makes passes at compilers) written by James Hacker. , 1 April 1989 Ty Coon, President of Vice This General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Library General Public License instead of this License.