" workaround for non-block-level HTML; this may be useful if you have a plugin that is doing different HTML sanitisation, or if your template already forces a block-level element around article descriptions. Fix -l for feeds with non-ASCII characters in their titles. Provide human-readable __feed_id__ in items (patch from David Durschlag), and add feed-whatevername class to the default item template; this should make it somewhat easier to add per-feed styles. Handle feeds that are local files correctly, and handle file: URLs in feedparser (reported by Chris Niekel). Allow feed arguments to be given on indented lines after the "feed" or "feeddefaults" lines; this makes it possible to have spaces in feed arguments. Add a meta element to the default template to stop search engines indexing rawdog pages (patch from Rick van Rein). Add new feeds at the end of the config file rather than before the first feed line (patch from Decklin Foster). - rawdog 2.1 Fix a character encoding problem with format=text feeds. Add proxyuser and proxypassword options for feeds, so that you can use per-feed proxies requiring HTTP Basic authentication (patch from Jon Nelson). Add a manual page (written by Decklin Foster). Remove extraneous #! line from feedparser.py (reported by Decklin Foster). Update an article's modified date when a new version of it is seen (reported by Decklin Foster). Support nested ifs in templates (patch from David Durschlag), and add __else__. Make the README file list all the options that rawdog now supports (reported by David Durschlag). Make --verbose work even if it's specified after an action (reported by Dan Noe and David Durschlag). - rawdog 2.0 Update to feedparser 3.3. This meant reworking some of rawdog's internals; state files from old versions will no longer work with rawdog 2.0 (and external programs that manipulate rawdog state files will also be broken). The new feedparser provides a much nicer API, and is significantly more robust; several feeds that previously caused feedparser internal errors or Python segfaults now work fine. Add an --upgrade option to import state from rawdog 1.x state files into rawdog 2.x. To upgrade from 1.x to 2.x, you'll need to perform the following steps after installing the new rawdog: - cp -R ~/.rawdog ~/.rawdog-old - rm ~/.rawdog/state - rawdog -u - rawdog --upgrade ~/.rawdog-old ~/.rawdog (to copy the state) - rawdog -w - rm -r ~/.rawdog-old (once you're happy with the new version) Keep track of a version number in the state file, and complain if you use a state file from an incompatible version. Remove support for the old option syntax ("rawdog update write"). Remove workarounds for early 1.x state file versions. Save the state file in the binary pickle format, and use cPickle instead of pickle so it can be read and written more rapidly. Add hideduplicates and allowduplicates options to attempt to hide duplicate articles (based on patch from Grant Edwards). Fix a bug when sorting feeds with no titles (found by Joseph Reagle). Write the updated state file more safely, to reduce the chance that it'll be damaged or truncated if something goes wrong while it's being written (requested by Tim Bishop). Include feedfinder, and add a -a|--add option to add a feed to the config file. Correctly handle dates with timezones specified in non-UTC locales (reported by Paul Tomblin and Jon Lasser). When a feed's URL changes, as indicated by a permanent HTTP redirect, automatically update the config file and state. - rawdog 1.13 Handle OverflowError with parsed dates (patch from Matthew Scott). - rawdog 1.12 Add "sortbyfeeddate" option for planet pages (requested by David Dorward). Add "currentonly" option (patch from Chris Cutler). Handle nested CDATA blocks in feed XML and HTML correctly in feedparser. - rawdog 1.11 Add __num_items__ and __num_feeds__ to the page template, and __url__ to the item template (patch from Chris Cutler). Add "daysections" and "timesections" options to control whether to split items up by day and time (based on patch from Chris Cutler). Add "tidyhtml" option to use mx.Tidy to clean feed-provided HTML. Remove the
wrapping __description__ from the default item template, and make rawdog add
...
around the description only if it doesn't start with a block-level element (which isn't perfect, but covers the majority of problem cases). If you have a custom item template and want rawdog to generate a better approximation to valid HTML, you should change "__description__
" to "__description__". HTML metacharacters in links are now encoded correctly in generated HTML ("foo?a=b&c=d" as "foo?a=b&c=d"). Content type selection is now performed for all elements returned from the feed, since some Blogger v5 feeds cause feedparser to return multiple versions of the title and link (reported by Eric Cronin). - rawdog 1.10 Add "ignoretimeouts" option to silently ignore timeout errors. Fix SSL and socket timeouts on Python 2.3 (reported by Tim Bishop). Fix entity encoding problem with HTML sanitisation that was causing rawdog to throw an exception upon writing with feeds containing non-US-ASCII characters in attribute values (reported by David Dorward, Dmitry Mark and Steve Pomeroy). Include MANIFEST.in in the distribution (reported by Chris Cutler). - rawdog 1.9 Add "clear: both;" to item, time and date styles, so that items with floated images in don't extend into the items below them. Changed how rawdog selects the feeds to update; --verbose now shows only the feeds being updated. rawdog now uses feedparser 2.7.6, which adds date parsing and limited sanitisation of feed-provided HTML; I've removed rawdog's own date-parsing (including iso8601.py) and relative-link-fixing code in favour of the more-capable feedparser equivalents. The persister module in rawdoglib is now licensed under the LGPL (requested by Giles Radford). Made the error messages that listed the state dir reflect the -b setting (patch from Antonin Kral). Treat empty titles, links or descriptions as if they weren't supplied at all, to cope with broken feeds that specify "tag # before it. This still fails when the HTML contains text, then # a block-level element, then more text, but it's better than # nothing. if block_level_re.match(html) is None: html = "
" + html if config["tidyhtml"]: args = { "numeric_entities": 1, "input_encoding": "ascii", "output_encoding": "ascii", "output_html": 1, "output_xhtml": 0, "output_xml": 0, "wrap": 0, } call_hook("mxtidy_args", config, args, baseurl, inline) call_hook("tidy_args", config, args, baseurl, inline) if tidylib is not None: # Disable PyTidyLib's somewhat unhelpful defaults. tidylib.BASE_OPTIONS = {} output = tidylib.tidy_document(html, args)[0] elif mxtidy is not None: output = mxtidy.tidy(html, None, None, **args)[2] else: # No Tidy bindings installed -- do nothing. output = "
" + html + "" html = output[output.find("") + 6 : output.rfind("")].strip() html = html.decode("UTF-8") box = Box(html) call_hook("clean_html", config, box, baseurl, inline) return box.value def select_detail(details): """Pick the preferred type of detail from a list of details. (If the argument isn't a list, treat it as a list of one.)""" TYPES = { "text/html": 30, "application/xhtml+xml": 20, "text/plain": 10, } if details is None: return None if type(details) is not list: details = [details] ds = [] for detail in details: ctype = detail.get("type", None) if ctype is None: continue if TYPES.has_key(ctype): score = TYPES[ctype] else: score = 0 if detail["value"] != "": ds.append((score, detail)) ds.sort() if len(ds) == 0: return None else: return ds[-1][1] def detail_to_html(details, inline, config, force_preformatted=False): """Convert a detail hash or list of detail hashes as returned by feedparser into HTML.""" detail = select_detail(details) if detail is None: return None if force_preformatted: html = "" + cgi.escape(detail["value"]) + "" elif detail["type"] == "text/plain": html = cgi.escape(detail["value"]) else: html = detail["value"] return sanitise_html(html, detail["base"], inline, config) def author_to_html(entry, feedurl, config): """Convert feedparser author information to HTML.""" author_detail = entry.get("author_detail") if author_detail is not None and author_detail.has_key("name"): name = author_detail["name"] else: name = entry.get("author") url = None fallback = "author" if author_detail is not None: if author_detail.has_key("href"): url = author_detail["href"] elif author_detail.has_key("email") and author_detail["email"] is not None: url = "mailto:" + author_detail["email"] if author_detail.has_key("email") and author_detail["email"] is not None: fallback = author_detail["email"] elif author_detail.has_key("href") and author_detail["href"] is not None: fallback = author_detail["href"] if name == "": name = fallback if url is None: html = name else: html = "" + cgi.escape(name) + "" # We shouldn't need a base URL here anyway. return sanitise_html(html, feedurl, True, config) def string_to_html(s, config): """Convert a string to HTML.""" return sanitise_html(cgi.escape(s), "", True, config) template_re = re.compile(r'(__[^_].*?__)') def fill_template(template, bits): """Expand a template, replacing __x__ with bits["x"], and only including sections bracketed by __if_x__ .. [__else__ ..] __endif__ if bits["x"] is not "". If not bits.has_key("x"), __x__ expands to "".""" result = Box() call_hook("fill_template", template, bits, result) if result.value is not None: return result.value encoding = get_system_encoding() f = StringIO() if_stack = [] def write(s): if not False in if_stack: f.write(s) for part in template_re.split(template): if part.startswith("__") and part.endswith("__"): key = part[2:-2] if key.startswith("if_"): k = key[3:] if_stack.append(bits.has_key(k) and bits[k] != "") elif key == "endif": if if_stack != []: if_stack.pop() elif key == "else": if if_stack != []: if_stack.append(not if_stack.pop()) elif bits.has_key(key): if type(bits[key]) == types.UnicodeType: write(bits[key].encode(encoding)) else: write(bits[key]) else: write(part) v = f.getvalue() f.close() return v file_cache = {} def load_file(name): """Read the contents of a template file, caching the result so we don't have to read the file multiple times. The file is assumed to be in the system encoding; the result will be an ASCII string.""" if not file_cache.has_key(name): try: f = open(name) data = f.read() f.close() except IOError: raise ConfigError("Can't read template file: " + name) try: data = data.decode(get_system_encoding()) except UnicodeDecodeError, e: raise ConfigError("Character encoding problem in template file: " + name + ": " + str(e)) data = encode_references(data) file_cache[name] = data.encode(get_system_encoding()) return file_cache[name] def write_ascii(f, s, config): """Write the string s, which should only contain ASCII characters, to file f; if it isn't encodable in ASCII, then print a warning message and write UTF-8.""" try: f.write(s) except UnicodeEncodeError, e: config.bug("Error encoding output as ASCII; UTF-8 has been written instead.\n", e) f.write(s.encode("UTF-8")) def short_hash(s): """Return a human-manipulatable 'short hash' of a string.""" return hashlib.sha1(s).hexdigest()[-8:] def ensure_unicode(value, encoding): """Convert a structure returned by feedparser into an equivalent where all strings are represented as fully-decoded unicode objects.""" if isinstance(value, str): try: return value.decode(encoding) except: # If the encoding's invalid, at least preserve # the byte stream. return value.decode("ISO-8859-1") elif isinstance(value, unicode) and type(value) is not unicode: # This is a subclass of unicode (e.g. BeautifulSoup's # NavigableString, which is unpickleable in some versions of # the library), so force it to be a real unicode object. return unicode(value) elif isinstance(value, dict): d = {} for (k, v) in value.items(): d[k] = ensure_unicode(v, encoding) return d elif isinstance(value, list): return [ensure_unicode(v, encoding) for v in value] else: return value timeout_re = re.compile(r'timed? ?out', re.I) def is_timeout_exception(exc): """Return True if the given exception object suggests that a timeout occurred, else return False.""" # Since urlopen throws away the original exception object, # we have to look at the stringified form to tell if it was a timeout. # (We're in reasonable company here, since test_ssl.py in the Python # distribution does the same thing!) # # The message we're looking for is something like: # Stock Python 2.7.7 and 2.7.8: #
__title__ [__feed_title__]
__if_description__Feed | RSS | Last fetched | Next fetched after |
---|
This is manifestly not a feed.
EOF } make_html_head () { cat >"$1" <This is manifestly not a feed.
EOF } make_html_body () { cat >"$1" <This is manifestly not a feed.
EOF cat >>"$1" cat >>"$1" <" # if they don't already start with a block-level element. blocklevelhtml true # Whether to attempt to turn feed-provided HTML into valid HTML. # The most common problem that this solves is a non-closed element in an # article causing formatting problems for the rest of the page. # For this option to have any effect, you need to have PyTidyLib or mx.Tidy # installed. tidyhtml true # Whether the articles displayed should be sorted first by the date # provided in the feed (useful for "planet" pages, where you're # displaying several feeds and want new articles to appear in the right # chronological place). If this is false, then articles will first be # sorted by the time that rawdog first saw them. sortbyfeeddate false # Whether to consider articles' unique IDs or GUIDs when updating rawdog's # database. If you turn this off, then rawdog will create a new article in its # database when it sees an updated version of an existing article in a feed. # You probably want this turned on. useids true # The fields to use when detecting duplicate articles: "id" is the article's # unique ID or GUID; "link" is the article's link. rawdog will find the first # one of these that's present in the article, and ignore the article if it's # seen an article before (in any feed) that had the same value. For example, # specifying "hideduplicates id link" will first look for id/guid, then for # link. # Note that some feeds use the same link for all their articles; if you specify # "link" here, you will probably want to specify the "allowduplicates" feed # argument (see below) for those feeds. hideduplicates id # The period to use for new feeds added to the config file via the -a|--add # option. newfeedperiod 3h # Whether rawdog should automatically update this config file (and its # internal state) if feed URLs change (for instance, if a feed URL # results in a permanent HTTP redirect). If this is false, then rawdog # will ask you to make the necessary change by hand. changeconfig true # The feeds you want to watch, in the format "feed period url [args]". # The period is the minimum time between updates; if less than period # minutes have passed, "rawdog update" will skip that feed. Specifying # a period less than 30 minutes is considered to be bad manners; it is # suggested that you make the period as long as possible. # Arguments are optional, and can be given in two ways: either on the end of # the "feed" line in the form "key=value", separated by spaces, or as extra # indented lines after the feed line. # possible arguments are: # id Value for the __feed_id__ value in the item # template for items in this feed (defaults to the # feed title with non-alphanumeric characters and # HTML markup removed) # user User for HTTP basic authentication # password Password for HTTP basic authentication # format "text" to indicate that the descriptions in this feed # are unescaped plain text (rather than the usual HTML), # and should be escaped and wrapped in a
element # X_proxy Proxy URL for protocol X (for instance, "http_proxy") # proxyuser User for proxy basic authentication # proxypassword Password for proxy basic authentication # allowduplicates "true" to disable duplicate detection for this feed # maxage Override the global "maxage" value for this feed # keepmin Override the global "keepmin" value for this feed # define_X Equivalent to "define X ..." for item templates # when displaying items from this feed # You can provide a default set of arguments for all feeds using # "feeddefaults". You can specify as many feeds as you like. # (These examples have been commented out; remove the leading "#" on each line # to use them.) #feeddefaults # http_proxy http://proxy.example.com:3128/ #feed 1h http://example.com/feed.rss #feed 30m http://example.com/feed2.rss id=newsfront #feed 3h http://example.com/feed3.rss keepmin=5 #feed 3h http://example.com/secret.rss user=bob password=secret #feed 3h http://example.com/broken.rss # format text # define_myclass broken #feed 3h http://proxyfeed.example.com/proxied.rss http_proxy=http://localhost:1234/ #feed 3h http://dupsfeed.example.com/duplicated.rss allowduplicates=true rawdog-2.22/testserver.py 0000644 0004715 0004715 00000016474 12745633226 015140 0 ustar ats ats 0000000 0000000 # testserver: servers for rawdog's test suite. # Copyright 2013, 2016 Adam Sampson# # rawdog is free software; you can redistribute and/or modify it # under the terms of that license as published by the Free Software # Foundation; either version 2 of the License, or (at your option) # any later version. # # rawdog is distributed in the hope that it will be useful, but # WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU # General Public License for more details. # # You should have received a copy of the GNU General Public License # along with rawdog; see the file COPYING. If not, write to the Free # Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, # MA 02110-1301, USA, or see http://www.gnu.org/. import BaseHTTPServer import SimpleHTTPServer import SocketServer import base64 import cStringIO import gzip import hashlib import os import re import sys import threading import time class TimeoutRequestHandler(SocketServer.BaseRequestHandler): """Request handler for a server that just does nothing for a few seconds, then disconnects. This is used for testing timeout handling.""" def handle(self): time.sleep(5) class TimeoutServer(SocketServer.ThreadingMixIn, SocketServer.TCPServer): """Timeout server for rawdog's test suite.""" pass class HTTPRequestHandler(SimpleHTTPServer.SimpleHTTPRequestHandler): """HTTP request handler for rawdog's test suite.""" # do_GET/do_HEAD are copied from SimpleHTTPServer because send_head isn't # part of the API. def do_GET(self): f = self.send_head() if f: self.copyfile(f, self.wfile) f.close() def do_HEAD(self): f = self.send_head() if f: f.close() def send_head(self): # Look for lines of the form "/oldpath /newpath" in .rewrites. try: f = open(os.path.join(self.server.files_dir, ".rewrites")) for line in f.readlines(): (old, new) = line.split(None, 1) if self.path == old: self.path = new f.close() except IOError: pass m = re.match(r'^/auth-([^/-]+)-([^/]+)(/.*)$', self.path) if m: # Require basic authentication. auth = "Basic " + base64.b64encode(m.group(1) + ":" + m.group(2)) if self.headers.get("Authorization") != auth: self.send_response(401) self.end_headers() return None self.path = m.group(3) m = re.match(r'^/digest-([^/-]+)-([^/]+)(/.*)$', self.path) if m: # Require digest authentication. (Not a good implementation!) realm = "rawdog test server" nonce = "0123456789abcdef" a1 = m.group(1) + ":" + realm + ":" + m.group(2) a2 = "GET:" + self.path def h(s): return hashlib.md5(s).hexdigest() response = h(h(a1) + ":" + nonce + ":" + h(a2)) mr = re.search(r'response="([^"]*)"', self.headers.get("Authorization", "")) if mr is None or mr.group(1) != response: self.send_response(401) self.send_header("WWW-Authenticate", 'Digest realm="%s", nonce="%s"' % (realm, nonce)) self.end_headers() return None self.path = m.group(3) m = re.match(r'^/(\d\d\d)(/.*)?$', self.path) if m: # Request for a particular response code. code = int(m.group(1)) dest = m.group(2) self.send_response(code) if dest: if dest.startswith("/="): # Provide an exact value for Location (to simulate an # invalid response). dest = dest[2:] else: dest = self.server.base_url + dest self.send_header("Location", dest) self.end_headers() return None encoding = None m = re.match(r'^/(gzip)(/.*)$', self.path) if m: # Request for a content encoding. encoding = m.group(1) self.path = m.group(2) m = re.match(r'^/([^/]+)$', self.path) if m: # Request for a file. filename = os.path.join(self.server.files_dir, m.group(1)) try: f = open(filename, "rb") except IOError: self.send_response(404) self.end_headers() return None # Use the SHA1 hash as an ETag. etag = '"' + hashlib.sha1(f.read()).hexdigest() + '"' f.seek(0) # Oversimplistic, but matches what feedparser sends. if self.headers.get("If-None-Match", "") == etag: self.send_response(304) self.end_headers() return None size = os.fstat(f.fileno()).st_size mime_type = "text/plain" if filename.endswith(".rss") or filename.endswith(".rss2"): mime_type = "application/rss+xml" elif filename.endswith(".rdf"): mime_type = "application/rdf+xml" elif filename.endswith(".atom"): mime_type = "application/atom+xml" elif filename.endswith(".html"): mime_type = "text/html" self.send_response(200) if encoding: self.send_header("Content-Encoding", encoding) if encoding == "gzip": data = f.read() f.close() f = cStringIO.StringIO() g = gzip.GzipFile(fileobj=f, mode="wb") g.write(data) g.close() size = f.tell() f.seek(0) self.send_header("Content-Length", size) self.send_header("Content-Type", mime_type) self.send_header("ETag", etag) self.end_headers() return f # A request we can't handle. self.send_response(500) self.end_headers() return None def log_message(self, fmt, *args): f = open(self.server.files_dir + "/.log", "a") f.write(fmt % args + "\n") f.close() class HTTPServer(BaseHTTPServer.HTTPServer): """HTTP server for rawdog's test suite.""" def __init__(self, base_url, files_dir, *args, **kwargs): self.base_url = base_url self.files_dir = files_dir BaseHTTPServer.HTTPServer.__init__(self, *args, **kwargs) def main(args): if len(args) < 3: print "Usage: testserver.py HOSTNAME TIMEOUT-PORT HTTP-PORT FILES-DIR" sys.exit(1) hostname = args[0] timeout_port = int(args[1]) http_port = int(args[2]) files_dir = args[3] timeoutd = TimeoutServer((hostname, timeout_port), TimeoutRequestHandler) t = threading.Thread(target=timeoutd.serve_forever) t.daemon = True t.start() base_url = "http://" + hostname + ":" + str(http_port) httpd = HTTPServer(base_url, files_dir, (hostname, http_port), HTTPRequestHandler) httpd.serve_forever() if __name__ == "__main__": main(sys.argv[1:]) rawdog-2.22/MANIFEST.in 0000644 0004715 0004715 00000000334 12167056532 014077 0 ustar ats ats 0000000 0000000 include COPYING include MANIFEST.in include NEWS include PLUGINS include README include config include rawdog include rawdog.1 include style.css include test-rawdog include testserver.py recursive-include rawdoglib *.py rawdog-2.22/README 0000644 0004715 0004715 00000006405 12176436301 013222 0 ustar ats ats 0000000 0000000 rawdog: RSS Aggregator Without Delusions Of Grandeur Adam Sampson rawdog is a feed aggregator, capable of producing a personal "river of news" or a public "planet" page. It supports all common feed formats, including all versions of RSS and Atom. By default, it is run from cron, collects articles from a number of feeds, and generates a static HTML page listing the newest articles in date order. It supports per-feed customizable update times, and uses ETags, Last-Modified, gzip compression, and RFC3229+feed to minimize network bandwidth usage. Its behaviour is highly customisable using plugins written in Python. rawdog has the following dependencies: - Python 2.6 or later (but not Python 3) - feedparser 5.1.2 or later - PyTidyLib 0.2.1 or later (optional but strongly recommended) To install rawdog on your system, use distutils -- "python setup.py install". This will install the "rawdog" command and the "rawdoglib" Python module that it uses internally. (If you want to install to a non-standard prefix, read the help provided by "python setup.py install --help".) rawdog needs a config file to function. Make the directory ".rawdog" in your $HOME directory, copy the provided file "config" into that directory, and edit it to suit your preferences. Comments in that file describe what each of the options does. You should copy the provided file "style.css" into the same directory that you've told rawdog to write its HTML output to. rawdog should be usable from a browser that doesn't support CSS, but it won't be very pretty. When you invoke rawdog from the command line, you give it a series of actions to perform -- for instance, "rawdog --update --write" tells it to do the "--update" action (downloading articles from feeds), then the "--write" action (writing the latest articles it knows about to the HTML file). For details of all rawdog's actions and command-line options, see the rawdog(1) man page -- "man rawdog" after installation. You will want to run "rawdog -uw" periodically to fetch data and write the output file. The easiest way to do this is to add a crontab entry that looks something like this: 0,10,20,30,40,50 * * * * /path/to/rawdog -uw (If you don't know how to use cron, then "man crontab" is probably a good start.) This will run rawdog every ten minutes. If you want rawdog to fetch URLs through a proxy server, then set your "http_proxy" environment variable appropriately; depending on your version of cron, putting something like: http_proxy=http://myproxy.mycompany.com:3128/ at the top of your crontab should be appropriate. (The http_proxy variable will work for many other programs too.) In the event that rawdog gets horribly confused (for instance, if your system clock has a huge jump and it thinks it won't need to fetch anything for the next thirty years), you can forcibly clear its state by removing the ~/.rawdog/state file (and the ~/.rawdog/feeds/*.state files, if you've got the "splitstate" option turned on). If you don't like the appearance of rawdog, then customise the style.css file. If you come up with one that looks much better than the existing one, please send it to me! This should, hopefully, be all you need to know. If rawdog breaks in interesting ways, please tell me at the email address at the top of this file.