I've been hacking on mitmproxy recently, gearing up to a new release in the next week. One of the features I needed was to pretty-print XML-ish markup (HTML, SOAP, etc.) to make it easier to quickly scan through traffic not formatted for human eyes. I needed this function to cope robustly with incomplete or malformed data, which ruled out proper XML parsers like ElementTree. I also needed it to be fast on large-ish files, which ruled out BeautifulSoup. On the upside, I didn't need it to be perfect - and as long as it didn't lose or corrupt data, getting the indentation mostly right would be good enough.

Today I sat down and hacked up my own solution. This turns out to be just 40 lines of code, somewhat gnarled and ugly after being fine-tuned against a few dozen real-world data samples:

import re, textwrap

TAG = r"""
        <\s*
        (?!\s*[!"])
        (?P<close>\s*\/)?
        (?P<name>\w+)
        (
            [^'"\t >]+ |
            "[^\"]*"['\"]* |
            '[^']*'['\"]* | 
            \s+
        )*
        (?P<selfcont>\s*\/\s*)?
        \s*>
      """
UNI = set(["br", "hr", "img", "input", "area", "link"])
INDENT = " "*4
def pretty_xmlish(s):
    """
        A robust pretty-printer for XML-ish data. 
        Returns a list of lines.
    """
    data, offset, indent, prev = [], 0, 0, None
    for i in re.finditer(TAG, s, re.VERBOSE|re.MULTILINE):
        start, end = i.span()
        name = i.group("name")
        if start > offset:
            txt = []
            for x in textwrap.dedent(s[offset:start]).split("\n"):
                if x.strip():
                    txt.append(indent*INDENT + x)
            data.extend(txt)
        if i.group("close") and not (name in UNI and name==prev):
            indent = max(indent - 1, 0)
        data.append(indent*INDENT + i.group().strip())
        offset = end
        if not any([i.group("close"), i.group("selfcont"), name in UNI]):
            indent += 1
        prev = name
    trail = s[offset:]
    if trail.strip():
        data.append(trail)
    return data

Little snippets of code like this are too trivial to spin out into an independent library, but I'd like to put them up somewhere public where other folks could use them. Right now I don't know of a good place to do this. There's snipplr, but they went with the wrong kind of "social" when they made a social snippet repository, ending up with a social-news-like site focused on upvotes and popularity. This just seems to be a total mismatch to the problem space. What I really want is some combination of asymmetric follow, change tracking, tags, powerful search and good curation tools - more like delicious.com (may it rest in peace) than Reddit. Github's gists are structurally much closer to this, but aren't quite there on curation and search. I also suspect that the fact that gists are full-fledged Git repos is overkill for snippet tracking, much as I love Git (and Github) for larger projects.

I'll no doubt be fine-tuning this function in the days to come - if you're interested, you'll have to keep an eye on this file here, which is less than ideal. If anyone knows of a better snippet sharer, though, let me know...