Python HTML Parser Performance

parser凡是涉及到对页面的处理,就需要一个强大的/ 支持解析,通过对目标文件的格式化处理,我们才能够实现特定信息提取、特定信息删除和遍历等操作。使用过靓汤和ElementTree,个人感觉可以实现很多功能,但是性能不是太好。下面这篇文章具体分析和比较了各种常用解析器的性能供大家参考,包括, HTMLParser, 等。

In preparation for my PyCon talk on HTML I thought I’d do a comparison of several parsers and document models.

The situation is a little complex because there’s different steps in handling :

  1. Parse the
  2. Parse it into something (a document object)
  3. Serialize it

Some libraries handle 1, some handle 2, some handle 1, 2, 3, etc. For instance, ElementSoup uses ElementTree as a document, but BeautifulSoup as the . BeautifulSoup itself has a document object included. HTMLParser only parses, while html5lib includes tree builders for several kinds of trees. There is also and serialization.

So I’ve taken several combinations and made benchmarks. The combinations are:

  • lxml: a , document, and serializer. Also can use BeautifulSoup and html5lib for parsing.
  • BeautifulSoup: a , document, and serializer.
  • html5lib: a . It has a serializer, but I didn’t use it. It has a built-in document object (simpletree), but I don’t think it’s meant for much more than self-testing.
  • ElementTree: a document object, and serializer (I think newer versions might include an serializer, but I didn’t use it). It doesn’t have a , but I used html5lib to parse to it. (I didn’t use the ElementSoup.)
  • cElementTree: a document object implemented as a C extension. I didn’t find any serializer.
  • HTMLParser: a . It didn’t parse to anything. It also doesn’t parse lots of normal (but maybe invalid) . When using it, I just ran documents through the , not constructing any tree.
  • htmlfill: this library uses HTMLParser, but at least pays a little attention to the elements as they are parsed.
  • Genshi: includes a , document, and serializer.
  • xml.dom.minidom: a document model built into the standard library, which html5lib can parse to. (I do not recommend using minidom for anything — some reasons will become apparent in this post, but there are many other reasons not covered why you shouldn’t use it.)

I expected to perform well, as it is based on the C library libxml2. But it performed better than I realized, far better than any other library. As a result, if it wasn’t for some persistent installation problems (especially on Macs) I would recommend for just about any task.

You can try the code out here. I’ve included all the sample data, and the commands I ran for these graphs are here. These tests use a fairly random selection of files (355 total) taken from .org.

Parsing

lxml:0.6; BeautifulSoup:10.6; html5lib ElementTree:30.2; html5lib minidom:35.2; Genshi:7.3; HTMLParser:2.9; htmlfill:4.5

The first test parses the documents. Things to note: is 6x faster than even HTMLParser, even though HTMLParser isn’t doing anything ( is building a tree in memory). I didn’t include all the things html5lib can parse to, because they all take about the same amount of time. .dom.minidom is only included because it is so noticeably slow. Genshi is fairly fast, but it’s the most fragile of the parsers. html5lib, , and BeautifulSoup are all fairly similarly robust. html5lib has the benefit of (at least in theory) being the correct parsing of .

While I don’t really believe it matters often, releases the GIL during parsing.

Serialization

lxml:0.3; BeautifulSoup:2.0; html5lib ElementTree:1.9; html5lib minidom:3.8; Genshi:4.4

Serialization is pretty fast across all the libraries, though again leads the pack by a long distance. ElementTree and minidom are only doing serialization, but there’s no reason that the equivalent would be any faster. That Genshi is slower than minidom is surprising. That anything is worse than minidom is generally surprising.

Memory

lxml:26; BeautifulSoup:82; BeautifulSoup lxml:104; html5lib cElementTree:54; html5lib ElementTree:64; html5lib simpletree:98; html5lib minidom:192; Genshi:64; htmlfill:5.5; HTMLParser:4.4

The last test is of memory. I don’t have a lot of confidence in the way I made this test, but I’m sure it means something. This was done by parsing all the documents and holding the documents in memory, and using the RSS size reported by ps to see how much the process had grown. All the libraries should be imported when calculating the baseline, so only the documents and parsing should cause the memory increase.

HTMLParser is a baseline, as it just keeps the documents in memory as a string, and creates some intermediate strings. The intermediate strings don’t end up accounting for anything, since the memory used is almost exactly the combined size of all the files.

A tricky part of this measurement is that the allocator doesn’t let go of memory that it requests, so if a creates lots of intermediate strings and then releases them the process will still hang onto all that memory. To detect this I tried allocating new strings until the process size grew (trying to detect allocated but unused memory), but this didn’t reveal much — only the BeautifulSoup , serialized to an tree, showed much extra memory.

This is one of the only places where html5lib with cElementTree was noticeably different than html5lib with ElementTree. Not that surprising, I guess, since I didn’t find a coded-in-C serializer, and I imagine the tree building is only going to be a lot faster for if you are building the tree from C code (as its native would do).

is probably memory efficient because it uses native libxml2 data structures, and only creates objects on demand.

In Conclusion

I knew was fast before I started these benchmarks, but I didn’t expect it to be quite this fast.

So in conclusion: kicks ass. You can use it in ways you couldn’t use other systems. You can parse, serialize, parse, serialize, and repeat the process a couple times with your before the will hurt you. With high-level constructs many constructs can happen in very fast C code without calling out to . As an example, if you do an XPath query, the query string is compiled into something native and traverses the native libxml2 objects, only creating objects to wrap the query results. In addition, things like the modest memory use make me more confident that will act reliably even under unexpected load.

I also am more confident about using a document model instead of stream parsing. It is sometimes felt that streamed parsing is better: you don’t keep the entire document in memory, and your work generally scales linearly with your document size. HTMLParser is a stream-based , emitting events for each kind of token (open tag, close tag, data, etc). Genshi also uses this model, with higher-level stuff like filters to make it feel a bit more natural. But the stream model is not the natural way to process a document, it’s actually a really awkward way to handle a document that is better seen as a single thing. If you are processing gigabyte files of it can make sense (and both the normally document-oriented and ElementTree offer options when this happens). This doesn’t make any sense for . And these tests make me believe that even really big documents can be handled quite well by , so a huge outlying document won’t break a system that is appropriately optimized for handling normal sized documents.

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Share and Enjoy:
  • Print this article!
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • LinkedIn
  • Live
  • MySpace
  • RSS
  • Slashdot
  • Technorati
  • TwitThis

Related posts:

  1. Python Programming – Sqlite for data persistence
  2. BlogPump: Blog Post Client with Web Crawler(1) – big picture
  3. Python用SGMLParser抓取网页连接的改进
  4. Core Python Programming(1) - Basic
  5. Python programming- List extend() and append()
  6. Eclipse SDK + PyDev = Python IDE
  7. Python 3 简介
  8. 开始Python — Dictionary
  9. 判断链表是否存在环并找出环的入口
  10. Python 3.1.1 RC发布

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

*
To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.
Click to hear an audio file of the anti-spam word

Contact us

Admin: Bryan Wu