Universal Encoding Detector

http://chardet.feedparser.org/
http://chardet.feedparser.org/docs/how-it-works.html
参考:http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

Basic usage

The easiest way to use the Universal Encoding Detector library is with the detect function.


[link]Example: Using the detect function

The detect function takes one argument, a
non-Unicode string. It returns a dictionary containing the
auto-detected character encoding and a confidence level from 0 to 1.

>>> import urllib
>>> rawdata = urllib.urlopen(’http://yahoo.co.jp/’).read()
>>> import chardet
>>> chardet.detect(rawdata)
{’encoding’: ‘EUC-JP’, ‘confidence’: 0.99}

Advanced usage

If you’re dealing with a large amount of text, you can call the Universal Encoding Detector library incrementally, and it will stop as soon as it is confident enough to report its results.

Create a UniversalDetector object, then call its feed method repeatedly with each block of text. If the detector reaches a minimum threshold of confidence, it will set detector.done to True.

Once you’ve exhausted the source text, call detector.close(), which will do some final calculations in case the detector didn’t hit its minimum confidence threshold earlier. Then detector.result will be a dictionary containing the auto-detected character encoding and confidence level (the same as the chardet.detect function returns).

[link]Example: Detecting encoding incrementally
import urllib
from chardet.universaldetector import UniversalDetector

usock = urllib.urlopen(’http://yahoo.co.jp/’)
detector = UniversalDetector()
for line in usock.readlines():
detector.feed(line)
if detector.done: break
detector.close()
usock.close()
print detector.result
{’encoding’: ‘EUC-JP’, ‘confidence’: 0.99}

If you want to detect the encoding of multiple texts (such as separate files), you can re-use a single UniversalDetector object. Just call detector.reset() at the start of each file, call detector.feed as many times as you like, and then call detector.close() and check the detector.result dictionary for the file’s results.

[link]Example: Detecting encodings of multiple files
import glob
from charset.universaldetector import UniversalDetector

detector = UniversalDetector()
for filename in glob.glob(’*.xml’):
print filename.ljust(60),
detector.reset()
for line in file(filename, ‘rb’):
detector.feed(line)
if detector.done: break
detector.close()
print detector.result

Share and Enjoy:
  • Print this article!
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • LinkedIn
  • Live
  • MySpace
  • RSS
  • Slashdot
  • Technorati
  • TwitThis

Related posts:

  1. Python用SGMLParser抓取网页连接的改进
  2. Python Daemon(守护进程)
  3. python国际化(i18n)和中英文切换

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

*
To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.
Click to hear an audio file of the anti-spam word

Contact us

Admin: Bryan Wu