Python用SGMLParser抓取网页连接的改进

grove_SGMLParser 在网上看见用抓取网页连接的大部分都是用以下代码:
#!/usr/bin/env
# -*- coding: utf-8 -*-
from sgmllib import
import urllib
import
import socket
socket.setdefaulttimeout(210)

class URLLister():

def reset(self):
self.url = []
.reset(self)

def start_a(self, attrs):
href = [v for k, v in attrs if k == 'href']
if href:
self.url.extend(href)

parser = URLLister()

myurl=’http://www.gjjblog.com’
request = .Request(myurl) #网页请求
opener = .build_opener()
page = opener.open(request)

if page.code == 200:
predata = page.read()
parser.feed(predata)
print parser.url # 显示抓取到的url数组
但这程序抓取到的网页连接路径有些是没有包含http://的网页路径,那现在就要需要进一部处理改进了, 我增加了一些函数,改成以下:
#!/usr/bin/env
# -*- coding: utf-8 -*-
from sgmllib import
import urllib
import
improt re
import string
import socket
socket.setdefaulttimeout(210)

class URLLister():
def reset(self):
self.url = []
.reset(self)

def start_a(self, attrs):
href = [v for k, v in attrs if k == 'href']
if href:
self.url.extend(href)

# 分析主机头
def fenxiurl(xurl):
aa=[]
xurl=xurl.lower()
str=string.split(xurl,’http://’)
#print str,len(str)
if len(str)>1:
ss=str[1]
str=string.split(ss,’/’)
if len(str)>1:
#print str
if str[0] !=”:
aa.append(’http://’+str[0])

s1=’http://’
for r in range(len(str)-1):
s1=s1+str[r]+’/’

aa.append(s1)
else:
aa.append(xurl)
aa.append(xurl+’/’)

return aa

# 分析和合成
def ChuLiUrl(furlsz,xx):
newurllist=[]
s=re.compile(’^http://’)
y=re.compile(’^/’)
for x in range(len(furlsz)):
ssurl=furlsz[x].lower()

m=s.search(ssurl)
if m:
#print “Yes: “,furlsz[x]
newurllist.append(furlsz[x])
else:
#print “No: “,furlsz[x]
if ssurl.find(’mailto’)>-1:continue
#if ssurl.find(’ftp://’)>-1:continue
if ssurl.find(’://’)>-1:continue
if ssurl.find(’javascript:’)>-1:continue

n=y.search(ssurl)
if n:
newurllist.append(xx[0]+furlsz[x])
else:
newurllist.append(xx[1]+furlsz[x])

a1=set(newurllist)
a2=[i for i in a1]
return a2

parser = URLLister()

myurl=’http://www.gjjblog.com’
request = .Request(myurl) #网页请求
opener = .build_opener()
page = opener.open(request)

if page.code == 200:
predata = page.read()
parser.feed(predata)
urlsz=parser.url # 显示抓取到的url数组

fenxihost=fenxiurl(myurl)
print ChuLiUrl(urlsz,fenxihost)

现在的连接捉取的连接路径就算完整了,做这个都是为了我写的信息蜘蛛服务啊,不过还是不太完美,因为连接里最后的参数包含有?xxx=/usrl/xxxx的话,以上的函数出来就会出错了,因为现在的分析都是基于/做为分隔符号来分析的,暂时我想不到方法完美的去分析它,先做着这些先,想到什么方法分析再改.

抓取网页连接的改进, 文章写得有点草,初学的菜鸟, 请大家指教。

此篇为转载,原地址:http://www.lpfrx.com/archives/176/

Share and Enjoy:
  • Print this article!
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • LinkedIn
  • Live
  • MySpace
  • RSS
  • Slashdot
  • Technorati
  • TwitThis

Related posts:

  1. BlogPump: Blog Post Client with Web Crawler(1) – big picture
  2. 开始Python — Dictionary
  3. Python programming- List extend() and append()
  4. Python模块学习 - re 正则表达式
  5. Python HTML Parser Performance
  6. Django实现大数据量分页查询
  7. 序列、元组、列表、字典
  8. URL 调度程序
  9. URL 调度程序

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

*
To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.
Click to hear an audio file of the anti-spam word

Contact us

Admin: Bryan Wu