linkcrawl.pl
|
Content is copyright
© 2002 Matthew Astley
$Id: linkcrawl-pl.html,v 1.3 2002/09/01 20:08:11 mca1001 Exp $
|
This page is just here to document the existence of
linkcrawl.pl. Hopefully you'll never see it; it was written
for internal use.
If you do ever see the thing, at least there will
be one hit in the search engines to give you some idea why and
where.
- User agent
- Of the form linkcrawl.pl/1.11 [rude mode]
libwww-perl/5.64
- Crawling speed
- Fast and rude, unless I've changed the inheritance to use
LWP::RobotUA since I wrote this. It's not multithreaded,
it tends to be CPU bound on parsing the HTML anyway.
- Depth/breadth first?
- Configurable
- robots.txt
- It doesn't. Several of the things I need it to crawl forbid
robots, at least in production use. I still need to crawl all over
them though.
- Purpose
- Multipurpose link checker and enumeration tool.
- Neat features (IMHO)
- Follows links from framesets (the original reason I needed to
write it), indeed anything HTML::LinkExtor spits out.
- Fetches and lists pages once each.
One use is preparing a list of URLs to hammer with a HTTP
performance tester such as
siege.
- Regexps define what is to be fetched
- Regexps are applied to links before they're followed, so I can
undo the link munging I had to do earlier.
- Link checking down to the bookmark/anchor level for HTML.
- Separate compartments for images, Java classes, hrefs etc.
- I can hang various things off the different MIME types that come
back. I plan to extend it to check for TODO lists, pages not
covered by robot rules, GET locations that have unwanted side
effects .. whatever is needed really.
- Licence
- Not distributed; proprietary by default. If it turns out to be
useful I can ask for a copyright waiver and then GPL it.
- How do I make it stop?
-
If it's thrashing your server, my suggestion would be to run
whois or a reverse DNS query on the originating IP. If it's
from a domain I administer, there should be a phone number in the
TXT record attached to the relevant domain.
So you get the picture. It's a very rude web crawler, and it
shouldn't be allowed Out.