http://www.t8o.org/~mca1001/linkcrawl-pl.html

linkcrawl.pl

$Id: linkcrawl-pl.html,v 1.3 2002/09/01 20:08:11 mca1001 Exp $

do the search

This page is just here to document the existence of linkcrawl.pl. Hopefully you'll never see it; it was written for internal use. If you do ever see the thing, at least there will be one hit in the search engines to give you some idea why and where.

User agent

Of the form linkcrawl.pl/1.11 [rude mode] libwww-perl/5.64

Crawling speed

Fast and rude, unless I've changed the inheritance to use LWP::RobotUA since I wrote this. It's not multithreaded, it tends to be CPU bound on parsing the HTML anyway.

Depth/breadth first?

Configurable

robots.txt

It doesn't. Several of the things I need it to crawl forbid robots, at least in production use. I still need to crawl all over them though.

Purpose

Multipurpose link checker and enumeration tool.

Neat features (IMHO)

Follows links from framesets (the original reason I needed to write it), indeed anything HTML::LinkExtor spits out.
Fetches and lists pages once each.
One use is preparing a list of URLs to hammer with a HTTP performance tester such as siege.
Regexps define what is to be fetched
Regexps are applied to links before they're followed, so I can undo the link munging I had to do earlier.
Link checking down to the bookmark/anchor level for HTML.
Separate compartments for images, Java classes, hrefs etc.
I can hang various things off the different MIME types that come back. I plan to extend it to check for TODO lists, pages not covered by robot rules, GET locations that have unwanted side effects .. whatever is needed really.

Licence

Not distributed; proprietary by default. If it turns out to be useful I can ask for a copyright waiver and then GPL it.

How do I make it stop?

If it's thrashing your server, my suggestion would be to run whois or a reverse DNS query on the originating IP. If it's from a domain I administer, there should be a phone number in the TXT record attached to the relevant domain.

So you get the picture. It's a very rude web crawler, and it shouldn't be allowed Out.