Also, the Perl modules LWP, HTML, and URI provide tools for
downloading files off the web, processing them while they're
being downloaded, extracting hyperlinks and other functions.
I found this useful for repetitive site-specific tasks in which
I'd like to filter out some of the files being downloaded.
> Subject: Corpora: Help please - downloading text from the Web
> Hi. Can anyone help me with the following:
> I'm looking for software - preferably freeware or shareware - to
> use to download text from Web sites, for use in a corpus.
> This will be from large sites, with a lot of files, sub-directories
> and internal links. Most basically, the software would simply download
> HTML files from the site, following internal links from the Home page.
> I've tried various "bots" that do this, but have had problems with all
> of them. So I'd welcome recommendations for software that others have
> found unproblematic (and powerful/multi-functioned) for this purpose.
> And if anyone knows of packages that are more specifically aimed at the
> task I'm undertaking, that would be even better.
> Also useful would be software that mapped out the structure of
> sites, giving
> an idea of the size of the files.
> I have a related question. What tools do you use once you have downloaded
> the HTML files to (batch-)convert them in reasonably clean "plain" text?
This archive was generated by hypermail 2b29 : Tue Mar 28 2000 - 09:13:33 MET DST