WebTreeCopy

by Thomas Bleier, (e9420329@student.tuwien.ac.at)


Contents:


Introduction

Note: WebTreeCopy is not being maintained any more.

WebTreeCopy is a program which copies some web-pages to your local harddisk. You have to enter an URL, and WebTreeCopy loads this page and saves it to your disk. The key feature is that it scans this page for links to other objects like images or links to other pages and downloads this objects also. So you can copy a document including it's images, a document consisting of many pages or even a whole server (or even some part of the web ;-) to your local harddisk.

There are many such programs out there, so why have I written another one? Well, the original version was done in a course called Java Programming, hold at the Technical University of Vienna by the Department for Information Systems. Since I personally need such a programm sometimes, I decided to continue development of the result.

Features of WebTreeCopy:


Installation

The installation should be very easy. Just download the ZIP-File and extract it to a folder you like (Note: you should use an unzip-program that supports long filenames, e.g. the Info-Zip utilities, and you should unzip the file with the folders stored in the archive).

After that you have to run the file wtc.class in your Java environment. For example, if you use the Sun JDK, type java wtc on the command-line prompt. WebTreeCopy is currently implemented with JDK 1.0.2, so your Java environment should support this version.

If all goes right, you should see a message that WebTreeCopy is starting with the GUI, and the WebTreeCopy main-window should appear.


Usage

There are two options for using WebTreeCopy: you can use the GUI to set all the options you want and press the start button then. You will see another window where WebTreeCopy tells you what it does. Or you can enter the options on the command line, and WebTreeCopy will print it's output to the console.

If you start WebTreeCopy without any parameters, you can use the GUI. Start it with the parameter -? or -h to get help on the command-line parameters.

A description of all available options follows below.

OptionParameter Description
URL to copyfilename an URL where WebTreeCopy should start copying
Maximum recursion depth-m number the depth to which WebTreeCopy should follow links to.
You can think of the URL where WebTreeCopy starts as the root of a tree. WebTreeCopy follows this links and copies them too. The number given here specifies the maximum level of the tree, where WebTreeCopy does not follow a link any more.
Copy only subtree-s if this option is set, only subtrees of the given start-URL are copied.
This means, if the start-URL is like http://www.abc.org/directory/subdir/, and the -s option is set, only pages which are below the directories /directory/subdir/ are processed, e.g. a page http://www.abc.org/directory/subdir/anotherdir/apage.html is copied, but a page like http://www.abc.org/directory/nodir/somepage.html is left out. The pages in directory /directory/subdir/ are copied, too.
Copy files from other servers-o with this option set, files on other servers are copied too.
Destination directory-d directory specifies a directory, where the pages are stored locally. The default directory is the current directory.
Index-file name-i file the name of the index-file.
If an URL contains no filename (e.g. http://www.abc.org/somedir/), a default filename is used. With this option you can specify that name. It depends on the web-server used. Often it is index.html, so this is the default value.
All files in one directory-c if this option is set, all files are copied into one directory, regardless of the original directory structure. Normally the directory structure is created as on the server, where the files are copied from.
Concurrent threads??? number WebTreeCopy starts the given number of threads. Each thread looks for a page to copy in the pool of pages, and copies this page. The links found in the copied pages are added to the pool. The threads work concurrently.
Logfile-l file Logs all messages to the file file.
Quiet mode-q if this option is set, no messages but errors are printed. It can only be set on the command-line interface
Show extended messages/verbose mode-v With this option WebTreeCopy shows many information about what it is doing. Otherways only the URL's which are copied are shown.
Ask for each URL-a if this option is set, WebTreeCopy displays a prompt at each URL, where you can decide if this URL should be copied or not.
Copy images/pictures-p with this option you can control if the images included in the pages should be copied or not.

The graphical interface has also browse-buttons for file- and directory names. These are disabled by now, because the JDK does not support such dialogs (as far as I know). If these dialogs are added to Java, I will enable the buttons.


Have fun...

I hope you have fun using WebTreeCopy. The current version is 0.9, because it is not fully tested. So you should consider this software as in BETA-state.

The source-code is available here.

The usual disclaimers apply, so don't wonder if you lose data or something after using this software <GRIN>.

Many thanks to Jef Poskanzer for his HTML-Parser in Java, which is used in WebTreeCopy.

If you use WebTreeCopy, please let me know. It's free, as I said, but I am interested in knowing if somebody uses this program ;-). My mail-address is e9420329@student.tuwien.ac.at.


Page created by Thomas Bleier, (e9420329@student.tuwien.ac.at), last modified 4Feb97