FtpLocate - make your own
FTP Search Engine
INTRODUCTION
FtpLocate is a FTP search engine written with Perl.
It has the following features.
-
It supports indexing on multiple FTP servers. The username, password and
initial index directory for each ftp server could be defined.
-
It is very fast! FtpLocate uses glimpse as the indexer. One query on 4
million records takes less than 3 seconds in average!
-
It is very easy to install.
-
It provides the user two type of searching:
search by name:
The result is grouped by FTP servers. The server most near by the client
will be displayed first. (It is choose by domain name). Besides the file
name, file size and file date, the file description will be also provided
if available.
search by description:
User can find the files he wants without knowing the filename. For
example: Searching with the keyword "windows;ftp;server"
will give the result of all ftp server programs available for windows platform.
Files with same description will be grouped together.
-
It has both web and text version. The web version are CGI programs. The
text version is a simple web client, it communicates with CGI programs
through http protocol. It is handy in case there is no browser available.
-
It can collect the filelist from three sources:
direct
The filelist is collected by sending the 'ls -lR' command to remote
ftp server directly
file://path/filename.of.ls-lR
The filelist is collected by parsing the ls-lR file on
remote ftp server. Supported formats are Z, gz, zip and plain text.
an URL like http://other.ftplocate/cgi-bin/ftplocate/flserv.pl
The filelist is collected by requesting from the filelist database
of another FtpLocate server. This is useful if the ftp server is far away
from your FtpLocate server.
ps: The transfer of filelist between FtpLocate
servers uses http protocol and supports proxy.
Just set the environment
variable 'http_proxy' to 'http://your_proxy_server:3128/'
-
It generates summaries for all indexed FTP servers, including directory
count, file count, total file size,...etc.
-
It generates the maplist of FtpLocate servers on the Internet. Each FtpLocate
server will register itself to the master server and get the most up to
date server list from the master server.
-
It caches the results of user queries to speed up the response for repeated
request.
-
It generates the hot list of user queries. To increase the cache hit ratio,
a training program is provided to rebuild results of these queries
into cache after the database re-indexing.
-
It generates the history list of user queries.
-
The output is separated into pages.(100 records for each) Users don't have
to wait for large result transfer to complete.
-
It is designed to minimize the unavailability. The search engine will be
only unavailable at indexing stage.
ps: The time needed for indexing is short
compared to data collecting, so the search engine will serves the user
most of the time.
EXAMPLE
We have indexed 27 FTP sites with a Celeron450/256MB ram machine.
There are 924322 dirs, 3823568 files found, total size is 1064GB.
It takes about 5 hours to collect file list, the total size is 410MB.
Indexing of file list takes 30 minute, file list index size is 18MB
After filelist indexing is done, we use it to get filelist of files
containing description information. Now the description parser recognizes
Linux lsm, FreeBSD package index, Simtel 00index and RFC index. If a file
is unrecognized, the description parser will try a wild guess. :)
It takes about 2 hour to get the descriptions files, the total size
is 34MB.
Parsing and indexing of descriptions takes 5 minutes, index size is
9MB.
Most queries in this example will be finished within 3 seconds.
ps: The example is available at
http://ftp.ee.ncku.edu.tw/cgi-bin/ftplocate/flsearch.pl(filename
search) and
http://ftp.ee.ncku.edu.tw/cgi-bin/ftplocate/dsearch.pl
(Description
Search)
REQUIREMENT
-
Unix platform
-
Perl 5.005 or above
-
Apache or any web server able to execute CGI programs
-
Glimpse 4.1 (a great indexing
tool by cs.arizona.edu)
ps: FtpLocate was developed on FreeBSD
3.1 Release, Perl 5.00502, Apache
1.3.4 and Glimpse 4.1
DOWNLOAD
FILES
-
install.pl
the auto install program
documents
-
readme.zhtw.html
chinese readme
-
readme.english.html
english readme
-
help.zhtw.html
chinese help file
-
help.english.html
english help file
system files
-
config.site
the most important file, specify the ftp servers to be indexed
-
config
define the most global variable
-
lang.zhtw
string definition file for chinese
-
lang.english
string definition file for english
-
flmodule.pl
functions used in various ftplocate programs
data collecting and indexing programs
-
indexer.sh
a shell script to do data collecting and indexing, used in cron table
-
flcollect.pl
collect file list from ftp servers
-
flindex.pl
index the collected file list
-
dcollect.pl
collect the description files from ftp servers
-
dindex.pl
parse the description files and does indexing on them
-
flfilter.pl
program used by flindex.pl when call glimpse indexer
-
fltrain.pl
program used to train the search engine
search CGI programs
-
flsearch.pl
FtpLocate filename search engine
-
dsearch.pl
FtpLocate description search engine
misc CGI programs
-
flsummary.pl
list summaries of indexed ftp servers
-
flmap.pl
list other FtpLocate servers
-
fltop.pl
list the hottest queries
-
flhistory.pl
list the query history
text based client
-
ftplocate
text based client
log files (created by install program)
-
log.system
FtpLocate system log file, log the data collecting and indexing history
-
log.user
FtpLocate user query log file
-
log.map
FtpLocate server list map
data directories (created by install program)
-
filelist
used to store filelists of different ftp servers
-
desc
used to store description files
-
cache
used to store result of user queries (it will be cleared after data
re-index each time)
INSTALL
-
untar the ftplocate-2.xx.tar.gz, then change to the untared directory 'ftplocate-2.xx'
-
execute the './install.pl'. the install program will check the requirement
and determine most setting for you.
-
edit the file 'config.site' to specify the ftp site that will be indexed
by the FtpLocate server
-
execute indexer.sh to do data collecting and indexing
-
use your browser to test your FTP search engine...:)
TROUBLE SHOOTING
If you have any problem, please
-
check if your CGI system is okay
-
check the disk space for $TMPDIR and $CACHEDIR defined in 'config'
-
check the permission of $CACHEDIR, it needs to be able to be written by
your CGI user
-
check the path for external programs defined in $CMD_xxx
-
check log.system to see what happened
CHANGES
-
2.01
Change filelist source keyword "file" to "file://..." and support Z, zip formats
Fix a bug in dcollect.pl which induced the failure in ftpget
Fix maplist related function
Fix domainname related function
-
2.00
More modular design
An install program is provided to ease the installation.
The username, password and initial index directory are now assignable
Filelist source now can be direct, file, or other ftplocate server
Maplist function is added to maintain FtpLocate server list
The text server is now a CGI program. the text client now acts like
a web client.
Fix the ftp timeout problem in filelist collecting
Fix the DNS timeout problem in user query
-
1.50
Support description search
Display description when list files
Use glimpse to do index
-
1.00
Initial release.
TODO
-
make description parser recognize more format
-
search over multiple FtpLocate server at the same time
-
better server choice algorithm
Any help or suggestion is welcomed.
Distributed System Lab.
E.E. NCKU.
Taiwan
tung@turtle.ee.ncku.edu.tw
02/20/2000