Overview

elk is a powerful open-source command-line web crawler that can recursively search for files and text on websites. The name elk is inspired from the horns of elk. elk also comes with several useful features, from regular expression based file and text search to copying entire website content to the the local file system. elk is open-source python based utility developed and tested on python 2.5 and is hosted on sourceforge. The latest available version of elk is 0.1.



Usage

elk is tested with python 2.5 on windows, cygwin and ubuntu. The main script is elk.py that takes a number of arguments and switches; the command normally looks like:

python elk.py [switches] [urls]

Below is a complete list and explanation of arguments and switches

-a [--all]All resources and links (excluding foreign links)
-l [--locallinks]All local hyperlinks
-f [--foreignlinks]Foreign hyperlinks
-p [--pagelinks]Page anchors
-e [--emaillinks]Email links
-i [--images]Images
-c [--css]Style sheets
-r [--rss]RSS feeds
-j [--script]Scripts
-t [--text]Text files
-v [--verbose]verbose (print results)
-u [--usage]usage
-s [--search]Search text files with regular expression passed as an argument
-n [--name]Search files with names matching the passed regular expression
-m [--mime]Specific mime types (pipe seperated)
-o [--output]Output directory to save files
-d [--depth]Depth of search as argument. default is 1
-I [--ignorecase]Case insensitive search
-S [--save]Save matched files
-X [--saveall]Save all files (useful for downloading entire website content; it overrides most of other switches)

The default arguments can be set in the elk.ini configuration file.



Examples

Search all images with names ending with "jpg"
python elk.py -vi -n '.*jpg$' http://python.org

Search all files with names ending with "txt" containing text 'company' and save to "c:/mysearch"
python elk.py -S -n '.*txt$' -s 'company' -o c:/mysearch http://python.org

Download entire website content down two levels to "c:/mysearch"
python elk.py -X -d 2 -o c:/mysearch http://python.org

Search for all email addresses on the website, in all text content
python elk.py -vt -s '[a-zA-Z0-9._%-+]+@[a-zA-Z0-9._%-]+.[a-zA-Z]{2,6}' http://python.org

Search for all files with 'application/zip' mime type and name that ends with 'gz'
python elk.py -vm 'application/zip' -n '.*gz$' http://python.org



Download

elk-0.1.tgz - GPL v3



Advanced usage

elk is multithreaded and the thread pool is configurable and can be changed in the elk.ini file. In case of mime errors, it is also possible to extend support for new mime types by adding them to the mimelib.py script. It is also possible to add more headers to the request; it is done by adding them in the elk.ini file under the Headers section. The RequestData section in the config file is used to pass get/post data in the request.



Feedback

If you have any questions please email me at kamran.zafar@xeustechnologies.org. You can also submit defects and/or feature requests by logging a ticket in the elk tracking system; do not forget to enter your email address if you log a ticket anonymously.


Written by Kamran
Blog | Vendor