Tool to summarise proxy access files into summary CSV files
$30-100 USD
Đã đóng
Đã đăng vào khoảng 20 năm trước
$30-100 USD
Thanh toán khi bàn giao
A command line tool called to summarise proxy log files into much shorter CSV files is required. This tool needs to be extremely fast and memory efficient. It also needs to be well written, well commented and extensible.
## Deliverables
Speed: able to process a 30MB (uncompressed) access log in less than 2 hrs on a single Linux system using a 3GHz Xeon CPU and never exceeding total 1GB Resident Set Size memory usage
Arguments:
[ -t X ] number of seconds between subsequent hits defined as continuous surfing
[ -d 1,2 ] debug levels, 1 produces some debug, 2 produces copious debug
[ -v ] verbose if given
[ -p ] shows ASCII based progress bar
[ -f filename ] use proxy log file as input where file is either in cleartext,
gzipped or bzip2 format (program to autodetect file format and process accordingly)
The CLF fields that map to the CSV summary fields must be defined in an include file, so that file formats and the summary tool can easily be modified in future.
CSV summary file format for non zero-length or 2XX HTTP response url's only User/IP address, Site, Category, Bytes, Number, Time
CSV summary is to be sorted by User/IP, Site, Category
Description of CSV format required
(All proxy log fields referred to are ' ' separated, field 1 is the first entry)
User/IP address
---------------
If user (field 3) is not '-', use user name otherwise summarise using IP address
Site
---------------
Use hostname in field 6 (remove leading URL type://, remove trailing path after hostname, remove :port if specified)
Category
---------------
Use field 9, remove leading '[', there will never be more than 256 unique categories.
Bytes
---------------
Use field 8 (this is already in bytes)
Number
---------------
Number of objects retrieved (one log entry is one object)
Time
--------------
This value is to be in seconds.
An arbitrary value argument is to be given to program via '-t' optarg, this argument defines the length of time in seconds between HTTP object retrieval that is considered to be a user/IP continuously surfing.
This is calculated by measuring the date/time (field 4) difference between subsequent log entries for the same user/IP address (defaults to 120 seconds if no -t option provided)
From the CSV summary file, the following information should be ascertainable;
The number of web sites user X surfed in the COMPUTING category.
The amount of data user X downloaded from site Y
The number of objects ip address X retrieved from site Y
The time user X spent surfing category BANKING
Sample CSV summary entry
[login to view URL],[login to view URL],BANKING,10965,24,185
Description of above
Src IP address [login to view URL] downloaded 24 HTTP objects totalling 10965 bytes from
site [[login to view URL]][1] which falls into category BANKING and was doing so
continously for (possibly multiple periods) totalling 185 seconds.
Sample proxy log entries
[login to view URL] - 3000 [5/Mar/2004:07:12:13 +1000] "<[login to view URL]>" 200 673 [COMPUTING "Report All" "COMPUTING"]
[login to view URL] - 3000 [5/Mar/2004:07:12:13 +1000] "<[login to view URL]>" 200 673 [COMPUTING "Report All" "COMPUTING"]
[login to view URL] - [login to view URL] [5/Mar/2004:07:54:07 +1000] "[login to view URL]" 200 0 [BANKING "Report All" "BANKING"]
Coding language: ANSI C, source code required (not C++)
All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).
## Platform
Coding language: ANSI C, source code required (not C++)