System Administration for the Web

System Administration for the Web - Day 5 notes

Notes:
Administrivia:

Homework from before spring break due next week

Course Notes:

questions..
shell scripts
cron jobs and log rotation
text utilities and datamining

cut
sort
uniq
grep

pipes and redirection

Shell Scripts
shell scripts are simple programs written with the commands that the shell can
understand, including it's own built in commands (which are listed on a shell's
manpage) as well as any other programs on the system.

Shell scripts are used to perform simple system operations like starting
daemons (see below), and performing accounting. In recent years, scripting
languages such as perl have risen to prominence because of their usefulness
in writing programs for the web.

Most shell scripts used by the system are written in sh's scripting language
(but that doesn't stop people from writing scripts in csh, bash, ksh or any other
shell's language).

We'll talk more about shell scripting next week, for now, just take a look at what
a shell script looks like.

One thing to note is that the first line in the file is "magic" -it tells the
system to interpret the file with the program specified after the #! in the first
line, in this case, with the basic shell sh. Because the system must first
execute the file, and then read the file in as input to the shell, the user trying
to run the script must have read and execute permissions on the script.

Example: /etc/rc2.d/S99audit

#!/sbin/sh
#
# Copyright (c) 1997 by Sun Microsystems, Inc.
# All rights reserved.
#
#ident "@(#)audit 1.5 97/12/08 SMI"

case "$1" in
'start')
if [ -f /etc/security/audit_startup ]; then
echo 'starting audit daemon'
/etc/security/audit_startup
/usr/sbin/auditd &
fi
;;

'stop')
if [ -f /etc/security/audit_startup ]; then
/usr/sbin/audit -T
fi
;;

*)
echo "Usage: $0 { start | stop }"
exit 1
;;
esac
exit 0

cron jobs

The cron daemon is a process that executes prescheduled tasks on an
automated basis.

cron is typically used to run automated backups, do system accounting, clean up
temporary files, distribute data to remote computers, remind people about
important dates, and anything that needs to be done on a routine basis.

The example given in class was a routine job of rotating the web-server logs. A
simple implementation might look like this (a shell script):

#!/bin/sh
rm /var/logs/apache/access.log.1.gz
mv /var/logs/apache/access.log /var/logs/apache/access.log.1
gzip /var/logs/apache/access.log.1

This would only give you one rotation of log history, but you get the idea. Linux
comes with commands "logrotate" and "rotatelogs" to facilitate this process.

There are several different versions of cron, and unfortunately the OCF
doesn't allow general-user access to cron. Access to cron is controlled
with the /etc/cron.d/cron.allow and /etc/cron.d/cron.deny
files.

A brief introduction nonetheless:

on AT&T style systems, a user submits a file (called a crontab) by:
crontab mycrontab

if the user already has a crontab, she can retrieve it with:
crontab -l > mycrontab

The crontab file contains entries in the following format:

minute hour day month weekday command
with the ranges: 0-59 0-23 1-31 1-12 1-7

the command is a command or script recognizable by sh.

Examples (from the crontab manpage): Mail a birthday greeting:
(at noon, on Feb. 14th)

0 12 14 2 * mailx john%Happy Birthday!%Time for lunch.

0 0 * * 1 command
(would run a command only on Mondays.)

Text Utilities

Now we'll get into using utilities to examine text files such as the
system's web logs. System Administrators often find it useful to comb
through large amounts of data, such as logfiles, to discover properties
of that data. For instance, you could examine the logs generated by your
webserver to learn which pages are your most popular pages.

What is a text utility?

Text utilities are some of the oldest programs in the unix canon. In
the early days of unix, the philosophy was to create lots of little commands to
do very specific things. Later, especially with the development that went on
at UC Berkeley in the 80s, programs started becoming more generalized, getting
more and more flags (options). Unix text utilities are little programs that
were created to allow users to extract useful information from large text
files.

What is a logfile, and how can we get useful info from it?
Do you have a webpage on the OCF? Have you looked at a weblog analysis before?

The "access log" is the textfile in which the webserver makes a record of every
time someone views a webpage.

each line looks like this:

169.229.76.87 - - [25/Feb/2001:00:10:37 -0800] "GET /~superb/comedy/equation.gif HTTP/1.1" 200 15081 "http://www.ocf.berkeley.edu/~superb/comedy/comedy.html" "Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)"

There are several pieces of information on this line, from left to right:
169.229.76.87 - the IP address of the requestor
the first dash - the email address of the requestor (no longer used, due to spam)
the 2nd dash - the username for the requestor if they had to log in to the site
[timestamp] - the date, time and displacement from UTC
"GET stuff" - what request was made to the server, using what protocol
first number - the status code of the request. 200 means success
second number - the number of bytes transfered
"URL" - the referrer URL (the page the request came from)
"Mozilla" - information about the user's environment.

side note:
if you're curious about who was making the request, you can translate
the IP address to a hostname using nslookup:

% nslookup 169.229.76.87
Server: apocalypse.OCF.Berkeley.EDU
Address: 128.32.191.249

Name: fre-76-87.Reshall.Berkeley.EDU
Address: 169.229.76.87

So someone in the dorms with a win98 box was looking at SUPERB's website.

The access logs are normally found in:

/var/log/httpd/access.log

on the OCF, you will find it on our webserver, death, at:

/opt/httpd/logs/

if you are on another system and you can't find the logs,
you can use the locate command (locate access.log)

If you look in that directory, you see:

% ls
access_log access_log.small error_log.6.gz ssl_mutex.19614
access_log.1.gz cgiwrap.log error_log.7.gz ssl_mutex.26565
access_log.2.gz cgiwrap.log.1.gz error_log.8.gz ssl_mutex.28411
access_log.3.gz cgiwrap.log.2.gz error_log.9.gz ssl_mutex.340
access_log.4.gz error_log httpd.pid ssl_mutex.341
access_log.5.gz error_log.1.gz old_logs@ ssl_mutex.35
access_log.6.gz error_log.2.gz phf_log ssl_request_log
access_log.7.gz error_log.3.gz phf_log.1.gz ssl_scache.dir
access_log.8.gz error_log.4.gz ssl_engine_log ssl_scache.pag
access_log.9.gz error_log.5.gz ssl_mutex.19499 stats/

There are several types of logs, and you'll notice that each log has multiple
versions, ie: access_log.#.gz -those files are the logs from previous weeks.
The way we've got our system set up, every Saturday we take the current
access_log file, rename it access_log.1 and compress it with the gnu zip (gzip)
program. If there is already a access_log.1.gz, that file is first renamed
access_log.2.gz, and so on. This process is called log rotation. If
there is an access_log.9.gz, that will be thrown away because we've decided not
to keep that data around. We could keep all our old data, but we would
eventually run out of harddisk space.

Let's take a look at how large a recent access_log file is:

death [158] zcat access_log.1.gz | wc -l
2017601
> du -k access_log.1
412728 access_log.1

The access_log is over two million lines, and 400 megs! Using such a large file
will slow everything down. In order to actually be able to do some real work,
we'll use a much smaller version of this file, access_log.small, which has only
5000 lines.

Before we get into actually processing the log, let's look at the logfile
being updated in real time, as people check out our webpages. To do this,
do "tail -f access_log"

Remember that tail gives us the last 10 lines of a file. The "-f"
option causes tail to wait at the end of the file for additional data to
be appended. When the data is appended, tail displays it, and waits for
more again. This way, you can see the record of each new web-hit as it
happens! Oftentimes, you'll see a large number of requests for similar
pages one after another. This is because if a page has a lot of graphics,
each picture causes the browser to make a request.

While you're viewing the output of "tail -f access_log", whatever text you
enter will be ignored, so you can safely hit return repeatedly to separate
out different log entries.

To stop viewing the output of "tail -f", hit control-C.

Ok, as we discovered above, the access_log is really huge, and growing
all the time, so we're going to use a copy of a smaller section of the
logfile, so we can actually get some results in a reasonable amount of
time.

So, say you wanted to see who was looking at your webpages:

for Ken's account, with the login name "kenao", hits for his pages will
contain the string "kenao"

# cd /opt/httpd/logs
# grep kenao access_log.small
(grep returns all the lines containing a string in a file)

...[output]

Now let's see how man page views I got by piping this through wc (wordcount)
# grep kenao access_log.small | wc -l
6

(the "-l" flag tells wc to report only the number of lines read in on input
-so Ken's pages were requested 6 times in those 5000 lines)

Now lets see how many different people viewed his webpage by looking at how many
unique IP addresses requested his webpage. We use the cut command, which
returns only a certain column of text in a line, where the column is
delimited by a special character (in this case, we ask for column 1, where
the columns are separated by blank spaces, " ").

To do this, we use cut:

# grep kenao access_log.small | cut -f1 -d" "

66.77.73.235
66.77.73.235
66.77.73.235
66.77.73.235
66.77.73.235
66.77.73.235

Here we can see by just looking that there was only one host that
viewed his webpage.

What if there were many many more?
Let's look at the case where we're investigating who was looking
at the most pages on the OCF in general.

First, lets look at who was viewing pages on the OCF in general:

# cut -f1 -d" " access_log.small

...[long output]

this is way too much information to get any understanding.

We can (as we discussed in class):

# cut -f1 -d" " access_log.small | sort | uniq -c | sort -n

...[lots of output ending in:]
87 160.81.214.230
89 169.229.112.109
95 12.232.222.68
105 12.233.255.82
120 63.196.242.143
127 171.64.75.149
148 169.229.118.101
151 64.130.184.181
167 66.81.133.155
272 64.162.91.162
344 64.167.150.48

What about those commands? You can read the manpages to get lots of juicy
information. But briefly, sort sorts the lines of a file, and the
-n option sorts in true numerical order (otherwise alphabetically, "10" is
less than "3" because the value for the character "1" is less than the
value of the character "3"). uniq compresses adjacent same lines
of a file, and the -c option preppends a count of how many adjacent same
lines were found.

So what do the results mean? The last entry "344 64.167.150.48" means that
the IP address 64.167.150.48 requested OCF pages 344 times.

a side note:
So what were they looking at? We can use grep to find out:
% grep 64.167.150.48 access_log.small
...[344 lines of output ending in:]
64.167.150.48 - - [09/Apr/2002:01:25:50 -0700] "GET /~rcsa/random/hcal5.gif
HTTP/1.1" 200 26765 "http://www.ocf.berkeley.edu/~rcsa/resources.html"
"Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)"

Looking at that output, we can see that they are looking at Regents and
Chancellors Scholars Association, and the Hindu Students Council, and the Cal
Pre-Law Association.

You can see how you could use this to find lots of interesting information.
Say you had a huge increase in the hits to your webserver, and you wanted to
find out which user(s) were causing the load increase.

this introduces us to the topic of regular expressions. My roommate Michael once
said, "if you ever thought unix wasn't cryptic and confusing, and actually things
are fairly straightforward, the answer is no, unix is actually cryptic and
confusing." Check out the links below in the enrichment section to learn more
about regexps.

While this mostly works, you could also do the following:

# cat access_log.small | awk '{print $7}' | awk -F/ '{print $2}' | tr -d '~' |
tr -d '%7E' | sort | uniq -c | sort -n

this also gives you an idea about what's going on, but breaks in certain cases.
(You can see the difference in results by running both commands and finally
piping them through "tail -10". The correct way to do this would be to
convert the escaped html characters (i.e. you would convert %7E to ~) and to
process all non-user web hits separately - all hits for pages on the OCF's
main webpage and all the pages below it do not have ~ characters in the
requests. The first method ignores those anomolies, while the second tries
work around them).

This is all fine and good, but for your most common questions, a program called
analog (short for "analyze log") will do the work for you and display it on a
pretty web-page.

Using redirection

A quick review:
for unix commands, you can specify the input comes from and where the
output goes using >,<, and |. Where > and < are used for files,
and | is used for other programs. For example to sort the lines in inputfile
and send the results to outputfile you could invoke:

< inputfile sort > outputfile

sort < inputfile > outputfile

either will send inputfile to sort on the standard input, and redirect the
standard output to the file named outputfile.

sort also takes a filename as an arguement, so

sort inputfile >outputfile

will also do the same thing, and this way of processing text (taking input
from the standard input or a file, and sending it to standard output) is the
standard unix convention.

enrichment

Rembmer to look at the manpages for the commands that you've learned.

To learn more about redirection and your shell, read the manpage for your shell.
(most likely your shell is one of: csh, tcsh, or bash)

There are lots of ways to get interesting text out of files, if you're looking
for inspiration, do a google search for "unix text utility".

If you're a programmer, you might want to check out indent, and nm.

The following links might be interesting:

Unix Reference Desk

Logging with the Apache Webserver

regexps:

so what's a $#!%% regular expression, anyway?!

Homework due next class:

Finish the assignment from last week. If you need help, send me an email, or ask in this public forum.

c.2002, Devin Jones - jones@csua.berkeley.edu
last modified: