Sysadmin Decal: lecture 4

Comprehensive System Administration

Lecture 4 notes:

administrivia:
Enrollment
Any problems? Please email me to get on the course announcement list:
jones@csua.berkeley.edu

Lecture Notes
Last week's notes are now available.

Homework?
Last week's homework assignment was ill-defined (until now), and so is
due next week rather than today. Check last weeks notes for the
assignment details.

Since last weeks homework was a bit long, this weeks will be shorter than usual.

Getting Help
If you have trouble, please email me, or one of the other teachers.
The OCF staffers are often around the lab, and can be helpful.
Don't forget your classmates can also be helpful.

Questions?

course material:
Today Michael Constant will be leading the class.

So far we've covered basic unix commands, and how to edit files with vi.

Today we'll get into using utilities to examine text files such as the
system's web logs. System Administrators often find it useful to comb
through large amounts of data, such as logfiles, to discover properties
of that data. For instance, you could examine the logs generated by your
webserver to learn which pages are your most popular pages.

Goals:
What is a text utility?
What is a logfile?
What do the following utilities do?
cut
sort
uniq
grep

Be aware of the following utilities:
awk, sed, perl...

Some other old utilities, not often used now are:
paste
tr

There are other text utilities that are useful in specific contexts.
od, strings
comm, cmp
diff

What are pipes and redirection?
How can we use pipes with the above commands to learn things about our data?

What is a text utility?
Text utilities are some of the oldest programs in the unix canon. In
the early days of unix, the philosophy was to create lots of little commands to
do very specific things. Later, especially with the development that went on
at UC Berkeley in the 80s, programs started becoming more generalized, getting
more and more flags (options). Unix text utilities are little programs that
were created to allow users to extract useful information from large text
files.

What is a logfile, and how can we get useful info from it?
Do you have a webpage on the OCF? Have you looked at a weblog analysis before?

The "access log" is the textfile in which the webserver makes a record of every
time someone views a webpage.

each line looks like this:

169.229.76.87 - - [25/Feb/2001:00:10:37 -0800] "GET /~superb/comedy/equation.gif HTTP/1.1" 200 15081 "http://www.ocf.berkeley.edu/~superb/comedy/comedy.html" "Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)"

There are several pieces of information on this line, from left to right:
169.229.76.87 - the IP address of the requestor
the first dash - the email address of the requestor (no longer used, due to spam)
the 2nd dash - the username for the requestor if they had to log in to the site
[timestamp] - the date, time and displacement from UTC
"GET stuff" - what request was made to the server, using what protocol
first number - the status code of the request. 200 means success
second number - the number of bytes transfered
"URL" - the referrer URL (the page the request came from)
"Mozilla" - information about the user's environment.

side note:
if you're curious about who was making the request, you can translate
the IP address to a hostname using nslookup:

% nslookup 169.229.76.87
Server: apocalypse.OCF.Berkeley.EDU
Address: 128.32.191.249

Name: fre-76-87.Reshall.Berkeley.EDU
Address: 169.229.76.87

So someone in the dorms with a win98 box was looking at SUPERB's website.

The access logs are normally found in:

/var/log/httpd/access.log

on the OCF, you will find it at:

/services/http-logs/access_log

if you are on another system and you can't find the logs,
you can use the locate command (locate access.log)

If you look in that directory, you see:

% ls
access_log access_log.small error_log.6.gz ssl_mutex.19614
access_log.1.gz cgiwrap.log error_log.7.gz ssl_mutex.26565
access_log.2.gz cgiwrap.log.1.gz error_log.8.gz ssl_mutex.28411
access_log.3.gz cgiwrap.log.2.gz error_log.9.gz ssl_mutex.340
access_log.4.gz error_log httpd.pid ssl_mutex.341
access_log.5.gz error_log.1.gz old_logs@ ssl_mutex.35
access_log.6.gz error_log.2.gz phf_log ssl_request_log
access_log.7.gz error_log.3.gz phf_log.1.gz ssl_scache.dir
access_log.8.gz error_log.4.gz ssl_engine_log ssl_scache.pag
access_log.9.gz error_log.5.gz ssl_mutex.19499 stats/

There are several types of logs, and you'll notice that each log has multiple
versions, ie: access_log.#.gz -those files are the logs from previous weeks.
The way we've got our system set up, every Saturday we take the current
access_log file, rename it access_log.1 and compress it with the gnu zip (gzip)
program. If there is already a access_log.1.gz, that file is first renamed
access_log.2.gz, and so on. This process is called log rotation. If
there is an access_log.9.gz, that will be thrown away because we've decided not
to keep that data around. We could keep all our old data, but we would
eventually run out of harddisk space.

Since the class is being held on Friday, and the logs are rotated on Saturdays,
the access_log is nearly as large as it could possibly be. Let's take a look:

% wc -l access_log
1098889 19148553 221709037 access_log
% ls -l access_log
-rw-r--r-- 1 root www 222091324 Mar 2 17:08 access_log

The access_log is over a million lines, and 200 megs! When we all started
looking at it over the network, the network tried to transfer that 200 meg file
20+ times over the network simultaneously. That caused everything to slow
down.

In order to actually be able to do some real work, we'll use a much smaller
version of this file, access_log.small, which has only 5000 lines.

Before we get into actually processing the log, let's look at the logfile
being updated in real time, as people check out our webpages. To do this,
do "tail -f access_log"

Remember that tail gives us the last 10 lines of a file. The "-f"
option causes tail to wait at the end of the file for additional data to
be appended. When the data is appended, tail displays it, and waits for
more again. This way, you can see the record of each new web-hit as it
happens! Oftentimes, you'll see a large number of requests for similar
pages one after another. This is because if a page has a lot of graphics,
each picture causes the browser to make a request.

While you're viewing the output of "tail -f access_log", whatever text you
enter will be ignored, so you can safely hit return repeatedly to separate
out different log entries.

To stop viewing the output of "tail -f", hit control-C.

Ok, as we discovered above, the access_log is really huge, and growing
all the time, so we're going to use a copy of a smaller section of the
logfile, so we can actually get some results in a reasonable amount of
time.

So, say you wanted to see who was looking at your webpages:

for Devin, who has the login name "jones", hits for his pages will
contain the string "jones"

# cd /services/http-logs
# grep jones access_log.small
(grep returns all the lines containing a string in a file)

...[output]

Now let's see how man page views I got by piping this through wc (wordcount)
# grep jones access_log.small | wc -l
19

(the "-l" flag tells wc to report only the number of lines read in on input
-so my pages were requested 19 times)

Now lets see how many different people viewed my webpage by looking at how many
unique IP addresses requested my webpage. We use the cut command, which
returns only a certain column of text in a line, where the column is
delimited by a special character (in this case, we ask for column 1, where
the columns are separated by blank spaces, " ").

To do this, we use cut:

# grep jones access_log.small | cut -f1 -d" "

192.58.221.223
192.58.221.223
192.58.221.223
192.58.221.223
192.58.221.223
192.58.221.223
192.58.221.223
192.58.221.223
192.58.221.223
192.58.221.236
192.58.221.236
192.58.221.236
192.58.221.236
192.58.221.236
192.58.221.236
192.58.221.236
192.58.221.236
192.58.221.236
192.58.221.236

Here we can see by just looking that there were only two different hosts that
viewed my webpage.

What if there were many many more?
Let's look at the case where we're investigating who was looking
at the most pages on the OCF in general?

First, lets look at who was viewing pages on the OCF in general:

# cut -f1 -d" " access_log.small

...[long output]

this is way too much information to get any understanding.

We can:

# cut -f1 -d" " access_log.small | sort | uniq -c | sort -n

...[lots of output ending in:]
58 144.132.32.6
58 199.233.182.11
59 209.115.94.143
69 208.226.118.187
77 169.229.92.60
91 208.199.82.216
93 64.219.68.26
116 131.243.227.16
185 192.58.221.172
875 24.176.252.198

What about those commands? You can read the manpages to get lots of juicy
information. But briefly, sort sorts the lines of a file, and the
-n option sorts in true numerical order (otherwise alphabetically, "10" is
less than "3" because the value for the character "1" is less than the
value of the character "3"). uniq compresses adjacent same lines
of a file, and the -c option preppends a count of how many adjacent same
lines were found.

So what do the results mean? The last entry "875 24.176.252.198" means that
the IP address 24.176.252.198 requested OCF pages 875 times.

a side note:
So what were they looking at? We can use grep to find out:
% grep 24.176.252.198 access_log.small
...[875 lines of output ending in:]
24.176.252.198 - - [02/Mar/2001:13:47:48 -0800] "GET /~slc/diskwave.gif
HTTP/1.0" 304 - "-" "Mozilla/4.75 [en] (Win98; U)"

So, they are requesting the diskwave image again and again, about once a
second. Why? It's hard to say, but we can see their status code is 304, which
indicates that the request was caused by a redirect, probaly in a loop.
Unfortunately the referring URL is blank: "-" so we don't know where the
redirect is coming from. Hopefully the person at 24.176.252.198 (aka
c636130-a.pinol1.sfba.home.com) will notice this and close the window.
We may get into what we could do to solve this problem in a future class.

You can see how you could use this to find lots of interesting information.
Say you had a huge increase in the hits to your webserver, and you wanted to
find out which user(s) were causing the load increase.

this introduces us to the topic of regular expressions. As Michael said, "if
you ever thought unix wasn't cryptic and confusing, and actually things are
fairly straightforward, the answer is no, unix is actually cryptic and
confusing." Check out the links below in the enrichment section to learn more
about regexps.

While this mostly works, you could also do the following:

# cat access_log.small | awk '{print $7}' | awk -F/ '{print $2}' | tr -d '~' |
tr -d '%7E' | sort | uniq -c | sort -n

this also gives you an idea about what's going on, but breaks in certain cases.
(You can see the difference in results by running both commands and finally
piping them through "tail -10". The correct way to do this would be to
convert the escaped html characters (i.e. you would convert %7E to ~) and to
process all non-user web hits separately - all hits for pages on the OCF's
main webpage and all the pages below it do not have ~ characters in the
requests. The first method ignores those anomolies, while the second tries
work around them).

Using redirection

A quick review:
for unix commands, you can specify the input comes from and where the
output goes using >,<, and |. Where > and < are used for files,
and | is used for other programs. For example to sort the lines in inputfile
and send the results to outputfile you could invoke:

< inputfile sort > outputfile

sort < inputfile > outputfile

either will send inputfile to sort on the standard input, and redirect the
standard output to the file named outputfile.

sort also takes a filename as an arguement, so

sort inputfile >outputfile

will also do the same thing, and this way of processing text (taking input
from the standard input or a file, and sending it to standard output) is the
standard unix convention.

Homework #3

Questions:
What is log rotation?

Tasks:
How many requests for files under OCF's root webpage are there in
access_log.small? What commands did you use to find this number?

How many email messages from jones@csua.berkeley.edu
are in your inbox. What commands did you use to find this number?
(note: mailspools are usually found in /var/mail/username or
/var/mail/spool/username, or /var/spool/username.)

What to turn in:

Class Feedback:

Do you have any questions about the class?
Do you feel you're learning what you want to learn? What would you like to learn?
Do you feel that the lectures are going too fast/too slow/just right?

mail the answers to the above questions to:
jones@csua.berkeley.edu
(note that's not jones@ocf, mail to jones@ocf may go unnoticed).

enrichment

Rembmer to look at the manpages for the commands that you've learned.

To learn more about redirection and your shell, read the manpage for your shell.
(most likely your shell is one of: csh, tcsh, or bash)

There are lots of ways to get interesting text out of files, if you're looking
for inspiration, do a google search for "unix text utility".

If you're a programmer, you might want to check out indent, and nm.

The following links might be interesting:

GNU text utilities

Unix Reference Desk

Logging with the Apache Webserver

regexps:

regular expression tutorial
so what's a $#!%% regular expression, anyway?!