Comprehensive System Administration


Lecture 4 notes:

administrivia:

Enrollment Any problems? Please email me to get on the course announcement list: jones@csua.berkeley.edu Lecture Notes Last week's notes are now available. Homework? Last week's homework assignment was ill-defined (until now), and so is due next week rather than today. Check last weeks notes for the assignment details. Since last weeks homework was a bit long, this weeks will be shorter than usual. Getting Help If you have trouble, please email me, or one of the other teachers. The OCF staffers are often around the lab, and can be helpful. Don't forget your classmates can also be helpful. Questions?

course material:

Today Michael Constant will be leading the class. So far we've covered basic unix commands, and how to edit files with vi. Today we'll get into using utilities to examine text files such as the system's web logs. System Administrators often find it useful to comb through large amounts of data, such as logfiles, to discover properties of that data. For instance, you could examine the logs generated by your webserver to learn which pages are your most popular pages. Goals: What is a text utility? What is a logfile? What do the following utilities do? cut sort uniq grep Be aware of the following utilities: awk, sed, perl... Some other old utilities, not often used now are: paste tr There are other text utilities that are useful in specific contexts. od, strings comm, cmp diff What are pipes and redirection? How can we use pipes with the above commands to learn things about our data?

What is a text utility?

Text utilities are some of the oldest programs in the unix canon. In the early days of unix, the philosophy was to create lots of little commands to do very specific things. Later, especially with the development that went on at UC Berkeley in the 80s, programs started becoming more generalized, getting more and more flags (options). Unix text utilities are little programs that were created to allow users to extract useful information from large text files.

What is a logfile, and how can we get useful info from it?

Do you have a webpage on the OCF? Have you looked at a weblog analysis before? The "access log" is the textfile in which the webserver makes a record of every time someone views a webpage. each line looks like this: 169.229.76.87 - - [25/Feb/2001:00:10:37 -0800] "GET /~superb/comedy/equation.gif HTTP/1.1" 200 15081 "http://www.ocf.berkeley.edu/~superb/comedy/comedy.html" "Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)" There are several pieces of information on this line, from left to right: 169.229.76.87 - the IP address of the requestor the first dash - the email address of the requestor (no longer used, due to spam) the 2nd dash - the username for the requestor if they had to log in to the site [timestamp] - the date, time and displacement from UTC "GET stuff" - what request was made to the server, using what protocol first number - the status code of the request. 200 means success second number - the number of bytes transfered "URL" - the referrer URL (the page the request came from) "Mozilla" - information about the user's environment. The access logs are normally found in: /var/log/httpd/access.log on the OCF, you will find it at: /services/http-logs/access_log if you are on another system and you can't find the logs, you can use the locate command (locate access.log) If you look in that directory, you see: There are several types of logs, and you'll notice that each log has multiple versions, ie: access_log.#.gz -those files are the logs from previous weeks. The way we've got our system set up, every Saturday we take the current access_log file, rename it access_log.1 and compress it with the gnu zip (gzip) program. If there is already a access_log.1.gz, that file is first renamed access_log.2.gz, and so on. This process is called log rotation. If there is an access_log.9.gz, that will be thrown away because we've decided not to keep that data around. We could keep all our old data, but we would eventually run out of harddisk space. Since the class is being held on Friday, and the logs are rotated on Saturdays, the access_log is nearly as large as it could possibly be. Let's take a look: % wc -l access_log 1098889 19148553 221709037 access_log % ls -l access_log -rw-r--r-- 1 root www 222091324 Mar 2 17:08 access_log The access_log is over a million lines, and 200 megs! When we all started looking at it over the network, the network tried to transfer that 200 meg file 20+ times over the network simultaneously. That caused everything to slow down. In order to actually be able to do some real work, we'll use a much smaller version of this file, access_log.small, which has only 5000 lines. Before we get into actually processing the log, let's look at the logfile being updated in real time, as people check out our webpages. To do this, do "tail -f access_log" Remember that tail gives us the last 10 lines of a file. The "-f" option causes tail to wait at the end of the file for additional data to be appended. When the data is appended, tail displays it, and waits for more again. This way, you can see the record of each new web-hit as it happens! Oftentimes, you'll see a large number of requests for similar pages one after another. This is because if a page has a lot of graphics, each picture causes the browser to make a request. While you're viewing the output of "tail -f access_log", whatever text you enter will be ignored, so you can safely hit return repeatedly to separate out different log entries. To stop viewing the output of "tail -f", hit control-C. Ok, as we discovered above, the access_log is really huge, and growing all the time, so we're going to use a copy of a smaller section of the logfile, so we can actually get some results in a reasonable amount of time. So, say you wanted to see who was looking at your webpages: for Devin, who has the login name "jones", hits for his pages will contain the string "jones" # cd /services/http-logs # grep jones access_log.small (grep returns all the lines containing a string in a file) ...[output] Now let's see how man page views I got by piping this through wc (wordcount) # grep jones access_log.small | wc -l 19 (the "-l" flag tells wc to report only the number of lines read in on input -so my pages were requested 19 times) Now lets see how many different people viewed my webpage by looking at how many unique IP addresses requested my webpage. We use the cut command, which returns only a certain column of text in a line, where the column is delimited by a special character (in this case, we ask for column 1, where the columns are separated by blank spaces, " "). To do this, we use cut: # grep jones access_log.small | cut -f1 -d" " 192.58.221.223 192.58.221.223 192.58.221.223 192.58.221.223 192.58.221.223 192.58.221.223 192.58.221.223 192.58.221.223 192.58.221.223 192.58.221.236 192.58.221.236 192.58.221.236 192.58.221.236 192.58.221.236 192.58.221.236 192.58.221.236 192.58.221.236 192.58.221.236 192.58.221.236 Here we can see by just looking that there were only two different hosts that viewed my webpage. What if there were many many more? Let's look at the case where we're investigating who was looking at the most pages on the OCF in general? First, lets look at who was viewing pages on the OCF in general: # cut -f1 -d" " access_log.small ...[long output] this is way too much information to get any understanding. We can: # cut -f1 -d" " access_log.small | sort | uniq -c | sort -n ...[lots of output ending in:] 58 144.132.32.6 58 199.233.182.11 59 209.115.94.143 69 208.226.118.187 77 169.229.92.60 91 208.199.82.216 93 64.219.68.26 116 131.243.227.16 185 192.58.221.172 875 24.176.252.198 What about those commands? You can read the manpages to get lots of juicy information. But briefly, sort sorts the lines of a file, and the -n option sorts in true numerical order (otherwise alphabetically, "10" is less than "3" because the value for the character "1" is less than the value of the character "3"). uniq compresses adjacent same lines of a file, and the -c option preppends a count of how many adjacent same lines were found. So what do the results mean? The last entry "875 24.176.252.198" means that the IP address 24.176.252.198 requested OCF pages 875 times.