System Administration for the Web - Day 5 notes

Notes:

Administrivia:

  • Homework from before spring break due next week

    Course Notes:

  • questions..
  • shell scripts
  • cron jobs and log rotation
  • text utilities and datamining
    • cut
    • sort
    • uniq
    • grep
  • pipes and redirection
  • 
    

    Shell Scripts

    shell scripts are simple programs written with the commands that the shell can understand, including it's own built in commands (which are listed on a shell's manpage) as well as any other programs on the system. Shell scripts are used to perform simple system operations like starting daemons (see below), and performing accounting. In recent years, scripting languages such as perl have risen to prominence because of their usefulness in writing programs for the web. Most shell scripts used by the system are written in sh's scripting language (but that doesn't stop people from writing scripts in csh, bash, ksh or any other shell's language). We'll talk more about shell scripting next week, for now, just take a look at what a shell script looks like. One thing to note is that the first line in the file is "magic" -it tells the system to interpret the file with the program specified after the #! in the first line, in this case, with the basic shell sh. Because the system must first execute the file, and then read the file in as input to the shell, the user trying to run the script must have read and execute permissions on the script. Example: /etc/rc2.d/S99audit

    cron jobs

    Text Utilities

    Now we'll get into using utilities to examine text files such as the system's web logs. System Administrators often find it useful to comb through large amounts of data, such as logfiles, to discover properties of that data. For instance, you could examine the logs generated by your webserver to learn which pages are your most popular pages.

    What is a text utility?

    Text utilities are some of the oldest programs in the unix canon. In the early days of unix, the philosophy was to create lots of little commands to do very specific things. Later, especially with the development that went on at UC Berkeley in the 80s, programs started becoming more generalized, getting more and more flags (options). Unix text utilities are little programs that were created to allow users to extract useful information from large text files.

    What is a logfile, and how can we get useful info from it?

    Do you have a webpage on the OCF? Have you looked at a weblog analysis before? The "access log" is the textfile in which the webserver makes a record of every time someone views a webpage. each line looks like this: 169.229.76.87 - - [25/Feb/2001:00:10:37 -0800] "GET /~superb/comedy/equation.gif HTTP/1.1" 200 15081 "http://www.ocf.berkeley.edu/~superb/comedy/comedy.html" "Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)" There are several pieces of information on this line, from left to right: 169.229.76.87 - the IP address of the requestor the first dash - the email address of the requestor (no longer used, due to spam) the 2nd dash - the username for the requestor if they had to log in to the site [timestamp] - the date, time and displacement from UTC "GET stuff" - what request was made to the server, using what protocol first number - the status code of the request. 200 means success second number - the number of bytes transfered "URL" - the referrer URL (the page the request came from) "Mozilla" - information about the user's environment. The access logs are normally found in: /var/log/httpd/access.log on the OCF, you will find it on our webserver, death, at: /opt/httpd/logs/ if you are on another system and you can't find the logs, you can use the locate command (locate access.log) If you look in that directory, you see: There are several types of logs, and you'll notice that each log has multiple versions, ie: access_log.#.gz -those files are the logs from previous weeks. The way we've got our system set up, every Saturday we take the current access_log file, rename it access_log.1 and compress it with the gnu zip (gzip) program. If there is already a access_log.1.gz, that file is first renamed access_log.2.gz, and so on. This process is called log rotation. If there is an access_log.9.gz, that will be thrown away because we've decided not to keep that data around. We could keep all our old data, but we would eventually run out of harddisk space. Let's take a look at how large a recent access_log file is: death [158] zcat access_log.1.gz | wc -l 2017601 > du -k access_log.1 412728 access_log.1 The access_log is over two million lines, and 400 megs! Using such a large file will slow everything down. In order to actually be able to do some real work, we'll use a much smaller version of this file, access_log.small, which has only 5000 lines. Before we get into actually processing the log, let's look at the logfile being updated in real time, as people check out our webpages. To do this, do "tail -f access_log" Remember that tail gives us the last 10 lines of a file. The "-f" option causes tail to wait at the end of the file for additional data to be appended. When the data is appended, tail displays it, and waits for more again. This way, you can see the record of each new web-hit as it happens! Oftentimes, you'll see a large number of requests for similar pages one after another. This is because if a page has a lot of graphics, each picture causes the browser to make a request. While you're viewing the output of "tail -f access_log", whatever text you enter will be ignored, so you can safely hit return repeatedly to separate out different log entries. To stop viewing the output of "tail -f", hit control-C. Ok, as we discovered above, the access_log is really huge, and growing all the time, so we're going to use a copy of a smaller section of the logfile, so we can actually get some results in a reasonable amount of time. So, say you wanted to see who was looking at your webpages: for Ken's account, with the login name "kenao", hits for his pages will contain the string "kenao" # cd /opt/httpd/logs # grep kenao access_log.small (grep returns all the lines containing a string in a file) ...[output] Now let's see how man page views I got by piping this through wc (wordcount) # grep kenao access_log.small | wc -l 6 (the "-l" flag tells wc to report only the number of lines read in on input -so Ken's pages were requested 6 times in those 5000 lines) Now lets see how many different people viewed his webpage by looking at how many unique IP addresses requested his webpage. We use the cut command, which returns only a certain column of text in a line, where the column is delimited by a special character (in this case, we ask for column 1, where the columns are separated by blank spaces, " "). To do this, we use cut: # grep kenao access_log.small | cut -f1 -d" " 66.77.73.235 66.77.73.235 66.77.73.235 66.77.73.235 66.77.73.235 66.77.73.235 Here we can see by just looking that there was only one host that viewed his webpage. What if there were many many more? Let's look at the case where we're investigating who was looking at the most pages on the OCF in general. First, lets look at who was viewing pages on the OCF in general: # cut -f1 -d" " access_log.small ...[long output] this is way too much information to get any understanding. We can (as we discussed in class): # cut -f1 -d" " access_log.small | sort | uniq -c | sort -n ...[lots of output ending in:] 87 160.81.214.230 89 169.229.112.109 95 12.232.222.68 105 12.233.255.82 120 63.196.242.143 127 171.64.75.149 148 169.229.118.101 151 64.130.184.181 167 66.81.133.155 272 64.162.91.162 344 64.167.150.48 What about those commands? You can read the manpages to get lots of juicy information. But briefly, sort sorts the lines of a file, and the -n option sorts in true numerical order (otherwise alphabetically, "10" is less than "3" because the value for the character "1" is less than the value of the character "3"). uniq compresses adjacent same lines of a file, and the -c option preppends a count of how many adjacent same lines were found. So what do the results mean? The last entry "344 64.167.150.48" means that the IP address 64.167.150.48 requested OCF pages 344 times.

    enrichment


    Homework due next class:

  • Finish the assignment from last week. If you need help, send me an email, or ask in this public forum.
    c.2002, Devin Jones - jones@csua.berkeley.edu
    last modified: