Advanced Lab 1

Lecture: Thurs 2/1 Released: Mon 2/5 Due: End of Feb

Advanced section, Lab 1

Grading note

Labs are graded on completion. Treat this lab as seeds of exploration instead of just a grade. If you don’t pass on the first submission, you can have it checked off in-person by a decal facilitator.

Since you know how to use unix tools (though you may be more or less familiar with certain tools), the goal of this lab is to drop you in the wilderness. You can find your way out! :D

Composability & workflows

This lab can be done on your own UNIX-like machine, or you can ssh into tsunami.ocf.berkeley.edu using your OCF account to finish the lab there. As always, man and Google will be your friends.

Shell. web fetching, parsing, and frequency analysis.

[ ] Download Alice in Wonderland by parsing out a download link from Project Gutenberg.

https://www.gutenberg.org/ebooks/11

As it turns out, Project Gutenberg doesn’t like users of curl, and will demand you enable javascript. It’s probably a resonable defense against naive abuse of curl by people scraping the site, but it’s annoying to deal with.

You can just tell the server that you’re Firefox instead:

curl -s -A "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/51.0.2704.79 Chrome/51.0.2704.79 Safari/537.36" https://www.gutenberg.org/

Credit to: gurditsbedi on medium

[ ] Isolate the body content from the headers and footers, perhaps saving to an intermediate file.

Hints:

less may be useful for inspection, and sed with no-echo option.
sed’s “#,#p” line-number/print argument (play with ranges, offset, and start/end of file chars…twist on regex line terminators).
[ ] Process the document to get word counts. You may need to clean up the data, especially punctuation.

Hints:

As with all text processing in the shell, regex will be your friend (and worst enemy). Check out regexr to help you construct proper regular expressions.
grep has options to match whole words, and to print only match-contents (instead of a whole line containing a match). You need not use “regex-grouping”, just a pattern.
sed may also be useful for filtering by whitelisting and/or blacklisting characters. In either case, character classes [_] and (defining them negatively) [^_] can help. In both cases, use, but beware the - dash character: it gets interpreted as a range and triggers cryptic errors when that range is invalid. tr, cut, sort, and uniq also may be useful.

Questions (answer on Google form)

What are the top 10 words, and their frequency?

niche tools & little support

jq. jshawn.

These tools allow you to parse, restructure, and create JSON documents on the command line. Today, we’re using jq. It should be installed already on tsunami

[ ] Get location, lat long from a json api.

 curl 'http://api.geonames.org/postalCodeSearchJSON?postalcode=12345&maxRows=10&username=ocf_decal' -o location.json

[ ] Parse out values with jq Some command snippets that may help you with jq:

cat location.json  \
        | jq '.postalCodes[] | select(.placeName=="Berkeley") | {"lat": .lat, 
"long": .lng}'

    location.json -> { key: [ {...}, {...}] }
    jq '.key'   -> [1,2]
    jq '.key[]' -> 1,2

    jq '.[] | select(.key == "value")'

    jq '. | { "k1": .key1, "k2": .key2 }'

[ ] Get ISS flyovers by location. Docs: http://open-notify.org/Open-Notify-API/ISS-Pass-Times/

Consider using the pipemill pattern with date. It can take an argument for time to display, but it requires some syntax mangling you can find in the man pages.

Questions

When is the next flyovers of Berkeley? Berlin?

(Look up at the sky.)

Other APIs

SpaceX has a beautiful API presenting information about at least launches. Something you found fun or interesting (optional)

https://github.com/toddmotto/public-apis
https://github.com/jdorfman/awesome-json-datasets
https://www.data.gov/

Submission

Fill out the Google form