Advanced Lab 1
Lecture: Thurs 2/1
Released: Mon 2/5
Due: End of Feb
Advanced section, Lab 1
Grading note
Labs are graded on completion. Treat this lab as seeds of exploration instead
of just a grade. If you don’t pass on the first submission, you can have it
checked off in-person by a decal facilitator.
Since you know how to use unix tools (though you may be more or less familiar
with certain tools), the goal of this lab is to drop you in the wilderness. You
can find your way out! :D
Composability & workflows
This lab can be done on your own UNIX-like machine, or you can ssh into
tsunami.ocf.berkeley.edu
using your OCF account to finish the lab there. As
always, man
and Google will be your friends.
Shell. web fetching, parsing, and frequency analysis.
As it turns out, Project Gutenberg doesn’t like users of curl
, and will
demand you enable javascript. It’s probably a resonable defense against
naive abuse of curl
by people scraping the site, but it’s annoying to deal
with.
You can just tell the server that you’re Firefox instead:
curl -s -A "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/51.0.2704.79 Chrome/51.0.2704.79 Safari/537.36" https://www.gutenberg.org/
Credit to: gurditsbedi on
medium
- [ ] Isolate the body content from the headers and footers, perhaps saving to
an intermediate file.
Hints:
-
less
may be useful for inspection, and sed
with no-echo option.
-
sed
’s “#,#p” line-number/print argument (play with ranges, offset, and
start/end of file chars…twist on regex line terminators).
-
[ ] Process the document to get word counts. You may need to clean up the
data, especially punctuation.
Hints:
-
As with all text processing in the shell, regex
will be your friend (and worst enemy). Check out regexr
to help you construct proper regular expressions.
-
grep
has options to match whole words, and to print only match-contents
(instead of a whole line containing a match). You need not use
“regex-grouping”, just a pattern.
-
sed
may also be useful for filtering by whitelisting and/or blacklisting
characters. In either case, character classes [_]
and (defining them
negatively) [^_]
can help. In both cases, use, but beware the -
dash
character: it gets interpreted as a range and triggers cryptic errors when that
range is invalid. tr
, cut
, sort
, and uniq
also may be
useful.
- What are the top 10 words, and their frequency?
jq. jshawn.
These tools allow you to parse, restructure, and create
JSON documents on the
command line. Today, we’re using jq
. It should be installed already on tsunami
- [ ] Get location, lat long from a json api.
curl 'http://api.geonames.org/postalCodeSearchJSON?postalcode=12345&maxRows=10&username=ocf_decal' -o location.json
- [ ] Parse out values with
jq
Some command snippets that may help you with jq
:
cat location.json \
| jq '.postalCodes[] | select(.placeName=="Berkeley") | {"lat": .lat,
"long": .lng}'
location.json -> { key: [ {...}, {...}] }
jq '.key' -> [1,2]
jq '.key[]' -> 1,2
jq '.[] | select(.key == "value")'
jq '. | { "k1": .key1, "k2": .key2 }'
- [ ] Get ISS flyovers by location.
Docs:
http://open-notify.org/Open-Notify-API/ISS-Pass-Times/
Consider using the pipemill pattern with date
. It can take an argument for
time to display, but it requires some syntax mangling you can find in the man
pages.
Questions
- When is the next flyovers of Berkeley? Berlin?
(Look up at the sky.)
Other APIs
SpaceX has a beautiful API presenting
information about at least launches.
Something you found fun or interesting (optional)
https://github.com/toddmotto/public-apis
https://github.com/jdorfman/awesome-json-datasets
https://www.data.gov/
Submission
Fill out the Google form