0%

MSC-data-wrangling

Exercise

  1. Take this short interactive regex tutorial.

The tutorial is great for anyone. To help us understand the material better, we can do those problems after the tutorial.

  1. Find the number of words (in /usr/share/dict/words) that contain at
    least three as and don’t have a 's ending. What are the three
    most common last two letters of those words? sed‘s y command, or
    the tr program, may help you with case insensitivity. How many
    of those two-letter combinations are there? And for a challenge:
    which combinations do not occur?
    1
    2
    3
    4
    $ cat /usr/share/dict/words | \
    grep -E ".*([Aa].*){3,}[^Ss]$" | \
    grep -o -e '.\{2\}$' | \
    sort | uniq -c | sort -nr | head -n3

Challenge (ugly but works…)

1
2
3
4
5
6
7
8
$ # show what combinations are not included
$ diff <(echo {a..z}{a..z} | tr " " "\n") \
<(cat /usr/share/dict/words | \
grep -E ".*([Aa].*){3,}[^Ss]$" | \
grep -o -e '.\{2\}$' | \
sort | uniq -c | awk '{print $2}'\
) \
| grep "<"
  1. To do in-place substitution it is quite tempting to do something like
    sed s/REGEX/SUBSTITUTION/ input.txt > input.txt. However this is a
    bad idea, why? Is this particular to sed? Use man sed to find out
    how to accomplish this.

It is bad because if your command have any bug, the input file will be distrupted. Here is the option in the man sed

1
2
3
-i[SUFFIX], --in-place[=SUFFIX]

edit files in place (makes backup if SUFFIX supplied)
  1. Find your average, median, and max system boot time over the last ten
    boots. Use journalctl on Linux and log show on macOS, and look
    for log timestamps near the beginning and end of each boot. On Linux,
    they may look something like:

    1
    Logs begin at ...

    and

    1
    systemd[577]: Startup finished in ...

    On macOS, look
    for
    :

    1
    === system boot:

    and

    1
    Previous shutdown cause: 5
    1
    $ journalctl | grep ".*\[1\]: Startup finished in" | sed -E 's/.*Startup finished in.* = (.*)\.$/\1/' | sed s/min//g | sed s/s//g # right now cannot deal with the case '1min 51.581s'
  2. Look for boot messages that are not shared between your past three
    reboots (see journalctl‘s -b flag). Break this task down into
    multiple steps. First, find a way to get just the logs from the past
    three boots. There may be an applicable flag on the tool you use to
    extract the boot logs, or you can use sed '0,/STRING/d' to remove
    all lines previous to one that matches STRING. Next, remove any
    parts of the line that always varies (like the timestamp). Then,
    de-duplicate the input lines and keep a count of each one (uniq is
    your friend). And finally, eliminate any line whose count is 3 (since
    it was shared among all the boots).

  3. Find an online data set like this
    one
    , this
    one
    .
    or maybe one from
    here
    .
    Fetch it using curl and extract out just two columns of numerical
    data. If you’re fetching HTML data,
    pup might be helpful. For JSON
    data, try jq. Find the min and
    max of one column in a single command, and the sum of the difference
    between the two columns in another.