missing_class
Data Wrangling
Table of Contents
- Example: checking who has logged into a server
sed
- Regular Expressions
awk
- Data wrangling: Massaging data from one format to another more useful format
- Pretty much any use of the pipe
|
- logs are a common use case
Example: checking who has logged into a server
$ ssh myserver 'journalctl | grep sshd | grep "Disconnected from"' > ssh.log
$ less ssh.log
ssh into myserver. Query the systemd journal, search it for ssh related entries,
and then search those for any containing the phrase “Disconnected from”. Output
it as a file ssh.log so that we don’t have to keep connected to the server.
Then open the log with pagination using less
sed
sed
: stream editor, builds on top of olded
editor- give short commands for how to modify file, rather than directly manipulating contents
s
: substitution; written in forms/REGEX/SUBSTITUTION
REGEX
: regular expression to search forSUBSTITUTION
: the text you want to substitute matching text with
ssh myserver journalctl`
| grep sshd
| grep "Disconnected from"
| sed 's/.*Disconnected from //'
Regular Expressions
- regexr
- used to match text to a specified pattern
- regular expressions are usually surrounded by
/
- most ASCII characters have normally meanings, but some are special matching characters, with meaning varying from system to system
Typically:
.
: any single character, except newline*
: 0+ of preceding match+
: 1+ of preceding match[abc]
: any one character ofa
,b
, andc
(RX1|RX2)
: matchesRX1
orRX2
^
: start of the line$
: end of the line
NB sed
requires you to escape special characters with \
for them to match. Typically
the inverse is true.
- capture groups: any text surrounded by parentheses is stored in a numbered capture group, available as
\1
,\2
, …
| sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
This regex matches any number of any characters, followed by “Disconnected from “, followed by an optional “invalid “ or “authenticating “, followed by “user “, following by any number of characters, followed by 1 or more characters other than space, followed by “ port “, followed by 1 or more digits, with an optional suffix of “[preauth]”, and then the end of the line. Keep the second capture group containing the username
Email address matching: harder than it seems
Now sort by usernames, and collapse to distinct values, with a count:
ssh myserver journalctl
| grep sshd
| grep "Disconnected from"
| sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
| sort | uniq -c
| sort -nk1,1 # -n: sort in numeric order;
# -k1,1: sort by first whitespace-separated column only
| tail -n10 # take the 10 most common ones
| awk '{print $2}' # awk is another editor. see below
| paste -sd, # combines lines (-s) by delimiter ","
awk
awk
is a programming language;- very good at processing text streams
- programs comprise an optional pattern plus a block specifying behaviour for matches on a given line. The default pattern matches all lines.
- Inside the block:
$0
: entire line$1
-$n
: n-th field of the line, when separated by theawk
field (default: whitespace, specify alternative character with -F)
{print $2}
then means print the second field of each line (the username)- see class notes for more detail
Data analysis
| paste -sd+ # add numbers on each line together
| bc -l
echo "2*($(data | paste -sd+)" | bc -l
Stats: st
or R
ssh myserver journalctl
| grep sshd
| grep "Disconnected from"
| sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
| sort | uniq -c
| awk '{print $1}' | R --slave -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)'
- basic plotting:
gnuplot
ssh myserver journalctl | grep sshd | grep "Disconnected from" | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/' | sort | uniq -c | sort -nk1,1 | tail -n10 | gnuplot -p -e 'set boxwidth 0.5; plot "-" using 1:xtic(2) with boxes'
- binary data: these tools also work for binary data.
e.g. capture image with ffmpeg, convert to grayscale, compress it, send it to a remote machine, decompress it, make a copy, display it:ffmpeg -loglevel panic -i /dev/video0 -frames 1 -f image2 - | convert - -colorspace gray - | gzip | ssh mymachine 'gzip -d | tee copy.jpg | env DISPLAY=:0 feh -'
Exercises
- Take this short interactive regex tutorial.
- Find the number of words (in
/usr/share/dict/words
) that contain at least threea
s and don’t have a's
ending. What are the three most common last two letters of those words?sed
’sy
command, or thetr
program, may help you with case insensitivity. How many of those two-letter combinations are there? And for a challenge: which combinations do not occur? - To do in-place substitution it is quite tempting to do something like
sed s/REGEX/SUBSTITUTION/ input.txt > input.txt
. However this is a bad idea, why? Is this particular tosed
? Useman sed
to find out how to accomplish this. - Find your average, median, and max system boot time over the last ten
boots. Use
journalctl
on Linux andlog show
on macOS, and look for log timestamps near the beginning and end of each boot. On Linux, they may look something like:Logs begin at ...
and
systemd[577]: Startup finished in ...
On macOS, look for:
=== system boot:
and
Previous shutdown cause: 5
- Look for boot messages that are not shared between your past three
reboots (see
journalctl
’s-b
flag). Break this task down into multiple steps. First, find a way to get just the logs from the past three boots. There may be an applicable flag on the tool you use to extract the boot logs, or you can usesed '0,/STRING/d'
to remove all lines previous to one that matchesSTRING
. Next, remove any parts of the line that always varies (like the timestamp). Then, de-duplicate the input lines and keep a count of each one (uniq
is your friend). And finally, eliminate any line whose count is 3 (since it was shared among all the boots). - Find an online data set like this
one, this
one.
or maybe one from
here.
Fetch it using
curl
and extract out just two columns of numerical data. If you’re fetching HTML data,pup
might be helpful. For JSON data, tryjq
. Find the min and max of one column in a single command, and the sum of the difference between the two columns in another.
Solutions
- completed
- Number of words with $\ge$ 3 a’s not finishing in ‘s
grep -E "\w*[aA]\w*a\w*a\w*[^(\'s)]$" words \ | wc -w # word count, with output words flag # 435
Equivalently
grep -c -E "\w*[aA]\w*a\w*a\w*[^(\'s)]$" words # 435
Three most common last two letters
grep -E "^\w*[aA]\w*a\w*a\w*[^(\'s)]$" words | sed -E 's/^\w*(\w{3})$/\1/' | sort | uniq -c | sort -rnk1,1 | head -n3 # 53 ian # 31 lly # 20 ion
- As
sed
operates on file streams, you would be reading and writing to the same file at the same time. The os may not allow this, and if it did things could end up corrupted. There is an inplace flag-i
forsed
Which allows you to do this. - My
journalctl
doesn’t have any entries (WSL). ssh into one of my raspberry pi’s and download it’s journalctl It doesn’t appear to have any startup info. There is literally nothing interesting in the file… - <– todo –>
-
curl https://ucr.fbi.gov/crime-in-the-u.s/2016/crime-in-the-u.s.-2016/topic-pages/tables/table-1 > crime_rate.html cat crime_rate.html | pup 'table .group0,.group3 text{}' | grep -E "[^\n]" | grep -oE "([0-9]{4})" > year cat crime_rate.html | pup 'table .group0,.group3 text{}' | grep -E "[^\n]" | grep -oE "([0-9]{3}\.[0-9])" > violent_crime_rate paste -d " " year violent_crime_rate > crime_rate
Note this is an instance where you could use
tee
to pipe the common output of the first grep to each second grep, and then use this as an input to the paste.
year | violent_crime_rate |
---|---|
1997 | 611.0 |
1998 | 567.6 |
1999 | 523.0 |
2000 | 506.5 |
2001 | 504.5 |
2002 | 494.4 |
2003 | 475.8 |
2004 | 463.2 |
2005 | 469.0 |
2006 | 479.3 |
2007 | 471.8 |
2008 | 458.6 |
2009 | 431.9 |
2010 | 404.5 |
2011 | 387.1 |
2012 | 387.8 |
2013 | 369.1 |
2014 | 361.6 |
2015 | 373.7 |
2016 | 386.3 |
Subtract column 2 from column 1
cat crime_rate | awk '{print $1-$2}'
Summary stats, using st
cat crime_rate | awk '{print $1}' | st
# N min max sum mean stddev
# 20 1997 2016 40130 2006.5 5.91608
Let’s plot it
$ gnuplot -p -e "set terminal dumb; plot 'crime_rate' with points pt 2"
650 +--------------------------------------------------------------------+
| + + + + + + + + + |
| 'crime_rate' B |
600 |-+B +-|
| |
| B |
| |
550 |-+ +-|
| B |
| |
500 |-+ B B B +-|
| B |
| B B B B |
450 |-+ B +-|
| |
| B |
| B |
400 |-+ B +-|
| B B |
| + + + + + + + + B B |
350 +--------------------------------------------------------------------+
1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016