-
Notifications
You must be signed in to change notification settings - Fork 17
/
01-shell.Rmd
1056 lines (713 loc) · 39.6 KB
/
01-shell.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Data Science for Economics and Finance"
subtitle: "Lecture 1: Shell"
author:
name: Dawie van Lill ([email protected]) | [Github](https://github.com/DawievLill)
date: Lecture 1 "`r format(Sys.time(), '%d %B %Y')`"
output:
html_document:
toc: true
toc_depth: 2
toc_float: true
theme: cosmo
highlight: pygments
always_allow_html: true
urlcolor: blue
mainfont: cochineal
sansfont: Fira Sans
monofont: Fira Code
---
<style type="text/css">
body{
font-size: 14pt;
}
</style>
```{r setup, include=FALSE}
options(htmltools.dir.version = FALSE)
library(knitr)
opts_chunk$set(
prompt = T, ## See hook below. I basically want a "$" prompt for every bash command in this lecture.
fig.align="center", #fig.width=6, fig.height=4.5,
# out.width="748px", #out.length="520.75px",
dpi=300, #fig.path='Figs/',
cache=F#, echo=F, warning=F, message=F
)
## Next hook based on this SO answer: https://stackoverflow.com/a/39025054
knit_hooks$set(
prompt = function(before, options, envir) {
options(
prompt = if (options$engine %in% c('sh','bash')) '$ ' else 'R> ',
continue = if (options$engine %in% c('sh','bash')) '$ ' else '+ '
)
})
```
# Introduction
---
Today is about working with a command line interface. There are many shell variants, and we will be working with **Bash** (**B**ourne **a**gain **sh**ell). This is the default for Linux and MacOS and needs to be installed for Windows. It is easiest to work with a Unix based operating system when coding, such as Linux or MacOS. However, I do realise that many of you will be working with Windows.
If you want to know more about Linux then you are welcome to come and speak to me about it. I do most of my work in a Linux environment and I think programming is much easier in Linux and MacOS. I use [Manjaro](https://manjaro.org/), but I think that [PopOS](https://pop.system76.com/) is a good Linux distribution to start with. It is quite similar to MacOS and you don't need to use the command line as frequently as with other Linux distributions.
In this section we will look at how to setup a project as well as how to work with the command line. You might be used to working with graphical user interfaces for most of your coding career, but it is useful to know how to work through the shell. This is not something that is often taught in economics programs, but if you are looking at a career in data science this might be very useful. [Here](https://raw.githack.com/uo-ec607/lectures/master/03-shell/03-shell.html#14) are some of the reasons why we might be interested in using the shell. We will start with looking at the basic file structures within a computer. The key points are the following,
1. Power
2. Reproducibility
3. Interacting with servers and super computers (**NB**)
4. Automating workflow and analysis pipelines
The primary references for this section are the notes from [Merely Useful](https://merely-useful.tech/py-rse/) and the slides from [Grant McDermott](https://raw.githack.com/uo-ec607/lectures/master/03-shell/03-shell.html#1)
**Note**: The output displayed will be for the file structure on my computer. For your PC things will obviously be different.
## Motivation
Let us quickly talk about some things that we can use the shell for.
- renaming and moving files **en masse**
- finding things on the computer
- combining and manipulating PDFs
- installing and updating software
- scheduling tasks
- monitoring resources on your system
- connecting to cloud environments
- running jobs on super computers
There are many more examples, but these are some of the fundamental things that make the command line useful. I think the following quite encapsulates my thoughts on the command line,
> As data scientists, we work with a lot of data. This data is often stored in files. It is important to know how to work with files (and the directories they live in) on the command line. Every action that you can do using a GUI, you can do with command-line tools (and much more).
# Hello World!
---
Once you have opened your terminal it should look something like this,
![](images/shell-1.png)
I am using Linux, so obviously my terminal is going to look different from yours. Mine is customised to look the way I like it. If you are using Windows it will probably look something more along the lines of this,
![](images/shell-2.png)
The first thing that you want to do is open up the Bash shell. You can do this through the [built-in terminal](https://support.rstudio.com/hc/en-us/articles/115010737148-Using-the-RStudio-Terminal) in **RStudio** if you prefer. I will quickly demonstrate this in class.
To start, when Bash runs it presents us with a \gref{prompt}{prompt}\index{prompt (in Unix shell)} to indicate that it is waiting for us to type something. This prompt is a simple dollar sign by default:
``` bash
$
```
For our very first command in the terminal we will enter `echo 'Hellow World!'` in the command line and see what happens.
```{bash hello}
echo 'Hello World!'
```
You will see the syntax above for the rest of the notes. The first part is the command that I have entered and below we will see the output from issuing that command.
## Exercise
1. Try and print your own name to the terminal.
2. Clear your terminal using the `clear` command.
# Listing files
---
Now that we have that pesky first command out of the way, let us continue our journey with shell commands. The following commands will let us explore our folders and files, and will also introduce us to several conventions that most Unix tools follow.
The first thing I normally do is check my current working directory and list the files located in this directory. You have encountered the idea of a current working directory in R, so this should be familiar. We can do this by running the following command, which is for **p**rint **w**orking **d**irectory. (Notice that we only run the command, there are no options and arguments).
```{bash}
pwd
```
```{bash}
type pwd
```
On my computer the **pwd** is the `01-shell` folder. We can also see that it is a built-in command for the shell.
It is important to note that all Bash commands have the same basic syntax -- command, option(s), argument(s). Below is a picture to illustrate the point,
![](images/shell_command_syntax.svg)
The options and arguments are not always needed, but you will need a command (such as `ls`). The options start with a dash and are usually one letter. Options are sometimes referred to as **switches** or **flags**. Arguments tell the command what to operate on, so this is usually a file, path or set of files and folders.
Let us see what happens if we type the command and option from above into the shell,
```{bash}
ls -F
```
The `ls` command gives the files **l**i**s**ted in the current working directory, while the `-F` option provides a classification of the types of files.
> **Note**: Directory names in a path are separated with `/` on Unix, but `\` on Windows.
We could use other options like `-a`, `-l` and `-h` to show `all` the folders and files. The `-l` is for long format and the `-h` is for human readable.
```{bash}
ls -l -a -h
```
You might notice that there is a lot of information here. Let us analyse this one piece at a time.
The first column indicates the object type. It can either be a (`d`) directory / folder, (`l`) link or (`-`) file. In the case of the first line, we see that this is a directory.
The following nine columns indicate the permissions associated with the objects user types. In this case we have `r` (read), `w` (write) or `x` (execute) access. The `-` indicates missing permissions.
The remaining columns represent the hard links to the object, the identity of the owner's of the object and then descriptive elements about the object.
In terms of the first row, the output shows a special directory called `.`, which is the current working directory. In the second row, we have `..` which means the directory that contains the current one. This is referred to as the **parent** directory.
Beyond the first two columns, the rest of the objects are files. We can see the files listed here are Jupyter notebooks, `html` files and some `RMarkdown` files.
# The manual
---
It is worth noting that options for many of the commands, such as `ls`, is contained in the manual. We can access the manual for the command in the following way,
```{bash}
man ls | head -n 30 # head lets us look at the first 30 lines here
```
# Navigating directories
---
Now let's navigate to the `research project` folder. In order to this we will use the `cd` command to **c**hange **d**irectory. In this case I know that the `research-project` folder is located higher up in the file structure (in the parent directory). In order to go back one level in the directory, we can use the following command,
```{bash}
cd ..
pwd
```
If you wanted to move up higher than the parent directory, you could use `cd ../..`, which moves up two directories.
The parent directory in our case is called `DataScience-871`. Once again, list the folders and files to get an idea of directory structure.
```{bash}
cd ..
ls -a # you can use whichever options you prefer
```
We can see the folder we are `research-project` folder we are looking for and can easily access it by using the `cd research-project` command.
```{bash}
cd ..
cd research-project ## Move into the "research-project" sub-directory of this lecture directory.
pwd
```
However, let us go to the **home** directory and navigate to the `research-project` folder from there. This is a good exercise that you can follow on your computer at home. We start with just typing `cd`. This will navigate us to the home directory for the current user.
```{bash}
cd
pwd
```
Let us quickly analyse the directory output above. The first component is the **root** directory and this holds everything. This refers to the slash character `/` on its own. Next we have the **home** directory, which contains the directory for the current user, which in this case is `dawie`. The second slash operator is a separator. In Windows this would have been two backslashes (`\\`) instead of a forward slash. Also, in Windows the **home** directory is usually called `Users`.
Another shortcut to get to the **home** directory is to use the command `cd ~`, where the `~` is a special shortcut for **home**.
From this point I am going to navigate to my `DataScience-871` folder. From above you can see the absolute path of the folder. In my case I first need to change the directory to the `Dropbox` folder.
```{bash}
cd ~/Dropbox
pwd
```
Now that I am in the Dropbox folder I want to go to my `2022` folder. I can do this as follows,
```{bash}
cd ~/Dropbox/2022
pwd
```
Now let's take a quick look at the folders that are located in my 2022 folder.
```{bash}
cd ~/Dropbox/2022
ls -a
```
If we try and move to a folder that doesn't exist in this directory we will receive an error message.
```{bash}
cd ~/Dropbox/2022/comp
pwd
```
The relevant folder seems to be te `871-data-science` folder. So we change the directory once again. We keep doing this till we get to the `DataScience-871` folder. Obviously on your system this will be saved in a different location. You should know where the folder is located on your computer. We can briefly discuss this in class. There are slight differences for different operating systems.
```{bash}
cd ~/Dropbox/2022/871-data-science/DataScience-871
ls
```
We see that one of the folders is called `research-project`. That is where we want to navigate, so we use the `cd` command along with the folder name as follows.
```{bash}
cd ~/Dropbox/2022/871-data-science/DataScience-871/research-project
pwd
```
You can now see that the **pwd** is the `research-project` folder. Let us explore this folder to see what it contains.
```{bash}
cd ../research-project
ls -l -h
```
In this folder structure you can see a broad template for organising a small project.
For this project you see that here is a `README.md` file that gives us the basic information of the project. You might often see `licence`, `conduct` and `citation` files in projects, but we won't be dealing with those in detail in this course. The only boilerplate file that you need to concern yourself with is the `README.md` document, which we will talk about a bit more when we deal with **version control**. You will notice that this has the `.md` extension, which indicates that this is a Markdown file. This entire notebook was written in Markdown and I will talk about the format briefly in the lecture.
The directories for this project are organised by purpose.
Some runnable programs are located in the `bin` folder.
One normally keeps source files in a folder named the `src` folder, which includes your shell scripts and R / Julia scripts.
`src` folders normally contain human readable code and `bin` the computer readable codes. This is not an important distinction for our purposes. We will mostly work with source code.
Our raw data goes into the `data` folder and the data in this folder is never modified. This is the original raw data.
Results are put in the `results` folder. This includes the cleaned data, figures and other components that are created from the `bin`, `src` and `data` folders.
## Exercise
Starting from `home/dawie/Dropbox`, which of the following commands could I use to navigate to my home directory, which is `/home/dawie`?
1. `cd .`
2. `cd /`
3. `cd /home/dawie`
4. `cd ../..`
5. `cd ~`
6. `cd home`
7. `cd ~/data/..`
8. `cd`
9. `cd ..`
# Creating directories
---
In the next part we will be creating some files and directories that relate to our research project. In order to do this we use the command `mkdir`, which is short for **m**ake **d**irectory.
```{bash}
cd ../research-project
rmdir new_dir # remove in case it already exists
mkdir new_dir
ls
```
You should now be able to see the `new_dir` directory that we have created with this command. This is similar to creating a new folder, as you would with a graphical file explorer in your operating system of choice.
# Naming directories
---
There are a few "rules" about naming directories that we can quickly mention.
1. Don't use spaces.
2. Don't begin the name with a dash
3. Stick to letters, digits, dashes and underscores for names.
Examples of bad names
- Data Science 871 -- Has spaces
- -DataScience871 -- Starts with a dash
- #DataScienceLife -- Don't use hashtags!
Examples of "good" names
- DataScience871
- DataScience-871
- datascience-871
- data_science_871
I generally stick to lowercase with dashes, but that is a preference. It is simply easier to type.
# Using `touch` and `rm`
---
Getting back to our example with the new directory we just created.
```{bash}
cd ../research-project
ls new_dir
```
You will see that this directory is empty. So let us create a file and put it into this new directory. The files name is going to be `draft.txt`. The `.txt` extension indicates to us that this will be a text file. in order to create an empty file we can use the `touch` command.
```{bash}
cd ../research-project/new_dir
touch draft.txt
ls
```
We can also delete the objects that we created with the `rm` command.
> **NB**: Please note that there is no **undo** option with the shell. If you remove something it is permanently deleted. So make very sure about your actions before commiting. This is one of the reasons why version control is so important, which we will discuss in the next lecture.
```{bash}
cd ../research-project/new_dir
rm draft.txt
ls
```
## Exercise
What happens when we execute `rm -i new_dir/draft.txt`? Why would we want this protection when using `rm`?
If you wanted to delete the entire directory then you would have to use `rmdir`. If there are files in the directory you will get a warning telling you that this might not be the best idea. If there are files and you want to remove the directory, then you can use the recursive option `-r`.
```{bash}
cd ../research-project
rmdir new_dir
ls
```
You will note that `new_dir` is now gone from the list.
# Copying and renaming
---
Another important command is copy. Let us make a new sub-directory with copies and then copy across files from another folder.
```{bash}
cd ../research-project
#rmdir examples -r # delete in case this already exists (just for this example)
#mkdir examples
cd examples
touch example1.txt example2.txt
```
Let us now copy `example1.txt` into the `docs` folder with a new name.
```{bash}
cd ../research-project
cp examples/example1.txt docs/doc1.txt
ls docs
```
We can also move and rename files with the `mv` command. This is similar to copying, but completely moves the file to a new location.
```{bash}
cd ../research-project
mv examples/example2.txt docs
ls docs
```
We can also move the file back to its original location.
```{bash}
cd ../research-project
mv docs/example2.txt examples
ls docs
```
If you are moving the object into the same directory but with a new object name, then you are effectively just renaming it.
```{bash}
cd ../research-project
mv docs/doc1.txt docs/doc_new.txt
ls docs
```
There is a more convenient way to do this, the `rename` function. The syntax here is `pattern`, `replacement`, `file(s)`.
```{bash}
cd ../research-project
rename txt csv docs/doc_new.txt
ls docs
```
We can also easily revert our changes.
```{bash}
cd ../research-project
rename csv txt docs/doc_new.csv
mv docs/doc_new.txt docs/doc1.txt
ls docs
```
# Wildcards `*` and `?`
---
The place where `rename` is super useful is when we can use it in combination with regular expressions and **wildcards**. You would have dealt with the concept of regular expressions in the first part of the course.
With these methods we could change all the `.txt` file extensions in the exmamples folder to `.csv` in one line.
```{bash}
cd ../research-project
rename txt csv examples/* # the star represents the wild card expression here
ls examples
```
We can then just as easily change it back, with a similar command. Make sure you understand what is happening here.
```{bash}
cd ../research-project
rename csv txt examples/*
ls examples
```
Wildcards are special characters that can be used as a replacement for other characters. The two most important ones are the `*` and `?`.
We havent mentioned the `?` wildcard yet. It matches any single character in a filename, so `?.txt` matches `a.txt` but not `any.txt`.
# Working with text files
---
Next we will take a look at some data from the Project Gutenberg website. We have some full ebooks in our data folder.
```{bash}
cd ../research-project
ls data
```
There are some really interesting books in this folder. I recommend reading all of them if you have the time (especially `Frankenstein`!). If you don't have a lot of time read `Time Machine`, it is quite short.
## The word count command `wc`
Moving on, we can use the `wc` command, which is short for **w**ord **c**ount, which will tell us how many lines, words and letters there are in a file.
```{bash}
cd ../research-project/data
wc moby_dick.txt
```
We can't actually run the `wc` command on the entire directory.
```{bash}
cd ../research-project
wc data
pwd
```
What if we wanted to get the wordcount for two books, well then we would just specify the filenames of both books as inputs,
```{bash}
cd ../research-project/data
wc moby_dick.txt frankenstein.txt
```
If we wanted to know the word counts for **all** the books in the folder we could use wildcards.
```{bash}
cd ../research-project/data
wc *.txt
```
We could also just show the number of lines in each file,
```{bash}
cd ../research-project/data
wc -l *.txt
```
The easiest way to read text is with the `cat` command, which stands for "concatenate". `cat` will read in *all* of the text. We don't want all the text, we simply want the first few lines, so we use the `head` command. This is similar to the `head` command in R.
```{bash}
cd ../research-project/data
head -n 18 frankenstein.txt
```
## Regular expressions
If we want to find specific patterns in the text, we can use regular expressions type matching with `grep`. This stands for "global regular expression print".
What if we wanted to find some famous line in `Frankenstein`. One such line is,
> I was benevolent and good; misery made me a fiend.
We can search for the first part of the sentence as follows,
```{bash}
cd ../research-project/data
grep -n "I was benevolent" frankenstein.txt
```
We can look for specific words as well, such as *fear*.
```{bash}
cd ../research-project/data
grep -n "fear" frankenstein.txt | head -n 5
```
It seems there is a lot of fear related words in the book! However, what if we are looking just for `fear` and not words like `fearful` that contain `fear`. In this case we can use the `-w` flag. It indicates we are looking to match the whole word.
```{bash}
cd ../research-project/data
grep -n -w "fear" frankenstein.txt | head -n 5
```
We can use `grep` to find patterns in a group of files as well. Say we want to find the word *benevolent* across all the books in the directory,
```{bash}
cd ../research-project
grep -R -l "benevolent" data/
```
It seems that only four books in our list contain the word. It should be noted that there are many options for `grep`. If you want to find out what functions are available you can consult the manual for the command by the `man` command.
One of the things that we could do is figure out how many lines in the books in our list contain the word *benevolent*.
```{bash}
cd ../research-project
grep -w -r "benevolent" data | wc -l
```
It looks like there are 23 lines in all of the books that contain this specific word.
One of the real powers of regular expressions is the fact that we aren't limited to characters and strings when defining our search, we can also use **metacharacters**.
Metacharacters are characters that can be used to represent other characters. As an example, the period `"."` is a metacharacter, which represents *any* character. If we wanted to search `Frankenstein` for the character `"a"` followed by any character and then followed by the character `"x"` we could use the following command,
```{bash}
cd ../research-project/data
grep -P "a.x" frankenstein.txt
```
You see some nice words here like **a**n**x**ious or **a**n**x**iety. Besides metacharacters that represent other characters, we also have quantifiers which allow you to specify the number of times a particular regex should appear in a string. The most basic quantifiers are `+` and `*`. The plus represents one or more occurrences of the preceding expression. This means an expression like `s+as` would mean, one or more `s` followed by `as`. Let us illustrate with an example,
```{bash}
cd ../research-project/data
grep -P "s+as" frankenstein.txt
```
Some examples here are di**sas**ter and a**s**s**as**in.
It is not possible to go through all the cool things you can do with regular expressions in our lecture, so you can go and read more on it [here](https://seankross.com/the-unix-workbench/working-with-unix.html) if you find this interesting.
## Finding files
While the `grep` command helps us to find things in files, we can use `find` to find files themselves. We can use `find .` to find and list everything in a directory.
```{bash}
cd ../research-project
find .
```
If we only wanted to find directories then we could do,
```{bash}
cd ../research-project
find . -type d
```
What about finding the files?
```{bash}
cd ../research-project
find . -type f
```
We could also find files of a certain type using the `*` wildcard.
```{bash}
cd ../research-project
find . -name "*.txt"
```
We can combine some of the knowledge we have accumulated thus far to see how large the files are in terms of number of lines,
```{bash}
cd ../research-project
wc -l $(find . -name "*.txt")
```
You will often have `find` and `grep` used together. The following is a nice way to show the author of each of the books in the `data/` directory,
```{bash}
cd ../research-project/data
grep "Author:" $(find . -name "*.txt")
```
## Using `awk` and `sed`
Let us continue on our route of manipulating text in the shell. We encounter two more tools, `sed` and `awk`.
```{bash}
cd ../research-project
cat examples/nursery.txt
```
Let us now change "Jack" to "Bill".
```{bash}
cd ../research-project
sed -i 's/Jack/Bill/g' examples/nursery.txt
cat examples/nursery.txt
```
And then change is back.
```{bash}
cd ../research-project
sed -i 's/Bill/Jack/g' examples/nursery.txt
cat examples/nursery.txt
```
In this next example we are going to count the 10 most commonly used words in `Frakenstein`. We are going to be using a pipe operator `|`, similar to the one that you have used in R, to chain some commands together. We will also use the redirect operator `<`.
```{bash}
cd ../research-project
sed -e 's/\s/\n/g' < data/frankenstein.txt | sort | uniq -c | sort -n -r | head -10
```
Let us talk a bit more about the pipe and redirect operator.
# Pipes and redirect
---
You can send output from the shell to a file using the redirect operator `>`. We can print a message to the shell using the `echo` command.
```{bash echo_1}
echo "The shell really isn't all that scary. Or is it?"
```
You could also save this output to a file, just send it to a particular filename.
```{bash redirect_1}
cd ../research-project/examples
echo "Some text for the example file" > example3.txt
ls -a
```
You can see that `example3.txt` has now been created and if you open the file you will see the text located in that file. You can append the text in an existing file if you use `>>`.
```{bash redirect_2}
cd ../research-project/examples
echo "Some more text for the example file! We appended!" >> example3.txt
cat example3.txt
```
A nice operation that you can do with the shell, when working with `git`, is to issue the following command, `echo "*.csv" >> .gitignore`. Can you figure out what this is going to do and why it is useful?
Next we discuss the pipe operator in a bit more detail. You should be familiar with how this works in R by now, but is also a really cool feature that can be used in the shell.
We can easily chain together a string of simple operations and make it something quite complex. See the following example,
```{bash pipe_1}
cd ../research-project
cat -n data/frankenstein.txt | head -n1002 | tail -n10
```
If we wanted to check which of the books is the longest in this list we could do the following. First, we use the redirect operator `>` to send the list of line lengths to a text document, instead of printing it to the shell. Second, we use the `sort` command to sort the lines in the file. We use the `-n` flag to sort numerically.
```{bash}
cd ../research-project
wc -l data/*.txt > examples/lengths.txt
sort -n examples/lengths.txt
```
It seems that Moby Dick is the longest book and Time Machine the shortest. See, I told you it was a quick read. Frankenstein is second shortest, also a really good book, so it is worthwhile reading.
```{bash}
cd ../research-project
rm examples/lengths.txt
```
We remove the lengths file since we aren't going to be using it further.
A much easier way to have done the same operation above uses a pipe operator.
```{bash}
cd ../research-project/data
wc -l *.txt | sort -n
```
Below is a graphical illustration of the operations we performed. The code is not exactly the same, but it performs a very similar function.
![](images/redirects-and-pipes.svg)
## Exercise
In our `research-project/data` directory, we want to find the 3 files which have the least number of lines. Which command listed below would work?
1. `wc -l *.txt > sort -n > head -n 3`
2. `wc -l *.txt | sort -n | head -n 1-3`
3. `wc -l *.txt | head -n 3 | sort -n`
4. `wc -l *.txt | sort -n | head -n 3`
Next lets move on to an interesting topic that you would have encountered in R. The usage of `for loops`.
# Using `curl`
One thing that I have not mentioned is how to download files from the internet using the command line. I often use this command to download software or some text file from a website. Say that we wanted to include a new book in our `data` folder from the Project Gutenberg website, we could do the following,
```{bash}
cd ../research-project/data
curl -s "https://www.gutenberg.org/files/11/11-0.txt" > alice_in_wonderland.txt
ls -a -h
```
We could also perform operations on the file without having to store it on our computer.
```{bash}
curl -s "https://www.gutenberg.org/files/11/11-0.txt" | grep " CHAPTER"
```
Let us clean up the directory.
```{bash}
cd ../research-project/data
rm alice_in_wonderland.txt
```
If you want to explore other ways in which data can be downloaded and transformed, then the following [link](https://datascienceatthecommandline.com/2e/chapter-3-obtaining-data.html) provides is a great read.
# Iteration (for loops)
---
These loops work similar to those in other programming languages. The basic syntax is the following,
```bash
for i in LIST
do
OPERATION $i ## the $ sign indicates a variable in bash
done
```
We can also condense things into a single line by using ";" appropriately.
```bash
for i in LIST; do OPERATION $i; done
```
The second method is more compact and perhaps better to use in the shell. The semicolons indicate line endings. Let us show an example of a for loop in the shell.
```{bash for1a}
for i in 1 2 3 4 5; do echo $i; done
```
We could have also used the brace expansion `{1..n}` instead of writing out all the numbers from `1` to `n`.
```{bash for1b}
for i in {1..5}; do echo $i; done
```
A loop might be useful if we wanted to repeat a certain set of command for each item in a list. Let us go back to our Project Gutenberg example. Suppose we wanted to take the first 8 lines out of each book whose name starts with an `s` (after the initial 9 lines of preamble). We then want code to take 8 lines after the first 9 lines. We actually did something similar to this before,
```{bash repeat_1}
cd ../research-project
head -n 17 data/frankenstein.txt | tail -n 8
```
We can use wildcards to find all the books that start with `s`,
```{bash repeat_2}
cd ../research-project
for filename in data/s*.txt; do head -n 17 $filename | tail -n 8; done
```
# Functions
We can also write functions, similar to that in R and other scripting languages. The following function computes the factorial of some integer that we input. Don't be too concerned about how this works for now, it is simply nice to know that this can be done.
```{bash}
fac() { (echo 1; seq $1) | paste -s -d\* - | bc; }
type -a fac
```
```{bash}
fac() { (echo 1; seq $1) | paste -s -d\* - | bc; }
fac 5
```
# Shell scripting
---
We can explore data and file structure with the shell, but we can also write shell scripts that combine commands. These scripts normally have a `.sh` extension.
I have written a short shell script called `hello.sh`. You can probably guess what the script is going to say / do?
```{bash shell_script_1}
cd ../research-project
cat examples/hello.sh
```
From reading this you might be able to figure out what is does, but let us unpack it.
- `#!/bin/sh` is a *shebang*. This indicates which program should be used to run the command.
- `echo -e "\nHello World!\n"` is the actual command that we would like to run. The `-e` flag tells `bash` to evaluate an expression.
If we want to run this script then we can simply type the filename.
```{bash shell_script_2}
cd ../research-project
bash examples/hello.sh
```
The directory for runnable programs is normally the `bin/` directory. We have one of these in our file structure. Within that directory we also have a file called `book_summary.sh`, which is an empty shell script that we can fill from the shell.
We start by writing to the empty shell script with the following command,
```{bash shell_script_3}
cd ../research-project/bin
echo "head -n 17 ../data/moby_dick.txt | tail -n 8 | grep Author" > book_summary.sh
```
Can you make sense of what is happening here? If you are following along with these commands then you are starting to understand shell commands.
We can run the shell script as before,
```{bash shell_script_4}
cd ../research-project/bin
bash book_summary.sh
```
Shell scripts are especially useful if you have a long list of commands that you want to pipe together. The following is quite unwieldy in the terminal and work much better in a shell script,
```bash
curl -sL "https://www.gutenberg.org/files/11/11-0.txt" |
tr '[:upper:]' '[:lower:]' |
grep -oE "[a-z\']{2,}" |
sort |
uniq -c |
sort -nr |
head -n 10
```
Then we can easily call the shell script (in this case `top_words.sh`) and see the result, without having to do all the typing,
```{bash shell_script_5}
cd ../research-project/bin
bash top_words.sh
```
If you want to learn more about shell scripting and best practice for bash scripts I recommend the following resource, located [here](https://earthly.dev/blog/understanding-bash/).
# Editing text files
---
I will say a few words in class about editing text files and text editors that you could potentially use. This is an important topic, but we do not have the scope to fully cover it in this lecture. I mostly use [VS Code](https://code.visualstudio.com/) to edit text documents. However, there are other options like [nano](https://www.nano-editor.org/), [vim](https://www.vim.org/) and [Emacs](https://www.gnu.org/software/emacs/) for those that are interested. You can also use `RStudio`, but I wouldn't really recommend it. There are better text editors with much more functionality.
As always, find what works best for you. Find the space that best fits your style and you will enjoy the journey so much more.
# Make
---
In the early days of computing there were not nice user interfaces that we saw today. Most of the work was done on the command line. This is exactly why the tools on the command line are so well developed. One of the primary things computer scientists wanted to do was to share software, but it was always a bit tricky to install software across different systems. This is where `make` comes in. The idea behind `make` is the following,
1. Download files required for installation into a certain directory.
2. `cd` into that directory.
3. Run the `make` command.
In order to do this one had to construct a `makefile` that contains all the instructions on how different parts of the downloaded components need to interact with each other. The nice thing about `make` is that it allows for a process whereby we can create documents **automatically**. This is amazing for reproducibility in projects! You are essentially writing down the steps whereby things should be executed in order to get to a finished product.
We will show a very easy example of constructing our own `makefile` and in the next lecture we will go into version control and project management in a bit more detail.
We created a `makefile` file in the `journal` directory of the `research-project` folder, using a text editor (in this case VS Code). In the makefile we have the following text,
```bash
draft_journal_entry.txt:
touch draft_journal_entry.txt
```
The general format for `makefile` entries are the following,
```bash
[target]: [dependencies...]
[commands...]
```
The indent in the second line is important. It should be done with the `tab` key. If this whitespace is not used then the command will not execute.
In our simple example above, the **target** is the text file `draft_journal_entry.txt`. This file is created with the `touch` **command** in the second line. We can check the `journal` folder to see the contents.
```{bash}
cd ../research-project/journal
ls
```
Let us know use the `make` command to see if the `makefile` is properly executed.
```{bash}
cd ../research-project/journal
make draft_journal_entry.txt
```
It seems like this has worked properly. We have created a new file with our command in the `makefile`. Let us try running it again and see what happens.
```{bash}
cd ../research-project/journal
make draft_journal_entry.txt
```
Interesting, it tells us that the commands are up to date. This means that there is nothing to be done.
Now let us do something more complex. Let us add a table of contents to our journal,
```{bash}
cd ../research-project/journal
echo "1. 2022-02-11-Lecture-Stellenbosch" > toc.txt
ls
```
Next, let us edit our `makefile` in VS Code to include the following lines,
```bash
readme.txt: toc.txt
echo "This journal contains the following number of entries:" > readme.txt
wc -l toc.txt | grep -P -o "[0-9]+" >> readme.txt
```
Now let us run `make` with the `readme.txt` as the target.
```{bash}
cd ../research-project/journal
make readme.txt
```
To see that this works, let us look at the contents of `readme.txt`.
```{bash}
cd ../research-project/journal
cat readme.txt
```
Now let’s modify `toc.txt` then we’ll try running `make` again.
```{bash}
cd ../research-project/journal
echo "2. 2022-02-12-Chilling-SomersetWest" >> toc.txt