-
Notifications
You must be signed in to change notification settings - Fork 2
Useful Command Lines
(...)
Something that makes terminal shell commands powerful is concatenation of command with the pipe |
. Lets see it in cation:
Let's see the content of a meta.xml
file. Let's move to some dr location:
cd /data/biocache-load/dr603
and we can show the contents with cat
:
cat meta.xml
but if we want to see the contents of a occurrence.txt
file, if this file is too long, maybe it's more useful to just count the lines of that file:
cat occurrence.txt | wc
3540297 134473483 1548283140
the output shows the number of lines (3M), words, and bytes in that file. Here we concatenate the output of cat to wc
(word count command) with the pipe |
.
We can do many concatenation of commands until we get the thing we need. For instance this command get only the 10th column (the ids), sort all the ids and redirect the output to a file in /tmp
directory.
cat occurrence.txt | awk -F $'\t' '{print $10}' | sort > /tmp/dr-603-ids-load.txt
But let's explain this step by step.
Instead of use cat
many times is useful to use only a part of a file. In our previous example, with a file of 3M of lines, this is quite useful. So if we do:
head -50 occurrence.txt
we'll see the first 50 lines of that file. This is interesting to see the header of a file.
The same with tail
:
tail -50 occurrence.txt
we'll show you the end of that file.
This is also useful to test commands with a portion of a big file. In the previous long cat command we can test our command with head
instead of cat
head -5 occurrence.txt | awk -F $'\t' '{print $10}'
will print 5 lines of the 10th column of occurrence.txt
with columns separated with TABs (\t
). (explanation in detail). Something like:
occurrenceID
3084007342
1090938898
1090938908
3015196328
this is useful to see if we are selecting the correct column we are interested in, without having to process all the 3M lines.
When we are sure that his is what we want we can continue concatenating the output with the pipe |
with other commands, and substitute the head
with cat
for process all the file:
cat occurrence.txt | awk -F $'\t' '{print $10}' | sort -n | uniq > /tmp/dr603-sorted-ids.txt
In this case we have all the occurrenceID
of the 3M records, sorted, removed duplicates and the output in the file /tmp/dr603-sorted-ids.txt
. More details of that command.
Let's compare this with a download for biocache
. We can download some file using wget https://Some-URL
or curl https://Some-URL
:
curl -o /tmp/records-2021-06-30.zip https://registros-ws.gbif.es/biocache-download/0cb552f1-6421-3df8-a8bc-7573e6a584f9/1625070232900/records-2021-06-30.zip
With -o
we indicate where to save the download. Now we move (cd
) to the /tmp
directory and we can unzip
the output:
cd /tmp/
unzip /tmp/records-2021-06-30.zip
Lets get the occurrenceID
of that download also. As that CSV is separated but commas, we can do like this:
cat records-2021-06-30.csv | awk -F '","' '{print $17}' | sort -n | uniq > /tmp/dr-603-ids-reg-sorted.txt
we use '","'
as separator, as all fields are double quoted in the CSV.
Let's compare the two files or ids we have generated previously:
comm -23 /tmp/dr-603-ids-reg-sorted.txt /tmp/dr-603-ids-load-sorted.txt > /tmp/ids-only-in-reg.txt
This compares the two files and removes ids that are in both files, and only in the second file. More details.
We want to remove old ids that are not present in our loaded dr from our IPT. But we want the LA uuuid instead of the OccurrenceID
.
First we'll add the double quotes again in the ids:
cat /tmp/ids-only-in-reg.txt | sed 's/^/"/g' | sed 's/$/"/g' > /tmp/ids-only-in-reg-quoted.txt
more explained here, and later we can use that ids quoted to search again in our records CSV. For this we use grep
:
grep -Ff /tmp/ids-only-in-reg-quoted.txt /tmp/records-2021-06-30.csv > /tmp/to-delete.csv
Now we can get the ids (field 8th):
cat /tmp/to-delete.csv |awk -F '","' '{print $8}' > /tmp/dr603-ids-to-delete.txt
head /tmp/dr603-ids-to-delete.txt
We obtain:
8a863029-f435-446a-821e-275f4f641165
264e6a66-9c9e-4115-9aec-29d694c68097
8a863029-f435-446a-821e-275f4f641165
8a863029-f435-446a-821e-275f4f641165
8a863029-f435-446a-821e-275f4f641165
8a863029-f435-446a-821e-275f4f641165
And now we can delete that ids with biocache
from solr
and cassandra
:
biocache delete-records -f /tmp/dr603-ids-to-delete.txt
TODO
Index
- Wiki home
- Community
- Getting Started
- Support
- Portals in production
- ALA modules
- Demonstration portal
- Data management in ALA Architecture
- DataHub
- Customization
- Internationalization (i18n)
- Administration system
- Contribution to main project
- Study case