Posted: May 31, 2021 - Updated: April 10, 2024
Small recipes for manipulation with BIG files using *nix (linux and macOS) tooling.
Some of these tools may be installed on Windows using a lot of various ways, the most easy probably will be:
Correct:
grep -c ^ file.txt
note: command wc -l
that is first google result count new-line characters, and this is not working in this case:
1 This is the first line.\n
2 This is the second line.
In this case we will use sed
command. Logic behind this command is: it has two buffers. The first one is used for reading input file line by line, the second contains current line and the parametrized operation executed on that buffer. In default mode after each line the result of second buffer is printed to out before and after, but this can be disabled by -n
(longer version --quiet
), for example if we want to print only required lines.
Command is optional parameter, it can contain ReqExp or line number for identifying lines of interest. For this task we need two commands:
p
- printq
- quit processing after that linebefore the command we need to provide lines addressing. General rules are:
I
- case-insensitive and so on;Examples:
Show lines by number:
```bash
sed -n '10000000,10000020p; 10000021q' filename
sed -n '10000000,+20p' filename
There exists two very specialized tools in *nix the simply took either head
or tail
of the file. These works much faster then sed
but the capabilities are more limited.
head
parameters:
-n <number>
or -<number>
took only given amount of first lines--lines=-<number>
took all first lines of file except given amount of latest(because of minus)tail
parameters:
-n <number>
or -<number>
took only given amount of last lines-n +<number>
or +<number>
took all last lines of file except given amount of first(because of plus)So, combining these, we can print lines of interest from file. Example:
# print 10 lines staring from 2nd:
tail +2 sample.txt | head -10
# print lines 10-12:
tail +10 sample.txt | head -2
# Copy from one file to another with skipping header
cat in.csv | tail -n +2 >> out.csv
Speed comparison:
> time tail +50000000 bigfile.csv | head -2
< real 0m4,291s
< user 0m1,560s
< sys 0m2,700s
> time sed -n '50000000,+2p;50000001q' bigfile.csv
> real 0m7,080s
> user 0m4,468s
> sys 0m2,608s
So, according to this tail and head combination is as twice as faster.
sed
For sed
there two important option for this:
-i
parameter means inplace edition (otherwise instead of file edition this will print out without deleted line)d
command means delete that lineExamples:
# delete line number 10
sed -i '10d' sample.txt
# delete first line
sed -i '1d' sample.txt
example:
tar cf - bigfile.csv | pv -s $(du -b bigfile.csv | awk '{print $1}') | gzip > bigfile.csv.tar.gz