Column order and decimal point changes with awk and sed

Recently, I’ve been confronted with a simple problem that I usually solve in a spreadsheet application. In a text file, change the order of columns and shift from “,” to “.” as a decimal point separator. However, since I had many similarly formatted text files and did want to speed up the conversion, I searched for tools that could help me do theses simple tasks without too much hassle. Welcome awk, a programming language and tool for text processing and sed, a line-by-line editor, both available by default in MacOS X and Linux. As usual, stackexchange answers and question were extremely useful to quickly find a solution.

While I was awking happily around, I became aware of a problem that I did not expect – it uses line feed characters (LF) as line terminator and if the file comes from Window, it has carriage return (CR) and LF at the end of lines. Thus, the first step needed to get a clean file, was to remove those annoying CR (well visible in the following screenshot, on Geany, a fantastic text editor):

inputfile

This can be done with the following awk command that removes CR characters while leaving LF in the file:

awk '{ sub(/\r$/,""); print }' infile.txt >outfile.txt

The result is, as expected:

crremoved

Now, we can proceed with the following step, which is a change in column order. What I needed was to move 4th column in second position. Awk comes to the rescue here as well:

awk -F\t '{print $1,$4,$2,$3}' OFS=$'\t' infile.txt > outfile.txt

The result looks good, no more CRs and the order of columns is fine:

columnorderchanged

Finally, the numeric values that used “,” as decimal separator were not correctly interpreted by the clustering program. However, changing all the commas to dots was not very nice, because column 2 now contains useful text commas. Sed provided a very simple command to do that:

sed 's/\([0-9]\)\,\([0-9]\)/\1.\2/g' < infile.txt > outfile.txt

To understand how sed does its thing, one must be familiar with regular expressions.

In the end, the file looks exactly as I wanted it to be:

commadotchanged

There is no need to keep intermediate files and these commands can be chained using the “|” pipe operator. Alternatively, they can be put together in a small shell script.

Advertisements