Hackery

CSV wrangling with App::CCSV

Tue 02 Sep 2014

Well past time to get back on the blogging horse.

I'm now working on a big data web mining startup, and spending an inordinate amount of time buried in large data files, often some variant of CSV.

My favourite new tool over the last few months is is Karlheinz Zoechling's App::CCSV perl module, which lets you do some really powerful CSV processing using perl one-liners, instead of having to write a trivial/throwaway script.

If you're familiar with perl's standard autosplit functionality (perl -a) then App::CCSV will look pretty similar - it autosplits its input into an array on your CSV delimiters for further processing. It handles embedded delimiters and CSV quoting conventions correctly, though, which perl's standard autosplitting doesn't.

App::CCSV uses @f to hold the autosplit fields, and provides utility functions csay and cprint for doing say and print on the CSV-joins of your array. So for example:

# Print just the first 3 fields of your file
perl -MApp::CCSV -ne 'csay @f[0..2]' < file.csv

# Print only lines where the second field is 'Y' or 'T'
perl -MApp::CCSV -ne 'csay @f if $f[1] =~ /^[YT]$/' < file.csv

# Print the CSV header and all lines where field 3 is negative
perl -MApp::CCSV -ne 'csay @f if $. == 1 || ($f[2]||0) < 0' < file.csv

# Insert a new country code field after the first field
perl -MApp::CCSV -ne '$cc = get_country_code($f[0]); csay $f[0],$cc,@f[1..$#f]' < file.csv

App::CCSV can use a config file to handle different kinds of CSV input. Here's what I'm using, which lives in my home directory in ~/.CCSVConf:

<CCSV>
sep_char ,
quote_char """
<names>
  <comma>
    sep_char ","
    quote_char """
  </comma>
  <tabs>
    sep_char "  "
    quote_char """
  </tabs>
  <pipe>
    sep_char "|"
    quote_char """
  </pipe>
  <commanq>
    sep_char ","
    quote_char ""
  </comma>
  <tabsnq>
    sep_char "  "
    quote_char ""
  </tabs>
  <pipenq>
    sep_char "|"
    quote_char ""
  </pipe>
</names>
</CCSV>

That just defines two sets of names for different kinds of input: comma, tabs, and pipe for [,\t|] delimiters with standard CSV quote conventions; and three nq ("no-quote") variants - commanq, tabsnq, and pipenq - to handle inputs that aren't using standard CSV quoting. It also makes the comma behaviour the default.

You use one of the names by specifying it when loading the module, after an =:

perl -MApp::CCSV=comma ...
perl -MApp::CCSV=tabs ...
perl -MApp::CCSV=pipe ...

You can also convert between formats by specifying two names, in <input>,<output> format e.g.

perl -MApp::CCSV=comma,pipe ...
perl -MApp::CCSV=tabs,comma ...
perl -MApp::CCSV=pipe,tabs ...

And just to round things off, I have a few aliases defined in my bashrc file to make these even easier to use:

alias perlcsv='perl -CSAD -MApp::CCSV'
alias perlpsv='perl -CSAD -MApp::CCSV=pipe'
alias perltsv='perl -CSAD -MApp::CCSV=tabs'
alias perlcsvnq='perl -CSAD -MApp::CCSV=commanq'
alias perlpsvnq='perl -CSAD -MApp::CCSV=pipenq'
alias perltsvnq='perl -CSAD -MApp::CCSV=tabsnq'

That simplifies my standard invocation to something like:

perlcsv -ne 'csay @f[0..2]' < file.csv

Happy data wrangling!