Sat 06 Nov 2021
Tags: linux, sysadmin
GNU sort is an excellent
utility that is a mainstay of the linux command line. It has all kinds
of tricks up its sleeves, including support for uniquifying records,
stable sorts, files larger than memory, parallelisation, controlled
memory usage, etc. Go read the
man page
for all the gory details.
It also supports sorting with field separators, but unfortunately this
support has some nasty traps for the unwary. Hence this post.
First, GNU sort cannot do general sorts of CSV-style datasets, because
it doesn't understand CSV-features like quoting rules, quote-escaping,
separator-escaping, etc. If you have very simple CSV files that don't
do any escaping and you can avoid quotes altogether (or always use
them), you might be able to use GNU sort - but it can get difficult
fast.
Here I'm only interested in very simple delimited files - no quotes
or escaping at all, Even here, though, there are some nasty traps to
watch out for.
Here's a super-simple example file with just two lines and three fields,
called dsort.csv
:
$ cat dsort.csv
a,b,c
a,b+c,c
If we do a vanilla sort on this file, we get the following (I'm also
running it through md5sum
to highlight when the output changes):
$ sort dsort.csv | tee /dev/stderr | md5sum
a,b+c,c
a,b,c
5efd74fa9bef453dd477ec9acb2cef5f -
The longer line sorts before the shorter line because the '+' sign
collates before the second comma in the short line - this is sorting
on the whole line, not on the individual fields.
Okay, so if I want do an individual field sort, I can just use the -t
option, right? You would think so, but unfortunately:
$ sort -t, dsort.csv | tee /dev/stderr | md5sum
a,b+c,c
a,b,c
5efd74fa9bef453dd477ec9acb2cef5f -
Huh? Why doesn't that work the short line first, like we'd expect?
Maybe it's not sorting on all the fields or something? Do I need to
explicitly include all fields? Let's see:
$ sort -t, -k1,3 dsort.csv | tee /dev/stderr | md5sum
a,b+c,c
a,b,c
5efd74fa9bef453dd477ec9acb2cef5f -
Huh? What the heck is going on here?
It turns out this unintuitive behaviour is because of the way sort
interprets the the -k
option - -kM,N
(where M != N
) doesn't mean
'sort by field M, then field M+1,... then by field N', it means instead
'join all fields from M to N (with the field separator?), and sort by
that'. Ugh!
So I just need to specify the fields individually? Unfortunately, even
that's not enough:
$ sort -t, -k1 -k2 -k3 dsort.csv | tee /dev/stderr | md5sum
a,b+c,c
a,b,c
5efd74fa9bef453dd477ec9acb2cef5f -
This is because the first option here - -k1
is interpreted as -k1,3
(since the last field is '3'), because the default 'end-field' is the
last. Double-ugh!
So the takeaway is: if you want an individual-field sort you have to
specify every field individually, AND you have to use -kN,N
syntax,
like so:
$ sort -t, -k1,1 -k2,2 -k3,3 dsort.csv | tee /dev/stderr | md5sum
a,b,c
a,b+c,c
493ce7ca60040fa184f1bf7db7758516 -
Yay, finally what we're after!
Also, unfortunately, there doesn't seem to be a generic way of
specifying 'all fields' or 'up to the last field' or 'M-N' fields -
you have to specify them all individually. It's verbose and ugly, but
it works.
And for some good news, you can use sort suffixes on those individual
options (like n
for numerics, r
for reverse sorts, etc.) just fine.
Happy sorting!
Sun 26 Sep 2021
Tags: golang, software engineering
So this weekend I scratched a long-standing itch and whipped up
a little utility for colourising
output. It's called ctap
and is now available on
Github.
I have a bunch of processes that spit out TAP output at various
points as tests are run on outputs, and having the output
coloured for fast scanning is a surprisingly significant UX
improvement.
I've been a happy user of the javascript
tap-colorize
utility for quite a while, and it does the job nicely.
Unfortunately, it's not packaged (afaict) on either CentOS or
Ubuntu, which means you typically have to install from source
via npm
, which is fine on a laptop, but a PITA on a server.
And though I'm pretty comfortable building
I've always found node packages to be more bother than they're
worth.
Which brings us back to ctap this weekend. Given how simple the
TAP protocol is, I figured building an MVP colouriser couldn't
more than an hour or two - and so it turned out.
But what's really nice is all the modern tooling that provides
you with impressive boosts as you go these days. On this project
that included:
These are fun times to be a software engineer!
Fri 06 Aug 2021
Tags: bash, linux
Here are a few bash functions that I find myself using all the time.
Functions are great where you have something that's slightly more
complex than an alias, or wants to parse out its arguments, but
isn't big enough to turn into a proper script. Drop these into your
~/.bashrc
file (and source ~/.bashrc
) and you're good to go!
Hope one or two of these are helpful/interesting.
1. ssht
Only came across this one fairly recently, but it's nice - you can
combine the joys of ssh
and tmux
to drop you automagically into
a given named session - I use sshgc
(with my initials), so as
not to clobber anyone else's session. (Because ssh
and then
tmux attach
is so much typing!)
ssht() {
local SESSION_NAME=sshgc
command ssh -t "$1" tmux new-session -A -s $SESSION_NAME
}
2. lead
lead
is a combination of ls -l
and head
, showing you the most
recent N files in the given (or current) directory:
lead() {
if [[ "$2" =~ ^[0-9]+$ ]]; then
command ls -lt "$1" | head -n $2
else
# This version works with multiple args or globs
command ls -lt "$@" | head -n 30
fi
}
Or if you're using exa instead of ls
,
you can use:
lead() {
if [[ "$2" =~ ^[0-9]+$ ]]; then
command exa -l -s newest -r --git --color always "$1" | head -n $2
else
command exa -l -s newest -r --git --color always "$@" | head -n 30
fi
}
Usage:
# Show the 30 most recent items in the current directory
lead
# Show the 30 most recent items in the given directory
lead /etc
# Show the 50 most recent items in the current directory
lead . 50
# Show the most recent items beginning with `abc`
lead abc*
3. l1
This ("lowercase L - one", in case it's hard to read) is similar
in spirit to lead
, but it just returns the filename of the most
recently modified item in the current directory.
l1() {
command ls -t | head -n1
}
This can be used in places where you'd use bash's !$
e.g. to edit
or view some file you just created:
solve_the_meaning_of_life >| meaning.txt
cat !$
42!
# OR: cat `l1`
But l1
can also be used in situations where the filename isn't
present in the previous command. For instance, I have a script that
produces a pdf invoice from a given text file, where the pdf name is
auto-derived from the text file name. With l1
, I can just do:
invoice ~/Invoices/cust/Hours.2107
evince `l1`
4. xtitle
This is a cute hack that lets you set the title of your terminal to
the first argument that doesn't being with a '-':
function xtitle() {
if [ -n "$DISPLAY" ]; then
# Try and prune arguments that look like options
while [ "${1:0:1}" == '-' ]; do
shift
done
local TITLE=${1:-${HOSTNAME%%.*}}
echo -ne "\033]0;"$TITLE"\007"
fi
}
Usage:
# Set your terminal title to 'foo'
xtitle foo
# Set your terminal title to the first label of your hostname
xtitle
I find this nice to use with ssh
(or incorporated into ssht
above) e.g.
function sshx() {
xtitle "$@"
command ssh -t "$@"
local RC=$?
xtitle
return $RC
}
This (hopefully) sets your terminal title to the hostname you're ssh-ing
to, and then resets it when you exit.
5. line
This function lets you select a particular line or set of lines from a
text file:
function line() {
# Usage: line <line> [<window>] [<file>]
local LINE=$1
shift
local WINDOW=1
local LEN=$LINE
if [[ "$1" =~ ^[0-9]+$ ]]; then
WINDOW=$1
LEN=$(( $LINE + $WINDOW/2 ))
shift
fi
head -n "$LEN" "$@" | tail -n "$WINDOW"
}
Usage:
# Selecting from a file with numbered lines:
$ line 5 lines.txt
This is line 5
$ line 5 3 lines.txt
This is line 4
This is line 5
This is line 6
$ line 10 6 lines.txt
This is line 8
This is line 9
This is line 10
This is line 11
This is line 12
This is line 13
And a bonus alias:
alias bashrc="$EDITOR ~/.bashrc && source ~/.bashrc"
Tue 13 Apr 2021
Tags: web, serverless, gcp
Let's say you have a list of URLs you need to fetch for some reason -
perhaps to check that they still exist, perhaps to parse their content
for updates, whatever.
If the list is small - say up to 1000 urls - this is pretty easy to do
using just curl(1)
or wget(1)
e.g.
INPUT=urls.txt
wget --execute robots=off --adjust-extension --convert-links \
--force-directories --no-check-certificate --no-verbose \
--timeout=120 --tries=3 -P ./tmp --warc-file=${INPUT%.txt} \
-i "$INPUT"
This iterates over all the urls in urls.txt
and fetches them one by
one, capturing them in WARC format.
Easy.
But if your url list is long - thousands or millions of urls - this is
going to be too slow to be practical. This is a classic
Embarrassingly Parallel
problem, so to make this scalable the obvious solution is to split your
input file up and run multiple fetches in parallel, and then merge your
output files (i.e. a kind of map-reduce job).
But then your problem becomes that you need to run this on multiple
machines, and setting up and managing and tearing down those machines
becomes the core of the problem. But really, you don't want to worry
about machines, you just want an operating system instance available
that you can make use of.
This is the promise of so-called
serverless
architectures such as AWS "Lambda" and Google Cloud's "Cloud Functions",
which provide a container-like environment for computing, without
actually having to worry about managing the containers. The serverless
environment spins up instances on demand, and then tears them down
after a fixed period of time or when your job completes.
So to try out this serverless paradigm on our web fetch problem, I've
written cloudfunc-geturilist,
a Google Cloud Platform "Cloud Function" written in go, that is
triggered by input files being written into an input Google Cloud
Storage bucket, and writes its output files to another GCS output
bucket.
See the README instructions if you'd like to try out (which you can
do using a GCP free tier account).
In terms of scalability, this seems to work pretty well. The biggest
file I've run so far has been 100k URLs, split into 334 input files
each containing 300 URLs. With MAX_INSTANCES=20
, cloudfunc-geturilist
processes these 100k URLs in about 18 minutes; with MAX_INSTANCES=100
that drops to 5 minutes. All at a cost of a few cents.
That's a fair bit quicker than having to run up 100 container instances
myself, or than using wget!
Wed 17 Jun 2020
Tags: perl, golang
I started using perl back in 1996, around version 5.003, while working at UBC
in Vancover. We were working on a student management system at the time,
written in C, interfacing to an Oracle database. We started experimenting with
this Common Gateway Interface thing (CGI) that had recently appeared, and let
you write interactive applications on the web (!). perl was the tool of
choice for CGI, and we were able to talk to Oracle using a perl module that
spoke Oracle Call Interface (OCI).
That project turned out to be pretty successful, and I've been developing in
perl ever since. Perl has a reputation for being obscure and abstruse, but I
find it a lovely language - powerful and expressive. Yes it's probably too
easy to write bad/unreadable perl, but it's also straightforward to write
elegant, readable perl. I routinely pick up code I wrote 5 years ago and have
no problems reading it again.
That said, perl is showing its age in places, and over the last few years
I've also been using other languages where different project requirements
made that make sense. I've written significant code in C, Java, python,
ruby, and javascript/nodejs, but for me none of them were sufficiently
attractive to threaten perl as my language of choice.
About a year ago I started playing with Go at $dayjob, partly interested
in the performance gains of a compiled language, and partly to try out the
concurrency features I'd heard about. Learning a new language is always
challenging, but Go's small footprint, consistency, and well-written
libraries really made picking it up pretty straightforward.
And for me the killer feature is this: Go is the only language besides
perl in which I regularly find myself writing a good chunk of code, getting
it syntactically correct, and then testing it and finding that it Just
Works, first time. The friction between thinking and coding seems low enough
(at least the way my brain works) that I can formulate what I'm thinking
with a pretty high chance of getting it right. I still get surprised when it
happens, but it's great when it does!
Other things help too - it's crazy fast, especially coming from mostly
interpreted languages recently; the concurrency stuff really is nice, and
let's you think about concurrent flows pretty intuitively; and lots of the
the language decisions like formatting, tooling, and composition just seem
to sit pretty well with me.
So while I'm still very happy writing perl, especially for less
performance-intensive applications, I'm now a happy little Go developer
as well, and enjoying exploring some more advanced corners of a new
language home.
Yay Go!
If you're interested in learning Go, the online docs are pretty great: start
with the Tour of Go and
How to write Go code, and then read
Effective Go.
If you're after a book-length treatment, the standard text is
The Go Programming Language by Donovan and Kernighan.
It's excellent but pretty dense, more textbook than tutorial.
I've also read Go in Practice,
which is more accessible and cookbook-style. I thought it was okay, and learnt
a few things, but I wouldn't go out of your way for it.
Sun 05 May 2019
Tags: linux, centos, networking
Recently had to setup a few servers that needed dual upstream gateways,
and used an ancient blog post
I wrote 11 years ago (!) to get it all working. This time around I hit a
gotcha that I hadn't noted in that post, and used a simpler method to
define the rules, so this is an updated version of that post.
Situation: you have two upstream gateways (gw1
and gw2
) on separate
interfaces and subnets on your linux server. Your default route is via gw1
(so all outward traffic, and most incoming traffic goes via that), but you
want to be able to use gw2
as an alternative ingress pathway, so that
packets that have come in on gw2
go back out that interface.
(Everything below done as root, so sudo -i
first if you need to.)
1) First, define a few variables to make things easier to modify/understand:
# The device/interface on the `gw2` subnet
GW2_DEV=eth1
# The ip address of our `gw2` router
GW2_ROUTER_ADDR=172.16.2.254
# Our local ip address on the `gw2` subnet i.e. $GW2_DEV's address
GW2_LOCAL_ADDR=172.16.2.10
2) The gotcha I hit was that 'strict reverse-path filtering' in the kernel
will drop all asymmetrically routed entirely, which will kill our response
traffic. So the first thing to do is make sure that is either turned off
or set to 'loose' instead of 'strict':
# Check the rp_filter setting for $GW2_DEV
# A value of '0' means rp_filtering is off, '1' means 'strict', and '2' means 'loose'
$ cat /proc/sys/net/ipv4/conf/$GW2_DEV/rp_filter
1
# For our purposes values of either '0' or '2' will work. '2' is slightly
# more conservative, so we'll go with that.
echo 2 > /proc/sys/net/ipv4/conf/$GW2_DEV/rp_filter
$ cat /proc/sys/net/ipv4/conf/$GW2_DEV/rp_filter
2
3) Define an extra routing table called gw2
e.g.
$ cat /etc/iproute2/rt_tables
#
# reserved values
#
255 local
254 main
253 default
0 unspec
#
# local tables
#
102 gw2
#
4) Add a default route via gw2
(here 172.16.2.254) to the gw2
routing table:
$ echo "default table gw2 via $GW2_ROUTER_ADDR" > /etc/sysconfig/network-scripts/route-${GW2_DEV}
$ cat /etc/sysconfig/network-scripts/route-${GW2_DEV}
default table gw2 via 172.16.2.254
5) Add an iproute 'rule' saying that packets that come in on our $GW2_LOCAL_ADDR
should use routing table gw2
:
$ echo "from $GW2_LOCAL_ADDR table gw2" > /etc/sysconfig/network-scripts/rule-${GW2_DEV}
$ cat /etc/sysconfig/network-scripts/rule-${GW2_DEV}
from 172.16.2.10 table gw2
6) Take $GW2_DEV down and back up again, and test:
$ ifdown $GW2_DEV
$ ifup $GW2_DEV
# Test that incoming traffic works as expected e.g. on an external server
$ ssh -v server-via-gw2
For more, see:
Wed 13 Feb 2019
Tags: sysadmin, nginx
Solved an interesting problem this week using nginx.
We have an internal nginx webserver for distributing datasets with
dated filenames, like foobar-20190213.tar.gz
. We also create a symlink
called foobar-latest.tar.gz
, that is updated to point to the latest
dataset each time a new version is released. This allows users to just
use a fixed url to grab the latest release, rather than having to scrape
the page to figure out which version is the latest.
Which generally works well. However, one wrinkle is that when you download
via the symlink you end up with a file named with the symlink filename
(foobar-latest.tar.gz
), rather than a dated one. For some use cases this
is fine, but for others you actually want to know what version of the dataset
you are using.
What would be ideal would be a way to tell nginx to handle symlinks differently
from other files. Specifically, if the requested file is a symlink, look up the
file the symlink points to and issue a redirect to request that file. So you'd
request foobar-latest.tar.gz
, but you'd then be redirected to
foobar-20190213.tar.gz
instead. This gets you the best of both worlds - a
fixed url to request, but a dated filename delivered. (If you don't need dated
filenames, of course, you just save to a fixed name of your choice.)
Nginx doesn't support this functionality directly, but it turns out it's pretty
easy to configure - at least as long as your symlinks are strictly local (i.e.
your target and your symlink both live in the same directory), and as long as you
have the nginx embedded perl module included in your nginx install (the one from
RHEL/CentOS EPEL does, for instance.)
Here's how:
1. Add a couple of helper directives in the http
context (that's outside/as
a sibling to your server
section):
# Setup a variable with the dirname of the current uri
# cf. https://serverfault.com/questions/381545/how-to-extract-only-the-file-name-from-the-request-uri
map $uri $uri_dirname {
~^(?<capture>.*)/ $capture;
}
# Use the embedded perl module to return (relative) symlink target filenames
# cf. https://serverfault.com/questions/305228/nginx-serve-the-latest-download
perl_set $symlink_target_rel '
sub {
my $r = shift;
my $filename = $r->filename;
return "" if ! -l $filename;
my $target = readlink($filename);
$target =~ s!^.*/!!; # strip path (if any)
return $target;
}
';
2. In a location
section (or similar), just add a test on $symlink_target_rel
and issue a redirect using the variables we defined previously:
location / {
autoindex on;
# Symlink redirects FTW!
if ($symlink_target_rel != "") {
# Note this assumes that your symlink and target are in the same directory
return 301 https://www.example.com$uri_dirname/$symlink_target_rel;
}
}
Now when you make a request to a symlinked resource you get redirected instead to
the target, but everything else is handled using the standard nginx pathways.
$ curl -i -X HEAD https://www.example.com/foobar/foobar-latest.tar.gz
HTTP/1.1 301 Moved Permanently
Server: nginx/1.12.2
Date: Wed, 13 Feb 2019 05:23:11 GMT
Location: https://www.example.com/foobar/foobar-20190213.tar.gz
Fri 31 Aug 2018
Tags: linux, sysadmin
(Updated April 2020: added new no. 7 after being newly bitten...)
incron
is a useful little cron-like utility that lets you run arbitrary jobs
(like cron
), but instead of being triggered at certain times, your
jobs are triggered by changes to files or directories.
It uses the linux kernel inotify
facility (hence the name), and so it isn't cross-platform, but on linux
it can be really useful for monitoring file changes or uploads, reporting
or forwarding based on status files, simple synchronisation schemes, etc.
Again like cron
, incron
supports the notion of job 'tables' where
commands are configured, and users can have manage their own tables
using an incrontab
command, while root can manage multiple system
tables.
So it's a really useful linux utility, but it's also fairly old (the
last release, v0.5.10, is from 2012), doesn't appear to be under
active development any more, and it has a few frustrating quirks that
can make using it unnecessarily difficult.
So this post is intended to highlight a few of the 'gotchas' I've
experienced using incron
:
You can't monitor recursively i.e. if you create a watch on a
directory incron will only be triggered on events in that
directory itself, not in any subdirectories below it. This isn't
really an incron issue since it's a limitation of the underlying
inotify
mechanism, but it's definitely something you'll want
to be aware of going in.
The incron
interface is enough like cron
(incrontab -l
,
incrontab -e
, man 5 incrontab
, etc.) that you might think
that all your nice crontab features are available. Unfortunately
that's not the case - most significantly, you can't have comments
in incron tables (incron
will try and parse your comment lines and
fail), and you can't set environment variables to be available for
your commands. (This includes PATH, so you might need to explicitly
set a PATH inside your incron scripts if you need non-standard
locations. The default PATH is documented as
/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin
.)
That means that cron
MAILTO
support is not available, and in
general there's no easy way of getting access to the stdout or
stderr of your jobs. You can't even use shell redirects in your
command to capture the output (e.g. echo $@/$# >> /tmp/incron.log
doesn't work). If you're debugging, the best you can do is add a
layer of indirection by using a wrapper script that does the
redirection you need (e.g. echo $1 2&>1 >> /tmp/incron.log
)
and calling the wrapper script in your incrontab with the incron
arguments (e.g. debug.sh $@/$#
). This all makes debugging
misbehaving commands pretty painful. The main place to check if
your commands are running is the cron log (/var/log/cron
) on
RHEL/CentOS, and syslog (/var/log/syslog
) on Ubuntu/Debian.
incron is also very picky about whitespace in your incrontab.
If you put more than one space (or a tab) between the inotify
masks and your command, you'll get an error in your cron log
saying cannot exec process: No such file or directory
, because
incron will have included everything after the first space as part
of your command e.g. (gavin) CMD ( echo /home/gavin/tmp/foo)
(note the evil space before the echo
).
It's often difficult (and non-intuitive) to figure out what inotify
events you want to trigger on in your incrontab masks. For instance,
does 'IN_CREATE' get fired when you replace an existing file with a
new version? What events are fired when you do a mv
or a cp
?
If you're wanting to trigger on an incoming remote file copy, should
you use 'IN_CREATE' or 'IN_CLOSE_WRITE'? In general, you don't want to guess,
you actually want to test and see what events actually get fired on
the operations you're interested in. The easiest way to do this is
use inotifywait
from the inotify-tools
package, and run it using
inotifywait -m <dir>
, which will report to you all the inotify
events that get triggered on that directory (hit <Ctrl-C>
to exit).
The "If you're wanting to trigger on an incoming remote file copy,
should you use 'IN_CREATE' or 'IN_CLOSE_WRITE'?" above was a trick
question - it turns out it depends how you're doing the copy! If
you're just going a simple copy in-place (e.g. with scp
), then
(assuming you want the completed file) you're going to want to trigger
on 'IN_CLOSE_WRITE', since that's signalling all writing is complete and
the full file will be available. If you're using a vanilla rsync
,
though, that's not going to work, as rsync does a clever
write-to-a-hidden-file trick, and then moves the hidden file to
the destination name atomically. So in that case you're going to want
to trigger on 'IN_MOVED_TO', which will give you the destination
filename once the rsync is completed. So again, make sure you test
thoroughly before you deploy.
Though cron works fine with symlinks to crontab files (in e.g.
/etc/cron.d
, incron doesn't support this in /etc/incron.d
-
symlinks just seem to be quietly ignored. (Maybe this is for
security, but it's not documented, afaict.)
Have I missed any? Any other nasties bitten you using incron
?
Sat 28 Jul 2018
Tags: sysadmin, mongodb
I've been doing a few upgrades of old standalone (not replica set)
mongodb databases lately, taking them from 2.6, 3.0, or 3.2 up to 3.6.
Here's my upgrade process on RHEL/CentOS, which has been working pretty
smoothly (cf. the mongodb notes here: https://docs.mongodb.com/manual/tutorial/change-standalone-wiredtiger/).
First, the WiredTiger storage engine (the default since mongodb 3.2)
"strongly" recommends using the xfs
filesystem on linux, rather than
ext4
(see https://docs.mongodb.com/manual/administration/production-notes/#prod-notes-linux-file-system
for details). So the first thing to do is reorganise your disk to make
sure you have an xfs filesystem available to hold your upgraded database.
If you have the disk space, this may be reasonably straightforward; if
you don't, it's a serious PITA.
Once your filesystems are sorted, here's the upgrade procedure.
1. Take a full mongodump
of all your databases
cd /data # Any path with plenty of disk available
for $DB in db1 db2 db3; do
mongodump -d $DB -o mongodump-$DB-$(date +%Y%m%d)
done
2. Shut the current mongod
down
systemctl stop mongod
# Save the current mongodb.conf for reference
mv /etc/mongodb.conf /etc/mongod.conf.rpmsave
3. Hide the current /var/lib/mongo directory to avoid getting confused later.
Create your new mongo directory on the xfs filesystem you've prepared e.g. /opt
.
cd /var/lib
mv mongo mongo-old
# Create new mongo directory on your (new?) xfs filesytem
mkdir /opt/mongo
chown mongod:mongod /opt/mongo
4. Upgrade to mongo v3.6
vi /etc/yum.repos.d/mongodb.repo
# Add the following section, and disable any other mongodb repos you might have
[mongodb-org-3.6]
name=MongoDB 3.6 Repository
baseurl=https://repo.mongodb.org/yum/redhat/$releasever/mongodb-org/3.6/x86_64/
gpgcheck=1
enabled=1
gpgkey=https://www.mongodb.org/static/pgp/server-3.6.asc
# Then do a yum update on the mongodb packages
yum update mongodb-org-{server,tools,shell}
5. Check/modify the new mongod.conf
. See https://docs.mongodb.com/v3.6/reference/configuration-options/
for all the details on the 3.6 config file options. In particular, dbPath
should point
to the new xfs-based mongo directory you created in (3) above.
vi /etc/mongod.conf
# Your 'storage' settings should look something like this:
storage:
dbPath: /opt/mongo
journal:
enabled: true
engine: "wiredTiger"
wiredTiger:
collectionConfig:
blockCompressor: snappy
indexConfig:
prefixCompression: true
6. Restart mongod
systemctl daemon-reload
systemctl enable mongod
systemctl start mongod
systemctl status mongod
7. If all looks good, reload the mongodump
data from (1):
cd /data
for $DB in db1 db2 db3; do
mongorestore --drop mongodump-$DB-$(date +%Y%m%d)
done
All done!
These are the basics anyway. This doesn't cover configuring access control on your
new database, or wrangling SELinux permissions on your database directory, but if
you're doing those currently you should be able to figure those out.
Fri 09 Feb 2018
Tags: sysadmin, linux, centos, sftp
Had to setup a new file transfer host recently, with the following requirements:
- individual login accounts required (for customers, no anonymous access)
- support for (secure) downloads, ideally via a browser (no special software required)
- support for (secure) uploads, ideally via sftp (most of our customers are familiar with ftp)
Our target was RHEL/CentOS 7, but this should transfer to other linuxes pretty
readily.
Here's the schema we ended up settling on, which seems to give us a good mix of
security and flexibility.
- use apache with HTTPS and PAM with local accounts, one per customer, and
nologin
shell accounts
- users have their own groups (group=
$USER
), and also belong to the sftp
group
- we use the
users
group for internal company accounts, but NOT for customers
- customer data directories live in /data
- we use a 3-layer hierarchy for security:
/data/chroot_$USER/$USER
are created with a nologin
shell
- the
/data/chroot_$USER
directory must be owned by root:$USER
, with
permissions 750
, and is used for an sftp chroot directory (not writeable
by the user)
- the next-level
/data/chroot_$USER/$USER
directory should be owned by $USER:users
,
with permissions 2770
(where users
is our internal company user group, so both
the customer and our internal users can write here)
- we also add an ACL to
/data/chroot_$USER
to allow the company-internal users
group read/search access (but not write)
We just use openssh internal-sftp
to provide sftp access, with the following config:
Subsystem sftp internal-sftp
Match Group sftp
ChrootDirectory /data/chroot_%u
X11Forwarding no
AllowTcpForwarding no
ForceCommand internal-sftp -d /%u
So we chroot sftp connections to /data/chroot_$USER
and then (via the ForceCommand
)
chdir to /data/chroot_$USER/$USER
, so they start off in the writeable part of their
tree. (If they bother to pwd
, they see that they're in /$USER
, and they can chdir
up a level, but there's nothing else there except their $USER
directory, and they
can't write to the chroot.)
Here's a slightly simplified version of the newuser
script we use:
die() {
echo $*
exit 1
}
test -n "$1" || die "usage: $(basename $0) <username>"
USERNAME=$1
# Create the user and home directories
mkdir -p /data/chroot_$USERNAME/$USERNAME
useradd --user-group -G sftp -d /data/chroot_$USERNAME/$USERNAME -s /sbin/nologin $USERNAME
# Set home directory permissions
chown root:$USERNAME /data/chroot_$USERNAME
chmod 750 /data/chroot_$USERNAME
setfacl -m group:users:rx /data/chroot_$USERNAME
chown $USERNAME:users /data/chroot_$USERNAME/$USERNAME
chmod 2770 /data/chroot_$USERNAME/$USERNAME
# Set user password manually
passwd $USERNAME
And we add an apache config file like the following to /etc/httpd/user.d
:
Alias /CUSTOMER /data/chroot_CUSTOMER/CUSTOMER
<Directory /data/chroot_CUSTOMER/CUSTOMER>
Options +Indexes
Include "conf/auth.conf"
Require user CUSTOMER
</Directory>
(with CUSTOMER
changed to the local username), and where conf/auth.conf
has
the authentication configuration against our local PAM users and allows internal
company users access.
So far so good, but how do we restrict customers to their own /CUSTOMER
tree?
That's pretty easy too - we just disallow customers from accessing our apache document
root, and redirect them to a magic '/user' endpoint using an ErrorDocument 403
directive:
<Directory /var/www/html>
Options +Indexes +FollowSymLinks
Include "conf/auth.conf"
# Any user not in auth.conf, redirect to /user
ErrorDocument 403 "/user"
</Directory>
with /user
defined as follows:
# Magic /user endpoint, redirecting to /$USERNAME
<Location /user>
Include "conf/auth.conf"
Require valid-user
RewriteEngine On
RewriteCond %{LA-U:REMOTE_USER} ^[a-z].*
RewriteRule ^\/(.*)$ /%{LA-U:REMOTE_USER}/ [R]
</Location>
The combination of these two says that any valid user NOT in auth.conf should
be redirected to their own /CUSTOMER
endpoint, so each customer user lands
there, and can't get anywhere else.
Works well, no additional software is required over vanilla apache and openssh,
and it still feels relatively simple, while meeting our security requirements.
Mon 27 Nov 2017
Tags: linux, mdadm, lvm
Ran out of space on an old CentOS 6.8 server in the weekend, and so had
to upgrade the main data mirror from a pair of Hitachi 2TB HDDs to a pair
of 4TB WD Reds I had lying around.
The volume was using mdadm
, aka Linux Software RAID, and is a simple mirror
(RAID1), with LVM
volumes on top of the mirror. The safest upgrade path is
to build a new mirror on the new disks and sync the data across, but there
weren't any free SATA ports on the motherboard, so instead I opted to do an
in-place upgrade. I haven't done this for a good few years, and hit a couple
of wrinkles on the way, so here are the notes from the journey.
Below, the physical disk partitions are /dev/sdb1
and /dev/sdd1
, the
mirror is /dev/md1
, and the LVM volume group is extra
.
1. Backup your data (or check you have known good rock-solid backups in
place), because this is a complex process with plenty that could go wrong.
You want an exit strategy.
2. Break the mirror, failing and removing one of the old disks
mdadm --manage /dev/md1 --fail /dev/sdb1
mdadm --manage /dev/md1 --remove /dev/sdb1
3. Shutdown the server, remove the disk you've just failed, and insert your
replacement. Boot up again.
4. Since these are brand new disks, we need to partition them. And since
these are 4TB disks, we need to use parted
rather than the older fdisk
.
parted /dev/sdb
print
mklabel gl
# Create a partition, skipping the 1st MB at beginning and end
mkpart primary 1 -1
unit s
print
# Not sure if this flag is required, but whatever
set 1 raid on
quit
5. Then add the new partition back into the mirror. Although this is much
bigger, it will just sync up at the old size, which is what we want for now.
mdadm --manage /dev/md1 --add /dev/sdb1
# This will take a few hours to resync, so let's keep an eye on progress
watch -n5 cat /proc/mdstat
6. Once all resynched, rinse and repeat with the other disk - fail and remove
/dev/sdd1
, shutdown and swap the new disk in, boot up again, partition the new
disk, and add the new partition into the mirror.
7. Once all resynched again, you'll be back where you started - a nice stable
mirror of your original size, but with shiny new hardware underneath. Now we
can grow the mirror to take advantage of all this new space we've got.
mdadm --grow /dev/md1 --size=max
mdadm: component size of /dev/md0 has been set to 2147479552K
Ooops! That size doesn't look right, that's 2TB, but these are 4TB disks?!
Turns out there's a 2TB limit on mdadm
metadata version 0.90
, which this
mirror is using, as documented on https://raid.wiki.kernel.org/index.php/RAID_superblock_formats#The_version-0.90_Superblock_Format.
mdadm --detail /dev/md1
/dev/md1:
Version : 0.90
Creation Time : Thu Aug 26 21:03:47 2010
Raid Level : raid1
Array Size : 2147483648 (2048.00 GiB 2199.02 GB)
Used Dev Size : -1
Raid Devices : 2
Total Devices : 2
Preferred Minor : 1
Persistence : Superblock is persistent
Update Time : Mon Nov 27 11:49:44 2017
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
UUID : f76c75fb:7506bc25:dab805d9:e8e5d879
Events : 0.1438
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 49 1 active sync /dev/sdd1
Unfortunately, mdadm
doesn't support upgrading the metadata version. But
there is a workaround documented on that wiki page, so let's try that:
mdadm --detail /dev/md1
# (as above)
# Stop/remove the mirror
mdadm --stop /dev/md1
mdadm: Cannot get exclusive access to /dev/md1:Perhaps a running process, mounted filesystem or active volume group?
# Okay, deactivate our volume group first
vgchange --activate n extra
# Retry stop
mdadm --stop /dev/md1
mdadm: stopped /dev/md1
# Recreate the mirror with 1.0 metadata (you can't go to 1.1 or 1.2, because they're located differently)
# Note that you should specify all your parameters in case the defaults have changed
mdadm --create /dev/md1 -l1 -n2 --metadata=1.0 --assume-clean --size=2147483648 /dev/sdb1 /dev/sdd1
That outputs:
mdadm: /dev/sdb1 appears to be part of a raid array:
level=raid1 devices=2 ctime=Thu Aug 26 21:03:47 2010
mdadm: /dev/sdd1 appears to be part of a raid array:
level=raid1 devices=2 ctime=Thu Aug 26 21:03:47 2010
mdadm: largest drive (/dev/sdb1) exceeds size (2147483648K) by more than 1%
Continue creating array? y
mdadm: array /dev/md1 started.
Success! Now let's reactivate that volume group again:
vgchange --activate y extra
3 logical volume(s) in volume group "extra" now active
Another wrinkle is that recreating the mirror will have changed the array UUID,
so we need to update the old UUID in /etc/mdadm.conf
:
# Double-check metadata version, and record volume UUID
mdadm --detail /dev/md1
# Update the /dev/md1 entry UUID in /etc/mdadm.conf
$EDITOR /etc/mdadm.conf
So now, let's try that mdadm --grow
command again:
mdadm --grow /dev/md1 --size=max
mdadm: component size of /dev/md1 has been set to 3907016564K
# Much better! This will take a while to synch up again now:
watch -n5 cat /proc/mdstat
8. (You can wait for this to finish resynching first, but it's optional.)
Now we need to let LVM know that the physical volume underneath it has changed size:
# Check our starting point
pvdisplay /dev/mda1
--- Physical volume ---
PV Name /dev/md1
VG Name extra
PV Size 1.82 TiB / not usable 14.50 MiB
Allocatable yes
PE Size 64.00 MiB
Total PE 29808
Free PE 1072
Allocated PE 28736
PV UUID mzLeMW-USCr-WmkC-552k-FqNk-96N0-bPh8ip
# Resize the LVM physical volume
pvresize /dev/md1
Read-only locking type set. Write locks are prohibited.
Can't get lock for system
Cannot process volume group system
Read-only locking type set. Write locks are prohibited.
Can't get lock for extra
Cannot process volume group extra
Read-only locking type set. Write locks are prohibited.
Can't get lock for #orphans_lvm1
Cannot process standalone physical volumes
Read-only locking type set. Write locks are prohibited.
Can't get lock for #orphans_pool
Cannot process standalone physical volumes
Read-only locking type set. Write locks are prohibited.
Can't get lock for #orphans_lvm2
Cannot process standalone physical volumes
Read-only locking type set. Write locks are prohibited.
Can't get lock for system
Cannot process volume group system
Read-only locking type set. Write locks are prohibited.
Can't get lock for extra
Cannot process volume group extra
Read-only locking type set. Write locks are prohibited.
Can't get lock for #orphans_lvm1
Cannot process standalone physical volumes
Read-only locking type set. Write locks are prohibited.
Can't get lock for #orphans_pool
Cannot process standalone physical volumes
Read-only locking type set. Write locks are prohibited.
Can't get lock for #orphans_lvm2
Cannot process standalone physical volumes
Failed to find physical volume "/dev/md1".
0 physical volume(s) resized / 0 physical volume(s) not resized
Oops - that doesn't look good. But it turns out it's just a weird
locking type default. If we tell pvresize
it can use local filesystem
write locks we should be good (cf. /etc/lvm/lvm.conf
):
# Let's try that again...
pvresize --config 'global {locking_type=1}' /dev/md1
Physical volume "/dev/md1" changed
1 physical volume(s) resized / 0 physical volume(s) not resized
# Double-check the PV Size
pvdisplay /dev/mda1
--- Physical volume ---
PV Name /dev/md1
VG Name extra
PV Size 3.64 TiB / not usable 21.68 MiB
Allocatable yes
PE Size 64.00 MiB
Total PE 59616
Free PE 30880
Allocated PE 28736
PV UUID mzLeMW-USCr-WmkC-552k-FqNk-96N0-bPh8ip
Success!
Finally, you can now resize your logical volumes using lvresize
as you
usually would.
Fri 08 Jul 2016
Tags: linux, sysadmin
Since I got bitten by this recently, let me blog a quick warning here:
glibc iconv
- a utility for character set conversions, like iso8859-1 or
windows-1252 to utf-8 - has a nasty misfeature/bug: if you give it data on
stdin it will slurp the entire file into memory before it does a single
character conversion.
Which is fine if you're running small input files. If you're trying to
convert a 10G file on a VPS with 2G of RAM, however ... not so good!
This looks to be a
known issue, with
patches submitted to fix it in August 2015, but I'm not sure if they've
been merged, or into which version of glibc. Certainly RHEL/CentOS 7 (with
glibc 2.17) and Ubuntu 14.04 (with glibc 2.19) are both affected.
Once you know about the issue, it's easy enough to workaround - there's an
iconv-chunks wrapper on github that
breaks the input into chunks before feeding it to iconv, or you can do much
the same thing using the lovely GNU parallel
e.g.
gunzip -c monster.csv.gz | parallel --pipe -k iconv -f windows-1252 -t utf8
Nasty OOM avoided!
Tue 06 Oct 2015
Tags: hardware, linux, rhel, centos
Wow, almost a year since the last post. Definitely time to reboot the blog.
Got to replace my aging ThinkPad X201 with a lovely shiny new
ThinkPad X250
over the weekend. Specs are:
- CPU: Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz
- RAM: 16GB PC3-12800 DDR3L SDRAM 1600MHz SODIMM
- Disk: 256GB SSD (swapped out for existing Samsung SSD)
- Display: 12.5" 1920x1080 IPS display, 400nit, non-touch
- Graphics: Intel Graphics 5500
- Wireless: Intel 7265 AC/B/G/N Dual Band Wireless and Bluetooth 4.0
- Batteries: 1 3-cell internal, 1 6-cell hot-swappable
A very nice piece of kit!
Just wanted to document what works and what doesn't (so far) on my standard OS,
CentOS 7, with RH kernel 3.10.0-229.11.1. I had to install the following
additional packages:
- iwl7265-firmware (for wireless support)
- acpid (for the media buttons)
Working so far:
- media buttons (Fn + F1/mute, F2/softer, F3/louder - see below for configuration)
- wifi button (Fn + F8 - worked out of the box)
- keyboard backlight (Fn + space, out of the box)
- sleep/resume (out of the box)
- touchpad hard buttons (see below)
- touchpad soft buttons (out of the box)
Not working / unconfigured so far:
- brightness buttons (Fn + F5/F6)
- fingerprint reader (supposedly works with
fprintd
)
Not working / no ACPI codes:
- mute microphone button (Fn + F4)
- application buttons (Fn + F9-F12)
Uncertain/not tested yet:
- switch video mode (Fn + F7)
To get the touchpad working I needed to use the "evdev" driver rather than the
"Synaptics" one - I added the following as /etc/X11/xorg.conf.d/90-evdev.conf
:
Section "InputClass"
Identifier "Touchpad/TrackPoint"
MatchProduct "PS/2 Synaptics TouchPad"
MatchDriver "evdev"
Option "EmulateWheel" "1"
Option "EmulateWheelButton" "2"
Option "Emulate3Buttons" "0"
Option "XAxisMapping" "6 7"
Option "YAxisMapping" "4 5"
EndSection
This gives me 3 working hard buttons above the touchpad, including middle-mouse-
button for paste.
To get fonts scaling properly I needed to add a monitor section as
/etc/X11/xorg.conf.d/50-monitor.conf
, specifically for the DisplaySize
:
Section "Monitor"
Identifier "Monitor0"
VendorName "Lenovo ThinkPad"
ModelName "X250"
DisplaySize 276 155
Option "DPMS"
EndSection
and also set the dpi properly in my ~/.Xdefaults
:
*.dpi: 177
This fixes font size nicely in Firefox/Chrome and terminals for me.
I also found my mouse movement was too slow, which I fixed with:
xinput set-prop 11 "Device Accel Constant Deceleration" 0.7
(which I put in my old-school ~/.xsession
file).
Finally, getting the media keys involved installing acpid and setting up
the appropriate magic in 3 files in /etc/acpid/events
:
# /etc/acpi/events/volumedown
event=button/volumedown
action=/etc/acpi/actions/volume.sh down
# /etc/acpi/events/volumeup
event=button/volumeup
action=/etc/acpi/actions/volume.sh up
# /etc/acpi/events/volumemute
event=button/mute
action=/etc/acpi/actions/volume.sh mute
Those files capture the ACPI events and handle them via a custom script in
/etc/acpi/actions/volume.sh
, which uses amixer
from alsa-utils
. Volume
control worked just fine, but muting was a real pain to get working correctly
due to what seems like a bug in amixer - amixer -c1 sset Master playback toggle
doesn't toggle correctly - it mutes fine, but then doesn't unmute all
the channels it mutes!
I worked around it by figuring out the specific channels that sset Master
was muting, and then handling them individually, but it's definitely not as clean:
#!/bin/sh
#
# /etc/acpi/actions/volume.sh (must be executable)
#
PATH=/usr/bin
die() {
echo $*
exit 1
}
usage() {
die "usage: $(basename $0) up|down|mute"
}
test -n "$1" || usage
ACTION=$1
shift
case $ACTION in
up)
amixer -q -c1 -M sset Master 5%+ unmute
;;
down)
amixer -q -c1 -M sset Master 5%- unmute
;;
mute)
# Ideally the next command should work, but it doesn't unmute correctly
# amixer -q -c1 sset Master playback toggle
# Manual version for ThinkPad X250 channels
# If adapting for another machine, 'amixer -C$DEV contents' is your friend (NOT 'scontents'!)
SOUND_IS_OFF=$(amixer -c1 cget iface=MIXER,name='Master Playback Switch' | grep 'values=off')
if [ -n "$SOUND_IS_OFF" ]; then
amixer -q -c1 cset iface=MIXER,name='Master Playback Switch' on
amixer -q -c1 cset iface=MIXER,name='Headphone Playback Switch' on
amixer -q -c1 cset iface=MIXER,name='Speaker Playback Switch' on
else
amixer -q -c1 cset iface=MIXER,name='Master Playback Switch' off
amixer -q -c1 cset iface=MIXER,name='Headphone Playback Switch' off
amixer -q -c1 cset iface=MIXER,name='Speaker Playback Switch' off
fi
;;
*)
usage
;;
esac
So in short, really pleased with the X250 so far - the screen is lovely, battery
life seems great, I'm enjoying the keyboard, and most things have Just
Worked or have been pretty easily configurable with CentOS. Happy camper!
References:
Sun 12 Oct 2014
Tags: web, urls, personal_cloud
I wrote a really simple personal URL shortener a couple of years ago, and
have been using it happily ever since. It's called shrtn
("shorten"), and is just a simple perl script that captures (or generates) a
mapping between a URL and a code, records in a simple text db, and then generates
a static html file that uses HTML meta-redirects to point your browser towards
the URL.
It was originally based on posts from
Dave Winer
and Phil Windley,
but was interesting enough that I felt the itch to implement my own.
I just run it on my laptop (shrtn <url> [<code>]
), and it has settings to
commit the mapping to git and push it out to a remote repo (for backup),
and to push the generated html files up to a webserver somewhere (for
serving the html).
Most people seem to like the analytics side of personal URL shorteners
(seeing who clicks your links), but I don't really track that side of it
at all (it would be easy enought to add Google Analytics to to your html
files to do that, or just doing some analysis on the access logs). I
mostly wanted it initially to post nice short links when microblogging,
where post length is an issue.
Surprisingly though, the most interesting use case in practice is the
ability to give custom mnemonic code codes to URLs I use reasonably often, or
cite to other people a bit. If I find myself sharing a URL with more
than a couple of people, it's easier just to create a shortened version and
use that instead - it's simpler, easier to type, and easier to remember for
next time.
So my shortener has sort of become a cross between a Level 1 URL cache
and a poor man's bookmarking service. For instance:
If you don't have a personal url shortener you should give it a try - it's
a surprisingly interesting addition to one's personal cloud. And all you
need to try it out is a domain and some static webspace somewhere to host
your html files.
Too easy.
[ Technical Note: html-based meta-redirects work just fine with browsers,
including mobile and text-only ones. They don't work with most spiders and
bots, however, which may a bug or a feature, depending on your usage. For a
personal url shortener meta-redirects probably work just fine, and you gain
all the performance and stability advantages of static html over dynamic
content. For a corporate url shortener where you want bots to be able to
follow your links, as well as people, you probably want to use http-level
redirects instead. In which case you either go with a hosted option, or look
at something like YOURLS for a slightly more heavyweight
self-hosted option. ]
Sun 21 Sep 2014
Tags: hardware, linux, scanning, rhel, centos, usb
Just picked up a shiny new Fujitsu ScanSnap 1300i ADF scanner to get
more serious about less paper.
I chose the 1300i on the basis of the nice small form factor, and that SANE
reports
it having 'good' support with current SANE backends. I'd also been able
to find success stories of other linux users getting the similar S1300
working okay:
Here's my experience getting the S1300i up and running on CentOS 6.
I had the following packages already installed on my CentOS 6
workstation, so I didn't need to install any new software:
- sane-backends
- sane-backends-libs
- sane-frontends
- xsane
- gscan2pdf (from rpmforge)
- gocr (from rpmforge)
- tesseract (from my repo)
I plugged the S1300i in (via the dual USB cables instead of the power
supply - nice!), turned it on (by opening the top cover) and then ran
sudo sane-find-scanner
. All good:
found USB scanner (vendor=0x04c5 [FUJITSU], product=0x128d [ScanSnap S1300i]) at libusb:001:013
# Your USB scanner was (probably) detected. It may or may not be supported by
# SANE. Try scanimage -L and read the backend's manpage.
Ran sudo scanimage -L
- no scanner found.
I downloaded the S1300 firmware Luuk had provided in his post and
installed it into /usr/share/sane/epjitsu
, and then updated
/etc/sane.d/epjitsu.conf
to reference it:
# Fujitsu S1300i
firmware /usr/share/sane/epjitsu/1300_0C26.nal
usb 0x04c5 0x128d
Ran sudo scanimage -L
- still no scanner found. Hmmm.
Rebooted into windows, downloaded the Fujitsu ScanSnap Manager package
and installed it. Grubbed around in C:/Windows and found the following 4
firmware packages:
Copied the firmware onto another box, and rebooted back into linux.
Copied the 4 new firmware files into /usr/share/sane/epjitsu
and
updated /etc/sane.d/epjitsu.conf
to try the 1300i firmware:
# Fujitsu S1300i
firmware /usr/share/sane/epjitsu/1300i_0D12.nal
usb 0x04c5 0x128d
Close and re-open the S1300i (i.e. restart, just to be sure), and
retried sudo scanimage -L
. And lo, this time the scanner whirrs
briefly and ... victory!
$ sudo scanimage -L
device 'epjitsu:libusb:001:014' is a FUJITSU ScanSnap S1300i scanner
I start gscan2pdf
to try some scanning goodness. Eeerk: "No devices
found". Hmmm. How about sudo gscan2pdf
? Ahah, success - "FUJITSU
ScanSnap S1300i" shows up in the Device dropdown.
I exit, and google how to deal with the permissions problem. Looks like
the usb device gets created by udev as root:root 0664, and I need 'rw'
permissions for scanning:
$ ls -l /dev/bus/usb/001/014
crw-rw-r--. 1 root root 189, 13 Sep 20 20:50 /dev/bus/usb/001/014
The fix is to add a scanner
group and local udev rule to use that
group when creating the device path:
# Add a scanner group (analagous to the existing lp, cdrom, tape, dialout groups)
$ sudo groupadd -r scanner
# Add myself to the scanner group
$ sudo useradd -aG scanner gavin
# Add a udev local rule for the S1300i
$ sudo vim /etc/udev/rules.d/99-local.rules
# Added:
# Fujitsu ScanSnap S1300i
ATTRS{idVendor}=="04c5", ATTRS{idProduct}=="128d", MODE="0664", GROUP="scanner", ENV{libsane_matched}="yes"
Then logout and log back in to pickup the change in groups, and close
and re-open the S1300i. If all is well, I'm now in the scanner group and
can control the scanner sans sudo:
# Check I'm in the scanner group now
$ id
uid=900(gavin) gid=100(users) groups=100(users),10(wheel),487(scanner)
# Check I can scanimage without sudo
$ scanimage -L
device 'epjitsu:libusb:001:016' is a FUJITSU ScanSnap S1300i scanner
# Confirm the permissions on the udev path (adjusted to match the new libusb path)
$ ls -l /dev/bus/usb/001/016
crw-rw-r--. 1 root scanner 189, 15 Sep 20 21:30 /dev/bus/usb/001/016
# Success!
Try gscan2pdf
again, and this time it works fine without sudo!
And so far gscan2pdf 1.2.5 seems to work pretty nicely. It handles both
simplex and duplex scans, and both the cleanup phase (using unpaper
)
and the OCR phase (with either gocr
or tesseract
) work without
problems. tesseract
seems to perform markedly better than gocr
so
far, as seems pretty typical.
So thus far I'm a pretty happy purchaser. On to a paperless
searchable future!
Sat 20 Sep 2014
Tags: perl, parallel
Did a talk at the Sydney Perl Mongers group on Tuesday night,
called "Parallelising with Perl", covering AnyEvent, MCE, and
GNU Parallel.
Slides
Wed 10 Sep 2014
Tags: finance, billing, web
You'd think that 20 years into the Web we'd have billing all sorted out.
(I've got in view here primarily bill/invoice delivery, rather than
payments, and consumer-focussed billing, rather than B2B invoicing).
We don't. Our bills are probably as likely to still come on paper as in
digital versions, and the current "e-billing" options all come with
significant limitations (at least here in Australia - I'd love to hear
about awesome implementations elsewhere!)
Here, for example, are a representative set of my current vendors, and
their billing delivery options (I'm not picking on anyone here, just
grounding the discussion in some specific examples).
So that all looks pretty reasonable, you might say. All your vendors have
some kind of e-billing option. What's the problem?
The current e-billing options
Here's how I'd rate the various options available:
email: email is IMO the best current option for bill delivery - it's
decentralised, lightweight, push-rather-than-pull, and relatively easy
to integrate/automate. Unfortunately, not everyone offers it, and sometimes
(e.g. Citibank) they insist on putting passwords on the documents they send
out via email on the grounds of 'security'. (On the other hand, emails
are notoriously easy to fake, so faking a bill email is a straightforward
attack vector if you can figure out customer-vendor relationships.)
(Note too that most of the non-email e-billing options still use email
for sending alerts about a new bill, they just don't also send the bill
through as an attachment.)
web (i.e. a company portal of some kind which you log into and can
then download your bill): this is efficient for the vendor, but pretty
inefficient for the customer - it requires going to the particular
website, logging in, and navigating to the correct location before you
can view or download your bill. So it's an inefficient, pull-based
solution, requiring yet another username/password, and with few
integration/automation options (and security issues if you try).
BillPayView
/ Australia Post Digital Mailbox:
for non-Australians, these are free (for consumers) solutions for
storing and paying bills offered by a consortium of banks
(BillPayView) and Australia Post (Digital Mailbox) respectively.
These provide a pretty decent user experience in that your bills are
centralised, and they can often parse the bill payment options and
make the payment process easy and less error-prone. On the other
hand, centralisation is a two-edged sword, as it makes it harder to
change providers (can you get your data out of these providers?);
it narrows your choices in terms of bill payment (or at least makes
certain kinds of payment options easier than others); and it's
basically still a web-based solution, requiring login and navigation,
and very difficult to automate or integrate elsewhere. I'm also
suspicious of 'free' services from corporates - clearly there is value
in driving you through their preferred payment solutions and/or in the
transaction data itself, or they wouldn't be offering it to you.
Also, why are there limited providers at all? There should be a
standard in place so that vendors don't have to integrate separately
with each provider, and so that customers have maximum choice in whom
they wish to deal with. Wins all-round.
And then there's the issue of formats. I'm not aware of any Australian
vendors that bill customers in any format except PDF - are there any?
PDFs are reasonable for human consumption, but billing should really be
done (instead of, or as well as) in a format meant for computer consumption,
so they can be parsed and processed reliably. This presumably means billing
in a standardised XML or JSON format of some kind (XBRL?).
How billing should work
Here's a strawman workflow for how I think billing should work:
the customer's profile with the vendor includes a billing delivery
URL, which is a vendor-specific location supplied by the customer to
which their bills are to be HTTP POST-ed. It should be an HTTPS URL to
secure the content during transmission, and the URL should be treated
by the vendor as sensitive, since its possession would allow someone
to post fake invoices to the customer
if the vendor supports more than one bill/invoice format, the customer
should be able to select the format they'd like
the vendor posts invoices to the customer's URL and gets back a URL
referencing the customer's record of that invoice. (The vendor might,
for instance, be able to query that record for status information, or
they might supply a webhook of their own to have status updates on the
invoice pushed back to them.)
the customer's billing system should check that the posted invoice has
the correct customer details (at least, for instance, the vendor/customer
account number), and ideally should also check the bill payment methods
against an authoritative set maintained by the vendor (this provides
protection against someone injecting a fake invoice into the system with
bogus bill payment details)
the customer's billing system is then responsible for facilitating the
bill payment manually or automatically at or before the due date, using
the customer's preferred payment method. This might involve billing
calendar feeds, global or per-vendor preferred payment methods, automatic
checks on invoice size against vendor history, etc.
all billing data (ideally fully parsed, categorised, and tagged) is then
available for further automation / integration e.g. personal financial
analytics, custom graphing, etc.
This kind of solution would give the customer full control over their
billing data, the ability to choose a billing provider that's separate from
(and more agile than) their vendors and banks, as well as significant
flexibility to integrate and automate further. It should also be pretty
straightforward on the vendor side - it just requires a standard HTTP POST
and provides immediate feedback to the vendor on success or failure.
Why doesn't this exist already - it doesn't seem hard?
Tue 02 Sep 2014
Tags: perl, csv, data wrangling
Well past time to get back on the blogging horse.
I'm now working on a big data web mining startup,
and spending an inordinate amount of time buried in large data files, often
some variant of CSV.
My favourite new tool over the last few months is is Karlheinz Zoechling's
App::CCSV
perl module, which lets you do some really powerful CSV processing using
perl one-liners, instead of having to write a trivial/throwaway script.
If you're familiar with perl's standard autosplit functionality (perl -a
)
then App::CCSV will look pretty similar - it autosplits its input into an
array on your CSV delimiters for further processing. It handles
embedded delimiters and CSV quoting conventions correctly, though, which
perl's standard autosplitting doesn't.
App::CCSV uses @f
to hold the autosplit fields, and provides utility
functions csay
and cprint
for doing say
and print
on the CSV-joins
of your array. So for example:
# Print just the first 3 fields of your file
perl -MApp::CCSV -ne 'csay @f[0..2]' < file.csv
# Print only lines where the second field is 'Y' or 'T'
perl -MApp::CCSV -ne 'csay @f if $f[1] =~ /^[YT]$/' < file.csv
# Print the CSV header and all lines where field 3 is negative
perl -MApp::CCSV -ne 'csay @f if $. == 1 || ($f[2]||0) < 0' < file.csv
# Insert a new country code field after the first field
perl -MApp::CCSV -ne '$cc = get_country_code($f[0]); csay $f[0],$cc,@f[1..$#f]' < file.csv
App::CCSV can use a config file to handle different kinds of CSV input.
Here's what I'm using, which lives in my home directory in ~/.CCSVConf
:
<CCSV>
sep_char ,
quote_char """
<names>
<comma>
sep_char ","
quote_char """
</comma>
<tabs>
sep_char " "
quote_char """
</tabs>
<pipe>
sep_char "|"
quote_char """
</pipe>
<commanq>
sep_char ","
quote_char ""
</comma>
<tabsnq>
sep_char " "
quote_char ""
</tabs>
<pipenq>
sep_char "|"
quote_char ""
</pipe>
</names>
</CCSV>
That just defines two sets of names for different kinds of input: comma
,
tabs
, and pipe
for [,\t|]
delimiters with standard CSV quote conventions;
and three nq
("no-quote") variants - commanq
, tabsnq
, and pipenq
- to
handle inputs that aren't using standard CSV quoting. It also makes the comma
behaviour the default.
You use one of the names by specifying it when loading the module, after an =
:
perl -MApp::CCSV=comma ...
perl -MApp::CCSV=tabs ...
perl -MApp::CCSV=pipe ...
You can also convert between formats by specifying two names, in
<input>,<output> format e.g.
perl -MApp::CCSV=comma,pipe ...
perl -MApp::CCSV=tabs,comma ...
perl -MApp::CCSV=pipe,tabs ...
And just to round things off, I have a few aliases defined in my bashrc
file
to make these even easier to use:
alias perlcsv='perl -CSAD -MApp::CCSV'
alias perlpsv='perl -CSAD -MApp::CCSV=pipe'
alias perltsv='perl -CSAD -MApp::CCSV=tabs'
alias perlcsvnq='perl -CSAD -MApp::CCSV=commanq'
alias perlpsvnq='perl -CSAD -MApp::CCSV=pipenq'
alias perltsvnq='perl -CSAD -MApp::CCSV=tabsnq'
That simplifies my standard invocation to something like:
perlcsv -ne 'csay @f[0..2]' < file.csv
Happy data wrangling!
Mon 20 May 2013
Tags: dell, drac, linux, sysadmin
Note to self: this seems to be the most reliable way of checking whether
a Dell machine has a DRAC card installed:
sudo ipmitool sdr elist mcloc
If there is, you'll see some kind of DRAC card:
iDRAC6 | 00h | ok | 7.1 | Dynamic MC @ 20h
If there isn't, you'll see only a base management controller:
BMC | 00h | ok | 7.1 | Dynamic MC @ 20h
You need ipmi setup for this (if you haven't already):
# on RHEL/CentOS etc.
yum install OpenIPMI
service ipmi start
Fri 22 Mar 2013
Tags: text, linux
This has bitten me a couple of times now, and each time I've had to
re-google the utility and figure out the appropriate incantation. So
note to self: to subtract text files use comm(1)
.
Input files have to be sorted, but comm
accepts a -
argument for
stdin, so you can sort on the fly if you like.
I also find the -1 -2 -3
options pretty counter-intuitive, as they
indicate what you want to suppress, when I seem to want to indicate
what I want to select. But whatever.
Here's the cheatsheet:
FILE1=one.txt
FILE2=two.txt
# FILE1 - FILE2 (lines unique to FILE1)
comm -23 $FILE1 $FILE2
# FILE2 - FILE1 (lines unique to FILE2)
comm -13 $FILE1 $FILE2
# intersection (common lines)
comm -12 $FILE1 $FILE2
# xor (non-common lines, either FILE)
comm -3 $FILE1 $FILE2
# or without the column delimiters:
comm -3 --output-delimiter=' ' $FILE1 $FILE2 | sed 's/^ *//'
# union (all lines)
comm $FILE1 $FILE2
# or without the column delimiters:
comm --output-delimiter=' ' $FILE1 $FILE2 | sed 's/^ *//'