Hive’s LOAD DATA fails to import many files with exception in org.apache.hadoop.hive.ql.exec.CopyTask

Interesting issue I came across recently, loading a large set of files that are coming from Localytics into Hive using Hive’s command line interface. The script that loads that data,basically contained something like 60k LOAD DATA statements that Hive was suppose to execute and LOAD DATA from each file into a table. This was all running smoothly on ElasticMapReduce, until, seemingly random exception, caused it to fail:

Failed with exception null
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.CopyTask

After some investigation I saw noticed a number of open files during the process was growing like crazy until it hit the limit which apparently was the root cause for the exception.

Some more googling around brought me to this unresolved bug report:

I guess, not the most critical issue for Hive folks, but still not a pleasant one. My work around was the to split the .q files into chunks not larger then 28000 and iterate over them.

Simple chrooted FTP setup on EC2 micro instance

Source environment: Ubuntu

1. Install vsftpd
apt-get install vsftpd

2. Edit default config at /etc/vsftpd.conf

Make sure the you enable these:


# (default follows)

Ensure this is disabled:


and add the following to the end:


max, min ports could be anything high enough not to overlap with other services. Those ports will also need to be open in your security group if you’re using EC2

3. Create/edit /etc/vstfp.chroot_list
Add usernames that you don’t want to chroot.

4. Create users for FTP access:

adduser USERNAME

5. Ensure the home folder of a user is not writable(!) This is new since VSFTP 2.3.5 I believe.

chmod a-w /home/USERNAME

6. Create folders under /home/USERNAME for a user to upload stuff to, since a user won’t be able to upload to the root of /home/USERNAME

Ubuntu 11.10 or 12.04 fail to boot after upgrade due to software raid degrade/failure.

Since I first started using Ubuntu back in ’09 with 9.04 I had issues with my software RAID array roughly about every other time I am trying to upgrade to a newer version. Almost everytime the issue lies in GRUB not being able to install/update itself properly, so I end up just doing that manually from the rescue disk – process I have unintentionally learned by heart.

This time around, when I upgraded from 11.04 to 11.10 – it was a different issue. System failed to boot and dropped into initramfs/BusyBox with failure to assemble one of the software RAIDs. Apparently there was an update introduced in 11.10(I believe) that prevents system to boot if there is any software RAID array that it could not assemble fully. This could be an issue if for example your drives got mixed or, like in my case, I had one older RAID array defined that was not properly removed, but was always deactivated.

There is a pretty long, yet interesting conversation here on this matter:

The way to solve this for me was to hit Ctrl-D when it dropped into initramfs/BusyBox, select ‘root shell’ and fix the issue – properly deactive the array I didn’t need and fix my working RAID array, that got degraded and needed to rebalance.

Oh, well… The Ubuntu upgrade process is still not there.

WordPress MU Upgrade and Permalinks 404 Erros

WordPress MU upgrade(from 2.x to 3.2.1) was a rather simple process surprisingly! Having completed it in a matter of couple hours for a fairly large blogging network, I was a happy camper up till the moment when permalinks started giving 404s.

What followed is painstaking process where I verified everysingle aspect of configuration from Apache’s mod_rewrite setup to htaccess rules to WordPress’s network site configs. Everything looked correctly.

Googling around for a good hour I came across this site, which pointed to the incompatibility of some plugins with WP3. In my case problem lied in a plugin called Top Level Categories, which I had to disable to get the permalinks working.

How to install Xapian 1.2.5 PHP bindings on Ubuntu Lucid Lynx

Starting from version 1.2.x, Xapian repository on Ubuntu does not contain php5-xapian package :( apparently due to the license incompatability between GPL and PHP license(great…)
Issue is discussed somewhat at length here.

But in the meantime, folks suggesting to build PHP bindings for Xapian manually on Ubuntu and Debian. Here is a quick command trail that shows how to install Xapian 1.2.5 PHP bindings on Ubuntu Lucid(10.04), also tested on Ubuntu 10.10 and 11.04:

1. Edit /etc/apt/sources.list and add the following lines to it:

deb lucid main
deb-src lucid main

2. Get some required packages:

sudo apt-get update
sudo apt-get build-dep xapian-bindings
sudo apt-get install php5-dev php5-cli
sudo apt-get install devscripts

3. Fetch sources and build:

apt-get source xapian-bindings
cd xapian-bindings-1.2.5
rm debian/control
env PHP_VERSIONS=5 debian/rules maint
debuild -e PHP_VERSIONS=5 -us -uc

This will generate .deb file in the folder, one level up.

4. Finally, install php5-xapian extenstion:

cd ..
sudo dpkg -i php5-xapian*.deb

5. Verify that you got it running:

php -i | grep Xapian

Information about this process is taken from here and here.

Auto-splitting video file in equal chunks with ffmpeg and python

UPDATE: The source code for this simple script has been moved to GitHub, get it here:

Recently I needed to upload a whole bunch of long video files. Maximum allowed length for each video was just few minutes, while the actual length of files I tried to upload were about an hour each. FFmpeg is really great for splitting the video files and Python is quite handy for automating the task. Combining two together in this handy little script(see below). The script below takes a video file and a chunk size in seconds and splits the video file into chunks using ffmpeg, so each chunk is self contained, playable video.

Source code:

#!/usr/bin/env python

import subprocess
import re
import math
from optparse import OptionParser

length_regexp = 'Duration: (\d{2}):(\d{2}):(\d{2})\.\d+,'
re_length = re.compile(length_regexp)

def main():

    (filename, split_length) = parse_options()
    if split_length <= 0:
        print "Split length can't be 0"
        raise SystemExit

    output = subprocess.Popen("ffmpeg -i '"+filename+"' 2>&1 | grep 'Duration'", 
                            shell = True,
                            stdout = subprocess.PIPE
    print output
    matches =
    if matches:
        video_length = int( * 3600 + \
                        int( * 60 + \
        print "Video length in seconds: "+str(video_length)
        print "Can't determine video length."
        raise SystemExit

    split_count = math.ceil(video_length/float(split_length))
    if(split_count == 1):
        print "Video length is less then the target split length."
        raise SystemExit

    split_cmd = "ffmpeg -i '"+filename+"' -vcodec copy "
    for n in range(0, split_count):
        split_str = ""
        if n == 0:
            split_start = 0
            split_start = split_length * n
        split_str += " -ss "+str(split_start)+" -t "+str(split_length) + \
                    " '"+filename[:-4] + "-" + str(n) + "." + filename[-3:] + \
        print "About to run: "+split_cmd+split_str
        output = subprocess.Popen(split_cmd+split_str, shell = True, stdout =

def parse_options():
    parser = OptionParser()    
    parser.add_option("-f", "--file",
                        dest = "filename",
                        help = "file to split, for example sample.avi",
                        type = "string",
                        action = "store"
    parser.add_option("-s", "--split-size",
                        dest = "split_size",
                        help = "split or chunk size in seconds, for example 10",
                        type = "int",
                        action = "store"
    (options, args) = parser.parse_args()
    if options.filename and options.split_size:

        return (options.filename, options.split_size)

        raise SystemExit

if __name__ == '__main__':

    except Exception, e:
        print "Exception occured running main():"
        print str(e)

Or download it here: splitting video file script link