<?xml version="1.0" encoding="UTF-8"?>
<feed xml:lang="en-IE" xmlns="http://www.w3.org/2005/Atom">
  <title>thickpaddy.com - Home</title>
  <id>tag:www.thickpaddy.com,2012:mephisto/</id>
  <generator uri="http://mephistoblog.com" version="0.8.0">Mephisto Drax</generator>
  <link href="http://www.thickpaddy.com/feed/atom.xml" rel="self" type="application/atom+xml"/>
  <link href="http://www.thickpaddy.com/" rel="alternate" type="text/html"/>
  <updated>2010-11-17T07:51:45Z</updated>
  <entry xml:base="http://www.thickpaddy.com/">
    <author>
      <name>thickpaddy</name>
    </author>
    <id>tag:www.thickpaddy.com,2010-06-25:10</id>
    <published>2010-06-25T16:57:00Z</published>
    <updated>2010-11-17T07:51:45Z</updated>
    <category term="mysql"/>
    <category term="rails"/>
    <link href="http://www.thickpaddy.com/2010/6/25/mysql-big-table-migration-helper" rel="alternate" type="text/html"/>
    <title>MySQL big table migration helper</title>
<summary type="html">With MySQL 5, altering large tables, including adding or dropping indexes, can be painfully slow. In order to alter the table, MySQL normally creates a temporary table with the new structure and copies rows into it, one-by-one, updating the indexes as it goes. Searches for “mysql slow index creation”, “mysql copy to tmp table” and “mysql alter table slow” will reveal all the gory details, but I'm going to concentrate on a workaround for Ruby on Rails applications.</summary><content type="html">
            With MySQL 5, altering large tables, including adding or dropping indexes, can be painfully slow. In order to alter the table, MySQL normally creates a temporary table with the new structure and copies rows into it, one-by-one, updating the indexes as it goes. Searches for “mysql slow index creation”, “mysql copy to tmp table” and “mysql alter table slow” will reveal all the gory details, but I'm going to concentrate on a workaround for Ruby on Rails applications.
&lt;p&gt;&lt;em&gt;Edit: this helper has now evolved into a plugin. See &lt;a href=&quot;https://github.com/thickpaddy/mysql_big_table_migration&quot;&gt;https://github.com/thickpaddy/mysql_big_table_migration&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With MySQL 5, altering large tables, including adding or dropping indexes, can be painfully slow. In order to alter the table, MySQL normally creates a temporary table with the new structure and copies rows into it, one-by-one, updating the indexes as it goes. Searches for “mysql slow index creation”, “mysql copy to tmp table” and “mysql alter table slow” will reveal all the gory details, but I'm going to concentrate on a workaround for Ruby on Rails applications.&lt;/p&gt;

&lt;p&gt;The problem with slow index creation on large tables hit me very hard when I started to review one of our production databases with a view to improving performance, primarily by adding appropriate indexes where missing, and dropping some unused indexes too. I needed to add an index to one particular table that had in excess of 50 million rows. Creating a new index on this table resulted in the system seeming to hang in state “copy to tmp table”. SHOW PROCESSLIST had one thread in this state overnight, until I eventually gave up and killed it.&lt;/p&gt;

&lt;p&gt;After some research, it seemed the best way to work around this (as in something that would work for various storage engines and server configurations), was to create the new table manually, and copy the data into it in bulk rather than row-by-row. I wrote a one-off migration to test this and it worked well.&lt;/p&gt;

&lt;p&gt;It wasn't long before I needed to add indexes on other large tables, so I ended up creating a migration helper for altering large tables, allowing indexes to be created and dropped, and columns added and removed, in minutes rather than hours (or days in some cases).&lt;/p&gt;



&lt;p&gt;I saved the code in a file called mysql_big_table_migration_helper.rb and dumped it directly into the lib directory. To use it from a migration, extend your migration with the helper module. This adds 4 new class methods, for adding and removing indexes and columns by bulk copying into a temp table rather than relying on MySQL's row-by-row copying. I did toy with the idea of creating a plugin that does all of this automatically, but I'm wary of that kind of voodoo, and decided it was wiser (for the moment at least) to have developers make their intentions explicit.&lt;/p&gt;

&lt;p&gt;An example migration that adds and removes indexes...&lt;/p&gt;



&lt;p&gt;Some notes/warnings...&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The migration helper only works with the standard MySQL adapter (i.e not MySQL2).&lt;/li&gt;
&lt;li&gt;If you use another adapter, it will still add and remove indexes and columns, but without creating and bulk copying into the temp table. So, it won't break if, for example, you use the MySQL adapter in production, but SQLite in development.&lt;/li&gt;
&lt;li&gt;Although the code does try to handle updates that occur after the migration has started, it just locks tables at the end of the bulk copy and checks for new or updated rows using the id, or updated_at, column. This is absolutely not the same as an ACID compliant transaction! It does the job for us, but it may not be suitable for you.&lt;/li&gt;
&lt;/ul&gt;
          </content>  </entry>
  <entry xml:base="http://www.thickpaddy.com/">
    <author>
      <name>thickpaddy</name>
    </author>
    <id>tag:www.thickpaddy.com,2010-05-14:9</id>
    <published>2010-05-14T17:14:00Z</published>
    <updated>2011-04-06T16:53:02Z</updated>
    <category term="capistrano"/>
    <category term="lighthouse"/>
    <category term="rails"/>
    <link href="http://www.thickpaddy.com/2010/5/14/capistrano-lighthouse-integration" rel="alternate" type="text/html"/>
    <title>Capistrano Lighthouse integration</title>
<summary type="html">In my previous post I explained how I added some git hooks to ensure that we have lighthouse ticket info added to our git commit and merge messages. This means that when we push to github, the lighthouse service hook updates the tickets in lighthouse and anyone watching the ticket gets updated. This is only half the story though, we also update the tickets both when code is deployed to the staging server or the production system. Doing this manually is a pain in the behind, so I took the time to automate it...</summary><content type="html">
            In my previous post I explained how I added some git hooks to ensure that we have lighthouse ticket info added to our git commit and merge messages. This means that when we push to github, the lighthouse service hook updates the tickets in lighthouse and anyone watching the ticket gets updated. This is only half the story though, we also update the tickets both when code is deployed to the staging server or the production system. Doing this manually is a pain in the behind, so I took the time to automate it...
&lt;p&gt;In my previous post I explained how I added some git hooks to ensure that we have lighthouse ticket info added to our git commit and merge messages. This means that when we push to github, the lighthouse service hook updates the tickets in lighthouse and anyone watching the ticket gets updated. This is only half the story though, we also update the tickets both when code is deployed to the staging server or the production system. Doing this manually is a pain in the behind, so I took the time to automate it...&lt;/p&gt;

&lt;p&gt;All I've done is created a script that can be executed from a capistrano deploy script. Capistrano already knows the SHAs of the previous and current revision, my script just takes these as a commit range, looks for lighthouse ticket references in commit and merge messages, and then uses the lighthouse api to update these tickets. When deploying to the staging server, it changes the state to 'uat' and re-assigns the ticket to the creator. When deploying to production, it just changes the stage to 'resolved'. It also both prints out a summary and emails it, with links to the tickets, to a configured email address. &lt;/p&gt;

&lt;p&gt;Our capistrano deploy script includes a task that calls the lighthouse integration script, providing the environment and commit range as arguments. This is the very last task that gets executed as part of our deploy.&lt;/p&gt;

&lt;pre class=&quot;prettyprint&quot;&gt;
  task :lighthouse do
    run &quot;cd #{release_path}; ./script/deploy_lighthouse #{rails_env} #{previous_revision}..#{current_revision}&quot;, :once =&gt; true
  end
&lt;/pre&gt;

&lt;p&gt;Combined with the git hooks mentioned in the previous post, this pretty much automates all updates to our ticketing system. No more manual updating, more time to code, or go to the pub, which I'm going to do in 5 minutes, it being Friday 'n' all.&lt;/p&gt;

&lt;p&gt;Here's the code anyway, I hope someone else finds it useful - it's good to share :-)&lt;/p&gt;


          </content>  </entry>
  <entry xml:base="http://www.thickpaddy.com/">
    <author>
      <name>thickpaddy</name>
    </author>
    <id>tag:www.thickpaddy.com,2010-04-30:8</id>
    <published>2010-04-30T20:21:00Z</published>
    <updated>2011-03-30T11:20:51Z</updated>
    <category term="git"/>
    <category term="github"/>
    <category term="lighthouse"/>
    <category term="rails"/>
    <link href="http://www.thickpaddy.com/2010/4/30/commit-and-merge-hooks-for-github-and-lighthouse" rel="alternate" type="text/html"/>
    <title>Commit and merge hooks for Github and Lighthouse</title>
<summary type="html">Github includes a service hook to allow commit messages containing Lighthouse ticket info trigger an update to the relevant tickets when you push changes. We use Lighthouse to keep track of bugs and feature requests, and were adding the ticket info to commit messages manually, but found ourselves both forgetting to do it, and making stupid typos. So, to make up for our incompetence, I automated the addition of ticket numbers and state to the commit messages using a couple of githooks…</summary><content type="html">
            Github includes a service hook to allow commit messages containing Lighthouse ticket info trigger an update to the relevant tickets when you push changes. We use Lighthouse to keep track of bugs and feature requests, and were adding the ticket info to commit messages manually, but found ourselves both forgetting to do it, and making stupid typos. So, to make up for our incompetence, I automated the addition of ticket numbers and state to the commit messages using a couple of githooks…
&lt;p&gt;Github includes a service hook to allow commit messages containing Lighthouse ticket info trigger an update to the relevant tickets when you push changes. We use Lighthouse to keep track of bugs and feature requests, and were adding the ticket info to commit messages manually, but found ourselves both forgetting to do it, and making stupid typos. So, to make up for our incompetence, I automated the addition of ticket numbers and state to the commit messages using a couple of githooks…&lt;/p&gt;

&lt;p&gt;Our code always goes to a staging environment for user acceptance testing (UAT) before being deployed to production. We settled on a custom state, coding-done, to indicate that coding has been completed, and committed, or merged, to our master branch, but has not yet been deployed to the staging server. &lt;/p&gt;

&lt;p&gt;We also use topic/feature branches when working on anything substantial, and have started including a Lighthouse ticket number in the branch name, e.g. &quot;LH007-permalink-from-lighthouse-ticket&quot;, and we always use the --no-ff option when merging to master, so we always generate a merge commit.&lt;/p&gt; 

&lt;p&gt;The hooks encourage use of the branch naming convention, and inclusion of a ticket number in commits directly to master (a number of formats are supported). They also set the state to coding-done.&lt;/p&gt;

&lt;p&gt;It's quite likely that you don't follow our conventions, and don't intend to start doing so any time soon, but the code may still be a useful starting point. My starting point was &lt;a href=&quot;http://www.robbyonrails.com/articles/2009/02/16/git-commit-msg-for-lighthouse-tickets&quot;&gt;Robby Russell's commit-msg hook&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;My commit-msg hook looks for a Lighthouse ticket number in commits directly to master and both adds the new state, coding-done, and makes sure the ticket number is in a format that will be parsed by the Github service hook. For commits to topic/feature branches, it strongly encourages developers to stick with the branch naming convention.&lt;/p&gt;



&lt;p&gt;The post-merge hook looks for the ticket number in the branch name, and then amends the most recent &quot;merge branch&quot; commit message to add the ticket number and state in the format expected by Github. This is kinda nasty, but I couldn't find a better way, and it does seem to work well.&lt;/p&gt;



&lt;p&gt;To use the hooks as they are, save them as commit-msg and post-merge respectively and dump them into your .git/hooks directory. We actually added them to our repository and added symlinks from .git/hooks.&lt;/p&gt;
          </content>  </entry>
  <entry xml:base="http://www.thickpaddy.com/">
    <author>
      <name>thickpaddy</name>
    </author>
    <id>tag:www.thickpaddy.com,2010-03-10:7</id>
    <published>2010-03-10T07:11:00Z</published>
    <updated>2010-12-29T13:28:06Z</updated>
    <category term="autospec"/>
    <category term="autotest"/>
    <category term="git"/>
    <category term="zentest"/>
    <link href="http://www.thickpaddy.com/2010/3/10/gittest-love-child-of-git-and-autotest" rel="alternate" type="text/html"/>
    <title>Gittest - love child of Git and Autotest</title>
<summary type="html">Continuous testing with &lt;a href=&quot;http://zentest.rubyforge.org/&quot;&gt;autotest&lt;/a&gt; is great, especially when working on an existing application. Autotest runs in the background, and when you save a file, runs any matching tests and tells you if you've broken something, fantastic! Thing is though, there are lots of reasons why you don't want tests to run every, single time you make some little, tiny change to your code.</summary><content type="html">
            Continuous testing with &lt;a href=&quot;http://zentest.rubyforge.org/&quot;&gt;autotest&lt;/a&gt; is great, especially when working on an existing application. Autotest runs in the background, and when you save a file, runs any matching tests and tells you if you've broken something, fantastic! Thing is though, there are lots of reasons why you don't want tests to run every, single time you make some little, tiny change to your code.
&lt;p&gt;&lt;em&gt;Edit: this script has now evolved into a gem. See &lt;a href=&quot;https://github.com/thickpaddy/gittest&quot;&gt;https://github.com/thickpaddy/gittest&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Continuous testing with &lt;a href=&quot;http://zentest.rubyforge.org/&quot;&gt;autotest&lt;/a&gt; is great, especially when working on an existing application. Autotest runs in the background, and when you save a file, runs any matching tests and tells you if you've broken something, fantastic! Thing is though, there are lots of reasons why you don't want tests to run every, single time you make some little, tiny change to your code.&lt;/p&gt;

&lt;p&gt;For me, the most important feature of autotest is its ability to find and run relevant tests for the modified code files, rather than run the entire test suite. This is crucial when you have an enormous application with a horrendously large test suite. It means you can have a fairly good idea whether you've broken the build or not, without having to run the entire test suite, which saves a lot of time. &lt;/p&gt;

&lt;p&gt;But what if you forget to run autotest? Then you realise that you want to run the relevant tests and you can't, unless you re-save each file you have open. Or what if you've made some minor changes that you know are going to break the tests, but you want to save the file because you need to make a cup of tea or grab a sandwich? Wouldn't it be nice if you could just run the relevant tests when you're ready, rather than every single time you make a change? &lt;/p&gt;

&lt;p&gt;Gittest is a scrappy little script that abuses autotest. You run it, it looks for new or modified files in git, then it uses the autotest mappings to determine which tests (or specs) need to run and tells autotest to run them (err, sort of). It's really, really ugly because I found it hard to figure out what the hell autotest was doing under the hood, but it works. I'll clean it up when I get a chance, but here's a working version of the script...&lt;/p&gt;

&lt;pre class=&quot;prettyprint&quot;&gt;
#!/usr/bin/env ruby

require 'rubygems'
require 'autotest'
require 'optparse'

# exit if we're not in a git repository
exit $?.exitstatus unless system('git diff --name-only &gt; /dev/null')

# make sure we're in the root path for the repository
loop do
  begin
    Dir.entries('.git')
    break
  rescue SystemCallError
    Dir.chdir('..')
    next   
  end
end

options = {:fast =&gt; false, :diff =&gt; 'HEAD', :trace =&gt; false}
OptionParser.new do |opts|
  opts.banner = &quot;Usage: ./script/gittest [options]&quot;
  opts.on(&quot;-f&quot;, &quot;--fast&quot;, &quot;Fast mode - skips preparation of test db&quot;) do |o|
    options[:fast] = o
  end
  opts.on(&quot;-d MANDATORY&quot;, &quot;--diff MANDATORY&quot;, &quot;Commit argument for git diff command used to check for new or modified files (defaults to HEAD)&quot;) do |o|
    options[:diff] = o
  end
  opts.on(&quot;-t&quot;, &quot;--trace&quot;, &quot;Enable trace option when calling rake tasks to prepare test db&quot;) do |o|
    options[:trace] = o
  end  
end.parse!

# prepare db if fast start not switched on
unless options[:fast]

  puts &quot;Preparing test database...&quot;
  puts &quot;(You can use the -f switch to skip this in future)&quot;
  rake_options = options[:trace] ? '--trace' : ''
  system &quot;rake db:migrate RAILS_ENV=test #{$rake_options} &gt; /dev/null&quot;
  system &quot;rake db:test:prepare #{$rake_options} &gt; /dev/null&quot;
 
end
      
# autotest options
$f = true # never run the entire test/spec suite on startup
$v = false
$h = false
$q = false
$DEBUG = false
$help = false

# use ansi colors to highlight test/spec passes, failures and errors
COLORS = { :red =&gt; 31, :green =&gt; 32, :yellow =&gt; 33 }

# get a list of new or modified files according to git (using terminal commands is faster than using a ruby git library)
new_or_modified_files = `git diff --name-only #{options[:diff]}`.split(&quot;\n&quot;).uniq

if new_or_modified_files.size == 0
  puts &quot;No modified files, exiting&quot;
  exit
end
  
msg = &quot;#{new_or_modified_files.size} new or modified file&quot; 
msg &amp;lt;&amp;lt; &quot;s&quot; unless new_or_modified_files.size == 1
puts msg + &quot;:&quot;

new_or_modified_files.each {|f| puts &quot;\t#{f}&quot;}

at = Autotest.new
# Note: the initialize hook is normally called within Autotest#run
at.hook :initialize
at.reset
at.find_files # must populate the known files for autotest, otherwise Autotest#files_matching will always return nil

# this isn't pretty, but it will probably be reliable enough (can't see any good reason for renaming that particular instance variable, IMO it should be accessible anyway)
test_mappings = at.instance_eval { @test_mappings }

# find files to test
files_to_test = at.new_hash_of_arrays
new_or_modified_files.each do |f|
  next if f =~ at.exceptions # skip exceptions
  result = test_mappings.find { |file_re, ignored| f =~ file_re }
  unless result.nil?
    [result.last.call(f, $~)].flatten.each {|match| files_to_test[match] if File.exist?(match)}
  end
end

# exit if no files to test
puts &quot;No matching files to test, exiting&quot; and exit if files_to_test.empty?
  
msg = &quot;#{files_to_test.size} file&quot; 
msg &amp;lt;&amp;lt; &quot;s&quot; unless files_to_test.size == 1
msg &amp;lt;&amp;lt; &quot; to test&quot;
puts msg + &quot;:&quot;
puts &quot;\t&quot; + files_to_test.map{|k,v| k}.sort.join(&quot;\n\t&quot;)

puts &quot;Press ENTER to continue, or CTRL+C to quit&quot;
begin
  $stdin.gets # note: Kernel#gets assumes that ARGV contains a list of files from which to read next line
rescue Interrupt
  exit 1
end
puts &quot;Running tests and specs, please wait...&quot; 

cmd = at.make_test_cmd(files_to_test)

at.hook :run_command

# copied from Autotest#run_tests and updated to use ansi colours in TURN enabled test output and specs run with the format option set to specdoc
old_sync = $stdout.sync
$stdout.sync = true
results = []
line = []
begin
  open(&quot;| #{cmd}&quot;, &quot;r&quot;) do |f|
    until f.eof? do
      c = f.getc or break
      # putc c
      line &amp;lt;&amp;lt; c
      if c == ?\n then
        str = if RUBY_VERSION &gt;= &quot;1.9&quot; then
                          line.join
                        else
                          line.pack &quot;c*&quot;
                        end
        results &amp;lt;&amp;lt; str
        line.clear
        if str.match(/(PASS|FAIL|ERROR)$/)
          # test output
          case $1
            when 'PASS' ; color = :green
            when 'FAIL' ; color = :red
            when 'ERROR' ; color = :yellow
          end
          print &quot;\e[#{COLORS[color]}m&quot; + str + &quot;\e[0m&quot;
        elsif str.match(/^\- /)
          # spec output 
          if str.match(/^\- .*(ERROR|FAILED) \- [0-9]+/)
            color = $1 == 'FAILED' ? :red : :yellow 
            print &quot;\e[#{COLORS[color]}m&quot; + str + &quot;\e[0m&quot;
          else
            print &quot;\e[#{COLORS[:green]}m&quot; + str + &quot;\e[0m&quot;
          end
        else
          print str
        end
      end
    end
  end
ensure
  $stdout.sync = old_sync
end

at.handle_results(results.join)
&lt;/pre&gt;
          </content>  </entry>
  <entry xml:base="http://www.thickpaddy.com/">
    <author>
      <name>thickpaddy</name>
    </author>
    <id>tag:www.thickpaddy.com,2010-02-15:6</id>
    <published>2010-02-15T07:50:00Z</published>
    <updated>2010-08-14T08:48:58Z</updated>
    <category term="campfire"/>
    <category term="hudson"/>
    <link href="http://www.thickpaddy.com/2010/2/15/forked-campfire-plugin-for-hudson" rel="alternate" type="text/html"/>
    <title>Forked Campfire plugin for Hudson</title>
<summary type="html">Sometime before Christmas, we decided we'd had enough of CruiseControl.rb and needed a better continuous integration solution. We use Campfire to keep in touch with each other during the day and we'd always had build notifications from CC.rb sent to Campfire, so a similar feature was considered essential in any continuous integration solution that could be considered as a replacement.</summary><content type="html">
            Sometime before Christmas, we decided we'd had enough of CruiseControl.rb and needed a better continuous integration solution. We use Campfire to keep in touch with each other during the day and we'd always had build notifications from CC.rb sent to Campfire, so a similar feature was considered essential in any continuous integration solution that could be considered as a replacement.
&lt;p&gt;Sometime before Christmas, we decided we'd had enough of CruiseControl.rb and needed a better continuous integration solution. We use Campfire to keep in touch with each other during the day and we'd always had build notifications from CC.rb sent to Campfire, so a similar feature was considered essential in any continuous integration solution that could be considered as a replacement.&lt;/p&gt;

&lt;p&gt;After looking at numerous different options, and testing a few of them out, we settled on Hudson, a solid, well-documented and extensible CI server written in Java. Luckily, we found that someone else, Jens Lukowski, had already written a Campfire notifier plugin for Hudson. The plugin was fairly new and a little buggy, so with my limited Java knowledge I implemented a couple of workarounds for some small issues and passed suggestions on to Jens, who implemented them in an updated version of the plugin. Unfortunately, in the updated version of the plugin a number of issues remained, so I spent some time over Christmas tracking them down and sent some updated code to Jens. There hasn't been a release including these updates since, so I decided to publish my fork of the plugin on Github.&lt;/p&gt;

&lt;p&gt;If you've been trying to use the Campfire notifier available from the Hudson wiki and have been running into problems, you should give this fork a go, at least until an updated version of the &quot;official&quot; plugin is available. We've been using it for over a month now without any problems. It addresses a number of issues and also adds some new features...&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Refactored the code to fix a number of null pointer exceptions.&lt;/li&gt;
&lt;li&gt;Moved from per-job to global config.&lt;/li&gt;
&lt;li&gt;Fixed issues with configuration details being lost after a hudson restart.&lt;/li&gt;
&lt;li&gt;Tidied up jelly view for configuration form and added help files for each field.&lt;/li&gt;
&lt;li&gt;Added a link to the build in notifications sent to campfire.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can get it from &lt;a href=&quot;http://github.com/jgp/hudson_campfire_plugin&quot;&gt;http://github.com/jgp/hudson_campfire_plugin&lt;/a&gt;&lt;/p&gt;
          </content>  </entry>
  <entry xml:base="http://www.thickpaddy.com/">
    <author>
      <name>thickpaddy</name>
    </author>
    <id>tag:www.thickpaddy.com,2009-08-10:5</id>
    <published>2009-08-10T19:44:00Z</published>
    <updated>2011-09-11T17:44:20Z</updated>
    <category term="coldfusion"/>
    <category term="encoding"/>
    <category term="utf8"/>
    <category term="utf-8"/>
    <link href="http://www.thickpaddy.com/2009/8/10/coldfusion-is-not-utf-8-encoded" rel="alternate" type="text/html"/>
    <title>ColdFusion is not UTF-8 encoded</title>
<summary type="html">A lot of CF developers seem to have concluded that since ColdFusion is now effectively a java webapp and there's a lot of talk about UTF8 encoding that hey, ColdFusion is UTF8 encoded. Well, it's not that simple, not everything in ColdFusion is UTF8 encoded, but this seems to be a difficult thing to explain without some background, so I wrote me a little blog post about it. If you had assumed that you were always dealing with UTF8 encoded unicode, or you're not even quite sure what UTF-8 encoded unicode means, well, read on my friend, this post is for you...</summary><content type="html">
            A lot of CF developers seem to have concluded that since ColdFusion is now effectively a java webapp and there's a lot of talk about UTF8 encoding that hey, ColdFusion is UTF8 encoded. Well, it's not that simple, not everything in ColdFusion is UTF8 encoded, but this seems to be a difficult thing to explain without some background, so I wrote me a little blog post about it. If you had assumed that you were always dealing with UTF8 encoded unicode, or you're not even quite sure what UTF-8 encoded unicode means, well, read on my friend, this post is for you...
&lt;p&gt;A lot of CF developers seem to have concluded that since ColdFusion is now effectively a java webapp and there's a lot of talk about UTF8 encoding that hey, ColdFusion is UTF8 encoded. Well, it's not that simple, not everything in ColdFusion is UTF8 encoded, but this seems to be a difficult thing to explain without some background, so I wrote me a little blog post about it. If you had assumed that you were always dealing with UTF8 encoded unicode, or you're not even quite sure what UTF-8 encoded unicode means, well, read on my friend, this post is for you...&lt;/p&gt;
&lt;h2&gt;Character Sets and Encodings&lt;/h2&gt;
&lt;p&gt;Let's start by explaining what it means to refer to some text as UTF-8 encoded unicode. Without getting too pedantic about it, unicode is usually used to refer to the set of characters that is part of the Unicode Standard. The same set of characters is also referred to as the Universal Character Set (UCS), and has been standardized by the International Organization for Standardization (ISO) as part of the ISO-10646 standard.&lt;/p&gt;
&lt;p&gt;A character set is a set of distinct, named, characters (technically referred to as a character repertoire) and a set of numeric codes used to refer to those characters. For example, the ASCII character set includes the Latin letter “lowercase a”, which has a numeric code of 97. The numeric codes are usually referred to as either character codes or code points. A character encoding specifies an algorithm for storing the characters in a particular character set as a sequence of bytes (octets really, but we'll avoiding getting into too much detail here). So, to say some text is UTF-8 encoded unicode, we mean the text contains only characters in the “unicode” character set and it has been encoded into a sequence of bytes using the UTF-8 encoding scheme.&lt;/p&gt;
&lt;p&gt;While character sets and character encodings aren't the same thing, the terms have been used interchangeably over the years and the MIME standard even uses the term charset to refer to the combination of a character set and encoding scheme. Though ISO-10646 defines a number of possible encodings for the Universal Character Set, many earlier standards, such as ISO-8859-1 (aka Latin-1) and Windows-1252, only defined a single encoding for a character set, blurring the distinction between character sets and character encodings. It's all very confusing, but the important thing to remember is that the character encoding defines how characters are encoded into a sequence of bytes, and encoding schemes are not always compatible with one another. For example, lowercase e actute, i.e. &amp;eacute;, is not stored using the same sequence of bytes by both UTF-8 and ISO-8859-1, in fact UTF-8 uses two bytes whereas ISO-8859-1 uses just one.&lt;/p&gt;
&lt;h2&gt;ColdFusion uses UTF-16 internally&lt;/h2&gt;
&lt;p&gt;This isn't particularly important, but it may clear up some confusion over the role of the pageEncoding attribute of the cfprocessingdirective tag. ColdFusion runs on Java, and Java uses UTF-16 to represent text internally, so ColdFusion presumably also uses UTF-16. The pageEncoding attribute of the cfprocessingdirective tag does not tell ColdFusion how to represent data internally, it just tells ColdFusion which encoding to use when processing/compiling the source file.&lt;/p&gt;
&lt;p&gt;Although I'm pretty sure ColdFusion does use UTF-16, the ColdFusion 8 documentation states that ColdFusion uses UCS-2. Java did use UCS-2 in the distant past, so I'm guessing this is just a mistake in the docs, but even if it is correct it doesn't really matter for most people. UCS-2 is an obsolete predecessor to UTF-16 that uses 2 bytes to store each character and can represent all of the characters in the basic multilingual plane of the unicode standard. Although the unicode standard now includes over 100,000 characters, the basic multilingual plane has 65,536 code points and covers the vast majority of characters in common use around the world. UCS-2 and UTF-16 encoding of the characters in the basic multilingual plane is identical, so for most purposes, it really doesn't matter whether ColdFusion uses UCS-2 or UTF-16 internally.&lt;/p&gt;
&lt;p&gt;If the reference to UCS-2 in the documentation is not a mistake, it might be that it does use UTF-16, but that some operations aren't safe with supplementary characters outside the basic multilingual plane, which are encoded using surrogate pairs of 2 byte code units. Again, it really doesn't matter as long as you're only using characters in the basic multilingual plane, which you probably are.&lt;/p&gt;
&lt;h2&gt;Default Encoding for IO is Platform Dependent&lt;/h2&gt;
&lt;p&gt;While ColdFusion defaults to using UTF-8 when sending output to the browser or via email, the default encoding used for other kinds of input and output can vary depending on the operating system and configuration of the java virtual machine (JVM). This matters when you are reading and writing files, and can also matter when processing text using java libraries.&lt;/p&gt;
&lt;p&gt;Each JVM instance has a default encoding scheme (referred to in the Java documentation as the platform's default encoding), which can be set by passing an argument to the JVM when starting the instance, but by default comes from the operating system. For example, on a Windows server set to use a Western European locale, the default encoding scheme might be Windows-1252. This is the default encoding used by CFFILE when reading and writing text files (in the absence of a charset attribute or a byte order mark (BOM) at the start of the file), and will also be the default encoding used by methods of various Java objects when converting data from strings to streams and vice-versa.&lt;/p&gt;
&lt;p&gt;So, for example, if you use CFFILE to read a UTF-8 encoded configuration file, you can't just assume that CFFILE will read the file as UTF-8, because there are circumstances in which it won't and some characters won't be decoded properly. Similarly, if you are converting Java Strings to Streams in order to pass them to processing libraries (as you might do if you are using JTidy), you can't assume that those conversions will use the UTF-8 encoding scheme. The same issue applies to your source files - ColdFusion will fall back to using the platform's default encoding when compiling the source if the encoding is not set using cfprocessingdirective and the file does not include a byte order mark.&lt;/p&gt;
&lt;p&gt;Note that you can attempt to detect the encoding of a file using either command line tools (e.g. file command on UNIX systems, eh, kind of) or Java libraries (e.g. &lt;a href=&quot;http://jchardet.sourceforge.net/&quot;&gt;jdchardet&lt;/a&gt;), but CFFILE doesn't seem to do any character encoding detection apart from looking for a byte order mark, and character encoding detection is not 100% reliable anyway.&lt;/p&gt;
&lt;h2&gt;Content-type of Response Does Not Always Default to UTF-8&lt;/h2&gt;
&lt;p&gt;By default, ColdFusion sends http responses UTF-8 encoded, and also sends a Content-type header set to text/html; charset=UTF-8. When you use the CFCONTENT tag, and set the content type without setting a charset, ColdFusion usually figures out what the encoding should be and appends the charset, but not always. With the file attribute, this seems to work well, and CFCONTENT does actually seem to do some encoding detection that goes beyond simply checking for a byte order mark and falling back to the platform's default encoding. If you don't use the file attribute, it only seems to work when you set the content type to either text/html or text/plain. Set it to anything else, and ColdFusion sends a content-type header as you enter it, without a charset element, leaving the browser/client to decide how to interpret the response. HTTP mentions ISO-8859-1 as the default encoding in the absence of a charset element of the content-type header and, lo and behold, that's exactly how some clients interpret it. This can lead to mangled data because CF does actually send the response UTF-8 encoded, but the client interprets it as ISO-8859-1.
&lt;h2&gt;Verity is a Law unto Itself&lt;/h2&gt;
&lt;p&gt;The character set and encoding used by the Verity indexing engine is not directly related to the character set and encoding used by the ColdFusion server. The encoding used when indexing and searching is dependent on the language selected, which defaults to English, the version of Verity/ColdFusion and, if you are indexing documents, the default encoding for the operating system.&lt;/p&gt; 
&lt;p&gt;The version of verity included with CF6 didn't support unicode, so verity could only deal with a subset of the characters that can be represented by ColdFusion, the documents on the server and probably the database. Later versions do support unicode, but don't create multilingual indexes by default, and use either the operating system's default encoding or UTF-8 when indexing files. I'm unsure of the exact mechanism at work to decide which encoding to use when indexing files as I've only used verity for indexing database queries, but to create multilingual indexes you need to install the separate verity multi language pack and set the language to &quot;uni&quot;.&lt;/p&gt;
&lt;p&gt;And, I think that's it. You're probably more confused that ever now, but at least you know that can't assume that text is UTF-8 encoded. The only advice I can give is, where you know the encoding, be explicit and tell ColdFusion what to do, don't rely on the default behaviour.&lt;/p&gt;
          </content>  </entry>
  <entry xml:base="http://www.thickpaddy.com/">
    <author>
      <name>thickpaddy</name>
    </author>
    <id>tag:www.thickpaddy.com,2009-07-23:2</id>
    <published>2009-07-23T11:52:00Z</published>
    <updated>2011-09-06T16:29:14Z</updated>
    <category term="centos"/>
    <category term="postgresql"/>
    <category term="rhel"/>
    <category term="tsearch2"/>
    <link href="http://www.thickpaddy.com/2009/7/23/upgrading-postgres-tsearch2-on-centos-5" rel="alternate" type="text/html"/>
    <title>Upgrading Postgres/tsearch2 on CentOS 5</title>
<summary type="html">For anyone using the tsearch2 module with postgres version 8.1 included with CentOS 5, upgrading to 8.3 or later can be a little tricky and IMO, the official documentation and a number of blog posts I've read skip over the details. I documented what I did when I first upgraded one of our servers from 8.1 to 8.4, and I've posted an edited version here for anyone that might find it useful. The instructions should also be applicable to users of RHEL and Fedora.</summary><content type="html">
            For anyone using the tsearch2 module with postgres version 8.1 included with CentOS 5, upgrading to 8.3 or later can be a little tricky and IMO, the official documentation and a number of blog posts I've read skip over the details. I documented what I did when I first upgraded one of our servers from 8.1 to 8.4, and I've posted an edited version here for anyone that might find it useful. The instructions should also be applicable to users of RHEL and Fedora.
&lt;p&gt;For anyone using the tsearch2 module with postgres version 8.1 included with CentOS 5, upgrading to 8.3 or later can be a little tricky and IMO, the official documentation and a number of blog posts I've read skip over the details. I documented what I did when I first upgraded one of our servers from 8.1 to 8.4, and I've posted an edited version here for anyone that might find it useful. The instructions should also be applicable to users of RHEL and Fedora.&lt;/p&gt;
&lt;p&gt;Normally, when you upgrade postgres, you can just dump all databases in the cluster, using pg_dumpall, and then restore them once you've upgraded. Unfortunately, none of the tsearch2 stuff from pre-8.3 versions will work if you do this, so for each of the databases using tsearch2, you need to avoid restoring all that old tsearch2 stuff. A simple solution to this is to dump the databases that use tsearch2 separately, without the clean option, then when restoring, create each of these database manually and install the tsearch2 compatibility module before restoring the data from the dump file.&lt;/p&gt;
&lt;p&gt;Doing this avoids restoring the old tsearch2 stuff, because the various tsearch functions, operators etc. will already exist, having been created by the tsearch2 compatibility module, by the time you restore from the dump. It also means you should avoid having to update your old code to work with the full text search features as they are implemented in postgres 8.3 and later.&lt;/p&gt;
&lt;p&gt;
One thing to note with this process is that it assumes that you have a lot of databases, but only a handful using tsearch2. If you have loads of databases in your cluster using the tsearch2 module, this method might be a little tedious.&lt;/p&gt;
&lt;p&gt;
I was logged in to a CentOS 5 server as root while performing the upgrade, and we allow root to connect as user postgres using ident authentication. You might have to modify the commands slightly.
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;First, download the latest yum setup rpm for your distro from &lt;a href=&quot;http://yum.pgsqlrpms.org/reporpms/repoview/index.html&quot;&gt;http://yum.pgsqlrpms.org/reporpms/repoview/index.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Install the package:&lt;br /&gt;&lt;code&gt;rpm -Uvh pgdg-centos-8.4-1.noarch.rpm&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Edit base cent os repo file:&lt;br&gt;
&lt;code&gt;vim /etc/yum.repos.d/CentOS-Base.repo&lt;/code&gt;&lt;br&gt;
Note: if you're using RedHat or Fedora, you'll need to edit the equivalent repo file (sorry, I'm not sure what it's called).
&lt;/li&gt;
&lt;li&gt;Add a line to exclude postgres related packages to both the base and updates sections of the file. You should end up with something like this:&lt;br /&gt;
&lt;pre&gt;
[base]
name=CentOS-$releasever - Base
mirrorlist=http://mirrorlist.centos.org/?release=$releasever&amp;arch=$basearch&amp;repo=os
#baseurl=http://mirror.centos.org/centos/$releasever/os/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-5
exclude=postgresql*

#released updates
[updates]
name=CentOS-$releasever - Updates
mirrorlist=http://mirrorlist.centos.org/?release=$releasever&amp;arch=$basearch&amp;repo=updates
#baseurl=http://mirror.centos.org/centos/$releasever/updates/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-5
exclude=postgresql*
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;Dump all databases using the clean option:&lt;br&gt;
&lt;code&gt;pg_dumpall -U postgres -c &gt; /var/backup/postgres/postgres_all.sql&lt;/code&gt;&lt;br&gt;
Note: I'd also recommend making a separate backup that uses inserts rather than copy statements to keep in reserve in case something goes pear shaped&lt;/li&gt;
&lt;li&gt;Dump each database that uses tsearch2 separately, without using the clean option! (read that again, DO NOT USE THE CLEAN OPTION):&lt;br&gt;
&lt;code&gt;pg_dump -U postgres my_tsearch_db &gt; /var/backup/postgres/my_tsearch_db.sql&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Update server to latest version:&lt;br&gt;
&lt;code&gt;yum update&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Stop the server and move your old data directory:&lt;br&gt;
&lt;code&gt;
service postgresql stop&lt;br&gt;
mv /var/lib/pgsql/data /var/lib/pgsql/data_8.1
&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Init a new postgres 8.4 database cluster:&lt;br&gt;
&lt;code&gt;service postgresql initdb&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Update the configuration settings as necessary in the new postgresql.conf:&lt;br&gt;
&lt;code&gt;vim /var/lib/pgsql/postgresql.conf&lt;/code&gt;
&lt;pre&gt;
...
log_directory = '/var/log/postgresql'
log_line_prefix = '%t %d'
...
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;Edit the pg_ident.conf file if necessary to allow ident authentication (we allow root to connect as user postgres):&lt;br&gt;
&lt;code&gt;vim /var/lib/pgsql/data/pg_ident.conf&lt;/code&gt;
&lt;pre&gt;
...
# MAPNAME     IDENT-USERNAME    PG-USERNAME
localusers      root            postgres
...
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;Edit the pg_hba.conf file if necessary to allow your users connect:&lt;br&gt;
&lt;code&gt;vim /var/lib/pgsql/data/pg_hba.conf&lt;/code&gt;
&lt;pre&gt;
....
# TYPE  DATABASE    USER        CIDR-ADDRESS          METHOD
# &quot;local&quot; is for Unix domain socket connections only
local   all         all                               ident map=localusers
# IPv4 local connections:
host    all         all         127.0.0.1/32          password
....
&lt;/pre&gt;
Note that with ident authentication, the format of the options provided to each connection method has changed and now requires name value pairs. For example, if you had an ident map name of localusers in postgres 8.1, the method field in your pg_hba.conf could be set to &quot;ident localusers&quot;. With 8.4, this needs to be set to &quot;ident map=localusers&quot;.
&lt;/li&gt;
&lt;li&gt;Just to be on the safe side, make sure the permissions on the config files are correct (in our case, they should be owned by user postgres):&lt;br&gt;
&lt;code&gt;chown postgres:postgres /var/lib/pgsql/data/*.conf&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Start the database cluster:&lt;br&gt;
&lt;code&gt;service postgresql start&lt;/code&gt;&lt;br&gt;
If this fails, try running the postgres binary manually to see what's causing the problem. Open postmaster.opts to get the full command to execute, switch to user postgres and execute it.
&lt;/li&gt;
&lt;li&gt;Check that the connection to the server is working (and fix your config if it isn't):&lt;br&gt;
&lt;code&gt;psql -U postgres&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Restore the dump of all databases:&lt;br&gt;
&lt;code&gt;psql -U postgres &amp;lt; /var/backup/postgres/postgres_all.sql&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Switch to postgres user and drop the databases that use tsearch2:&lt;br&gt;
&lt;code&gt;
su postgres&lt;br&gt;
dropdb my_tsearch_db&lt;br&gt;
...&lt;br&gt;
&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Create empty databases for each database that uses tsearch2:&lt;br&gt;
&lt;code&gt;createdb -E UTF-8 my_tsearch_db&lt;br&gt;
...&lt;br&gt;
exit&lt;br&gt;
&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
Install tsearch2 compatibility module and then restore each database that uses tsearch2:&lt;br&gt;
&lt;code&gt;psql -U postgres my_tsearch_db &amp;lt; /usr/share/pgsql/contrib/tsearch2.sql&lt;br&gt;
psql -U postgres my_tsearch_db &amp;lt; /var/backup/postgres/my_tsearch_db.sql
&lt;/code&gt;&lt;br&gt;
Note: you will see lots of errors when restoring the databases, this is expected and 
is due to the fact that the various functions and operators for tsearch already exist.
&lt;/li&gt;
&lt;li&gt;Quickly test that tsearch features are working as expected:&lt;br&gt;
&lt;code&gt;psql -U postgres my_tsearch_db&lt;/code&gt;&lt;br&gt;
&lt;pre&gt;
# SELECT * FROM some_table_with_a_tsvector_column WHERE that_tsvector_column @@ to_tsquery('English','test');
&lt;/pre&gt;
Fingers-crossed, you won't see any errors and everything will be working perfectly.
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And that's it, upgrade complete, yay!&lt;/p&gt;
          </content>  </entry>
</feed>

