Base64 in Clojure: Once More, With Feeling

Dec 28th, 2010

Originally published at Heuristic Fencepost

In a previous post we sketched out a simple base64 implementation in Clojure and measured it’s performance against clojure-contrib and commons-codec. Our initial implementation used the partition function to split the input into blocks and then applied a sequence of interdependent functions to translate each block into base64-encoded data. This approach worked well enough but it’s not terribly efficient; it traverses the input data once to do the partitioning and then that same data is traversed again (in block form) when the encoding function is applied. There’s no reason to make multiple passes if we attack the problem recursively rather than iteratively. Time for some further investigation.

(ns fencepost.base64-recur)

; This could be placed within the defn below but rebuilding this map
; on each invocation seems a bit wasteful.
(def to-base64-recur
     (zipmap 
      (range 0 64)
      "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/")
     )

(def from-base64-recur
     (zipmap 
      "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
      (range 0 64)
      )
     )

(defn base64-encode-recur [#^String arg]
  "Encode the input string into the base64 alphabet.  Currently uses the platform charset for the conversion
   of the input string to bytes; this could be generalized if need be."
  (loop [bytes (vec (.getBytes arg))
         rv ""]
    (let [samplesize (count bytes)]
      (cond
       (> samplesize 3)
       (let [[byte1 byte2 byte3] (take 3 bytes)
             char1 (bit-shift-right byte1 2)
             char2 (bit-and 0x3F (bit-or (bit-shift-left byte1 4) (bit-shift-right byte2 4)))
             char3 (bit-and 0x3F (bit-or (bit-shift-left byte2 2) (bit-shift-right byte3 6)))
             char4 (bit-and 0x3F byte3)]
         (recur (subvec bytes 3) (str rv (apply str (map to-base64-recur (list char1 char2 char3 char4)))))
         )
       (= samplesize 3)
       (let [[byte1 byte2 byte3] (take 3 bytes)
             char1 (bit-shift-right byte1 2)
             char2 (bit-and 0x3F (bit-or (bit-shift-left byte1 4) (bit-shift-right byte2 4)))
             char3 (bit-and 0x3F (bit-or (bit-shift-left byte2 2) (bit-shift-right byte3 6)))
             char4 (bit-and 0x3F byte3)]
         (str rv (apply str (map to-base64-recur (list char1 char2 char3 char4))))
         )
       (= samplesize 2)
       (let [[byte1 byte2] (take 2 bytes)
             char1 (bit-shift-right byte1 2)
             char2 (bit-and 0x3F (bit-or (bit-shift-left byte1 4) (bit-shift-right byte2 4)))
             char3 (bit-and 0x3F (bit-shift-left byte2 2))]
         (str rv (apply str (map to-base64-recur (list char1 char2 char3))) "=")
         )
       (= samplesize 1)
       (let [byte1 (get bytes 0)
             char1 (bit-shift-right byte1 2)
             char2 (bit-and 0x3F (bit-shift-left byte1 4))]
         (str rv (apply str (map to-base64-recur (list char1 char2))) "==")
         )
       (= samplesize 0) ""
       )
      )
    )
  )

(defn base64-decode-recur [#^String arg]
  "Decode the string of base64-encoded content into a String"
  (loop [currstr arg
         rv ""]
    (let [currstrsize (count currstr)]
      (cond
       ; String longer than 4 characters
       (> currstrsize 4)
       (let [[char1 char2 char3 char4] (map from-base64-recur (take 4 currstr))
             byte1 (bit-or (bit-shift-left char1 2) (bit-shift-right char2 4))
             byte2 (bit-and 0xff (bit-or (bit-shift-left char2 4) (bit-shift-right char3 2)))
             byte3 (bit-and 0xff (bit-or (bit-shift-left char3 6) char4))
             newstr (String. (byte-array (map byte [byte1 byte2 byte3])))]
         (recur (.substring currstr 4) (str rv newstr))
         )
       (= currstrsize 0) ""
       (< currstrsize 4)
       (throw (IllegalArgumentException. "Malformed base64 string; must have even multiple of four characters"))
       (identity 1)
       (let [curratom (apply str (take 4 currstr))]
         (cond
          ; Third byte is the bottom two bits of byte 2 plus all six low
          ; bits of byte 3
          (re-matches #"[A-Za-z0-9+/]{4}" curratom)
          (let [[char1 char2 char3 char4] (map from-base64-recur (take 4 currstr))
                byte1 (bit-or (bit-shift-left char1 2) (bit-shift-right char2 4))
                byte2 (bit-and 0xff (bit-or (bit-shift-left char2 4) (bit-shift-right char3 2)))
                byte3 (bit-and 0xff (bit-or (bit-shift-left char3 6) char4))
                newstr (String. (byte-array (map byte [byte1 byte2 byte3])))]
            (str rv newstr)
            )
          ; Second byte is the low four bits of byte 2 plus the top four
          ; bits of byte 3
          (re-matches #"[A-Za-z0-9+/]{3}=" curratom)
          (let [[char1 char2 char3] (map from-base64-recur (take 3 currstr))
                byte1 (bit-or (bit-shift-left char1 2) (bit-shift-right char2 4))
                byte2 (bit-and 0xff (bit-or (bit-shift-left char2 4) (bit-shift-right char3 2)))
                newstr (String. (byte-array (map byte [byte1 byte2])))]
            (str rv newstr)
            )
          ; First byte is the low six bits of byte 1 plus the top two
          ; bits of byte 2
          (re-matches #"[A-Za-z0-9+/]{2}==" curratom)
          (let [[char1 char2] (map from-base64-recur (take 2 currstr))
                byte1 (bit-or (bit-shift-left char1 2) (bit-shift-right char2 4))
                newstr (String. (byte-array [(byte byte1)]))]
            (str rv newstr)
            )
          (identity 1)
          (throw (IllegalArgumentException. "Malformed base64 string"))
          )
         )
       )
      )
    )
  )

Clojure optimizes recursive calls when the loop/recur constructs are used so we deploy them here. In addition we make significant use of destructuring. Each input chunk of data is routed to the appropriate function via a sequence of cond expressions. These functions are implemented as let statements which break the chunk into bytes (via destructuring) and then create bindings for the base64 characters representing these bytes by applying the relevant bit manipulations. The body of the let doesn’t need to do much beyond combining these character bindings into an appropriate return value.

This refactoring offers the following benefits:

The code is a lot cleaner and much easier to read and understand
Using the same techniques we should be able to easily implement a decoding function
We should see some increase in performance

While we’re at it, let’s update our comparison application to include the new implementation:

(ns fencepost.test)

(import '(org.apache.commons.codec.binary Base64))
(import '(org.apache.commons.lang RandomStringUtils))
(use '[clojure.contrib.base64])
(use '[fencepost.base64_letfn])
(use '[fencepost.base64-recur])

(def sample-size 100)
(def max-string-size 256)

; Build up some sample data using commons-lang.  Sample data is built
; before any tests are run; this allows us to apply uniform test data
; to each implementation and to avoid corrupting timings with the
; generation of random data.
(def sample-data (map #(RandomStringUtils/randomAscii (* % (rand-int max-string-size))) (repeat sample-size 1)))

; Instantiate a Base64 instance from commons-codec
(def codec-base64 (new Base64 -1))

(println "Encoding")

; Time each run.  Mapping a function onto our sample data produces a
; lazy sequence so we have to take the additional step of realizing
; the sequence; thus the conversion to a vector.
(println "Commons-codec")
(def commons-codec-data (time (doall (map #(String. (.encode codec-base64 (.getBytes %))) sample-data))))
(println (first commons-codec-data))

(println "clojure-contrib")
(def clojure-contrib-data (time (doall (map #(clojure.contrib.base64/encode-str %) sample-data))))
(println (first clojure-contrib-data))

; Simple sanity check; take commons-codec output as canonical and
; verify that what our implementations generate match up with the
; canonical sample.
(assert (.equals (first commons-codec-data) (first clojure-contrib-data)))

(println "fencepost/base64_letfn")
(def fencepost-data (time (doall (map #(fencepost.base64_letfn/base64-encode-ascii %) sample-data))))
(println (first fencepost-data))
(assert (.equals (first commons-codec-data) (first fencepost-data)))

(println "fencepost-recur")
(def fencepost-recur-data (time (doall (map #(fencepost.base64-recur/base64-encode-recur %) sample-data))))
(println (first fencepost-recur-data))
(assert (.equals (first commons-codec-data) (first fencepost-recur-data)))

(println "Decoding")

(println "Commons-codec")
(def commons-codec-data2 (time (doall (map #(String. (.decode codec-base64 (.getBytes %))) commons-codec-data))))
(println (first commons-codec-data2))
(assert (.equals (first sample-data) (first commons-codec-data2)))

(println "clojure-contrib")
(def clojure-contrib-data2 (time (doall (map #(clojure.contrib.base64/decode-str %) clojure-contrib-data))))
(println (first clojure-contrib-data2))
(assert (.equals (first commons-codec-data2) (first clojure-contrib-data2)))

(println "fencepost-recur")
(def fencepost-recur-data2 (time (doall (map #(fencepost.base64-recur/base64-decode-recur %) fencepost-recur-data))))
(println (first fencepost-recur-data2))
(assert (.equals (first commons-codec-data2) (first fencepost-recur-data2)))

So, how’d we do? The following is fairly representative:

bartok ~/git/base64-clojure $ clojure fencepost/test/compare_base64.clj 
Encoding
Commons-codec
"Elapsed time: 85.219348 msecs"
Xm91U1RMbHRAK1ZCcWFqeEsmJEJrfkhIdSsxfStJdGBNc2txUTBkUyQrdCM7Wk9rInN2XFBtKGNTX002aw==
clojure-contrib
"Elapsed time: 896.538841 msecs"
Xm91U1RMbHRAK1ZCcWFqeEsmJEJrfkhIdSsxfStJdGBNc2txUTBkUyQrdCM7Wk9rInN2XFBtKGNTX002aw==
fencepost/base64_letfn
"Elapsed time: 315.582591 msecs"
Xm91U1RMbHRAK1ZCcWFqeEsmJEJrfkhIdSsxfStJdGBNc2txUTBkUyQrdCM7Wk9rInN2XFBtKGNTX002aw==
fencepost-recur
"Elapsed time: 215.516695 msecs"
Xm91U1RMbHRAK1ZCcWFqeEsmJEJrfkhIdSsxfStJdGBNc2txUTBkUyQrdCM7Wk9rInN2XFBtKGNTX002aw==

The recursive encoding operation is just north of a third faster. We still take about twice as long as commons-codec (an improvement on our previous performance) but we’re now over four times faster than clojure-contrib.

As referenced above a decoding function was also completed using the same techniques described for the recursive encoding function. The performance increase for that operation was just as stark:

Decoding
Commons-codec
"Elapsed time: 33.148648 msecs"
^ouSTLlt@+VBqajxK&$Bk~HHu+1}+It`MskqQ0dS$+t#;ZOk"sv\Pm(cS_M6k
clojure-contrib
"Elapsed time: 1494.2707 msecs"
^ouSTLlt@+VBqajxK&$Bk~HHu+1}+It`MskqQ0dS$+t#;ZOk"sv\Pm(cS_M6k
fencepost-recur
"Elapsed time: 268.112751 msecs"
^ouSTLlt@+VBqajxK&$Bk~HHu+1}+It`MskqQ0dS$+t#;ZOk"sv\Pm(cS_M6k

Still nowhere near as fast as commons-codec but we’re a bit better than five times faster than clojure-contrib.

As always code can be found on github.

A Funny Thing Happened While Writing Some Haskell….

Dec 20th, 2010

Originally published at Heuristic Fencepost

Suppose that you’ve just finished Graham Hutton’s solid introduction to Haskell and now want to take the language for a quick test drive. You decide to stick with something familiar; after noting that the Prelude doesn’t have an equivalent to Python’s range function you decide to whip one up yourself. Haskell’s syntax is still a bit unfamiliar, however, so you decide to implement this function in a different “mostly functional” language that you have some experience with. You’ve been looking for a reason to dip your toes into Scheme again (and it seems to be a good fit here [1]) so you install Gambit Scheme and Chicken Scheme and dig in.

Perhaps you begin with a straightforward recursive definition in Scheme:

(define (range start stop step)
 (if (< start stop)
  (cons start (range (+ start step) stop step))
  '()))

This works well enough when start < stop but fails completely when start > stop and step is a negative value (a use case supported by Python’s range()). To fix this problem we define two procedures via letrec and use whichever is appropriate based on the passed arguments:

(define (newrange start stop step)
  (letrec 
   ((upto (lambda (art op ep)
     (if (< art op)
      (cons art (upto (+ art ep) op ep))
      '())))
    (downto (lambda (art op ep)
     (if (> art op)
      (cons art (downto (+ art ep) op ep))
      '()))))
 (if (< start stop)
  (upto start stop step)
  (downto start stop step))))

This function now works for both start < stop (with positive step) and start > stop (with negative step). There’s still a problem, though; if we change the sign of step in either of these cases we find ourselves in an infinite loop. Python’s range() returns an empty list in this case so we probably should too:

(define (newerrange start stop step)
 (letrec
  ((upto (lambda (art op ep)
    (if (< art op)
     (cons art (upto (+ art ep) op ep))
     '())))
   (downto (lambda (art op ep)
    (if (> art op)
     (cons art (downto (+ art ep) op ep))
     '()))))
    (cond ((and (< start stop) (< step 0)) '())
          ((and (> start stop) (> step 0)) '())
          ((< start stop) (upto start stop step))
          (else (downto start stop step)))))

While working on these implementations you discover that SRFI-1 includes a function iota which looks similar to Python’s range(), although this function takes the number of elements in the return list rather than the start/stop/step values we’re looking for. Still, both Chicken Scheme and Gambit Scheme support SFRI-1 as extensions or modules and we should be able to whip up a thin wrapper around iota to get what we need. Of course we first need to figure out how to load up these modules. And after some initial experimentation we see some curious behaviour in the iota implementation on both platforms…

But wait… weren’t we supposed to be writing some Haskell code?

Oh, yeah.

Haskell’s guarded equations allow for an easy expression of the same logic we were getting from cons in Scheme. We have no need to translate the recursive logic found in the Scheme version, however, since the list-related functions in the Prelude combined with a few sections gives us everything we need:

range art op ep | art < op && ep < 0 = []
                | art > op && ep > 0 = []
                | art < op = takeWhile (<op) (iterate (+ep) art)
                | art > op = takeWhile (>op) (iterate (+ep) art)

In all honesty this is a fairly elegant expression of the algorithm. And while there’s clearly a lot to like in Haskell this exercise has rekindled a long-dormant romance with Scheme rather than encouraging exploration of something new.

[1] Yes, I realize Scheme isn’t “purely functional”. I’m not interested in taking a stand in this particular holy war, but there’s little doubt that Scheme’s bias against set! and it’s kin justify the “mostly functional” qualifier.

Base64 in Clojure: An Initial Attempt

Oct 31st, 2010

Originally published at Heuristic Fencepost

After working through most of Stuart Halloway’s excellent Programming Clojure I found myself in search of a meaty example to exercise my Clojure chops. Re-implementing the Project Euler examples in another language didn’t seem terribly appealing. Maybe a base64 implementation; I hadn’t implemented the encoding in a while and it is a reasonable exercise when learning a new language. So what kind of support does Clojure already have?

As expected there’s no base64 support in the core. clojure-contrib does include some support, although the documentation doesn’t inspire a great deal of confidence:

This is mainly here as an example.  It is much slower than the 
Apache Commons Codec implementation or sun.misc.BASE64Encoder.

Intriguing; looks like we have a winner!

The only complexity in a base64 implementation is handling the one and two byte cases. Any code must support returning the last two bits of the first byte (or the last four bits of the second byte) as distinct values in these cases as well as combining these bits with the leading bits of the second and third bytes (respectively) in other cases. My initial implementation was a fairly straightforward adaptation of this principle.

(ns fencepost.base64_letfn)

; This could be placed within the defn below but rebuilding this map
; on each invocation seems a bit wasteful.
(def to-base64
     (zipmap 
      (range 0 64)
      "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/")
     )

(defn base64-encode
  "Encode input string into base64 alphabet.  String should be encoded
  in the character set specified by encoding."
  [#^String arg #^String encoding]
  (letfn [
      (do-one-byte [somebyte]
                   {:curr [(bit-shift-right somebyte 2)] :remainder (bit-and 0x3F (bit-shift-left somebyte 4)) }
                   )
      (do-two-bytes [somebytes]
                    (let [[byteone bytetwo] somebytes
                          {curr :curr remainder :remainder} (do-one-byte byteone)]
                      {:curr (conj curr (bit-or remainder (bit-shift-right bytetwo 4))) :remainder (bit-and 0x3F (bit-shift-left bytetwo 2))}
                      )
                    )
      (do-three-bytes [somebytes]
                      (let [[byteone bytetwo bytethree] somebytes
                            {curr :curr remainder :remainder} (do-two-bytes [byteone bytetwo])]
                        (conj (conj curr (bit-or remainder (bit-shift-right bytethree 6))) (bit-and 0x3F bytethree))
                        )
                      )
      (somefn [somechars]
                (let [l (count somechars)]
                  (cond
                   (= l 3)
                   (apply str (map #(to-base64 %) (do-three-bytes somechars)))
                   (= l 2)
                   (let [{curr :curr remainder :remainder} (do-two-bytes somechars)]
                     (str (apply str (map #(to-base64 %) curr)) (to-base64 remainder) "=")
                     )
                   (= l 1)
                   (let [{curr :curr remainder :remainder} (do-one-byte (first somechars))]
                     (str (apply str (map #(to-base64 %) curr)) (to-base64 remainder) "==")
                     ))))]
    (apply str (map somefn (partition 3 3 () (.getBytes arg encoding))))))

(defn base64-encode-ascii
  "Encode input string into base64 alphabet.  Input string is assumed
  to be encoded in UTF-8.  If the input is encoded with a different
  character set base64-encode should be used instead of this method."
  [#^String arg]
  (base64-encode arg "UTF8")
  )

The core consists of a letfn definition with functions for handling the one, two and three byte cases. Each function returns two values: a list of all 6-bit elements discovered so far as well a “remainder” value representing the trailing bits in the one and two byte cases. This structure leads to better function re-use; as an example the two-byte function can build on the one-byte function. A wrapper function calls the correct value for a given set of bytes. Now we need only partition our byte array into groups of three or less and apply the wrapper function to each group. A combination of the partition and map functions do exactly that.

So how’d we do? We implement a simple test program to generate a set of random data, map a base64 implementation on to each sample and then realize the lazy sequence by converting it into a vector. We time this conversion (for a common set of sample data) for our implementation, the clojure-contrib implementation and commons-codec. We wind up with the following:

(ns fencepost.test)

(import '(org.apache.commons.codec.binary Base64))
(import '(org.apache.commons.lang RandomStringUtils))
(use '[clojure.contrib.base64])
(use '[fencepost.base64_letfn])

(def sample-size 100)
(def max-string-size 256)

; Build up some sample data using commons-lang.  Sample data is built
; before any tests are run; this allows us to apply uniform test data
; to each implementation and to avoid corrupting timings with the
; generation of random data.
(def sample-data (map #(RandomStringUtils/randomAscii (* % (rand-int max-string-size))) (repeat sample-size 1)))

; Instantiate a Base64 instance from commons-codec
(def codec-base64 (new Base64 -1))

; Time each run.  Mapping a function onto our sample data produces a
; lazy sequence so we have to take the additional step of realizing
; the sequence; thus the conversion to a vector.
(println "Commons-codec")
(def commons-codec-data (time (vec (map #(new String (.encode codec-base64 (.getBytes %))) sample-data))))
(println (get commons-codec-data 0))

(println "clojure-contrib")
(def clojure-contrib-data (time (vec (map #(clojure.contrib.base64/encode-str %) sample-data))))
(println (get clojure-contrib-data 0))

; Simple sanity check; take commons-codec output as canonical and
; verify that what our implementations generate match up with the
; canonical sample.
(assert (.equals(get commons-codec-data 0) (get clojure-contrib-data 0)))

(println "fencepost/base64_letfn")
(def fencepost-data (time (vec (map #(fencepost.base64_letfn/base64-encode-ascii %) sample-data))))
(println (get fencepost-data 0))
(assert (.equals (get commons-codec-data 0) (get fencepost-data 0)))

The following run is fairly representative:

bartok ~/git/base64-clojure $ clojure fencepost/test/compare_base64.clj 
Commons-codec
"Elapsed time: 69.074926 msecs"
RFJlYWkqIF10RCc1PXlpO3FcIDY6dkJuQkhHNWxRJkwkOFdgciFGfHwhYz5EYG15PjxWdFJ9Ml83TGFoeltHSTs+ST9mdj0rfSZrcVNIKn5oKSdTI3U5a1FqIzBvIkRIV0BkK3xCZjtrSWYuTiM1XWB9UW14W2dVLFM=
clojure-contrib
"Elapsed time: 1221.124746 msecs"
RFJlYWkqIF10RCc1PXlpO3FcIDY6dkJuQkhHNWxRJkwkOFdgciFGfHwhYz5EYG15PjxWdFJ9Ml83TGFoeltHSTs+ST9mdj0rfSZrcVNIKn5oKSdTI3U5a1FqIzBvIkRIV0BkK3xCZjtrSWYuTiM1XWB9UW14W2dVLFM=
fencepost/base64_letfn
"Elapsed time: 375.829153 msecs"
RFJlYWkqIF10RCc1PXlpO3FcIDY6dkJuQkhHNWxRJkwkOFdgciFGfHwhYz5EYG15PjxWdFJ9Ml83TGFoeltHSTs+ST9mdj0rfSZrcVNIKn5oKSdTI3U5a1FqIzBvIkRIV0BkK3xCZjtrSWYuTiM1XWB9UW14W2dVLFM=

These results do confirm that that clojure-contrib implementation is very, very slow in relation to commons-codec. We’ve reduced that runtime by roughly a third but we’re still five or six times slower than commons-codec. Surely we can do better; in the next installment we’ll see how to do so.

As usual all code can be found on github.

Concurrent Prime Factorization in Go

Oct 22nd, 2010

Originally published at Heuristic Fencepost

For a little while now I’ve been looking for a problem that would allow me to sink my teeth into Google’s Go language and see if it’s as tasty as it appears to be. After coming across Alex Tkachman’s request for concurrency benchmarks it was clear that I had what I needed. Alex is the driving force behind Groovy++ and his proposal was to implement a common benchmark (a concurrent version of prime factorization using the Sieve of Eratosthenes) in multiple JVM languages in order to compare and contrast what works and what doesn’t. An implementation in Go wouldn’t help much in benchmarking JVM languages but the highly concurrent nature of the problem does map naturally onto Go’s feature set. The language distribution also includes an elegant implementation of the Sieve of Eratosthenes in it’s tests so we already have a solid foundation to build from.

Let’s begin with sieve.go. The implementation consists of chained goroutines. Each time a candidate makes it through the sieve and is returned as a new prime number a new goroutine is created to check future candidates and reject them if they divide evenly by the new prime. The implementation cleanly demonstrates the concept but isn’t useful as a standalone application since it includes no termination condition. Let’s begin by adding that in:

// $G $F.go && $L $F.$A  # don't run it - goes forever

// Copyright 2009 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

package main

import "flag"

// Send the sequence 2, 3, 4, ... to channel 'ch'.
func Generate(max int, ch chan<- int) {

     // Slight optimization; after 2 we know there are no even primes so we only
     // need to consider odd values
     ch <- 2
     for i := 3; i<=max ; i += 2 {
       ch <- i // Send 'i' to channel 'ch'.
     }
     ch <- -1 // Use -1 as an indicator that we're done now
}

// Copy the values from channel 'in' to channel 'out',
// removing those divisible by 'prime'.
func Filter(in <-chan int, out chan<- int, prime int) {
     for i := <- in; i != -1; i = <- in {

       if i % prime != 0 {
        out <- i // Send 'i' to channel 'out'.
   }
     }
     out <- -1
}

// The prime sieve: Daisy-chain Filter processes together.
func Sieve(max int) {
     ch := make(chan int) // Create a new channel.
     go Generate(max,ch)      // Start Generate() as a subprocess.
     for prime := <-ch; prime != -1; prime = <-ch {
       print(prime, "\n")
   ch1 := make(chan int)
   go Filter(ch, ch1, prime)
   ch = ch1
     }
}

func main() {

     var max *int = flag.Int("max",1,"Maximum number we wish to return")
     flag.Parse()
     Sieve(*max)
}

[varese factorization_benchmark]$ cd go
[varese go]$ ls
factor.go  sieve.go
[varese go]$ gobuild sieve.go
Parsing go file(s)...
Compiling main (sieve)...
Linking sieve...
[varese go]$ ./sieve -max 20
2
3
5
7
11
13
17
19

That seemed to work nicely. Adding in the factorization code leads us to factor.go:

// Copyright 2009 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

// Prime factorization derived from slightly modified version of
// sieve.go in Go source distribution.

package main

import "fmt"
import "flag"
import vector "container/vector"

// Send the sequence 2, 3, 4, ... to channel 'ch'.
func Generate(max int, ch chan<- int) {

     // Slight optimization; after 2 we know there are no even primes so we only
     // need to consider odd values
     ch <- 2
     for i := 3; i<=max ; i += 2 {
       ch <- i // Send 'i' to channel 'ch'.
     }
     ch <- -1 // Use -1 as an indicator that we're done now
}

// Copy the values from channel 'in' to channel 'out',
// removing those divisible by 'prime'.
func Filter(in <-chan int, out chan<- int, prime int) {
     for i := <- in; i != -1; i = <- in {

       if i % prime != 0 {
        out <- i // Send 'i' to channel 'out'.
   }
     }
     out <- -1
}

func main() {

     var target *int = flag.Int("target",1,"Number we wish to factor")
     flag.Parse()

     t := *target
     var rv vector.IntVector

     // Retrieve a prime value and see if we can divide the target evenly by
     // that prime.  If so perform the multiplication and update the current
     // value.
     ch := make(chan int) // Create a new channel.
     go Generate(t,ch)      // Start Generate() as a subprocess.
     for prime := <-ch; (prime != -1) && (t > 1); prime = <-ch {

      for ;t % prime == 0; {
          t = t / prime
      rv.Push(prime)
  }

  // Create a goroutine for each prime number whether we use it or
  // not.  This performs the daisy chaining setup that was being
  // done by the Sieve() function in sieve.go.
        ch1 := make(chan int)
        go Filter(ch, ch1, prime)
        ch = ch1
     }

     fmt.Printf("Results: %s\n",fmt.Sprint(rv))
}

Simple enough. This leaves us with something that looks fairly robust:

[varese go]$ gobuild factor.go 
Parsing go file(s)...
Compiling main (factor)...
Linking factor...
[varese go]$ ./factor -target 60
Results: [2 2 3 5]

Examples here employ gobuild. This tool has proven to be very useful when building and testing Go applications. Complete code for both examples can be found on github. There is some interest in comparing a solution based on goroutines and channels to one based on actors and messages so in the future some Scala code might join the existing Go code.

Default Arguments and Implicit Conversions in Scala

Oct 10th, 2010

Originally published at Heuristic Fencepost

In a previous post we went in search of an implicit conversion from Int to List[Int] such that each member of the list corresponds to the value at an equivalent position in the input Int (i.e. 987 = List(9,8,7)). At the time we mentioned that a properly tail recursive implementation proved to be a bit more complicated than one might expect. In this post we’ll examine these problems in some detail.

A properly tail recursive implementation of this conversion function makes use of an accumulator array to store state as we recurse.

package org.fencepost.defaults

import org.scalatest.Suite

class ImplicitDefaultTest1 extends Suite {

  def int2list(arg:Int, acc:List[Int]):List[Int] = {

    val newmember = arg % 10
    if (arg <= 9)
      List(newmember) ::: acc
    else
      int2list(arg / 10,List(newmember) ::: acc)
  }

  implicit def toList(arg:Int) = int2list(arg,List())

  def testImplicit() = {

    assert(0.length == 1)
    assert(0(0) == 0)
    assert(5.length == 1)
    assert(5(0) == 5)
    assert(12345.length == 5)
    assert(12345(0) == 1)
    assert(12345(2) == 3)
    assert(12345(4) == 5)
    assert(98765432.length == 8)
    assert(98765432(4) == 5)
  }
}

The test above passes so everything looks good so far. On a second look, however, we note that the wrapper function toList() is less than ideal. The accumulator needs to be initialized to the empty list in order for the function to work correctly but defining a second function just to pass in an extra arg looks like unnecessary cruft. Scala 2.8 introduced default arguments to address situations such as this; perhaps we can deploy default arguments here to clean up our test:

package org.fencepost.defaults

import org.scalatest.Suite

class ImplicitDefaultTest2 extends Suite {

  implicit def int2list(arg:Int, acc:List[Int] = List()):List[Int] = {

    val newmember = arg % 10
    if (arg <= 9)
      List(newmember) ::: acc
    else
      int2list(arg / 10,List(newmember) ::: acc)
  }

  def testImplicit() = {

    assert(0.length == 1)
    assert(0(0) == 0)
    assert(5.length == 1)
    assert(5(0) == 5)
    assert(12345.length == 5)
    assert(12345(0) == 1)
    assert(12345(2) == 3)
    assert(12345(4) == 5)
    assert(98765432.length == 8)
    assert(98765432(4) == 5)
  }
}

When we attempt to execute the test above we’re greeted rather rudely by sbt:

[info] Compiling test sources...
[error] .../src/test/scala/org/fencepost/defaults/ImplicitDefaultTest2.scala:21:
 value length is not a member of Int
[error]     assert(0.length == 1)
[error]              ^
[error] .../src/test/scala/org/fencepost/defaults/ImplicitDefaultTest2.scala:22:
 0 of type Int(0) does not take parameters
[error]     assert(0(0) == 0)
...

Clearly the implicit conversion of Int to List[Int] wasn’t in play when this test was executed. But why not? Logically int2list(arg:Int, acc:List[Int] = List()) will convert Ints to List[Int] everywhere int2list(arg:Int, acc:List[Int]) does. We can demonstrate the validity of this claim by fooling the compiler using a variation on the front-end function we used before:

package org.fencepost.defaults

import org.scalatest.Suite

class ImplicitDefaultTest3 extends Suite {

  def int2list(arg:Int, acc:List[Int] = List()):List[Int] = {

    val newmember = arg % 10
    if (arg <= 9)
      List(newmember) ::: acc
    else
      int2list(arg / 10,List(newmember) ::: acc)
  }

  implicit def toList(arg:Int) = int2list(arg)

  def testImplicit() = {

    assert(0.length == 1)
    assert(0(0) == 0)
    assert(5.length == 1)
    assert(5(0) == 5)
    assert(12345.length == 5)
    assert(12345(0) == 1)
    assert(12345(2) == 3)
    assert(12345(4) == 5)
    assert(98765432.length == 8)
    assert(98765432(4) == 5)
  }
}

As expected this test passes without issue.

My suspicion is that this issue is a side effect of the fact that default arguments apparently aren’t represented in the type system. It’s not surprising that int2list(arg:Int, acc:List[Int]) isn’t available as an implicit conversion; there’s no way for the runtime to supply the required “acc” argument for an input Int instance. This is not true for int2list(arg:Int, acc:List[Int] = List()); in that case the default value of “acc” could be used to perform the translation. Note, however, that these two functions are represented by the same type in the Scala runtime:

$ scala
Welcome to Scala version 2.8.0.final (OpenJDK Client VM, Java 1.6.0_18).
Type in expressions to have them evaluated.
Type :help for more information.

scala>   def int2list(arg:Int, acc:List[Int]):List[Int] = {
...
int2list: (arg: Int,acc: List[Int])List[Int]

scala>   def int2list2(arg:Int, acc:List[Int] = List()):List[Int] = {
...
int2list2: (arg: Int,acc: List[Int])List[Int]

If the type system is unaware that default arguments are available for all arguments other than the type to convert from then it’s not at all surprising that a function would be excluded from the set of valid implicit conversion functions.

Results tested and verified on both Scala 2.8.0 and 2.8.1 RC2.

Learning Scala From Dead Swiss Mathematicians : Return of Palindromes

Aug 23rd, 2010

Originally published at Heuristic Fencepost

In a recent post we implemented a predicate to identify integers which were also palindromes. Our original implementation converted the input integer to a String in order to more easily access the leading digit, something that can’t easily be done with an input integer. But is this really the case?

If we could easily convert Integers into a List[Integer] [1] we could then easily access the “head” and “tail” of this list for purposes of comparison. Ideally this conversion could be automated so that we don’t have to explicitly track these conversions. Fortunately Scala’s implicit conversions provide exactly these features. A simple implementation looks something like the following:

implicit def int2list(arg:Int):List[Int] =
  if (arg <= 9)
    List(arg)
  else
    int2list(arg / 10) ::: List(arg % 10)

It’s worth taking a moment to point out a few things about this function:

We’re assuming base 10 integers here. We could make the function more flexible by adding a parameter to specify the integers base if necessary
A better implementation would be purely tail recursive, but this turns out to be a bit trickier than expected; more on that in a future post.

With this conversion in place we can now define a properly tail recursive predicate to check for palindromes:

def byInt(arg:List[Int]):Boolean = {

  if (arg.length == 0 || arg.length == 1)
    return true
  if (arg.head != arg.last)
    false
  else
    byInt(arg.slice(1,arg.length - 1))
}

Very nice, but Scala’s pattern matching allows for an even better (by which we mean “simpler”) implementation that makes use of pattern guards:

def byIntMatch(arg:List[Int]):Boolean = {

  // Note that we don't need to check for lists of length 
  // 0 or length 1 as we do in byInt above.  The first 
  // two cases of our match operation below handle these 
  // cases.
  arg match {
    case List() => true
    case List(_) => true
    case arghead :: rest if arg.last == arghead => byIntMatch(rest.slice(0,rest.length - 1))
    case _ => false
  }
}

The full implementation can be found at github. Most of the code discussed here can be found in the org.fencepost.palindrome package; a full solution to Problem 4 using this code is also included.

[1] We use integers here only as a matter of convenience; of course we really only need a Byte as long as we’re talking about base 256 or less. We can always optimize for space later if needed.

UPDATE - Shortly after completing this post I began another chapter in the “stairway” book. That chapter (covering details of the implementation of the List class in Scala) promptly pointed out that the list concatenation operator executes in time proportional to the size of the operand on the left. In order to avoid the performance hit we shift to an approach based on iteration and mutation.

implicit def int2list(arg:Int):List[Int] = {

  val buff = new ListBuffer[Int]
  var counter = arg
  while (counter > 9) {
    buff += (counter % 10)
    counter = (counter / 10)
  }
  buff += counter
  buff toList
}

The process for choosing when to use an iterative approach over a functional approach is still somewhat opaque. Scala has an explicit bias in favor of functional methods yet in many cases (including this one) an iterative implementation is the correct one to use. Presumably identifying when to use which approach is largely a function of experience with the language.

Learning Scala From Dead Swiss Mathematicians : Palindromes

Jul 26th, 2010

Originally published at Heuristic Fencepost

As part of my effort to get up to speed with Scala I returned to an old favorite for exploring new languages; working through some of the problems at Project Euler. Posting complete answers to specific problems isn’t terribly interesting and I have no plans to do so here. That said, some facets of these problems do lead to useful digressions which allow for a deeper exploration of the language. We’ll dig into one such example here.

The problem under consideration is Problem 4 which asks for “the largest palindrome made from the product of two 3-digit numbers”. It doesn’t take much imagination to determine that part of the solution will involve a method, most likely a function of some kind, for identifying whether a number is in fact a palindrome. Okay, but what would such a method look like? The problem lends itself to a simple recursive implementation: divide the integer into a leading digit, a middle integer and a trailing digit and return false if the leading and trailing digits differ, otherwise return the result of the a recursive call with the middle integer.

Easy enough, but a moments reflection tells us we have a problem; we can access the trailing digit in an integer using the mod function and the base of the input integer but there’s no easy way to access the leading digit. It’s not as if Int (or even RichInt) extend Seq[Int] or even Seq[Byte]. Fortunately we can shift the problem domain by observing that an input integer is equivalent to an input String in which each digit of the integer maps to a Char in the String. Even better, Scala offers built-in support for regular expressions, and a fairly simple regex should give us access to both the leading and trailing characters and everything in the middle. So, how does Scala’s regex support stack up?

// Pattern to be used by the regex-based predicate.  
// Absolutely must use conservative matching and 
// the backref here to make this work.
val palindromePattern = """(\d)(\d*?)\1""".r

// Recursive helper function to check for a 
//palindrome using a regex
def byRegexHelper(n:String):Boolean = {

  // Base case; empty string and single characters 
  // are by definition palindromes.  Place this 
  // test up front so that we can handle input values
  // of a single character.
  if (n.length == 0 || n.length == 1)
    return true
  palindromePattern.unapplySeq(n) match {

    case Some(matches) => byRegexHelper(matches(1))
    case None => false
  }
}

// Regex-based predicate; convert to a string and call
// the recrusive function
def byRegex(arg:Int):Boolean = byRegexHelper(arg.toString)

Not bad, actually. It’s not Perl or Ruby but the deep support for pattern matching in the language combined with relatively easy generation of a Regex from a String makes for a fairly clean syntax. Regex literals would be a small improvement but this is still cleaner than what one finds in Java or even Python.

So we’ve solved the problem, but can we do better? Do we really need the regex? Strings (or the StringOps they’re implicitly converted to in 2.8) make use of the SeqLike[Char] trait allowing easy access to the leading and trailing characters using simple method calls. This leads to the following implementation which avoids the overhead of evaluating the regular expression:

// Recursive helper function to perform the check 
// based on comparisons of the head and last 
// characters in a string
def byStringHelper(n:String):Boolean = {

  // Base case; empty string and single characters
  // are by definition palindromes.  Place this test
  // up front so that we can handle input values of
  // a single character.
  if (n.length == 0 || n.length == 1)
    return true
  if (n.head != n.last)
    false
  else
    byStringHelper(n.substring(1,n.length - 1))
}

// String-based predicate; convert to string and call
// the recursive function
def byString(arg:Int):Boolean = byStringHelper(arg.toString)

Also not bad, but still not completely satisfying. Moving away from the use of regular expressions minimizes the benefit of mapping the input integer onto a String in order to solve the problem. Recall that the primary argument for doing so was the inability to access leading digits in an input integer. How significant is this constraint? Is it something we can work around? More on this next time.

Blog Archives Newer →

Absurdity and Farce

Searching for the sublime among the ridiculous

Base64 in Clojure: Once More, With Feeling

A Funny Thing Happened While Writing Some Haskell….

Base64 in Clojure: An Initial Attempt

Concurrent Prime Factorization in Go

Default Arguments and Implicit Conversions in Scala

Learning Scala From Dead Swiss Mathematicians : Return of Palindromes

Learning Scala From Dead Swiss Mathematicians : Palindromes