Playing with Go: Embarrassingly Parallel Scripts
http://www.publicdomainpictures.net/view-image.php?image=29784&picture=colorful-lines
I recently needed to take a list of domain and find which ones point to a specific IP address. For a small list, say less than 10, manually running dig
in the console would work great, but this list had almost 800 domains so I needed a script. As domain lookup is a network request and thus very slow, setting up the domain requests in parallel made sense. I could easily just do this in Ruby, my language du-jour, but I’ve done this type of thread work before and frankly it can be tedious to set up, fragile, and still won’t have access to all of my system’s resources due to the GVL[1]. I’ve been keeping an eye on Google’s Go for some time now and decided to see how it handled this problem.
I’ve been intrigued by Go since it was originally announced. Here was a compiled, fast, light-weight, low level language with many of the features we take for granted these days, such as garbage collection, while also adding on a very sophisticated concurrency model similar to what’s found in Erlang: very lightweight internal processes managed by the runtime. Sounds like a perfect fit for my requirements.
The code I ended up with is here: https://gist.github.com/4170926. For the sake of comparisons I built a sequential version of the script as well as the parallel version and added timings for running both scripts against the full list of domains.
Running these scripts for yourself is a one-liner: go run [script.go]
. The input file domains.txt
needs to be a newline-delimited list of domains. I’ll go over the more confusing parts of the two scripts to help with understanding what’s really going on here.
Objects?
Go’s object model is very close to C’s: structs with data and methods that operate on said structs. Both scripts only use a small, two-element struct, DomainMap
, to keep track of the IP address found for a given domain. I use the short-form to initialization new instances of the DomainMap
structure. The order of values maps directly to the order of the defined fields at the top of the scripts.
type DomainMap struct {
Domain string
IpMapping string
}
object := DomainMap{domain, ipAddress}
object.Domain == domain
object.IpMapping == ipAddress
Error handling
Go does error handling by returning multiple values from a function, where the second return value is expected to be a value of type error
. You can ignore this with the _
variable.
rawIpAddresses, _ := net.LookupIP(domain)
Parallelism
The parallel version of the script has some new concepts that need explaining, particularly goroutines, channels, and channel communication.
A goroutine is a very lightweight process, sort of like a Ruby Fiber. Creating one is simple:
go domainLookup(responseChannel, domain)
Go will grab the function call after the go
keyword and execute it in parallel. However, given that we’re no longer in the main process, we can’t just return values from the function. We now need a different way to get the return value. This is where channels come in.
responseChannel := make(chan DomainMap)
As Go is a statically typed language, we need to define the type of channel being created. Channels can only accept data of the same type as the channel. Communication through channels is done with the reverse-stabby operator <-
, which should be read as the data on the right side is flowing to the left side
// Write into a channel
returnChannel <- DomainMap{domain, ipAddress}
// Read from the channel
domainMap := <- responseChannel
And that’s all the special syntax. The only real difference between the parallel and sequential scripts is the map-reduce-esque setup to wait for all the goroutines to finish. I didn’t need to worry about thread pooling, system capabilities, or thread safety. Go makes it so easy to write truly parallel code that there’s no excuse not to anymore. I was able to run almost 800 goroutines (one per domain) all throwing out DNS queries and coming back in less than 10 seconds, in a script that doesn’t even look like it’s running in parallel.
I highly recommend checking out the Tour of Go for basic introductions into every major feature of the language, and there’s a ton of documentation on the main website golang.org. For the little bit of time I’ve played with Go now, I see a very bright future for this language.
[1] Global VM Lock, more about Ruby’s concurrency here: http://www.engineyard.com/blog/2011/ruby-concurrency-and-you/
Colorful Lines is licenced under Creative Commons Zero
Comments
Can you post the full domains.txt file on your gist as well?
Celluloid
Ignoring errors is not the greatest thing to do…
@Sir: I’d rather not as it contains customer data.
@Carlos: As this is a one-off script, re-running the script is good enough error handling for me. This would definitely be far different if it was a module run inside of a bigger application.
I know you wanted to use GO for this article, but you know you could have considered JRuby for this work, right?
Please use gofmt whenever publishing Go source code.
Thanks for your blog post, it was very instructive.
One question: what does this mean?
domainMapping = append(domainMapping, on that line.
It seems that the code was mingled by a html strip tags, but never mind, I think I figured it out. When you have a LEFT ARROW channel in a param, you are just using whatever that channel returns next as the param. I assume that this is a blocking call.
@kikito: http://golang.org/pkg/builtin/#append
Append is a built-in function to work on the slice data type, and it always returns the modified slice because this call might resize the one you passed in or a new slice might be allocated depending on the capacity of said slice.
Also yes [left arrow] is a blocking call.
Another nitpick: it’s not really parallel - it’s concurrent. For now, Go is single-core unless you explicitly tell it to use multiple cores[10]. As far as I can tell, your code is running single-core. Which actually makes this a nice example of how concurrency can be faster regardless!
[10] http://golang.org/pkg/runtime/#GOMAXPROCS
Here’s a list of domains I found https://raw.github.com/tarr11/Webmail-Domains/master/domains.txt
I tired this under windows 7 and both scripts run the same…I have a Core 2 Duo. I also: set GOMAXPROCS=2
thx!
Go’s approach to parallelism reminds me of something… ah! Unix and its shells. It’s very easy to parallelize shell scripts too… And go channels look remarkably like pipes. Of course, the shells are kinda sucky and outdated, so yes, Go is better.
Echoing a previous comment - If you were itching to give Go a try, that’s one thing, but saying that you couldn’t do it in Ruby because of the GVL is fallacious and misleading.
You could easily have used JRuby and get an industrial-strength Ruby implementation without a GVL.
@Anthony: I never said I couldn’t do it in Ruby. What I said was that Ruby’s GVL ensures that you won’t get full use of your system when trying to build concurrent systems. Yes you can switch to JRuby but then you’re not using Ruby, you’re using JRuby, and I wanted to branch out and try something completely outside of the Ruby ecosystem.
@Job van der Zwan: Right, thanks for pointing that out! Had only glanced at some of that previously, I’ll be sure to remember that setting in the future.
@Jason: In MRI 1.9, threads blocked by IO will run in parallel. You don’t need JRuby to run a bunch of network requests on all your cores. I understand you just wanted to use Go, but please understand that the GVL doesn’t necessarily block ruby threads from running in parallel.
See Aaron Patterson’s MagmaRails 2012 talk for a simple demonstration http://www.youtube.com/watch?v=vERwKWqDC0c#t=11m00s
I wrote an example showing Ruby 1.9.3 on a Macbook Pro with a Core i7 resolving 800 random-ish hostnames: https://gist.github.com/ec353d84522531fe2bfa
As you can see, it takes about 16 seconds, but the point is that the requests run in parallel on MRI with nothing but Thread.new.
Don’t get me wrong, I think it’s great that you found a simple but practical example to introduce Go’s concurrency primitives, and I appreciate the time you took to write up this blog post. Kudos! I just find that there is a lot of confusion about concurrency when it comes to MRI’s thread implementation, and I think it’s a shame that Rubyists don’t realize they can parallelize IO-bound tasks.
@benolee: If anything I shouldn’t have mentioned Ruby’s GVL at all, as that ended up distracting from the point I was trying to make which was to show my playing with concurrency in Go. Doing anything IO bound is of course a very easily parallellizable task for any language, which puts us back in the realm of how hard it is to put together a good example. I never meant to say “Ruby sucks. What does this better?”, but “I’ve done this in Ruby and I want to try another language now!” and talking about my experiments.