Snail in a Turtleneck
Kristina Chodorow's Blog
Kristina Chodorow's Blog
May 15th

The Kindle should have a developer API. Ereaders could be revolutionizing the way people read, but right now they’re like paperbacks without the nice book smell.
I’ve heard a lot of people say, “the Kindle isn’t powerful enough for apps.” Poppycock. I’m not talking about using it to play Angry Birds, I’m talking about stuff a calculator could zoom though and would actually improve the reading experience.
So I present 10 apps that would be super-useful, require few resources, and (in some cases) increase profits:
I’m not sure why Amazon doesn’t have a dev API, but I’d imagine that part of the reason is that most publishers would not like it. However, I think Amazon is big enough to crush them into submission. I hope that they will hurry up and do so.
If anyone has any ideas on how to get Amazon to implement a developer API, please comment!
P.S. I know about the API here, but that’s essentially for Angry-Birds-type apps. I’m looking for an API that lets you mess with the reader.
May 10th
Having readahead set too high can slow your database to a crawl. This post discusses why that is and how you can diagnose it.
The #1 sign that readahead is too high is that MongoDB isn’t using as much RAM as it should be. If you’re running Mongo Monitoring Service (MMS), take a look at the “resident” size on the “memory” chart. Resident memory can be thought of as “the amount of space MongoDB ‘owns’ in RAM.” Therefore, if MongoDB is the only thing running on a machine, we want resident size to be as high as possible. On the chart below, resident is ~3GB:

Is 3GB good or bad? Well, it depends on the machine. If the machine only has 3.5GB of RAM, I’d be pretty happy with 3GB resident. However, if the machine has, say, 15GB of RAM, then we’d like at least 15GB of the data to be in there (the “mapped” field is (sort of) data size, so I’m assuming we have 60GB of data).
Assuming we’re accessing a lot of this data, we’d expect MongoDB’s resident set size to be 15GB, but it’s only 3GB. If we try turning down readahead and the resident size jumps to 15GB and our app starts going faster. But why is this?
Let’s take an example: suppose all of our docs are 512 bytes in size (readahead is set in 512-byte increments, called sectors, so 1 doc = 1 sector makes the math easier). If we have 60GB of data then we have ~120 million documents (60GB of data/(512 bytes/doc)). The 15GB of RAM on this machine should be able to hold ~30 million documents.
Our application accesses documents randomly across our data set, so we’d expect MongoDB to eventually “own” (have resident) all 15GB of RAM, as 1) it’s the only thing running and 2) it’ll eventually fetch at least 15GB of the data.
Now, let’s set our readahead to 100 (100 512-byte sectors, aka 100 documents): blockdev --set-ra 100. What happens when we run our application?
Picture our disk as looking like this, where each o is a document:
... ooooooooooooooooooooooooo ooooooooooooooooooooooooo ooooooooooooooooooooooooo ooooooooooooooooooooooooo ooooooooooooooooooooooooo ooooooooooooooooooooooooo ooooooooooooooooooooooooo ooooooooooooooooooooooooo ... // keep going for millions more o's
Let’s say our app requests a document. We’ll mark it with “x” to show that the OS has pulled it into memory:
... ooooooooooooooooooooooooo ooooxoooooooooooooooooooo ooooooooooooooooooooooooo ooooooooooooooooooooooooo ooooooooooooooooooooooooo ooooooooooooooooooooooooo ooooooooooooooooooooooooo ooooooooooooooooooooooooo ...
See it on the third line there? But that’s not the only doc that’s pulled into memory: readahead is set to 100 so the next 99 documents are pulled into memory, too:
... ooooooooooooooooooooooooo ooooxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxx xxxxooooooooooooooooooooo ooooooooooooooooooooooooo ooooooooooooooooooooooooo ...

Is your OS returning this with every document?
Now we have 100 docs in memory, but remember that our application is accessing documents randomly: the likelihood of the next document we access is in that block of 100 docs is almost nil. At this point, there’s 50KB of data in RAM (512 bytes * 100 docs = 51,200 bytes) and MongoDB’s resident size has only increase by 512 bytes (1 doc).
Our app will keep bouncing around the disk, reading docs from here and there and filing up memory with docs MongoDB never asked for until RAM is completely full of junk that’s never been used. Then, it’ll start evicting things to make room for new junk as our app continues to make requests.
Working this out, there’s a 25% chance of our app requesting a doc that’s already in memory, so 75% of the requests are going to go to disk. Say we’re doing 2 requests a sec. Then 1 hour of requests is 2 requests * 3600 seconds/hour = 7200 requests, 4800 of which are going to disk (.75 * 7200). If each request pulls back 50KB, that’s 240MB read from disk/hour. If we set readahead to 0, we’ll have 2MB read from disk/hour.
Which brings us to the next symptom of a too-high readahead: unexpectedly high disk IO. Because most of the data we want isn’t in memory, we keep having to go to disk, dragging shopping-carts full of junk into RAM, perpetuating the high disk io/low resident mem cycle.
The general takeaway is that a DB is not a “normal” workload for an OS. The default settings may screw you over.
May 7th
I’ve been doing replica set “bootcamps” for new hires. It’s mainly focused on applying this to debug replica set issues and being able to talk fluently about what’s happening, but it occurred to me that you (blog readers) might be interested in it, too.

There are 8 subjects I cover in my bootcamp:
I’m going to do one subject per post, we’ll see how many I can get through. I’m punting on reconfig for now because it’s finicky to write about.
Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.
When a secondary is operating normally, it chooses a member to sync from (more on that below) and starts pulling operations from the source’s local.oplog.rs collection. When it gets an op (call it W, for “write”), it does three things:
If the db crashes between 1 & 2 and then comes back up, it’ll think it hasn’t applied W yet, so it’ll re-apply it. Luckily (i.e., due to massive amounts of hard work), oplog ops are idempotent: you can apply W once, twice, or a thousand times and you’ll end up with the same document.
For example, if you have a doc that looks like {counter:1} and you do an update like {$inc:{counter:1}} on the primary, you’ll end up with {counter:2} and the oplog will store {$set:{counter:2}}. The secondaries will replicate that instead of the $inc.
To ensure a write is present on, say, two members, you can do:
> db.foo.runCommand({getLastError:1, w:2})
Syntax varies based on language, consult your API docs, but it’s always an option for writes. The way this works is kind of cool.
Suppose you have a member called primary and another member syncing from it, called secondary. How does primary know where secondary is synced to? Well, secondary is querying primary‘s oplog for more results. So, if secondary requests an op written at 3pm, primary knows seconday has replicated all ops written before 3pm.
So, it goes like:
{getLastError:1,w:2} is called on primary. primary has done the write, so it is just waiting for one more server to get the write (w:2).
When you start up a node, it takes a look at its local.oplog.rs collection and finds the latest entry in there. This is called the lastOpTimeWritten and it’s the latest op this secondary has applied.
You can always use this shell helper to get the current last op in the shell:
> rs.debug.getLastOpWritten()
The “ts” field is the last optime written.
If a member starts up and there are no entries in the oplog, it will begin the initial sync process, which is beyond the scope of this post.
Once it has the last op time, it will chose a target to sync from.
As of 2.0, servers automatically sync from whoever is “nearest” based on average ping time. So, if you bring up a new member it starts sending out heartbeats to all the other members and averaging how long it takes to get a response. Once it has a decent picture of the world, it’ll decide who to sync from using the following algorithm:
for each member that is healthy:
if member[state] == PRIMARY
add to set of possible sync targets
if member[lastOpTimeWritten] > our[lastOpTimeWritten]
add to set of possible sync targets
sync target = member with the min ping time from the possible sync targets
The definition of “member that is healthy” has changed somewhat over the versions, but generally you can think of it as a “normal” member: a primary or secondary. In 2.0, “healthy” debatably includes slave delayed nodes.
You can see who a server is syncing from by running db.adminCommand({replSetGetStatus:1}) and looking at the “syncingTo” field (only present on secondaries). Yes, yes, it probably should have been syncingFrom. Backwards compatibility sucks.

The algorithm for chosing a sync target means that slave chaining is semi-automatic: start up a server in a data center and it’ll (probably) sync from a server in the same data center, minimizing WAN traffic. (Note that you can’t end up with a sync loop, i.e., A syncing from B and B syncing from A, because a secondary can only sync from another secondary with a strictly higher optime.)
One cool thing to implement was making w work with slave chaining. If A is syncing from B and B is syncing from C, how does C know where A is synced to? The way this works is that it builds on the existing oplog-reading protocol.
When A starts syncing from B (or any server starts syncing from another server), it sends a special “handshake” message that basically says, “Hi, I’m A and I’ll be syncing from your oplog. Please track this connection for w purposes.”
When B gets this message, it says, “Hmm, I’m not primary, so let me forward that along to the member I’m syncing from.” So it opens a new connection to C and says “Pretend I’m ‘A‘, I’ll be syncing from your oplog on A‘s behalf.” Note that B now has two connections open to C, one for itself and one for A.
Whenever A requests more ops from B, B sends the ops from its oplog and then forwards a dummy request to C along “A‘s” connection to C. A doesn’t even need to be able to connect directly to C.
A B C
<====>
<====> <---->
<====> is a “real” sync connection. The connection between B and C on A’s behalf is called a “ghost” connection (<---->).
On the plus side, this minimizes network traffic. On the negative side, the absolute time it takes a write to get to all members is higher.
In 2.2, there will be a new command, replSetSyncFrom, that lets you change who a member is syncing from, bypassing the “choosing a sync target” logic.
> db.adminCommand({replSetSyncFrom:"otherHost:27017"})
Apr 20th
MongoDB users often ask about the “killed” field in db.currentOp() output. For example, if you’ve run db.killOp(), you might see something like:
> db.currentOp() { "inprog" : [ { "opid" : 3062962, "active" : true, "lockType" : "write", "waitingForLock" : false, "secs_running" : 32267, "op" : "update", "ns" : "httpdb.servers", "query" : { "_id" : "150.237.88.189" }, "client" : "127.0.0.1:50416", "desc" : "conn", "threadId" : "0x2900c400", "connectionId" : 74, "killed" : true, "numYields" : 0 }, { "opid" : 3063051, "active" : false, "lockType" : "read", "waitingForLock" : true, "op" : "query", "ns" : "", "query" : { "count" : "servers", "query" : { "code" : { "$gte" : 200 } } }, "client" : "127.0.0.1:30736", "desc" : "conn", "threadId" : "0x29113700", "connectionId" : 191, "killed" : true, "numYields" : 0 } ] }
The operation looks dead… it has killed:true, right? But you can run db.currentOp() again and again and the op doesn’t go away, even though it’s been “killed.” So what’s up with that?
It has to do with how MongoDB handles killed operations. When you run db.killOp(3062962), MongoDB looks up operation 3062962 in a hashtable and sets its killed field to true. However, the code running that op gets to decide whether to even check that field and how deal with it appropriately.
There are basically three ways MongoDB ops handle getting killed:
killed between stages. (Many commands do this, too.)
There is no kill -9 equivalent in MongoDB (other than kill -9-ing the server itself): if an op doesn’t know how to safely kill itself, it won’t die until it’s good and ready. Which means that you can have a “killed” op in db.currentOp() output for a long time. killed might be better named killRequested.
Also, if you kill an operation before it acquires a lock, it’ll generally start executing anyway (e.g., op 3063051 above). For example, try opening up a shell and make the db hold the writelock for 10 minutes:
> db.eval("sleep(10*60*1000)")
While that’s running, in another shell, try doing an insert (which will block, as the db cannot do any writes while the db.eval() is holding the writelock).
> db.foo.insert({x:1})
Now, in a third shell, kill the insert we just did (before the 10 minutes elapse):
> db.currentOp() { "inprog" : [ { "opid" : 455937, "active" : true, "lockType" : "W", "waitingForLock" : false, "secs_running" : 25, "op" : "query", "ns" : "test", "query" : { "$eval" : "sleep(10*60*1000)" }, "client" : "127.0.0.1:51797", "desc" : "conn", "threadId" : "0x7f241c0bf700", "connectionId" : 14477, "locks" : { "." : "W" }, "numYields" : 0 }, { "opid" : 455949, "active" : false, "lockType" : "w", "waitingForLock" : true, "op" : "insert", "ns" : "", "query" : { }, "client" : "127.0.0.1:51799", "desc" : "conn", "threadId" : "0x7f24147f8700", "connectionId" : 14478, "locks" : { "." : "w", ".test" : "W" }, "numYields" : 0 } ] } > // get the opId of the insert from currentOp > db.killOp(455949) { "info" : "attempting to kill op" } > // run currentOp again to see that killed:true > db.currentOp() { "inprog" : [ { "opid" : 455937, "active" : true, "lockType" : "W", "waitingForLock" : false, "secs_running" : 221, "op" : "query", "ns" : "test", "query" : { "$eval" : "sleep(10*60*1000)" }, "client" : "127.0.0.1:51797", "desc" : "conn", "threadId" : "0x7f241c0bf700", "connectionId" : 14477, "locks" : { "." : "W" }, "numYields" : 0 }, { "opid" : 455949, "active" : false, "lockType" : "w", "waitingForLock" : true, "op" : "insert", "ns" : "", "query" : { }, "client" : "127.0.0.1:51799", "desc" : "conn", "threadId" : "0x7f24147f8700", "connectionId" : 14478, "locks" : { "." : "w", ".test" : "W" }, "killed" : true, "numYields" : 0 } ] }
If you wait 10 minutes for the db.eval() to finish, then do a find on db.foo, you’ll see that {x:1} was actually inserted anyway. This is because the op’s lifecycle looks something like:
So it’ll run a bit before dying (if it can be killed at all), which may produce unintuitive results.
Apr 5th
Disk IO is slow. You just won’t believe how vastly, hugely, mind-bogglingly slow it is. I mean, you may think your network is slow, but that’s just peanuts to disk IO.
The image below helps visualize how slow (post continues below).
(Originally found on Hacker News and inspired by Gustavo Duarte’s blog.)
The kernel knows how slow the disk is and tries to be smart about accessing it. It not only reads the data you requested, it also returns a bit more. This way, if you’re reading through a file or watching a movie (sequential access), your system doesn’t have to go to disk as frequently because you’re pulling more data back than you strictly requested each time.
You can see how far the kernel reads ahead using the blockdev tool:
$ sudo blockdev --report RO RA SSZ BSZ StartSec Size Device rw 256 512 4096 0 80026361856 /dev/sda rw 256 512 4096 2048 80025223168 /dev/sda1 rw 256 512 4096 0 2000398934016 /dev/sdb rw 256 512 1024 2048 98566144 /dev/sdb1 rw 256 512 4096 194560 7999586304 /dev/sdb2 rw 256 512 4096 15818752 19999490048 /dev/sdb3 rw 256 512 4096 54880256 1972300152832 /dev/sdb4
Readahead is listed in the “RA” column. As you can see, I have two disks (sda and sdb) with readahead set to 256 on each. But what unit is that 256? Bytes? Kilobytes? Dolphins? If we look at the man page for blockdev, it says:
$ man blockdev ... --setra N Set readahead to N 512-byte sectors. ...
This means that my readahead is 512 bytes*256=131072 or 128KB. That means that, whenever I read from disk, the disk is actually reading at least 128KB of data, even if I only requested a few bytes.
So what value should you set your readahead to? Please don’t set it to a number you find online without understanding the consequences. If you Google for “blockdev setra”, the first result uses blockdev –setra 65536, which translates to 32MB of readahead. That means that, whenever you read from disk, the disk is actually doing 32MB worth of work. Please do not set your readahead this high if you’re doing a lot of random-access reads and writes, as all of the extra IO can slow things down a lot (and if your low on memory, you’ll be forcing the kernel to fill up your RAM with data you won’t need).
Getting a good readahead value can help disk IO issues to some extent, but if you are using MongoDB (in particular), please consider your typical document size and access patterns before changing your blockdev settings. I’m not recommending any particular value because what’s perfect for one application/machine can be death for another.
I’m really enjoying these –thursday posts because every week people have commented with different/better/interesting ways of doing what I talked about (or ways of telling the difference between stalagmites and stalactites), which is really cool. So I’m throwing this out there: how would you figure out what a good readahead setting is? Next week I’m planning to do iostat for –thursday which should cover this a bit, but please leave a comment if you have any ideas.
Mar 29th
Everyone says to use git branches early and often, but I inevitably lose track of what branch I’m on. My workflow generally goes something like:
To keep track, I’ve modified my command prompt to display what I’m working on using __git_ps1, which prints a nice “you are here” string for your prompt:
$ __git_ps1 (master) $ git checkout myTag123 $ __git_ps1 ((myTag123))
(Newlines added for clarity, it actually displays as ” (master)$ git checkout…”, which is generally what you want so your prompt doesn’t contain a newline.) I’m not sure what’s up with all the ‘()’s, but it does let you distinguish between branches (‘(branchname)’) and tags (‘((tagname))’).
__git_ps1 will even show you if you’re in the middle of a bisect or a merge, which is another common workflow for me:
If you tend to fall into a similar workflow, I highly recommend modifying your prompt to something like:
$ PS1='\w$(__git_ps1)> ' ~/gitroot/mongo (v2.0)>
And suddenly life is better, which is what Linux is all about.
You can run PS1=... on a per-shell basis (or export PS1 for all shells this session) or add that line to your ~/.bashrc file to get that prompt in all future shells.
Finally, __git_ps1 isn’t a binary in your path, it’s a function (so it won’t show up if you run which __git_ps1). You can see its source by running type __git_ps1.
Note: unfortunately this doesn’t work for me on OS X, so I think it might be Linux-only.
Mar 22nd
The word of the day is sexism: Shanley Kane tweeted the CTO of Geekli.st, asking them to take down a video of a woman in a Geekli.st top and underwear dancing around. The subsequent tweets were captured in a long Storify and Geekli.st subsequently issued a public apology.
The interesting thing about this, to me, is how often well-meaning geeks react badly when someone says that they did something sexist, racist, homophobic, etc.
Let’s say someone says to you, in a public forum, “This thing you did is sexist.” What should you do?
First, if your immediate reaction is to say, “No it’s not!” don’t give in to your immediate reaction. That seems to be what most people start off with and it doesn’t help anything.
The reason you’re probably so eager to say it’s not sexist is because you’re not sexist. (Let’s assume, if you’re reading this blog, that you are correct: you are not sexist.)
So, if you aren’t sexist, why can’t you flat-out say “You are wrong”? Well, for starters, just because you’re not sexist doesn’t mean you never do anything sexist. Besides, they’re saying “This thing you did is sexist” but what they mean is “I feel this thing you did is sexist.” And even if you’re Simone de Beauvoir and Margaret Atwood and Buffy the Vampire Slayer rolled into one, you cannot tell them their feelings are wrong (well, you can, but you’ll look like an ass). What should you do, then?
You should try to make them feel better and try to avoid hurting them in the future.
If someone tells you they’re offended, you should take that seriously. If they’re speaking up, there are probably many others who are silently offended.
Mar 15th
I’m trying something new: every Thursday I’ll do a short post on how to do something with the command line.
I always seem to either create or apply patches in the wrong direction. It’s like stalagmites vs. stalactites, which I struggled with until I heard the nemonic: “Stalagmites might hang from the ceiling… but they don’t.”
Moving right along, you can use diff to get line-by-line changes between any two files. Generally I use git diff because I’m dealing with a git repo, so that’s what I’ll use here.
Let’s get a diff of MongoDB between version 2.0.2 and 2.0.3.
$ git clone git://github.com/mongodb/mongo.git $ cd mongo $ git diff r2.0.2..r2.0.3 > mongo.patch
This takes all of the changes between 2.0.2 and 2.0.3 (r2.0.2..r2.0.3) and dumps them into a file called mongo.patch (that’s the > mongo.patch part).
Now, let’s get the code from 2.0.2 and apply mongo.patch, effectively making it 2.0.3 (this is kind of a silly example but if you’re still with me after the stalagmite thing, I assume you don’t mind silly examples):
$ git checkout r2.0.2 Note: checking out 'r2.0.2'. You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by performing another checkout. If you want to create a new branch to retain commits you create, you may do so (now or later) by using -b with the checkout command again. Example: git checkout -b new_branch_name HEAD is now at 514b122... BUMP 2.0.2 $ $ patch -p1 < mongo.patch
What intuitive syntax!
What does the -p1 mean? How many forward slashes to remove from the path in the patch, of course.
To take an example, if you look at the last 11 lines of the patch, you can see that it is the diff for the file that changes the version number. It looks like this:
$ tail -n 11 mongo.patch
--- a/util/version.cpp
+++ b/util/version.cpp
@@ -38,7 +38,7 @@ namespace mongo {
* 1.2.3-rc4-pre-
* If you really need to do something else you'll need to fix _versionArray()
*/
- const char versionString[] = "2.0.2";
+ const char versionString[] = "2.0.3";
// See unit test for example outputs
static BSONArray _versionArray(const char* version){Note the a/util/version.cpp and b/util/version.cpp. These indicate the file the patch should be applied to, but there are no a or b directories in the MongoDB repository. The a and b prefixes indicate that one is the previous version and one is the new version. And -p says how many slashes to strip from this path. An example may make this clearer:
So, we use -p1, because that makes the patch’s paths match the actually directory structure. If someone sent you a patch and the path is something like /home/bob/bobsStuff/foo.txt and your name is not Bob, you’re just trying to patch foo.txt, you’d probably want to use -p4.
On the plus side, if you’re using patches generated by git, they’re super-easy to apply. Git chose the intuitive verb “apply” to patch a file. If you have a patch generated by git diff, you can patch your current tree with:
$ git apply mongo.patchSo, aside from the stupid choice of verbiage, that is generally easier.
Did I miss anything? Get anything wrong? Got a suggestion for next week? Leave a comment below and let me know!
Mar 8th
I’m trying something new: every Thursday I’ll go over how to do something with the command line. Let me know what you think.
If you are using a modern-ish browser, you probably use tabs to keep multiple things open at once: your email, your calendar, whatever you’re actually doing, etc. You can do the same thing with the shell using screen: in a single terminal, you can compile a program while you’re editing a file and watching another process out of the corner of your eye.
Note that screen is super handy when SSH’d into a box. SSH in once, then start screen and open up all of the windows you need.
To start up screen, run:
$ screenNow your shell will clear and screen will give you a welcome message.
Screen version 4.00.03jw4 (FAU) 2-May-06
Copyright (c) 1993-2002 Juergen Weigert, Michael Schroeder
Copyright (c) 1987 Oliver Laumann
...
[Press Space or Return to end.]As it says at the bottom, just hit Return to clear the welcome message. Now you’ll see an empty prompt and you can start working normally.
Let’s say we have three things we want to do:
Go ahead and start up top:
$ top
Well, now we need to edit a file but top‘s using the shell. What to do now? Just create a new window. While top is still running, hit ^A c (I’m using ^A as shorthand for Control-a, so this means “hit Control-a, then hit c”) to create a new window. The new window gets put right on top of the old one, so you’ll see a fresh shell and be at the prompt again. But where did top go? Not to worry, it’s still there. We can switch back to it with ^A n or ^A p (next or previous window).
Now we can start up our editor and begin editing a file. But now we want to tail a file, so we create another new window with ^A c and run our tail -f filename. We can continue to use ^A n and ^A p to switch between the three things we’re doing (and open more windows as necessary).
screen seems pretty ubiquitous, it has been on every Linux machine I’ve ever tried running it on and even OS X (although it may be part of XCode, I haven’t checked).
^A is an annoying escape key, as it is also go-to-beginning-of-line shortcut in Emacs (and the shell). To fix this, create a .screenrc file and add one line to change this to something else:
# use ^T escape ^Tt # or ^Y escape ^Yy
The escape sequence is 3 characters: carat, T, and t. (It is not using the single special character “^T”.) The traditional escape key is actually Ctrl-^, as the carat is the one character Emacs doesn’t use for anything. In a .screenrc file, this results in the rather bizarre string:
escape ^^^^
…which makes sense when you think about it, but looks a bit weird.
As long as you’re poking at the .screenrc file, you might want to turn off the welcome message, too:
startup_message off
Run ^A ? anytime for help, or check out the manual’s list of default bindings.
Did I miss anything? Get anything wrong? Got a suggestion for next week? Leave a comment below and let me know!
Mar 6th
I’ve been doing replica set “bootcamps” for new hires. It’s mainly focused on applying this to debug replica set issues and being able to talk fluently about what’s happening, but it occurred to me that you (blog readers) might be interested in it, too.

There are 8 subjects I cover in my bootcamp:
I’m going to do one subject per post, we’ll see how many I can get through.
Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.
Suppose you are creating a new set. You start up some mongods with no data and the replSet option. When a server starts up with the replSet option, the first thing it does is check its local.system.replset collection for a replica set config it can load, but it won’t find one (because there’s no data). Since it cannot find a config, it goes into the EMPTYCONFIG state. If you look in the logs, you’ll see messages about EMPTYCONFIG.
Tue Mar 6 12:24:35 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG) Tue Mar 6 12:24:35 [rsStart] replSet info you may need to run replSetInitiate -- rs.initiate() in the shell -- if that is not already done Tue Mar 6 12:24:45 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG) Tue Mar 6 12:24:55 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG) Tue Mar 6 12:25:05 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG) Tue Mar 6 12:25:15 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG) Tue Mar 6 12:25:25 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Now suppose you send the replSetInitiate command to one of the members. replSetInitiate can either take a custom configuration or generate a config automatically. If you do not pass in a config, the server will try to figure out what its hostname is and generate a config using that.
Note: on EC2, the server always chooses the wrong hostname, so you have to pass in the config you want.
Once the server has a config document, either passed in or generated, it’ll make sure it can reach every node specified (all members must be up and reachable to initiate), then store the config in the local.system.replset collection. Finally, it begins sending heartbeat requests to the other members of the set.
Being able to interpret log files is critical to knowing what’s going on, so let’s walk through a sample log from running rs.initiate(). Here’s the whole log, feel free to skip over this for now. We’ll take it piece-by-piece below.
Tue Mar 6 12:45:00 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Tue Mar 6 12:45:02 [initandlisten] connection accepted from 127.0.0.1:38839 #2 (2 connections now open)
Tue Mar 6 12:45:08 [conn2] replSet replSetInitiate admin command received from client
Tue Mar 6 12:45:08 [conn2] replSet info initiate : no configuration specified. Using a default configuration for the set
Tue Mar 6 12:45:08 [conn2] replSet created this configuration for initiation : { _id: "foo", members: [ { _id: 0, host: "ubuntu:30000" } ] }
Tue Mar 6 12:45:08 [conn2] replSet replSetInitiate config object parses ok, 1 members specified
Tue Mar 6 12:45:08 [conn2] replSet replSetInitiate all members seem up
Tue Mar 6 12:45:08 [conn2] ******
Tue Mar 6 12:45:08 [conn2] creating replication oplog of size: 49085MB...
Tue Mar 6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.ns, filling with zeroes...
Tue Mar 6 12:45:08 [FileAllocator] creating directory /datadir/_tmp
Tue Mar 6 12:45:08 [FileAllocator] done allocating datafile /datadir/local.ns, size: 16MB, took 0.016 secs
Tue Mar 6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.0, filling with zeroes...
Tue Mar 6 12:45:08 [FileAllocator] done allocating datafile /datadir/local.0, size: 2047MB, took 0.016 secs
Tue Mar 6 12:45:08 [conn2] datafileheader::init initializing /datadir/local.0 n:0
Tue Mar 6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.1, filling with zeroes...
...
Tue Mar 6 12:45:09 [conn2] datafileheader::init initializing /datadir/local.22 n:22
Tue Mar 6 12:45:09 [FileAllocator] allocating new datafile /datadir/local.23, filling with zeroes...
Tue Mar 6 12:45:09 [FileAllocator] done allocating datafile /datadir/local.23, size: 2047MB, took 0.042 secs
Tue Mar 6 12:45:09 [conn2] datafileheader::init initializing /datadir/local.23 n:23
Tue Mar 6 12:45:10 [conn2] ******
Tue Mar 6 12:45:10 [conn2] replSet info saving a newer config version to local.system.replset
Tue Mar 6 12:45:10 [conn2] replSet saveConfigLocally done
Tue Mar 6 12:45:10 [conn2] replSet replSetInitiate config now saved locally. Should come online in about a minute.
Tue Mar 6 12:45:10 [conn2] command admin.$cmd command: { replSetInitiate: undefined } ntoreturn:1 keyUpdates:0 reslen:196 2802ms
Tue Mar 6 12:45:10 [rsStart] replSet load config ok from self
Tue Mar 6 12:45:10 [rsStart] replSet I am ubuntu:30000
Tue Mar 6 12:45:10 [rsStart] replSet STARTUP2
Tue Mar 6 12:45:10 [rsSync] replSet SECONDARY
Tue Mar 6 12:45:25 [rsSync] waiting for 2 pings from other members before syncing
Tue Mar 6 12:45:26 [rsMgr] replSet info electSelf 0
Tue Mar 6 12:45:26 [rsMgr] replSet PRIMARYNow, to understand this line-by-line. First, we start out in the EMPTYCONFIG state, as described above.
Tue Mar 6 12:45:00 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Now the shell connects to the server and runs rs.initiate() (replSetInitiate), which apparently takes me 6 seconds to type (take that, Mavis Beacon).
Tue Mar 6 12:45:02 [initandlisten] connection accepted from 127.0.0.1:38839 #2 (2 connections now open) Tue Mar 6 12:45:08 [conn2] replSet replSetInitiate admin command received from client
Because we didn’t give a config, the server tries to generate one:
Tue Mar 6 12:45:08 [conn2] replSet info initiate : no configuration specified. Using a default configuration for the set
Tue Mar 6 12:45:08 [conn2] replSet created this configuration for initiation : { _id: "foo", members: [ { _id: 0, host: "ubuntu:30000" } ] }Now it does some sanity checks on this config (e.g., making sure all members are up and reachable). This matters more if we gave it a config, but sometimes things are so screwed up you can’t reach yourself.
Tue Mar 6 12:45:08 [conn2] replSet replSetInitiate config object parses ok, 1 members specified Tue Mar 6 12:45:08 [conn2] replSet replSetInitiate all members seem up
Now it calculates the size of the oplog and allocates it between two lines of ******.
Tue Mar 6 12:45:08 [conn2] ****** Tue Mar 6 12:45:08 [conn2] creating replication oplog of size: 49085MB... Tue Mar 6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.ns, filling with zeroes... Tue Mar 6 12:45:08 [FileAllocator] creating directory /datadir/_tmp Tue Mar 6 12:45:08 [FileAllocator] done allocating datafile /datadir/local.ns, size: 16MB, took 0.016 secs Tue Mar 6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.0, filling with zeroes... Tue Mar 6 12:45:08 [FileAllocator] done allocating datafile /datadir/local.0, size: 2047MB, took 0.016 secs Tue Mar 6 12:45:08 [conn2] datafileheader::init initializing /datadir/local.0 n:0 Tue Mar 6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.1, filling with zeroes... ... Tue Mar 6 12:45:09 [conn2] datafileheader::init initializing /datadir/local.22 n:22 Tue Mar 6 12:45:09 [FileAllocator] allocating new datafile /datadir/local.23, filling with zeroes... Tue Mar 6 12:45:09 [FileAllocator] done allocating datafile /datadir/local.23, size: 2047MB, took 0.042 secs Tue Mar 6 12:45:09 [conn2] datafileheader::init initializing /datadir/local.23 n:23 Tue Mar 6 12:45:10 [conn2] ******
It doesn’t actually print the “…” part, I just omitted 21 files being preallocated.
Now it stores the configuration in the local.system.replset collection.
Tue Mar 6 12:45:10 [conn2] replSet info saving a newer config version to local.system.replset Tue Mar 6 12:45:10 [conn2] replSet saveConfigLocally done Tue Mar 6 12:45:10 [conn2] replSet replSetInitiate config now saved locally. Should come online in about a minute.
It seldom takes a minute for it to come online, but this way you’ll be pleasantly surprised.
replSetInitiate is now done running. It will often be logged, because slow operations are logged and allocating the oplog usually takes a while.
Tue Mar 6 12:45:10 [conn2] command admin.$cmd command: { replSetInitiate: undefined } ntoreturn:1 keyUpdates:0 reslen:196 2802msNow it actually loads this configuration as the replica set config we want to use:
Tue Mar 6 12:45:10 [rsStart] replSet load config ok from self
You won’t see this next line if you’re running 2.0, I added it for 2.2 because I was sick of having to decipher which server a log was from:
Tue Mar 6 12:45:10 [rsStart] replSet I am ubuntu:30000
Yay, now config has been loaded, making our state STARTUP2 (the “config has been loaded, starting up RS threads” state):
Tue Mar 6 12:45:10 [rsStart] replSet STARTUP2
At this point, it can go straight into PRIMARY state, but it doesn’t. We could fix this, but it hasn’t been a priority, so it goes into SECONDARY state briefly before becoming PRIMARY:
Tue Mar 6 12:45:10 [rsSync] replSet SECONDARY
A “real” secondary (we’re about to become primary, so we don’t count) will gather some stats about who’s best to sync from before it chooses a target:
Tue Mar 6 12:45:25 [rsSync] waiting for 2 pings from other members before syncing
Good luck with that, buddy, you’re the only one in the config.
Aha, we’ve noticed that we can become primary:
Tue Mar 6 12:45:26 [rsMgr] replSet info electSelf 0 Tue Mar 6 12:45:26 [rsMgr] replSet PRIMARY
And now we’re all set.
If we add members or provide an initial config with multiple members, the initial member’s heartbeat requests are sent to a bunch of servers in EMPTYCONFIG state. When these other servers receive the heartbeat request, they notice “Hey, I don’t have any config and this guy seems to think I’m part of his set, so I should check in with him.”
These EMPTYCONFIG members will then request a config from the member that requested the heartbeat. Seeing themselves in the config, they’ll save it to their own local.system.replset collection, then start sending out their own heartbeats. At that point, they’ll move into RECOVERING state and, once they’re synced to the original member, they’ll turn into secondaries (unless they’re arbiters, of course, but same general idea).
One quirk of initiating a set is that every member of the set must not have any data, other than the one you’re sending replSetInitiate to. After the set is initialized you can add members containing data, but at most one member can have data when you initialize the set.
Feel free to ask questions in the comments below. This is a loving, caring bootcamp (as bootcamps go).