Chain

Replica Set Internals Bootcamp Part IV: Syncing

I’ve been doing replica set “bootcamps” for new hires. It’s mainly focused on applying this to debug replica set issues and being able to talk fluently about what’s happening, but it occurred to me that you (blog readers) might be interested in it, too.

There are 8 subjects I cover in my bootcamp:

  1. Elections
  2. Creating a set
  3. Reconfiguring
  4. Syncing
  5. Initial Sync
  6. Rollback
  7. Authentication
  8. Debugging

I’m going to do one subject per post, we’ll see how many I can get through. I’m punting on reconfig for now because it’s finicky to write about.

Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.

Syncing

When a secondary is operating normally, it chooses a member to sync from (more on that below) and starts pulling operations from the source’s local.oplog.rs collection. When it gets an op (call it W, for “write”), it does three things:

  1. Applies the op
  2. Writes the op to its own oplog (also local.oplog.rs)
  3. Requests the next op

If the db crashes between 1 & 2 and then comes back up, it’ll think it hasn’t applied W yet, so it’ll re-apply it. Luckily (i.e., due to massive amounts of hard work), oplog ops are idempotent: you can apply W once, twice, or a thousand times and you’ll end up with the same document.

For example, if you have a doc that looks like {counter:1} and you do an update like {$inc:{counter:1}} on the primary, you’ll end up with {counter:2} and the oplog will store {$set:{counter:2}}. The secondaries will replicate that instead of the $inc.

w

To ensure a write is present on, say, two members, you can do:

> db.foo.runCommand({getLastError:1, w:2})

Syntax varies based on language, consult your API docs, but it’s always an option for writes. The way this works is kind of cool.

Suppose you have a member called primary and another member syncing from it, called secondary. How does primary know where secondary is synced to? Well, secondary is querying primary‘s oplog for more results. So, if secondary requests an op written at 3pm, primary knows seconday has replicated all ops written before 3pm.

So, it goes like:

  1. Do a write on primary.
  2. Write is written to the oplog on primary, with a field “ts” saying the write occurred at time t.
  3. {getLastError:1,w:2} is called on primary. primary has done the write, so it is just waiting for one more server to get the write (w:2).
  4. secondary queries the oplog on primary and gets the op
  5. secondary applies the op from time t
  6. secondary requests ops with {ts:{$gt:t}} from primary‘s oplog
  7. primary updates that secondary has applied up to t because it is requesting ops > t.
  8. getLastError notices that primary and secondary both have the write, so w:2 is satisfied and it returns.

Starting up

When you start up a node, it takes a look at its local.oplog.rs collection and finds the latest entry in there. This is called the lastOpTimeWritten and it’s the latest op this secondary has applied.

You can always use this shell helper to get the current last op in the shell:

> rs.debug.getLastOpWritten()

The “ts” field is the last optime written.

If a member starts up and there are no entries in the oplog, it will begin the initial sync process, which is beyond the scope of this post.

Once it has the last op time, it will chose a target to sync from.

Who to sync from

As of 2.0, servers automatically sync from whoever is “nearest” based on average ping time. So, if you bring up a new member it starts sending out heartbeats to all the other members and averaging how long it takes to get a response. Once it has a decent picture of the world, it’ll decide who to sync from using the following algorithm:

for each member that is healthy:
    if member[state] == PRIMARY
        add to set of possible sync targets

    if member[lastOpTimeWritten] > our[lastOpTimeWritten]
        add to set of possible sync targets

sync target = member with the min ping time from the possible sync targets

The definition of “member that is healthy” has changed somewhat over the versions, but generally you can think of it as a “normal” member: a primary or secondary. In 2.0, “healthy” debatably includes slave delayed nodes.

You can see who a server is syncing from by running db.adminCommand({replSetGetStatus:1}) and looking at the “syncingTo” field (only present on secondaries). Yes, yes, it probably should have been syncingFrom. Backwards compatibility sucks.

Chaining Slaves

The algorithm for chosing a sync target means that slave chaining is semi-automatic: start up a server in a data center and it’ll (probably) sync from a server in the same data center, minimizing WAN traffic. (Note that you can’t end up with a sync loop, i.e., A syncing from B and B syncing from A, because a secondary can only sync from another secondary with a strictly higher optime.)

One cool thing to implement was making w work with slave chaining. If A is syncing from B and B is syncing from C, how does C know where A is synced to? The way this works is that it builds on the existing oplog-reading protocol.

When A starts syncing from B (or any server starts syncing from another server), it sends a special “handshake” message that basically says, “Hi, I’m A and I’ll be syncing from your oplog. Please track this connection for w purposes.”

When B gets this message, it says, “Hmm, I’m not primary, so let me forward that along to the member I’m syncing from.” So it opens a new connection to C and says “Pretend I’m ‘A‘, I’ll be syncing from your oplog on A‘s behalf.” Note that B now has two connections open to C, one for itself and one for A.

Whenever A requests more ops from B, B sends the ops from its oplog and then forwards a dummy request to C along “A‘s” connection to C. A doesn’t even need to be able to connect directly to C.

A        B        C
           <====>
  <====>   <---->

<====> is a “real” sync connection. The connection between B and C on A’s behalf is called a “ghost” connection (<---->).

On the plus side, this minimizes network traffic. On the negative side, the absolute time it takes a write to get to all members is higher.

Coming soon to a replica set near you…

In 2.2, there will be a new command, replSetSyncFrom, that lets you change who a member is syncing from, bypassing the “choosing a sync target” logic.

> db.adminCommand({replSetSyncFrom:"otherHost:27017"})
Please let me know what thought of this post, anonymously here or in the comments below:
Funny (0) Useful (5) Too short (0) Too long (0)
Chainsaws: the kill -9 of living dead

Night of the Living Dead Ops

MongoDB users often ask about the “killed” field in db.currentOp() output. For example, if you’ve run db.killOp(), you might see something like:

> db.currentOp()
{
	"inprog" : [
		{
			"opid" : 3062962,
			"active" : true,
			"lockType" : "write",
			"waitingForLock" : false,
			"secs_running" : 32267,
			"op" : "update",
			"ns" : "httpdb.servers",
			"query" : {
				"_id" : "150.237.88.189"
			},
			"client" : "127.0.0.1:50416",
			"desc" : "conn",
			"threadId" : "0x2900c400",
			"connectionId" : 74,
			"killed" : true,			"numYields" : 0
		},
		{
			"opid" : 3063051,
			"active" : false,
			"lockType" : "read",
			"waitingForLock" : true,
			"op" : "query",
			"ns" : "",
			"query" : {
				"count" : "servers",
				"query" : {
					"code" : {
						"$gte" : 200
					}
				}
			},
			"client" : "127.0.0.1:30736",
			"desc" : "conn",
			"threadId" : "0x29113700",
			"connectionId" : 191,
			"killed" : true,			"numYields" : 0
		}
        ]
}

The operation looks dead… it has killed:true, right? But you can run db.currentOp() again and again and the op doesn’t go away, even though it’s been “killed.” So what’s up with that?

Chainsaws: the kill -9 of living dead

It has to do with how MongoDB handles killed operations. When you run db.killOp(3062962), MongoDB looks up operation 3062962 in a hashtable and sets its killed field to true. However, the code running that op gets to decide whether to even check that field and how deal with it appropriately.

There are basically three ways MongoDB ops handle getting killed:

  • Ones that die when they yield whatever lock they’re holding. This means that if the op never yields (note that numYields is 0 in the example above), it will never be killed.
  • Ones that can be killed at certain checkpoints. For example, index builds happen in multiple stages and check killed between stages. (Many commands do this, too.)
  • Ones cannot be killed at all. For example, rsSync, the name for the op applying replication, falls into this category. There are some sharding commands that can’t be killed, too.

There is no kill -9 equivalent in MongoDB (other than kill -9-ing the server itself): if an op doesn’t know how to safely kill itself, it won’t die until it’s good and ready. Which means that you can have a “killed” op in db.currentOp() output for a long time. killed might be better named killRequested.

Also, if you kill an operation before it acquires a lock, it’ll generally start executing anyway (e.g., op 3063051 above). For example, try opening up a shell and make the db hold the writelock for 10 minutes:

> db.eval("sleep(10*60*1000)")

While that’s running, in another shell, try doing an insert (which will block, as the db cannot do any writes while the db.eval() is holding the writelock).

> db.foo.insert({x:1})

Now, in a third shell, kill the insert we just did (before the 10 minutes elapse):

> db.currentOp()
{
        "inprog" : [
                {
                        "opid" : 455937,
                        "active" : true,
                        "lockType" : "W",
                        "waitingForLock" : false,
                        "secs_running" : 25,
                        "op" : "query",
                        "ns" : "test",
                        "query" : {
                                "$eval" : "sleep(10*60*1000)"
                        },
                        "client" : "127.0.0.1:51797",
                        "desc" : "conn",
                        "threadId" : "0x7f241c0bf700",
                        "connectionId" : 14477,
                        "locks" : {
                                "." : "W"
                        },
                        "numYields" : 0
                },
                {
                        "opid" : 455949,                        "active" : false,
                        "lockType" : "w",
                        "waitingForLock" : true,
                        "op" : "insert",
                        "ns" : "",
                        "query" : {
 
                        },
                        "client" : "127.0.0.1:51799",
                        "desc" : "conn",
                        "threadId" : "0x7f24147f8700",
                        "connectionId" : 14478,
                        "locks" : {
                                "." : "w",
                                ".test" : "W"
                        },
                        "numYields" : 0
                }
        ]
}
> // get the opId of the insert from currentOp
> db.killOp(455949)
{ "info" : "attempting to kill op" }
> // run currentOp again to see that killed:true
> db.currentOp()
{
        "inprog" : [
                {
                        "opid" : 455937,
                        "active" : true,
                        "lockType" : "W",
                        "waitingForLock" : false,
                        "secs_running" : 221,
                        "op" : "query",
                        "ns" : "test",
                        "query" : {
                                "$eval" : "sleep(10*60*1000)"
                        },
                        "client" : "127.0.0.1:51797",
                        "desc" : "conn",
                        "threadId" : "0x7f241c0bf700",
                        "connectionId" : 14477,
                        "locks" : {
                                "." : "W"
                        },
                        "numYields" : 0
                },
                {
                        "opid" : 455949,
                        "active" : false,
                        "lockType" : "w",
                        "waitingForLock" : true,
                        "op" : "insert",
                        "ns" : "",
                        "query" : {
 
                        },
                        "client" : "127.0.0.1:51799",
                        "desc" : "conn",
                        "threadId" : "0x7f24147f8700",
                        "connectionId" : 14478,
                        "locks" : {
                                "." : "w",
                                ".test" : "W"
                        },
                        "killed" : true,                        "numYields" : 0
                }
        ]
}

If you wait 10 minutes for the db.eval() to finish, then do a find on db.foo, you’ll see that {x:1} was actually inserted anyway. This is because the op’s lifecycle looks something like:

  • Wait for lock
  • Acquire lock!
  • Start running
  • Yield lock
  • Check for killed

So it’ll run a bit before dying (if it can be killed at all), which may produce unintuitive results.

Please let me know what thought of this post, anonymously here or in the comments below:
Funny (2) Useful (2) Too short (0) Too long (0)
Photo by  Steven Fettig, some rights reserved

Debugging Sexism

Photo by Steven Fettig, some rights reserved

The word of the day is sexism: Shanley Kane tweeted the CTO of Geekli.st, asking them to take down a video of a woman in a Geekli.st top and underwear dancing around. The subsequent tweets were captured in a long Storify and Geekli.st subsequently issued a public apology.

The interesting thing about this, to me, is how often well-meaning geeks react badly when someone says that they did something sexist, racist, homophobic, etc.

Let’s say someone says to you, in a public forum, “This thing you did is sexist.” What should you do?

First, if your immediate reaction is to say, “No it’s not!” don’t give in to your immediate reaction. That seems to be what most people start off with and it doesn’t help anything.

The reason you’re probably so eager to say it’s not sexist is because you’re not sexist. (Let’s assume, if you’re reading this blog, that you are correct: you are not sexist.)

So, if you aren’t sexist, why can’t you flat-out say “You are wrong”? Well, for starters, just because you’re not sexist doesn’t mean you never do anything sexist. Besides, they’re saying “This thing you did is sexist” but what they mean is “I feel this thing you did is sexist.” And even if you’re Simone de Beauvoir and Margaret Atwood and Buffy the Vampire Slayer rolled into one, you cannot tell them their feelings are wrong (well, you can, but you’ll look like an ass). What should you do, then?

You should try to make them feel better and try to avoid hurting them in the future.

  • Step 1: try to make them feel better. The best way to do this is through an apology. A real one, not an “I’m sorry this offended you”-type apology. A better choice: “I’m sorry that I did that.” Now follow up with, “What can I do to make this better?”
  • Step 2: try to avoid hurting them in the future. Obviously, everyone makes mistakes. You can’t never make one again, even if you and your company were publicly humiliated the first time around. The important thing is to try to learn from it. Google around and read about the issues women and minorities are concerned about in geek culture. Even if you don’t agree with everything everyone writes (I certainly don’t), you will at least know what the issues are.

If someone tells you they’re offended, you should take that seriously. If they’re speaking up, there are probably many others who are silently offended.

Please let me know what thought of this post, anonymously here or in the comments below:
Funny (0) Useful (4) Too short (0) Too long (0)