Snail in a Turtleneck
Kristina Chodorow's Blog
Kristina Chodorow's Blog
Jan 27th
This is a supplement to the Hacking Chess with the MongoDB Pipeline. This post has instructions for rolling your own data sets from chess games.
Download a collection of chess games you like. I’m using 1132 wins in less than 10 moves, but any of them should work.
These files are in a format called portable game notation (.PGN), which is a human-readable notation for chess games. For example, the first game in TEN.PGN (helloooo 80s filenames) looks like:
[Event "?"] [Site "?"] [Date "????.??.??"] [Round "?"] [White "Gedult D"] [Black "Kohn V"] [Result "1-0"] [ECO "B33/09"] 1.e4 c5 2.Nf3 Nc6 3.d4 cxd4 4.Nxd4 Nf6 5.Nc3 e5 6.Ndb5 d6 7.Nd5 Nxd5 8.exd5 Ne7 9.c4 a6 10.Qa4 1-0
This represents a 10-turn win at an unknown event. The “ECO” field shows which opening was used (a Sicilian in the game above).
Unfortunately for us, MongoDB doesn’t import PGNs in their native format, so we’ll need to convert them to JSON. I found a PGN->JSON converter in PHP that did the job here. Scroll down to the “download” section to get the .zip.
It’s one of those zips that vomits its contents into whatever directory you unzip it in, so create a new directory for it.
So far, we have:
$ mkdir chess $ cd chess $ $ ftp ftp://ftp.pitt.edu/group/student-activities/chess/PGN/Collections/ten-pg.zip ./ $ unzip ten-pg.zip $ $ wget http://www.dhtmlgoodies.com/scripts/dhtml-chess/dhtml-chess.zip $ unzip dhtml-chess.zip
Now, create a simple script, say parse.php, to run through the chess matches and output them in JSON, one per line:
<?php require("PgnParser.class.php"); $parser = new PgnParser("/path/to/chess/TEN.PGN"); $total = $parser->getNumberOfGames(); for ($i=0; $i<$total; $i++) { echo $parser->getGameDetailsAsJson($i)."\n"; } ?>
Run parse.php and dump the results into a file:
$ php parse.php > games.jsonNow you’re ready to import games.json.
Jan 26th

MongoDB’s new aggegation framework is now available in the nightly build! This post demonstrates some of its capabilities by using it to analyze chess games.
Make sure you have a the “Development Release (Unstable)” nightly running before trying out the stuff in this post. The aggregation framework will be in 2.1.0, but as of this writing it’s only in the nightly build.
First, we need some chess games to analyze. Download games.json, which contains 1132 games that were won in 10 moves or less (crush their soul and do it quick).
You can use mongoimport to import games.json into MongoDB:
$ mongoimport --db chess --collection quick_wins games.json connected to: 127.0.0.1 imported 1132 objects
We can take a look at our chess games in the Mongo shell:
> use chess switched to db chess > db.fast_win.count() 1132 > db.fast_win.findOne() { "_id" : ObjectId("4ed3965bf86479436d6f1cd7"), "event" : "?", "site" : "?", "date" : "????.??.??", "round" : "?", "white" : "Gedult D", "black" : "Kohn V", "result" : "1-0", "eco" : "B33/09", "moves" : { "1" : { "white" : { "move" : "e4" }, "black" : { "move" : "c5" } }, "2" : { "white" : { "move" : "Nf3" }, "black" : { "move" : "Nc6" } }, ... "10" : { "white" : { "move" : "Qa4" } } } }
Not exactly the greatest schema, but that’s how the chess format exporter munged it. Regardless, now we can use aggregation pipelines to analyze these games.
Experiment #1: First Mover Advantage
White has a slight advantage in chess because you move first (Wikipedia says it’s a 52%-56% chance of winning). I’d hypothesize that, in a short game, going first matters even more.
Let’s find out.
The “result” field in these docs is “1-0″ if white wins and “0-1″ if black wins. So, we want to divide our docs into two groups based on the “result” field and count how many docs are in each group. Using the aggregation pipeline, this looks like:
> db.runCommand({aggregate : "fast_win", pipeline : [ ... { ... $group : { ... _id : "$result", // group by 'result' field ... numGames : {$sum : 1} // add 1 for every document in the group ... } ... }]}) { "result" : [ { "_id" : "0-1", "numGames" : 435 }, { "_id" : "1-0", "numGames" : 697 } ], "ok" : 1 }
That gives a 62% chance white will win (697 wins/1132 total games). Pretty good (although, of course, this isn’t a very large sample set).

In case you're not familiar with it, a reference chessboard with 1-8, a-h marked.
Experiment #2: Best Starting Move
Given a starting move, what percent of the time will that move lead to victory? This probably depends on whether you’re playing white or black, so we’ll just focus on white’s opening move.
First, we’ll just determine what starting moves white uses with this series of steps:
moves.1.white.move field)
These steps look like:
> db.runCommand({aggregate: "fast_win", pipeline: [ ... // '$project' is used to extract all of white's opening moves ... { ... $project : { ... // extract moves.1.white.move into a new field, firstMove ... firstMove : "$moves.1.white.move" ... } ... }, ... // use '$group' to calculate the number of times each move occurred ... { ... $group : { ... _id : "$firstMove", ... numGames : {$sum : 1} ... } ... }]}) { "result" : [ { "_id" : "d3", "numGames" : 2 }, { "_id" : "e4", "numGames" : 696 }, { "_id" : "b4", "numGames" : 17 }, { "_id" : "g3", "numGames" : 3 }, { "_id" : "e3", "numGames" : 2 }, { "_id" : "c4", "numGames" : 36 }, { "_id" : "b3", "numGames" : 4 }, { "_id" : "g4", "numGames" : 11 }, { "_id" : "h4", "numGames" : 1 }, { "_id" : "Nf3", "numGames" : 37 }, { "_id" : "f3", "numGames" : 1 }, { "_id" : "f4", "numGames" : 25 }, { "_id" : "Nc3", "numGames" : 14 }, { "_id" : "d4", "numGames" : 283 } ], "ok" : 1 }
Now let’s compare those numbers with whether white won or lost.
> db.runCommand({aggregate: "fast_win", pipeline: [ ... // extract the first move ... { ... $project : { ... firstMove : "$moves.1.white.move", ... // create a new field, "win", which is 1 if white won and 0 if black won ... win : {$cond : [ ... {$eq : ["$result", "1-0"]}, 1, 0 ... ]} ... } ... }, ... // group by the move and count up how many winning games used it ... { ... $group : { ... _id : "$firstMove", ... numGames : {$sum : 1}, ... numWins : {$sum : "$win"} ... } ... }, ... // calculate the percent of games won with this starting move ... { ... $project : { ... _id : 1, ... numGames : 1, ... percentWins : { ... $multiply : [100, { ... $divide : ["$numWins","$numGames"] ... }] ... } ... } ... }, ... // discard moves that were used in less than 10 games (probably not representative) ... { ... $match : { ... numGames : {$gte : 10} ... } ... }, ... // order from worst to best ... { ... $sort : { ... percentWins : 1 ... } ... }]}) { "result" : [ { "_id" : "f4", "numGames" : 25, "percentWins" : 24 }, { "_id" : "b4", "numGames" : 17, "percentWins" : 35.294117647058826 }, { "_id" : "c4", "numGames" : 36, "percentWins" : 50 }, { "_id" : "d4", "numGames" : 283, "percentWins" : 50.53003533568905 }, { "_id" : "g4", "numGames" : 11, "percentWins" : 63.63636363636363 }, { "_id" : "Nf3", "numGames" : 37, "percentWins" : 67.56756756756756 }, { "_id" : "e4", "numGames" : 696, "percentWins" : 68.24712643678161 }, { "_id" : "Nc3", "numGames" : 14, "percentWins" : 78.57142857142857 } ], "ok" : 1 }
Pawn to e4 seems like the most dependable winner here. Knight to c3 also seems like a good choice (at a nearly 80% win rate), but it was only used in 14 winning games.
Experiment #3: Best and Worst Moves for Black
We basically want to do a similar pipeline to Experiment 2, but for black. At the end, we want to find the best and worst percent.
> db.runCommand({aggregate: "fast_win", pipeline: [ ... // extract the first move ... { ... $project : { ... firstMove : "$moves.1.black.move", ... win : {$cond : [ ... {$eq : ["$result", "0-1"]}, 1, 0 ... ]} ... } ... }, ... // group by the move and count up how many winning games used it ... { ... $group : { ... _id : "$firstMove", ... numGames : {$sum : 1}, ... numWins : {$sum : "$win"} ... } ... }, ... // calculate the percent of games won with this starting move ... { ... $project : { ... _id : 1, ... numGames : 1, ... percentWins : { ... $multiply : [100, { ... $divide : ["$numWins","$numGames"] ... }] ... } ... } ... }, ... // discard moves that were used in less than 10 games (probably not representative) ... { ... $match : { ... numGames : {$gte : 10} ... } ... }, ... // get the best and worst ... { ... $group : { ... _id : 1, ... best : {$max : "$_id"}, ... worst : {$min : "$_id"} ... } ... }]}) { "result" : [ { "_id" : 1, "best" : "g6", "worst" : "Nc6" } ], "ok" : 1 }
“Nc6″ means “move the knight to c6.” Or, rather, don’t, because it doesn’t tend to work out that well.
I like this new aggregation functionality because it’s feels simpler than MapReduce. You can start with a one-operation pipeline and build it up, step-by-step, seeing exactly what a given operation does to your output. (And no Javascript required, which is always a plus.)
There’s lots more documentation on aggregation pipelines in the docs and I’ll be doing a couple more posts on it.
Jan 17th
Probably only relevant to a limited portion of my audience, but Silicon Valley Ryan Gosling is awesome. I have never seen anything like and I’m not sure what the point is, but I know I’m a fan.
Go forth and be sexy and supportive for the female programmers you know.
Jan 4th
I’ve been doing replica set “bootcamps” for new hires. It’s mainly focused on applying this to debug replica set issues and being able to talk fluently about what’s happening, but it occurred to me that you (blog readers) might be interested in it, too.
There are 8 subjects I cover in my bootcamp:

I’m going to do one subject per post, we’ll see how many I can get through.
Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.
The most obvious feature of replica sets is their ability to elect a new primary, so the first thing we’ll cover is this election process.

Let’s say we have a replica set with 3 members: X, Y, and Z. Every two seconds, each server sends out a heartbeat request to the other members of the set. So, if we wait a few seconds, X sends out heartbeats to Y and Z. They respond with information about their current situation: the state they’re in (primary/secondary), if they are eligible to become primary, their current clock time, etc.
X receives this info and updates its “map” of the set: if members have come up or gone down, changed state, and how long the roundtrip took.
At this point, if X map changed, X will check a couple of things: if X is primary and a member went down, it will make sure it can still reach a majority of the set. If it cannot, it’ll demote itself to a secondary.
There is one wrinkle with X demoting itself: in MongoDB, writes default to fire-and-forget. Thus, if people are doing fire-and-forget writes on the primary and it steps down, they might not realize X is no longer primary and keep sending writes to it. The secondary-formerly-known-as-primary will be like, “I’m a secondary, I can’t write that!” But because the writes don’t get a response on the client, the client wouldn’t know.
Technically, we could say, “well, they should use safe writes if they care,” but that seems dickish. So, when a primary is demoted, it also closes all connections to clients so that they will get a socket error when they send the next message. All of the client libraries know to re-check who is primary if they get an error. Thus, they’ll be able to find who the new primary is and not accidentally send an endless stream of writes to a secondary.
Anyway, getting back to the heartbeats: if X is a secondary, it’ll occasionally check if it should elect itself, even if its map hasn’t changed. First, it’ll do a sanity check: does another member think it’s primary? Does X think it’s already primary? Is X ineligible for election? If it fails any of the basic questions, it’ll continue puttering along as is.
If it seems as though a new primary is needed, X will proceed to the first step in election: it sends a message to Y and Z, telling them “I am considering running for primary, can you advise me on this matter?”
When Y and Z get this message, they quickly check their world view. Do they already know of a primary? Do they have more recent data than X? Does anyone they know of have more recent data than X? They run through a huge list of sanity checks and, if everything seems satisfactory, they tentatively reply “go ahead.” If they find a reason that X cannot be elected, they’ll reply “stop the election!”
If X receives any “stop the election!” messages, it cancels the election and goes back to life as a secondary.
If everyone says “go ahead,” X continues with the second (and final) phase of the election process.
For the second phase, X sends out a second message that is basically, “I am formally announcing my candidacy.” At this point, Y and Z make a final check: do all of the conditions that held true before still hold? If so, they allow X to take their election lock and send back a vote. The election lock prevents them from voting for another candidate for 30 seconds.
If one of the checks doesn’t pass the second time around (fairly unusual, at least in 2.0), they send back a veto. If anyone vetos, the election fails.

Suppose that Y votes for X and Z vetos X. At that point, Y‘s election lock is taken, it cannot vote in another election for 30 seconds. That means that, if Z wants to run for primary, it had better be able to get X‘s vote. That said, it should be able to if Z is a viable candidate: it’s not like the members hold grudges (except for Y, for 30 seconds).
If no one vetos and the candidate member receives votes from a majority of the set, the candidate becomes primary.
Feel free to ask questions in the comments below. This is a loving, caring bootcamp (as bootcamps go).
Dec 9th
The aggregation pipeline code has finally been merged into the main development branch and is scheduled for release in 2.2. It lets you combine simple operations (like finding the max or min, projecting out fields, taking counts or averages) into a pipeline of operations, making a lot of things that were only possible by using MapReduce doable with a “normal” query.
In celebration of this, I thought I’d re-do the very popular MySQL to MongoDB mapping using the aggregation pipeline, instead of MapReduce.
Here is the original SQL:
SELECT Dim1, Dim2, SUM(Measure1) AS MSum, COUNT(*) AS RecordCount, AVG(Measure2) AS MAvg, MIN(Measure1) AS MMin MAX(CASE WHEN Measure2 < 100 THEN Measure2 END) AS MMax FROM DenormAggTable WHERE (Filter1 IN (’A’,’B’)) AND (Filter2 = ‘C’) AND (Filter3 > 123) GROUP BY Dim1, Dim2 HAVING (MMin > 0) ORDER BY RecordCount DESC LIMIT 4, 8
We can break up this statement and replace each piece of SQL with the new aggregation pipeline syntax:
| MongoDB Pipeline | MySQL |
|---|---|
aggregate: "DenormAggTable" |
FROM DenormAggTable |
{ $match : { Filter1 : {$in : ['A','B']}, Filter2 : 'C', Filter3 : {$gt : 123} } } |
WHERE (Filter1 IN (’A’,’B’)) AND (Filter2 = ‘C’) AND (Filter3 > 123) |
{ $project : { Dim1 : 1, Dim2 : 1, Measure1 : 1, Measure2 : 1, lessThanAHundred : { $cond: [ {$lt: ["$Measure2", 100] }, "$Measure2", // if 0] // else } } } |
CASE WHEN Measure2 < 100 THEN Measure2 END |
{ $group : { _id : {Dim1 : 1, Dim2 : 1}, MSum : {$sum : "$Measure1"}, RecordCount : {$sum : 1}, MAvg : {$avg : "$Measure2"}, MMin : {$min : "$Measure1"}, MMax : {$max : "$lessThanAHundred"} } } |
SELECT Dim1, Dim2, SUM(Measure1) AS MSum, COUNT(*) AS RecordCount, AVG(Measure2) AS MAvg, MIN(Measure1) AS MMin MAX(CASE WHEN Measure2 < 100 THEN Measure2 END) AS MMax GROUP BY Dim1, Dim2 |
{ $match : {MMin : {$gt : 0}} } |
HAVING (MMin > 0) |
{ $sort : {RecordCount : -1} } |
ORDER BY RecordCount DESC |
{ $limit : 8 }, { $skip : 4 } |
LIMIT 4, 8 |
Putting all of these together gives you your pipeline:
> db.runCommand({aggregate: "DenormAggTable", pipeline: [ { $match : { Filter1 : {$in : ['A','B']}, Filter2 : 'C', Filter3 : {$gt : 123} } }, { $project : { Dim1 : 1, Dim2 : 1, Measure1 : 1, Measure2 : 1, lessThanAHundred : {$cond: [{$lt: ["$Measure2", 100]}, { "$Measure2", 0] } } }, { $group : { _id : {Dim1 : 1, Dim2 : 1}, MSum : {$sum : "$Measure1"}, RecordCount : {$sum : 1}, MAvg : {$avg : "$Measure2"}, MMin : {$min : "$Measure1"}, MMax : {$max : "$lessThanAHundred"} } }, { $match : {MMin : {$gt : 0}} }, { $sort : {RecordCount : -1} }, { $limit : 8 }, { $skip : 4 } ]})
As you can see, the SQL matches the pipeline operations pretty clearly. If you want to play with it, it’ll be available soon to a the development nightly build.
If you’re at MongoSV today (December 9th, 2011), check out Chris Westin’s talk on the new aggregation framework at 3:45 in room B4.
Oct 18th
10gen is trying to hire a gazillion people, so I’m averaging two interviews a day (bleh). A lot of people have asked what it’s like to work on MongoDB, so I thought I’d write a bit about it.
A Usual Day

Coffee: the lynchpin of my day.
There are some variations on this: as I mentioned, a lot of time lately is taken up by interviewing. Other coworkers spend a lot more time than I do at consults, trainings, speaking at conferences, etc.
Other General Workday Stuff
On Fridays, we have lunch as a team. After lunch, we have a tech talk where someone presents on what they’re working on (e.g., the inspiration for my geospatial post) or general info that’s good to know (e.g., the inspiration for my virtual memory post). This is a nice way to end the week, especially since Fridays often wrap up earlier than other days.
A couple people use OS X or Windows for development, most people use Linux. You can use whatever you want. I’d like to encourage emacs users, in particular, to apply, as we’re falling slightly behind vi in numbers.
We sit in an open office plan, everyone at tables in a big room (including the CEO and CTO, who are both programmers). The only people in separate rooms are the people who have to be on the phone all day (sales, marketers, basketweavers… I’m not really clear on what non-technical people do).
And speaking of what people actually do, here are three examples of my job (that are more specific than “coding”):
Fixing Other People’s Bugs

Recently, a developer was using MongoDB and IBM’s DB2 with PHP. After he installed the MongoDB driver, PHP started segfaulting all over the place. I downloaded the ibm_db2 PHP extension to take a look.
PHP keeps a “storage unit” for extensions’ long-term memory use. Every extension shares the space and can store things there.
The DB2 extension was basically fire-bombing the storage unit.
It went through the storage, object by object, casting the objects into DB2 types and then freeing them. This worked fine when DB2 was the only PHP extension being used, but broke down when anyone else tried to use that storage. I gave the user a small patch that stopped the DB2 extension from destroying objects it didn’t create, and everything worked fine for them, after that.
The Game is Afoot

A user reported that they couldn’t initialize their replica set: a member wasn’t coming online. The trick with this type of bug is to get enough evidence before the user wants to beat you over the head with the 800th log you’ve requested.
I asked them to send the first round of logs. It was weird, nothing was wrong from server1‘s point of view: it initialized properly and could connect to everyone in the set. I puzzled over the messages, figuring out that once server1 had created the set, server2 had accepted the connection from server1 but then somehow failed to connect back to server1 and so couldn’t pick up the set config. However, according to server1, it could connect fine to server2 and thought it was perfectly healthy!
I finally realized what must be happening: “It looks like server2 couldn’t connect to any of the others, but all of them could connect to it. Could you check your firewall?”
“Oh, that server was blocking all outgoing connections! Now its working fine.”
Elementary, my dear Watson.
You know you’re not at a big company when…

At least it had "handles."
Someone on Sparc complained that the Perl driver wasn’t working at all for them. My first thought was that Sparc is big-endian, so maybe the Perl driver wasn’t flipping memory correctly. I asked Eliot where our Power PC was, and he said we must have forgotten it when we moved: it was still in our old office around the corner.
“Bring someone to help carry it,” he told me. “It’s heavy.”
Pshaw, I thought. How heavy could an old desktop be?
I went around the corner and the other company graciously let me walk into their server room, choose a server, and walk out with it. Unfortunately, it weighed about 50 pounds, and I have a traditional geek physique (no muscles). The trip back to our office involved me staggering a couple steps, putting it down, shaking out my arms, and repeat.
When I got to our office, I just dragged it down the hallway to our server closet. Eliot saw me tugboating the thing down the hallway.
“You didn’t bring someone to help?”
“It’s *oof* fine!”
Unfortunately, once it was all set up, the Perl driver worked perfectly on it. So it wasn’t big-endian specific.
I was now pretty sure it was Sparc-specific (another person had reported the same problem on a Sparc), so I bought an elderly Sparc server for a couple hundred bucks off eBay. When it arrived a couple days later, Eliot showed me how to rack it and I spent a day fighting with the Solaris/Oracle package manager. However, it was all worth it: I tried running the Perl driver and it instantly failed (success!).
After some debugging, I realized that Sparc was much more persnickety than Intel about byte alignment. The Perl driver was playing fast and loose with a byte buffer, casting pieces of it into other types (which Sparc didn’t like). I changed some casts to memcpys and the Perl driver started working beautifully.
But every day is different
The episodes above are a very small sample of what I do: there are hundreds of other things I’ve worked on over the last few years from speaking to working on the database to writing a freakin Facebook app.
So, if this sounded interesting, please go to our jobs website and submit an application!
Sep 28th
Edit: since this was written, Sam has written some excellent documentation on using MMS. I recommend reading through it as you explore MMS.
Telling someone “You should set up monitoring” is kind of like telling someone “You should exercise 20 minutes three times a week.” Yes, you know you should, but your chair is so comfortable and you haven’t keeled over dead yet.
For years*, 10gen has been planning to do monitoring “right,” making it painless to monitor your database. Today, we released the MongoDB Monitoring Service: MMS.
MMS is free hosted monitoring for MongoDB. I’ve been using it to help out paying customers for a while, so I thought I’d do a quick post on useful stuff I’ve discovered (documentation is… uh… a little light, so far).
So, first: you sign up.

There are two options: register a company and register another account for an existing company. For example, let’s say I wanted to monitor the servers for Snail in a Turtleneck Enterprises. I’ll create a new account and company group. Then Andrew, sys admin of my heart, can create an account with Snail in a Turtleneck Enterprises and have access to all the same monitoring info.

Once you’re registered, you’ll see a page encouraging you to download the MMS agent. Click on the “download the agent” link.

This is a little Python program that collects stats from MongoDB, so you need to have pymongo installed, too. Starting from scratch on Ubuntu, do:
$ # prereqs $ sudo apt-get install python python-setuptools $ sudo easy_install pymongo $ $ # set up agent $ unzip name-of-agent.zip $ cd name-of-agent $ mkdir logs $ $ # start agent $ nohup python agent.py > logs/agent.log 2>&1 &
Last step! Back to the website: see that “+” button next to the “Hosts” title?

Designed by programmers, for Vulcans
Click on that and type a hostname. If you have a sharded cluster, add a mongos. If you have a replica set, add any member.
Now go have a nice cup of coffee. This is an important part of the process.
When you get back, tada, you’ll have buttloads of graphs. They probably won’t have much on them, since MMS will have been monitoring them for all of a few minutes.
Cool stuff to poke
This is the top bar of buttons:
![]()
Of immediate interest: click “Hosts” to see a list of hosts.
You’ll see hostname, role, and the last time the MMS agent was able to reach this host. Hosts that it hasn’t reached recently will have a red ping time.

Now click on a server’s name to see all of the info about it. Let’s look at a single graph.

You can click & drag to see a smaller bit of time on the graph. See those icons in the top right? Those give you:
That’s the basics. Some other points of interest:
If you have any problems with MMS, there’s a little form at the bottom to let you complain:

This will file a bug report for you. This is a “private” bug tracker, only 10gen and people in your group will be able to see the bugs you file.
* If you ran mongod --help using MongoDB version 1.0.0 or higher, you might have noticed some options that started with --mms. In other words, we’ve been planning this for a little while.
Sep 7th
By request, a quick post on using PHP references in extensions.
To start, here’s an example of references in PHP we’ll be translating into C:
<?php // just for displaying output function display($x) { echo "x is $x\n"; } // pass in an argument by making a copy of it function not_by_ref($arg) { echo "called not_by_ref($arg)\n"; $arg = 2; } // pass in an argument by reference function by_ref(&$arg) { echo "called by_ref($arg)\n"; $arg = 3; } $x = 1; display($x); not_by_ref($x); display($x); // when x is passed by reference, the function can change the value by_ref($x); display($x); ?>
This will print:
x is 1 called not_by_ref(1) x is 1 called by_ref(1) x is 3
If you want your C extension’s function to officially have a signature with ampersands in it, you have to declare to PHP that you want to pass in refs as arguments. Remember how we declared functions in this struct?
zend_function_entry rlyeh_functions[] = { PHP_FE(cthulhu, NULL) { NULL, NULL, NULL } };
The second argument to PHP_FE, NULL, can optional be the argument spec. For example, let’s say we’re implementing by_ref() in C. We would add this to php_rlyeh.c:
// the 1 indicates pass-by-reference ZEND_BEGIN_ARG_INFO(arginfo_by_ref, 1) ZEND_END_ARG_INFO(); zend_function_entry rlyeh_functions[] = { PHP_FE(cthulhu, NULL) PHP_FE(by_ref, arginfo_by_ref) { NULL, NULL, NULL } }; PHP_FUNCTION(by_ref) { zval *zptr = 0; if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "z", &zptr) == FAILURE) { return; } php_printf("called (the c version of) by_ref(%d)\n", (int)Z_LVAL_P(zptr)); ZVAL_LONG(zptr, 3); }
Suppose we also add not_by_ref(). This might look something like:
ZEND_BEGIN_ARG_INFO(arginfo_not_by_ref, 0) ZEND_END_ARG_INFO(); zend_function_entry rlyeh_functions[] = { PHP_FE(cthulhu, NULL) PHP_FE(by_ref, arginfo_by_ref) PHP_FE(not_by_ref, arginfo_not_by_ref) { NULL, NULL, NULL } }; PHP_FUNCTION(not_by_ref) { zval *zptr = 0, *copy = 0; if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "z", &zptr) == FAILURE) { return; } php_printf("called (the c version of) not_by_ref(%d)\n", (int)Z_LVAL_P(zptr)); ZVAL_LONG(zptr, 2); }
However, if we try running this, we’ll get:
x is 1 called (the c version of) not_by_ref(1) x is 2 called (the c version of) by_ref(2) x is 3
What happened? not_by_ref used our variable like a reference!
This is really weird and annoying behavior (if anyone knows why PHP does this, please comment below).
To work around it, if you want non-reference behavior, you have to manually make a copy of the argument.
Our not_by_ref() function becomes:
PHP_FUNCTION(not_by_ref) { zval *zptr = 0, *copy = 0; if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "z", &zptr) == FAILURE) { return; } // make a copy MAKE_STD_ZVAL(copy); memcpy(copy, zptr, sizeof(zval)); // set refcount to 1, as we're only using "copy" in this function Z_SET_REFCOUNT_P(copy, 1); php_printf("called (the c version of) not_by_ref(%d)\n", (int)Z_LVAL_P(copy)); ZVAL_LONG(copy, 2); zval_ptr_dtor(©); }
Note that we set the refcount of copy to 1. This is because the refcount for zptr is 2: 1 ref from the calling function + 1 ref from the not_by_ref function. However, we don’t want the copy of zptr to have a refcount of 2, because it’s only being used by the current function.
Also note that memcpy-ing the zval only works because this is a scalar: if this were an array or object, we’d have to use PHP API functions to make a deep copy of the original.
If we run our PHP program again, it gives us:
x is 1 called (the c version of) not_by_ref(1) x is 1 called (the c version of) by_ref(1) x is 3
Okay, this is pretty good… but we’re actually missing a case. What happens if we pass in a reference to not_by_ref()? In PHP, this looks like:
function not_by_ref($arg) { $arg = 2; } $x = 1; not_by_ref(&$x); display($x);
…which displays “x is 2″. Unfortunately, we’ve overridden this behavior in our not_by_ref() C function, so we have to special case: if this is a reference, change its value, otherwise make a copy and change the copy’s value.
PHP_FUNCTION(not_by_ref) { zval *zptr = 0, *copy = 0; if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "z", &zptr) == FAILURE) { return; } // NEW CODE if (Z_ISREF_P(zptr)) { // if this is a reference, make copy point to zptr copy = zptr; // adding a reference so we can indiscriminately delete copy later zval_add_ref(&zptr); } // OLD CODE else { // make a copy MAKE_STD_ZVAL(copy); memcpy(copy, zptr, sizeof(zval)); // set refcount to 1, as we're only using "copy" in this function Z_SET_REFCOUNT_P(copy, 1); } php_printf("called (the c version of) not_by_ref(%d)\n", (int)Z_LVAL_P(copy)); ZVAL_LONG(copy, 2); zval_ptr_dtor(©); }
Now it’ll behave “properly.”
There may be a better way to do this, please leave a comment if you know of one. However, as far as I know, this is the only way to emulate the PHP reference behavior.
If you would like to read more about PHP references, Derick Rethans wrote a great article on it for PHP Architect.
Aug 30th

Linux: the developer's personal gentleman
When you run a process, it needs some memory to store things: its heap, its stack, and any libraries it’s using. Linux provides and cleans up memory for your process like an extremely conscientious butler. You can (and generally should) just let Linux do its thing, but it’s a good idea to understand the basics of what’s going on.
One easy way (I think) to understand this stuff is to actually look at what’s going on using the pmap command. pmap shows you memory information for a given process.
For example, let’s take a really simple C program that prints its own process id (PID) and pauses:
#include <stdio.h> #include <unistd.h> #include <sys/types.h> int main() { printf("run `pmap %d`\n", getpid()); pause(); }
Save this as mem_munch.c. Now compile and run it with:
$ gcc mem_munch.c -o mem_munch $ ./mem_munch run `pmap 25681`
The PID you get will probably be different than mine (25681).
At this point, the program will “hang.” This is because of the pause() function, and it’s exactly what we want. Now we can look at the memory for this process at our leisure.
Open up a new shell and run pmap, replacing the PID below with the one mem_munch gave you:
$ pmap 25681 25681: ./mem_munch 0000000000400000 4K r-x-- /home/user/mem_munch 0000000000600000 4K r---- /home/user/mem_munch 0000000000601000 4K rw--- /home/user/mem_munch 00007fcf5af88000 1576K r-x-- /lib/x86_64-linux-gnu/libc-2.13.so 00007fcf5b112000 2044K ----- /lib/x86_64-linux-gnu/libc-2.13.so 00007fcf5b311000 16K r---- /lib/x86_64-linux-gnu/libc-2.13.so 00007fcf5b315000 4K rw--- /lib/x86_64-linux-gnu/libc-2.13.so 00007fcf5b316000 24K rw--- [ anon ] 00007fcf5b31c000 132K r-x-- /lib/x86_64-linux-gnu/ld-2.13.so 00007fcf5b512000 12K rw--- [ anon ] 00007fcf5b539000 12K rw--- [ anon ] 00007fcf5b53c000 4K r---- /lib/x86_64-linux-gnu/ld-2.13.so 00007fcf5b53d000 8K rw--- /lib/x86_64-linux-gnu/ld-2.13.so 00007fff7efd8000 132K rw--- [ stack ] 00007fff7efff000 4K r-x-- [ anon ] ffffffffff600000 4K r-x-- [ anon ] total 3984K
This output is how memory “looks” to the mem_munch process. If mem_munch asks the operating system for 00007fcf5af88000, it will get libc. If it asks for 00007fcf5b31c000, it will get the ld library.
This output is a bit dense and abstract, so let’s look at how some more familiar memory usage shows up. Change our program to put some memory on the stack and some on the heap, then pause.
#include <stdio.h> #include <unistd.h> #include <sys/types.h> #include <stdlib.h> int main() { int on_stack, *on_heap; // local variables are stored on the stack on_stack = 42; printf("stack address: %p\n", &on_stack); // malloc allocates heap memory on_heap = (int*)malloc(sizeof(int)); printf("heap address: %p\n", on_heap); printf("run `pmap %d`\n", getpid()); pause(); }
Now compile and run it:
$ ./mem_munch stack address: 0x7fff497670bc heap address: 0x1b84010 run `pmap 11972`
Again, your exact numbers will probably be different than mine.
Before you kill mem_munch, run pmap on it:
$ pmap 11972 11972: ./mem_munch 0000000000400000 4K r-x-- /home/user/mem_munch 0000000000600000 4K r---- /home/user/mem_munch 0000000000601000 4K rw--- /home/user/mem_munch 0000000001b84000 132K rw--- [ anon ]00007f3ec4d98000 1576K r-x-- /lib/x86_64-linux-gnu/libc-2.13.so 00007f3ec4f22000 2044K ----- /lib/x86_64-linux-gnu/libc-2.13.so 00007f3ec5121000 16K r---- /lib/x86_64-linux-gnu/libc-2.13.so 00007f3ec5125000 4K rw--- /lib/x86_64-linux-gnu/libc-2.13.so 00007f3ec5126000 24K rw--- [ anon ] 00007f3ec512c000 132K r-x-- /lib/x86_64-linux-gnu/ld-2.13.so 00007f3ec5322000 12K rw--- [ anon ] 00007f3ec5349000 12K rw--- [ anon ] 00007f3ec534c000 4K r---- /lib/x86_64-linux-gnu/ld-2.13.so 00007f3ec534d000 8K rw--- /lib/x86_64-linux-gnu/ld-2.13.so 00007fff49747000 132K rw--- [ stack ] 00007fff497bb000 4K r-x-- [ anon ] ffffffffff600000 4K r-x-- [ anon ] total 4116K
Note that there’s a new entry between the final mem_munch section and libc-2.13.so. What could that be?
# from pmap
0000000001b84000 132K rw--- [ anon ]
# from our program
heap address: 0x1b84010
The addresses are almost the same. That block ([ anon ]) is the heap. (pmap labels blocks of memory that aren’t backed by a file [ anon ]. We’ll get into what being “backed by a file” means in a sec.)
The second thing to notice:
# from pmap
00007fff49747000 132K rw--- [ stack ]
# from our program
stack address: 0x7fff497670bc
And there’s your stack!
One other important thing to notice: this is how memory “looks” to your program, not how memory is actually laid out on your physical hardware. Look at how much memory mem_munch has to work with. According to pmap, mem_munch can address memory between address 0x0000000000400000 and 0xffffffffff600000 (well, actually 0x00007fffffffffffffff, beyond that is special). For those of you playing along at home, that’s almost 10 million terabytes of memory. That’s a lot of memory. (If your computer has that kind of memory, please leave your address and times you won’t be at home.)
So, the amount of memory the program can address is kind of ridiculous. Why does the computer do this? Well, lots of reasons, but one important one is that this means you can address more memory than you actually have on the machine and let the operating system take care of making sure the right stuff is in memory when you try to access it.
Memory mapping a file basically tells the operating system to load the file so the program can access it as an array of bytes. Then you can treat a file like an in-memory array.
For example, let’s make a (pretty stupid) random number generator ever by creating a file full of random numbers, then mmap-ing it and reading off random numbers.
First, we’ll create a big file called random (note that this creates a 1GB file, so make sure you have the disk space and be patient, it’ll take a little while to write):
$ dd if=/dev/urandom bs=1024 count=1000000 of=/home/user/random 1000000+0 records in 1000000+0 records out 1024000000 bytes (1.0 GB) copied, 123.293 s, 8.3 MB/s $ ls -lh random -rw-r--r-- 1 user user 977M 2011-08-29 16:46 random
Now we’ll mmap random and use it to generate random numbers.
#include <stdio.h> #include <unistd.h> #include <sys/types.h> #include <stdlib.h> #include <sys/mman.h> int main() { char *random_bytes; FILE *f; int offset = 0; // open "random" for reading f = fopen("/home/user/random", "r"); if (!f) { perror("couldn't open file"); return -1; } // we want to inspect memory before mapping the file printf("run `pmap %d`, then press <enter>", getpid()); getchar(); random_bytes = mmap(0, 1000000000, PROT_READ, MAP_SHARED, fileno(f), 0); if (random_bytes == MAP_FAILED) { perror("error mapping the file"); return -1; } while (1) { printf("random number: %d (press <enter> for next number)", *(int*)(random_bytes+offset)); getchar(); offset += 4; } }
If we run this program, we’ll get something like:
$ ./mem_munch run `pmap 12727`, then press <enter>
The program hasn’t done anything yet, so the output of running pmap will basically be the same as it was above (I’ll omit it for brevity). However, if we continue running mem_munch by pressing enter, our program will mmap random.
Now if we run pmap it will look something like:
$ pmap 12727 12727: ./mem_munch 0000000000400000 4K r-x-- /home/user/mem_munch 0000000000600000 4K r---- /home/user/mem_munch 0000000000601000 4K rw--- /home/user/mem_munch 000000000147d000 132K rw--- [ anon ] 00007fe261c6f000 976564K r--s- /home/user/random00007fe29d61c000 1576K r-x-- /lib/x86_64-linux-gnu/libc-2.13.so 00007fe29d7a6000 2044K ----- /lib/x86_64-linux-gnu/libc-2.13.so 00007fe29d9a5000 16K r---- /lib/x86_64-linux-gnu/libc-2.13.so 00007fe29d9a9000 4K rw--- /lib/x86_64-linux-gnu/libc-2.13.so 00007fe29d9aa000 24K rw--- [ anon ] 00007fe29d9b0000 132K r-x-- /lib/x86_64-linux-gnu/ld-2.13.so 00007fe29dba6000 12K rw--- [ anon ] 00007fe29dbcc000 16K rw--- [ anon ] 00007fe29dbd0000 4K r---- /lib/x86_64-linux-gnu/ld-2.13.so 00007fe29dbd1000 8K rw--- /lib/x86_64-linux-gnu/ld-2.13.so 00007ffff29b2000 132K rw--- [ stack ] 00007ffff29de000 4K r-x-- [ anon ] ffffffffff600000 4K r-x-- [ anon ] total 980684K
This is very similar to before, but with an extra line (bolded), which kicks up virtual memory usage a bit (from 4MB to 980MB).
However, let’s re-run pmap with the -x option. This shows the resident set size (RSS): only 4KB of random are resident. Resident memory is memory that’s actually in RAM. There’s very little of random in RAM because we’ve only accessed the very start of the file, so the OS has only pulled the first bit of the file from disk into memory.
pmap -x 12727 12727: ./mem_munch Address Kbytes RSS Dirty Mode Mapping 0000000000400000 0 4 0 r-x-- mem_munch 0000000000600000 0 4 4 r---- mem_munch 0000000000601000 0 4 4 rw--- mem_munch 000000000147d000 0 4 4 rw--- [ anon ] 00007fe261c6f000 0 4 0 r--s- random 00007fe29d61c000 0 288 0 r-x-- libc-2.13.so 00007fe29d7a6000 0 0 0 ----- libc-2.13.so 00007fe29d9a5000 0 16 16 r---- libc-2.13.so 00007fe29d9a9000 0 4 4 rw--- libc-2.13.so 00007fe29d9aa000 0 16 16 rw--- [ anon ] 00007fe29d9b0000 0 108 0 r-x-- ld-2.13.so 00007fe29dba6000 0 12 12 rw--- [ anon ] 00007fe29dbcc000 0 16 16 rw--- [ anon ] 00007fe29dbd0000 0 4 4 r---- ld-2.13.so 00007fe29dbd1000 0 8 8 rw--- ld-2.13.so 00007ffff29b2000 0 12 12 rw--- [ stack ] 00007ffff29de000 0 4 0 r-x-- [ anon ] ffffffffff600000 0 0 0 r-x-- [ anon ] ---------------- ------ ------ ------ total kB 980684 508 100
If the virtual memory size (the Kbytes column) is all 0s for you, don’t worry about it. That’s a bug in Debian/Ubuntu’s -x option. The total is correct, it just doesn’t display correctly in the breakdown.
You can see that the resident set size, the amount that’s actually in memory, is tiny compared to the virtual memory. Your program can access any memory within a billion bytes of 0x00007fe261c6f000, but if it accesses anything past 4KB, it’ll probably have to go to disk for it*.
What if we modify our program so it reads the whole file/array of bytes?
#include <stdio.h> #include <unistd.h> #include <sys/types.h> #include <stdlib.h> #include <sys/mman.h> int main() { char *random_bytes; FILE *f; int offset = 0; // open "random" for reading f = fopen("/home/user/random", "r"); if (!f) { perror("couldn't open file"); return -1; } random_bytes = mmap(0, 1000000000, PROT_READ, MAP_SHARED, fileno(f), 0); if (random_bytes == MAP_FAILED) { printf("error mapping the file\n"); return -1; } for (offset = 0; offset < 1000000000; offset += 4) { int i = *(int*)(random_bytes+offset); // to show we're making progress if (offset % 1000000 == 0) { printf("."); } } // at the end, wait for signal so we can check mem printf("\ndone, run `pmap -x %d`\n", getpid()); pause(); }
Now the resident set size is almost the same as the virtual memory size:
$ pmap -x 5378 5378: ./mem_munch Address Kbytes RSS Dirty Mode Mapping 0000000000400000 0 4 4 r-x-- mem_munch 0000000000600000 0 4 4 r---- mem_munch 0000000000601000 0 4 4 rw--- mem_munch 0000000002271000 0 4 4 rw--- [ anon ] 00007fc2aa333000 0 976564 0 r--s- random 00007fc2e5ce0000 0 292 0 r-x-- libc-2.13.so 00007fc2e5e6a000 0 0 0 ----- libc-2.13.so 00007fc2e6069000 0 16 16 r---- libc-2.13.so 00007fc2e606d000 0 4 4 rw--- libc-2.13.so 00007fc2e606e000 0 16 16 rw--- [ anon ] 00007fc2e6074000 0 108 0 r-x-- ld-2.13.so 00007fc2e626a000 0 12 12 rw--- [ anon ] 00007fc2e6290000 0 16 16 rw--- [ anon ] 00007fc2e6294000 0 4 4 r---- ld-2.13.so 00007fc2e6295000 0 8 8 rw--- ld-2.13.so 00007fff037e6000 0 12 12 rw--- [ stack ] 00007fff039c9000 0 4 0 r-x-- [ anon ] ffffffffff600000 0 0 0 r-x-- [ anon ] ---------------- ------ ------ ------ total kB 980684 977072 104
Now if we access any part of the file, it will be in RAM already. (Probably. Until something else kicks it out.) So, our program can access a gigabyte of memory, but the operating system can lazily load it into RAM as needed.
And that’s why your virtual memory is so damn high when you’re running MongoDB.
Left as an exercise to the reader: try running pmap on a mongod process before it’s done anything, once you’ve done a couple operations, and once it’s been running for a long time.
* This isn’t strictly true**. The kernel actually says, “If they want the first N bytes, they’re probably going to want some more of the file” so it’ll load, say, the first dozen KB of the file into memory but only tell the process about 4KB. When your program tries to access this memory that is in RAM, but it didn’t know was in RAM, it’s called a minor page fault (as opposed to a major page fault when it actually has to hit disk to load new info). back to context
** This note is also not strictly true. In fact, the whole file will probably be in memory before you map anything because you just wrote the thing with dd. So you’ll just be doing minor page faults as your program “discovers” it.