Kristina Chodorow's Blog
Programming
And now, for something completely different
Jan 17th
Probably only relevant to a limited portion of my audience, but Silicon Valley Ryan Gosling is awesome. I have never seen anything like and I’m not sure what the point is, but I know I’m a fan.
Go forth and be sexy and supportive for the female programmers you know.
More PHP Internals: References
Sep 7th
By request, a quick post on using PHP references in extensions.
To start, here’s an example of references in PHP we’ll be translating into C:
<?php // just for displaying output function display($x) { echo "x is $x\n"; } // pass in an argument by making a copy of it function not_by_ref($arg) { echo "called not_by_ref($arg)\n"; $arg = 2; } // pass in an argument by reference function by_ref(&$arg) { echo "called by_ref($arg)\n"; $arg = 3; } $x = 1; display($x); not_by_ref($x); display($x); // when x is passed by reference, the function can change the value by_ref($x); display($x); ?>
This will print:
x is 1 called not_by_ref(1) x is 1 called by_ref(1) x is 3
If you want your C extension’s function to officially have a signature with ampersands in it, you have to declare to PHP that you want to pass in refs as arguments. Remember how we declared functions in this struct?
zend_function_entry rlyeh_functions[] = { PHP_FE(cthulhu, NULL) { NULL, NULL, NULL } };
The second argument to PHP_FE, NULL, can optional be the argument spec. For example, let’s say we’re implementing by_ref() in C. We would add this to php_rlyeh.c:
// the 1 indicates pass-by-reference ZEND_BEGIN_ARG_INFO(arginfo_by_ref, 1) ZEND_END_ARG_INFO(); zend_function_entry rlyeh_functions[] = { PHP_FE(cthulhu, NULL) PHP_FE(by_ref, arginfo_by_ref) { NULL, NULL, NULL } }; PHP_FUNCTION(by_ref) { zval *zptr = 0; if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "z", &zptr) == FAILURE) { return; } php_printf("called (the c version of) by_ref(%d)\n", (int)Z_LVAL_P(zptr)); ZVAL_LONG(zptr, 3); }
Suppose we also add not_by_ref(). This might look something like:
ZEND_BEGIN_ARG_INFO(arginfo_not_by_ref, 0) ZEND_END_ARG_INFO(); zend_function_entry rlyeh_functions[] = { PHP_FE(cthulhu, NULL) PHP_FE(by_ref, arginfo_by_ref) PHP_FE(not_by_ref, arginfo_not_by_ref) { NULL, NULL, NULL } }; PHP_FUNCTION(not_by_ref) { zval *zptr = 0, *copy = 0; if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "z", &zptr) == FAILURE) { return; } php_printf("called (the c version of) not_by_ref(%d)\n", (int)Z_LVAL_P(zptr)); ZVAL_LONG(zptr, 2); }
However, if we try running this, we’ll get:
x is 1 called (the c version of) not_by_ref(1) x is 2 called (the c version of) by_ref(2) x is 3
What happened? not_by_ref used our variable like a reference!
This is really weird and annoying behavior (if anyone knows why PHP does this, please comment below).
To work around it, if you want non-reference behavior, you have to manually make a copy of the argument.
Our not_by_ref() function becomes:
PHP_FUNCTION(not_by_ref) { zval *zptr = 0, *copy = 0; if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "z", &zptr) == FAILURE) { return; } // make a copy MAKE_STD_ZVAL(copy); memcpy(copy, zptr, sizeof(zval)); // set refcount to 1, as we're only using "copy" in this function Z_SET_REFCOUNT_P(copy, 1); php_printf("called (the c version of) not_by_ref(%d)\n", (int)Z_LVAL_P(copy)); ZVAL_LONG(copy, 2); zval_ptr_dtor(©); }
Note that we set the refcount of copy to 1. This is because the refcount for zptr is 2: 1 ref from the calling function + 1 ref from the not_by_ref function. However, we don’t want the copy of zptr to have a refcount of 2, because it’s only being used by the current function.
Also note that memcpy-ing the zval only works because this is a scalar: if this were an array or object, we’d have to use PHP API functions to make a deep copy of the original.
If we run our PHP program again, it gives us:
x is 1 called (the c version of) not_by_ref(1) x is 1 called (the c version of) by_ref(1) x is 3
Okay, this is pretty good… but we’re actually missing a case. What happens if we pass in a reference to not_by_ref()? In PHP, this looks like:
function not_by_ref($arg) { $arg = 2; } $x = 1; not_by_ref(&$x); display($x);
…which displays “x is 2″. Unfortunately, we’ve overridden this behavior in our not_by_ref() C function, so we have to special case: if this is a reference, change its value, otherwise make a copy and change the copy’s value.
PHP_FUNCTION(not_by_ref) { zval *zptr = 0, *copy = 0; if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "z", &zptr) == FAILURE) { return; } // NEW CODE if (Z_ISREF_P(zptr)) { // if this is a reference, make copy point to zptr copy = zptr; // adding a reference so we can indiscriminately delete copy later zval_add_ref(&zptr); } // OLD CODE else { // make a copy MAKE_STD_ZVAL(copy); memcpy(copy, zptr, sizeof(zval)); // set refcount to 1, as we're only using "copy" in this function Z_SET_REFCOUNT_P(copy, 1); } php_printf("called (the c version of) not_by_ref(%d)\n", (int)Z_LVAL_P(copy)); ZVAL_LONG(copy, 2); zval_ptr_dtor(©); }
Now it’ll behave “properly.”
There may be a better way to do this, please leave a comment if you know of one. However, as far as I know, this is the only way to emulate the PHP reference behavior.
If you would like to read more about PHP references, Derick Rethans wrote a great article on it for PHP Architect.
Playing with Virtual Memory
Aug 30th

Linux: the developer's personal gentleman
When you run a process, it needs some memory to store things: its heap, its stack, and any libraries it’s using. Linux provides and cleans up memory for your process like an extremely conscientious butler. You can (and generally should) just let Linux do its thing, but it’s a good idea to understand the basics of what’s going on.
One easy way (I think) to understand this stuff is to actually look at what’s going on using the pmap command. pmap shows you memory information for a given process.
For example, let’s take a really simple C program that prints its own process id (PID) and pauses:
#include <stdio.h> #include <unistd.h> #include <sys/types.h> int main() { printf("run `pmap %d`\n", getpid()); pause(); }
Save this as mem_munch.c. Now compile and run it with:
$ gcc mem_munch.c -o mem_munch $ ./mem_munch run `pmap 25681`
The PID you get will probably be different than mine (25681).
At this point, the program will “hang.” This is because of the pause() function, and it’s exactly what we want. Now we can look at the memory for this process at our leisure.
Open up a new shell and run pmap, replacing the PID below with the one mem_munch gave you:
$ pmap 25681 25681: ./mem_munch 0000000000400000 4K r-x-- /home/user/mem_munch 0000000000600000 4K r---- /home/user/mem_munch 0000000000601000 4K rw--- /home/user/mem_munch 00007fcf5af88000 1576K r-x-- /lib/x86_64-linux-gnu/libc-2.13.so 00007fcf5b112000 2044K ----- /lib/x86_64-linux-gnu/libc-2.13.so 00007fcf5b311000 16K r---- /lib/x86_64-linux-gnu/libc-2.13.so 00007fcf5b315000 4K rw--- /lib/x86_64-linux-gnu/libc-2.13.so 00007fcf5b316000 24K rw--- [ anon ] 00007fcf5b31c000 132K r-x-- /lib/x86_64-linux-gnu/ld-2.13.so 00007fcf5b512000 12K rw--- [ anon ] 00007fcf5b539000 12K rw--- [ anon ] 00007fcf5b53c000 4K r---- /lib/x86_64-linux-gnu/ld-2.13.so 00007fcf5b53d000 8K rw--- /lib/x86_64-linux-gnu/ld-2.13.so 00007fff7efd8000 132K rw--- [ stack ] 00007fff7efff000 4K r-x-- [ anon ] ffffffffff600000 4K r-x-- [ anon ] total 3984K
This output is how memory “looks” to the mem_munch process. If mem_munch asks the operating system for 00007fcf5af88000, it will get libc. If it asks for 00007fcf5b31c000, it will get the ld library.
This output is a bit dense and abstract, so let’s look at how some more familiar memory usage shows up. Change our program to put some memory on the stack and some on the heap, then pause.
#include <stdio.h> #include <unistd.h> #include <sys/types.h> #include <stdlib.h> int main() { int on_stack, *on_heap; // local variables are stored on the stack on_stack = 42; printf("stack address: %p\n", &on_stack); // malloc allocates heap memory on_heap = (int*)malloc(sizeof(int)); printf("heap address: %p\n", on_heap); printf("run `pmap %d`\n", getpid()); pause(); }
Now compile and run it:
$ ./mem_munch stack address: 0x7fff497670bc heap address: 0x1b84010 run `pmap 11972`
Again, your exact numbers will probably be different than mine.
Before you kill mem_munch, run pmap on it:
$ pmap 11972 11972: ./mem_munch 0000000000400000 4K r-x-- /home/user/mem_munch 0000000000600000 4K r---- /home/user/mem_munch 0000000000601000 4K rw--- /home/user/mem_munch 0000000001b84000 132K rw--- [ anon ]00007f3ec4d98000 1576K r-x-- /lib/x86_64-linux-gnu/libc-2.13.so 00007f3ec4f22000 2044K ----- /lib/x86_64-linux-gnu/libc-2.13.so 00007f3ec5121000 16K r---- /lib/x86_64-linux-gnu/libc-2.13.so 00007f3ec5125000 4K rw--- /lib/x86_64-linux-gnu/libc-2.13.so 00007f3ec5126000 24K rw--- [ anon ] 00007f3ec512c000 132K r-x-- /lib/x86_64-linux-gnu/ld-2.13.so 00007f3ec5322000 12K rw--- [ anon ] 00007f3ec5349000 12K rw--- [ anon ] 00007f3ec534c000 4K r---- /lib/x86_64-linux-gnu/ld-2.13.so 00007f3ec534d000 8K rw--- /lib/x86_64-linux-gnu/ld-2.13.so 00007fff49747000 132K rw--- [ stack ] 00007fff497bb000 4K r-x-- [ anon ] ffffffffff600000 4K r-x-- [ anon ] total 4116K
Note that there’s a new entry between the final mem_munch section and libc-2.13.so. What could that be?
# from pmap
0000000001b84000 132K rw--- [ anon ]
# from our program
heap address: 0x1b84010
The addresses are almost the same. That block ([ anon ]) is the heap. (pmap labels blocks of memory that aren’t backed by a file [ anon ]. We’ll get into what being “backed by a file” means in a sec.)
The second thing to notice:
# from pmap
00007fff49747000 132K rw--- [ stack ]
# from our program
stack address: 0x7fff497670bc
And there’s your stack!
One other important thing to notice: this is how memory “looks” to your program, not how memory is actually laid out on your physical hardware. Look at how much memory mem_munch has to work with. According to pmap, mem_munch can address memory between address 0x0000000000400000 and 0xffffffffff600000 (well, actually 0x00007fffffffffffffff, beyond that is special). For those of you playing along at home, that’s almost 10 million terabytes of memory. That’s a lot of memory. (If your computer has that kind of memory, please leave your address and times you won’t be at home.)
So, the amount of memory the program can address is kind of ridiculous. Why does the computer do this? Well, lots of reasons, but one important one is that this means you can address more memory than you actually have on the machine and let the operating system take care of making sure the right stuff is in memory when you try to access it.
Memory Mapped Files
Memory mapping a file basically tells the operating system to load the file so the program can access it as an array of bytes. Then you can treat a file like an in-memory array.
For example, let’s make a (pretty stupid) random number generator ever by creating a file full of random numbers, then mmap-ing it and reading off random numbers.
First, we’ll create a big file called random (note that this creates a 1GB file, so make sure you have the disk space and be patient, it’ll take a little while to write):
$ dd if=/dev/urandom bs=1024 count=1000000 of=/home/user/random 1000000+0 records in 1000000+0 records out 1024000000 bytes (1.0 GB) copied, 123.293 s, 8.3 MB/s $ ls -lh random -rw-r--r-- 1 user user 977M 2011-08-29 16:46 random
Now we’ll mmap random and use it to generate random numbers.
#include <stdio.h> #include <unistd.h> #include <sys/types.h> #include <stdlib.h> #include <sys/mman.h> int main() { char *random_bytes; FILE *f; int offset = 0; // open "random" for reading f = fopen("/home/user/random", "r"); if (!f) { perror("couldn't open file"); return -1; } // we want to inspect memory before mapping the file printf("run `pmap %d`, then press <enter>", getpid()); getchar(); random_bytes = mmap(0, 1000000000, PROT_READ, MAP_SHARED, fileno(f), 0); if (random_bytes == MAP_FAILED) { perror("error mapping the file"); return -1; } while (1) { printf("random number: %d (press <enter> for next number)", *(int*)(random_bytes+offset)); getchar(); offset += 4; } }
If we run this program, we’ll get something like:
$ ./mem_munch run `pmap 12727`, then press <enter>
The program hasn’t done anything yet, so the output of running pmap will basically be the same as it was above (I’ll omit it for brevity). However, if we continue running mem_munch by pressing enter, our program will mmap random.
Now if we run pmap it will look something like:
$ pmap 12727 12727: ./mem_munch 0000000000400000 4K r-x-- /home/user/mem_munch 0000000000600000 4K r---- /home/user/mem_munch 0000000000601000 4K rw--- /home/user/mem_munch 000000000147d000 132K rw--- [ anon ] 00007fe261c6f000 976564K r--s- /home/user/random00007fe29d61c000 1576K r-x-- /lib/x86_64-linux-gnu/libc-2.13.so 00007fe29d7a6000 2044K ----- /lib/x86_64-linux-gnu/libc-2.13.so 00007fe29d9a5000 16K r---- /lib/x86_64-linux-gnu/libc-2.13.so 00007fe29d9a9000 4K rw--- /lib/x86_64-linux-gnu/libc-2.13.so 00007fe29d9aa000 24K rw--- [ anon ] 00007fe29d9b0000 132K r-x-- /lib/x86_64-linux-gnu/ld-2.13.so 00007fe29dba6000 12K rw--- [ anon ] 00007fe29dbcc000 16K rw--- [ anon ] 00007fe29dbd0000 4K r---- /lib/x86_64-linux-gnu/ld-2.13.so 00007fe29dbd1000 8K rw--- /lib/x86_64-linux-gnu/ld-2.13.so 00007ffff29b2000 132K rw--- [ stack ] 00007ffff29de000 4K r-x-- [ anon ] ffffffffff600000 4K r-x-- [ anon ] total 980684K
This is very similar to before, but with an extra line (bolded), which kicks up virtual memory usage a bit (from 4MB to 980MB).
However, let’s re-run pmap with the -x option. This shows the resident set size (RSS): only 4KB of random are resident. Resident memory is memory that’s actually in RAM. There’s very little of random in RAM because we’ve only accessed the very start of the file, so the OS has only pulled the first bit of the file from disk into memory.
pmap -x 12727 12727: ./mem_munch Address Kbytes RSS Dirty Mode Mapping 0000000000400000 0 4 0 r-x-- mem_munch 0000000000600000 0 4 4 r---- mem_munch 0000000000601000 0 4 4 rw--- mem_munch 000000000147d000 0 4 4 rw--- [ anon ] 00007fe261c6f000 0 4 0 r--s- random 00007fe29d61c000 0 288 0 r-x-- libc-2.13.so 00007fe29d7a6000 0 0 0 ----- libc-2.13.so 00007fe29d9a5000 0 16 16 r---- libc-2.13.so 00007fe29d9a9000 0 4 4 rw--- libc-2.13.so 00007fe29d9aa000 0 16 16 rw--- [ anon ] 00007fe29d9b0000 0 108 0 r-x-- ld-2.13.so 00007fe29dba6000 0 12 12 rw--- [ anon ] 00007fe29dbcc000 0 16 16 rw--- [ anon ] 00007fe29dbd0000 0 4 4 r---- ld-2.13.so 00007fe29dbd1000 0 8 8 rw--- ld-2.13.so 00007ffff29b2000 0 12 12 rw--- [ stack ] 00007ffff29de000 0 4 0 r-x-- [ anon ] ffffffffff600000 0 0 0 r-x-- [ anon ] ---------------- ------ ------ ------ total kB 980684 508 100
If the virtual memory size (the Kbytes column) is all 0s for you, don’t worry about it. That’s a bug in Debian/Ubuntu’s -x option. The total is correct, it just doesn’t display correctly in the breakdown.
You can see that the resident set size, the amount that’s actually in memory, is tiny compared to the virtual memory. Your program can access any memory within a billion bytes of 0x00007fe261c6f000, but if it accesses anything past 4KB, it’ll probably have to go to disk for it*.
What if we modify our program so it reads the whole file/array of bytes?
#include <stdio.h> #include <unistd.h> #include <sys/types.h> #include <stdlib.h> #include <sys/mman.h> int main() { char *random_bytes; FILE *f; int offset = 0; // open "random" for reading f = fopen("/home/user/random", "r"); if (!f) { perror("couldn't open file"); return -1; } random_bytes = mmap(0, 1000000000, PROT_READ, MAP_SHARED, fileno(f), 0); if (random_bytes == MAP_FAILED) { printf("error mapping the file\n"); return -1; } for (offset = 0; offset < 1000000000; offset += 4) { int i = *(int*)(random_bytes+offset); // to show we're making progress if (offset % 1000000 == 0) { printf("."); } } // at the end, wait for signal so we can check mem printf("\ndone, run `pmap -x %d`\n", getpid()); pause(); }
Now the resident set size is almost the same as the virtual memory size:
$ pmap -x 5378 5378: ./mem_munch Address Kbytes RSS Dirty Mode Mapping 0000000000400000 0 4 4 r-x-- mem_munch 0000000000600000 0 4 4 r---- mem_munch 0000000000601000 0 4 4 rw--- mem_munch 0000000002271000 0 4 4 rw--- [ anon ] 00007fc2aa333000 0 976564 0 r--s- random 00007fc2e5ce0000 0 292 0 r-x-- libc-2.13.so 00007fc2e5e6a000 0 0 0 ----- libc-2.13.so 00007fc2e6069000 0 16 16 r---- libc-2.13.so 00007fc2e606d000 0 4 4 rw--- libc-2.13.so 00007fc2e606e000 0 16 16 rw--- [ anon ] 00007fc2e6074000 0 108 0 r-x-- ld-2.13.so 00007fc2e626a000 0 12 12 rw--- [ anon ] 00007fc2e6290000 0 16 16 rw--- [ anon ] 00007fc2e6294000 0 4 4 r---- ld-2.13.so 00007fc2e6295000 0 8 8 rw--- ld-2.13.so 00007fff037e6000 0 12 12 rw--- [ stack ] 00007fff039c9000 0 4 0 r-x-- [ anon ] ffffffffff600000 0 0 0 r-x-- [ anon ] ---------------- ------ ------ ------ total kB 980684 977072 104
Now if we access any part of the file, it will be in RAM already. (Probably. Until something else kicks it out.) So, our program can access a gigabyte of memory, but the operating system can lazily load it into RAM as needed.
And that’s why your virtual memory is so damn high when you’re running MongoDB.
Left as an exercise to the reader: try running pmap on a mongod process before it’s done anything, once you’ve done a couple operations, and once it’s been running for a long time.
* This isn’t strictly true**. The kernel actually says, “If they want the first N bytes, they’re probably going to want some more of the file” so it’ll load, say, the first dozen KB of the file into memory but only tell the process about 4KB. When your program tries to access this memory that is in RAM, but it didn’t know was in RAM, it’s called a minor page fault (as opposed to a major page fault when it actually has to hit disk to load new info). back to context
** This note is also not strictly true. In fact, the whole file will probably be in memory before you map anything because you just wrote the thing with dd. So you’ll just be doing minor page faults as your program “discovers” it.
Setting Up Your Interview Toolbox
Dec 9th
This post covers a couple “toolbox” topics that are easy to brush up on before the technical interview.
I recently read a post that drove me nuts, written by someone looking for a job. They said:
I can’t seem to crack the on-site coding interviews… [Interviews are geared towards] those who can suavely implement a linked list code library (inserting, deleting, reversing) as well as a data structure using that linked list (i.e. a stack) on a white board, no syntax errors, compilable, all error paths covered, interfaces cleanly buttoned up. Lather, rinse, repeat for binary search trees and sorting algorithms.
These are a programmer’s multiplication tables! If someone asked me “what’s 6×15?” on an interview, I wouldn’t throw my hands up and complain that I learned it 20 years ago, I’d be fucking thrilled that they had given me such a softball question.
Believe me, if you can’t figure out my basic algorithm questions, you do not want me to ask my “fun” questions.
If you’re looking for a job, I’d recommend accepting that interviewers want to see you know your multiplication tables and spend a few hours cramming if you need to. Make sure you have a basic toolbox set up in your brain, covering a couple basic categories:
- Data structures: hashes, lists, trees – know how to implement them and common manipulations and searches.
- Algorithms: sorts, recursion, search – simple algorithm problems. “Algorithms” covers a lot of ground, but at the very least know how to do the basic sorts (merge, quick, selection), recursion, and tree searches. They come up a lot. Also, make sure you know when to apply them (or they won’t be very useful).
- Bit twiddling – this is mainly for C and C++ positions. I like to see if people know how to manipulate their bits (oh la la). This varies on the company, though, I doubt a Web 2.0 site is going to care that you know your bit shifts backwards and forwards (or, rather, left and right).
If you are applying for a language-specific job, the interviewer will probably ask you about some specifics. A good interviewer shouldn’t try to trap you with obscure language trivia, but make sure you’re familiar with the basics. So, if you’re applying for, say, a Java position, get comfortable with java.lang, java.util, how garbage collection works, basic synchronization, and know that Strings are immutable.
Protip: when I was looking for a job, every single place I interviewed asked me about Java’s public/protected/private keywords. Nearly all of them asked about final, too.
Don’t freak out if you get up to the board and can’t remember whether it’s foo.toString() or (String)foo, or if you forget a semicolon. Any reasonable interviewer knows that it’s hard to program on a whiteboard and doesn’t expect compiler-ready code. On the other hand, if your resume says you’ve been doing C for 10 years and you allocate an array of chars as char *x[], we expect you to laugh and understand your mistake when we point it out (I know I might do something like that out of nerves, so I wouldn’t hold it against you as long as you understood the problem).
Good luck out there. Remember that, if a company brings you in for an interview, they want to hire you. Do everything you can to let them!
MapReduce – The Fanfiction
Mar 15th
MapReduce is really cool, useful, and powerful, but a lot of people find it hard to wrap their heads around. This post is a fairly silly, non-technical explanation using Star Trek.
The Enterprise found a new planet, as it tends to do.
Kirk wanted to beam down immediately and start surveying the planet but Spock told him to wait a moment. “It usually takes us one hour to survey a planet, correct Captain? In less than 5 minutes, I can calculate whether the chance of encountering friendly alien females outweighs the risk of attack by brain-eating monsters.”
“Interesting idea, Spock,” said Kirk. ”Go ahead.”
The Data
“Logically,” thought Spock, “if we can survey a whole planet in one hour, we can survey 1/16th of a planet in 3.75 minutes.” Spock divided the planet into 16 equal-size pieces and summoned 16 red shirts.

“You’ll be beamed down to the surface of the planet with this special data collection device called an ‘emitter.’ If you see a brain-eating monster, you press the “brain-eating monster” button on your emitter. If you see an attractive female alien, you press the “hot alien chick” button. Press either, neither, or both buttons, as your situation requires.”
The Map Step
The 16 red shirts were beamed down to the 16 parts of the planet. As they found things, they would press the buttons on their emitter.
Back on the Enterprise, Spock started getting lots of data pairs that looked like:
| type | location | |----------------------|----------| | Brain-eating monster | 2 | | Hot alien chick | 7 | | Brain-eating monster | 14 | | Brain-eating monster | 7 |
The Reduce Step
“Computer,” Spock said. ”Initialize a counter to 0 for each new type you get. Then, for every subsequent data pair with the same type, increment that counter.”
“I dinnae understand,” said Scotty. ”What’s that, then?”
“I basically told the computer to initialize two variables, ‘Brain-eating monster’ and ‘Hot alien chick’, setting them both to zero. Every time the computer gets a ‘Brain-eating monster’ emit, it increments the ‘Brain-eating monster’ variable. Every time it gets a ‘Hot alien chick’ emit, it increments the ‘Hot alien chick’ variable.
“Ah, I see,” said Scotty. ”But don’t you lose the location information?”
“Yes,” replied Spock. ”But I don’t actually care about location for this readout. If I wanted the location, I could give the computer a slightly more complicated algorithm, but right now I just want the count.”
The Result
After 3.75 minutes, Spock beamed up the red shirts who were still alive and presented to Kirk: “There are brain-eating monsters on 7/8ths of the planet, Captain. 1/16 of the planet has hot alien chicks.”
“Excellent work Spock,” Kirk says. ”Let’s boldly go somewhere else.”
And so they did.
Captain’s log, star date 1419.7 (aka a summary of what we did)
- Goal – To generate a report on a planet.
- Data – 16 pieces of land with various attributes. Each piece of land could be represented by a JSON object such as:
{ "location" : 5 "contains" : ["Brain-eating monsters", "rocks", "poison gas"] }
- Map – Send attributes for each piece of data back to the processor. In JSON, each emit would look something like:
{ "Brain-eating monsters" : 5 }
- Reduce – Sum up the data, grouping by type
- Result – How much of each attribute is on the planet
Further reading: Kyle Banker has an excellent (and more technical) explanation of MapReduce.
Bug Reporting: A How-To
Mar 10th
This type of bug report drives me nuts:
You have a critical bug! This crashes Java!
for (int i=0; i<10; i++) { cursor.next(); }
(I’ve never gotten this exact report and I’m not picking on anyone, it’s a composite.)
This doesn’t crash for me. It doesn’t even compile for me because the variable “cursor” isn’t defined. If you’re going to use a variable (a function, a framework, etc.) in a code snippet, you have to define it. Let’s try again:
Mongo m = new Mongo(); DB db = m.getDB("bar"); DBCollection coll = db.getCollection("foo"); DBCursor cursor = coll.find(); for (int i=0; i<10; i++) { cursor.next(); }
Better! But this is probably crashing because of something in your database. Unless it crashes regardless of dataset, you need to send me the data that makes it crash. The basic rule is:
The faster I can reproduce your bug, the faster I can fix it.
Some other tips for submitting bug reports:
- Make sure to include information about your environment. The more the merrier.
- If I ask for log messages, please send me the entire log. If it’s been running for days and the log’s a zillion lines long, send everything from around the time the error happened (before, during, after). Please, please, please don’t skim the logs, extract a single suspicious-looking line, and send it to me. I promise that I’m not going to be mad about having to wade through a couple hundred (or thousand) lines to find what I’m looking for. I would rather quickly skim a bunch of extra info than pry the logs, line by line, from your clutches.
- If you are running on a non-traditional setup (e.g., a Linux kernel you modified yourself on a mainframe from 1972), it would be super helpful if you can give me access to your machine. If your bug is platform-specific to ENIAC, it’s doubtful I’m going to be able to figure out what’s going wrong on my MacBook Air.
And, of course, flatter the developer’s ego when they’ve fixed the bug. Not applicable for me, of course, but other developers like it when you give them a little “Thanks, you’re awesome!” biscuit when they’ve fixed your bug.
To my users: thank you to everyone who has ever filed a bug report. I’m really, really happy that you’re using my software and that you care enough about it to submit a bug, instead of just giving up. Seriously, thank you. I have to give a shout-out to Perl developers in particular, here. More than half the time, people reporting bugs in the MongoDB Perl driver actually include a patch in the bug report that fixes it! I love you guys.


Subscribe to all posts