To request addition or removal, please email sipb-www at mit.edu. Planet is updated every thirty minutes.
Adam Belay (of Dune fame) was recently wondering why Haskell’s MVars are so slow. “Slow?” I thought, “aren’t Haskell’s MVars supposed to be really fast?” So I did some digging around how MVars worked, to see if I could explain.
Let’s consider the operation of the function takeMVar in Control.Concurrent.MVar. This function is very simple, it unpacks MVar to get the underlying MVar# primitive value, and then calls the primop takeMVar#:
takeMVar :: MVar a -> IO a takeMVar (MVar mvar#) = IO $ \ s# -> takeMVar# mvar# s#
Primops result in the invocation of stg_takeMVarzh in PrimOps.cmm, which is where the magic happens. For simplicity, we consider only the multithreaded case.
The first step is to lock the closure:
("ptr" info) = ccall lockClosure(mvar "ptr");
Objects on the GHC heap have an info table header which indicates what kind of object they are, by pointing to the relevant info table for the object. These headers are also used for synchronization: since they are word-sized, they can be atomically swapped for other values. lockClosure is in fact a spin-lock on the info table header:
EXTERN_INLINE StgInfoTable *lockClosure(StgClosure *p)
{
StgWord info;
do {
nat i = 0;
do {
info = xchg((P_)(void *)&p->header.info, (W_)&stg_WHITEHOLE_info);
if (info != (W_)&stg_WHITEHOLE_info) return (StgInfoTable *)info;
} while (++i < SPIN_COUNT);
yieldThread();
} while (1);
}
lockClosure is used for some other objects, namely thread state objects (stg_TSO_info, via lockTSO) and thread messages i.e. exceptions (stg_MSG_THROWTO_info, stg_MSG_NULL_info).
The next step is to apply a GC write barrier on the MVar:
if (info == stg_MVAR_CLEAN_info) {
ccall dirty_MVAR(BaseReg "ptr", mvar "ptr");
}
As I’ve written before, as the MVar is a mutable object, it can be mutated to point to objects in generation 0; thus, when a mutation happens, it has to be added to the root set via the mutable list. Since mutable is per capability, this boils down into a bunch of pointer modifications, and does not require any synchronizations. Note that we will need to add the MVar to the mutable list, even if we end up blocking on it, because the MVar is a retainer of the thread (TSO) which is blocked on it! (However, I suspect in some cases we can get away with not doing this.)
Next, we case split depending on whether or not the MVar is full or empty. If the MVar is empty, we need to block the thread until the MVar is full:
/* If the MVar is empty, put ourselves on its blocking queue,
* and wait until we're woken up.
*/
if (StgMVar_value(mvar) == stg_END_TSO_QUEUE_closure) {
// We want to put the heap check down here in the slow path,
// but be careful to unlock the closure before returning to
// the RTS if the check fails.
ALLOC_PRIM_WITH_CUSTOM_FAILURE
(SIZEOF_StgMVarTSOQueue,
unlockClosure(mvar, stg_MVAR_DIRTY_info);
GC_PRIM_P(stg_takeMVarzh, mvar));
q = Hp - SIZEOF_StgMVarTSOQueue + WDS(1);
SET_HDR(q, stg_MVAR_TSO_QUEUE_info, CCS_SYSTEM);
StgMVarTSOQueue_link(q) = END_TSO_QUEUE;
StgMVarTSOQueue_tso(q) = CurrentTSO;
if (StgMVar_head(mvar) == stg_END_TSO_QUEUE_closure) {
StgMVar_head(mvar) = q;
} else {
StgMVarTSOQueue_link(StgMVar_tail(mvar)) = q;
ccall recordClosureMutated(MyCapability() "ptr",
StgMVar_tail(mvar));
}
StgTSO__link(CurrentTSO) = q;
StgTSO_block_info(CurrentTSO) = mvar;
StgTSO_why_blocked(CurrentTSO) = BlockedOnMVar::I16;
StgMVar_tail(mvar) = q;
jump stg_block_takemvar(mvar);
}
A useful thing to know when decoding C-- primop code is that StgTSO_block_info(...) and its kin are how we spell field access on objects. C-- doesn’t know anything about C struct layout, and so these “functions” are actually macros generated by utils/deriveConstants. Blocking a thread consists of three steps:
If the MVar is full, then we can go ahead and take the value from the MVar.
/* we got the value... */ val = StgMVar_value(mvar);
But that’s not all. If there are other blocked putMVars on the MVar (remember, when a thread attempts to put an MVar that is already full, it blocks until the MVar empties out), then we should immediately unblock one of these threads so that the MVar can always be left in a full state:
q = StgMVar_head(mvar);
loop:
if (q == stg_END_TSO_QUEUE_closure) {
/* No further putMVars, MVar is now empty */
StgMVar_value(mvar) = stg_END_TSO_QUEUE_closure;
unlockClosure(mvar, stg_MVAR_DIRTY_info);
return (val);
}
if (StgHeader_info(q) == stg_IND_info ||
StgHeader_info(q) == stg_MSG_NULL_info) {
q = StgInd_indirectee(q);
goto loop;
}
There is one interesting thing about the code that checks for blocked threads, and that is the check for indirectees (stg_IND_info). Under what circumstances would a queue object be stubbed out with an indirection? As it turns out, this occurs when we delete an item from the linked list. This is quite nice, because on a singly-linked list, we don't have an easy way to delete items unless we also have a pointer to the previous item. With this scheme, we just overwrite out the current item with an indirection, to be cleaned up next GC. (This, by the way, is why we can't just chain up the TSOs directly, without the extra linked list nodes. [1])
When we find some other threads, we immediately run them, so that the MVar never becomes empty:
// There are putMVar(s) waiting... wake up the first thread on the queue
tso = StgMVarTSOQueue_tso(q);
StgMVar_head(mvar) = StgMVarTSOQueue_link(q);
if (StgMVar_head(mvar) == stg_END_TSO_QUEUE_closure) {
StgMVar_tail(mvar) = stg_END_TSO_QUEUE_closure;
}
ASSERT(StgTSO_why_blocked(tso) == BlockedOnMVar::I16); // note: I16 means this is a 16-bit integer
ASSERT(StgTSO_block_info(tso) == mvar);
// actually perform the putMVar for the thread that we just woke up
W_ stack;
stack = StgTSO_stackobj(tso);
PerformPut(stack, StgMVar_value(mvar));
There is one detail here: PerformPut doesn’t actually run the thread, it just looks at the thread’s stack to figure out what it was going to put. Once the MVar is put, we then wake up the thread, so it can go on its merry way:
// indicate that the MVar operation has now completed. StgTSO__link(tso) = stg_END_TSO_QUEUE_closure; // no need to mark the TSO dirty, we have only written END_TSO_QUEUE. ccall tryWakeupThread(MyCapability() "ptr", tso); unlockClosure(mvar, stg_MVAR_DIRTY_info); return (val);
To sum up, when you takeMVar, you pay the costs of:
Adam and I puzzled about this a bit, and then realized the reason why the number of cycles was so large: our numbers are for roundtrips, and even with such lightweight synchronization (and lack of syscalls), you still have to go through the scheduler when all is said and done, and that blows up the number of cycles.
[1] It wasn’t always this way, see:
commit f4692220c7cbdadaa633f50eb2b30b59edb30183
Author: Simon Marlow <marlowsd@gmail.com>
Date: Thu Apr 1 09:16:05 2010 +0000
Change the representation of the MVar blocked queue
The list of threads blocked on an MVar is now represented as a list of
separately allocated objects rather than being linked through the TSOs
themselves. This lets us remove a TSO from the list in O(1) time
rather than O(n) time, by marking the list object. Removing this
linear component fixes some pathalogical performance cases where many
threads were blocked on an MVar and became unreachable simultaneously
(nofib/smp/threads007), or when sending an asynchronous exception to a
TSO in a long list of thread blocked on an MVar.
MVar performance has actually improved by a few percent as a result of
this change, slightly to my surprise.
This is the final cleanup in the sequence, which let me remove the old
way of waking up threads (unblockOne(), MSG_WAKEUP) in favour of the
new way (tryWakeupThread and MSG_TRY_WAKEUP, which is idempotent). It
is now the case that only the Capability that owns a TSO may modify
its state (well, almost), and this simplifies various things. More of
the RTS is based on message-passing between Capabilities now.
by Edward Z. Yang at May 20, 2013 12:00 AM
Ariel Rabkin has some code he'd like to verify, and at this year’s HotOS he appealed to participants of one “unconference” (informal breakout sessions to discuss various topics) to help him figure out what was really going on as far as formal verification went.
He had three questions: "What can we verify? What is impossible to verify? How can we tell that the verification is correct?" They seemed like impossibly large questions, so we drilled in a little bit, and found that Ariel had the very understandable question, "Say I have some C++ code implementing a subtle network protocol, and I'd like to prove that the protocol doesn't deadlock; how do I do that?"
I wish the formal verification community had a good answer to a question like this, but unfortunately we don't. The largest verification projects include things like verified kernels, which are written in fully specified subsets of C; which assume the translation performed by the compiler is correct, formalize C in a theorem prover, and then verify there. This is the "principled approach". It's just not feasible to take C or C++ in its entirety and try to formalize it; it's too complicated and too ill-specified. The easiest thing to do is formalize a small fragment of your algorithm and then make a hand-wavy argument that your implementation is adequate.
Martin Abadi remarked that before you embark on a verification project, you have to figure out where you'll get the most bang for your buck. Most of the time, a formalization won't get you "full correctness"; the "electrons may be faulty", as the case may be. But even a flawed verification forces you to state your assumptions explicitly, which is a good thing.
We then circled around to the subject of, well, what can be verified. Until the 90s, the formal verification community limited itself to only complete and sound analyses—and failed. The relaxation of this restriction lead to a renaissance of formal verification work. We talked about who was using formal verification, and the usual suspects showed up: safety critical software, cache coherence protocols (but one participant remarked that this was only a flash in the pan, as far as applications goes—he asserted that these companies are likely not going to use these methods any longer in the future), etc. Safety critical software is also likely to use coprocessors (since hardware failure is a very real issue), but Gernot Heiser noted that these folks are trying to get away from physical separation: it is expensive in terms of expense, weight and energy. Luckily, the costs of verification, as he recounted, are within a factor of two of normal industrial assurance, and half the cost of military assurance (though, he cautioned that this was for a specific project, and for a specific size of code.) He also remarked that as far as changes to code requiring changes to the proofs, the changes in the proofs seemed to be linear in the complexity (conceptual or implementation-wise) of the change, which is a good sign!
Well, supposing that you decide that you actually want to verify your software, how do you go about doing it? Unfortunately, it takes a completely different set of skills to build verified software versus normal software. Everyone agreed, "Yes, you need to hire a formal methods guy" if you're going to make any progress. But that's just not enough. The formal methods guy has to talk to the systems guy. Heiser recounted a very good experience hiring a formal methods person who was able to communicate with the other systems researchers working on the project; without this line of communication, he said, the project likely would have failed. And he mentioned another project, which had three times as much funding, but didn't accomplish nearly as much their team had. (Names not mentioned to protect the guilty.)
In the end, it seemed that we didn’t manage to give Ari a quite satisfactory answer. As one participant said, “You’ll probably learn the most by just sitting down and trying to formalize the thing you are interested in.” This is probably true, though I fear most will be scared off by the realization of how much work it actually takes to formalize software.
Hey guys, I’m liveblogging HotOS at my research Tumblr. The posts there are likely to be more fragmented than this, but if people are interested in any particular topics I can inflate them into full posts.
by Edward Z. Yang at May 14, 2013 04:58 AM
Christopher de Sa and I have been working on a category theoretic approach to optimizing MapReduce-like pipelines. Actually, we didn’t start with any category theory—we were originally trying to impose some structure on some of the existing loop optimizations that the Delite compiler performed, and along the way, we rediscovered the rich relationship between category theory and loop optimization.
On the one hand, I think the approach is pretty cool; but on the other hand, there’s a lot of prior work in the area, and it’s tough to figure out where one stands on the research landscape. As John Mitchell remarked to me when I was discussing the idea with him, “Loop optimization, can’t you just solve it using a table lookup?” We draw a lot of inspiration from existing work, especially the program calculation literature pioneered by Bird, Meertens, Malcom, Meijer and others in the early 90s. The purpose of this blog post is to air out some of the ideas we’ve worked out and get some feedback from you, gentle reader.
There are a few ways to think about what we are trying to do:
I’d like to illustrate some of these ideas by way of an example. Here is some sample code, written in Delite, which calculates an iteration of (1-dimensional) k-means clustering:
(0 :: numClusters, *) { j =>
val weightedPoints = sumRowsIf(0,m){i => c(i) == j}{i => x(i)};
val points = c.count(_ == j);
val d = if (points == 0) 1 else points
weightedPoints / d
}
You can read it as follows: we are computing a result array containing the position of each cluster, and the outermost block is looping over the clusters by index variable j. To compute the position of a cluster, we have to get all of the points in x which were assigned to cluster j (that’s the c(i) == j condition) and sum them together, finally dividing by the sum by the number of points in the cluster to get the true location.
The big problem with this code is that it iterates over the entire dataset numClusters times, when we’d like to only ever do one iteration. The optimized version which does just that looks like this:
val allWP = hashreduce(0,m)(i => c(i), i => x(i), _ + _)
val allP = hashreduce(0,m)(i => c(i), i => 1, _ + _)
(0::numClusters, *) { j =>
val weightedPoints = allWP(j);
val points = allP(j);
val d = if (points == 0) 1 else points
return weightedpoints / d
}
That is to say, we have to precompute the weighted points and the point count (note the two hashreduces can and should be fused together) before generating the new coordinates for each of the clusters: generating more intermediate data structures is a win, in this case.
Let us now calculate our way to the optimized version of the program. First, however, we have to define some functors:
There are a number of functions, which we will describe below:
Let us now rewrite the loop in more functional terms:
tabulate (\j ->
let weightedPoints = fold plus . filter (\i -> c[i] == j) $ x
points = fold inc . filter (\i -> c[i] == j) $ x
in divide (weightedPoints, points)
)
(Where divide is just a function which divides its arguments but checks that the divisor is not zero.) Eliminating some common sub-expressions and fusing the two folds together, we get:
tabulate (\j -> divide . fold (plus * inc) . filter (\i -> c[i] == j) $ x)
At this point, it is still not at all clear that there are any rewrites we can carry out: the filter is causing problems for us. However, because filter is testing on equality, we can rewrite it in a different way:
tabulate (\j -> divide . fold (plus * inc) . index j . bucket c $ x)
What is happening here? Rather than directly filtering for just items in cluster j, we can instead view this as bucketing x on c and then indexing out the single bucket we care about. This shift in perspective is key to the whole optimization.
Now we can apply the fundamental rule of natural transformations. Let phi = index j and f = divide . fold (plus * inc), then we can push f to the other side of phi:
tabulate (\j -> index j . fmap (divide . fold (plus * inc)) . bucket c $ x)
Now we can eliminate tabulate and index:
fmap (divide . fold (plus * inc)) . bucket c $ x
Finally, because we know how to efficiently implement fmap (fold f) . bucket c (as a hashreduce), we split up the fmap and join the fold and bucket:
fmap divide . hashreduce (plus * inc) c $ x
And we have achieved our fully optimized program.
All of this is research in progress, and there are lots of open questions which we have not resolved. Still, I hope this post has given you a flavor of the approach we are advocating. I am quite curious in your comments, from “That’s cool!” to “This was all done 20 years ago by X system.” Have at it!
by Edward Z. Yang at May 12, 2013 04:25 AM
Recursion and induction are closely related. When you were first taught recursion in an introductory computer science class, you were probably told to use induction to prove that your recursive algorithm was correct. (For the purposes of this post, let us exclude hairy recursive functions like the one in the Collatz conjecture which do not obviously terminate.) Induction suspiciously resembles recursion: the similarity comes from the fact that the inductive hypothesis looks a bit like the result of a “recursive call” to the theorem you are proving. If an ordinary recursive computation returns plain old values, you might wonder if an “induction computation” returns proof terms (which, by the Curry-Howard correspondence, could be thought of as a value).
As it turns out, however, when you look at recursion and induction categorically, they are not equivalent! Intuitively, the difference lies in the fact that when you are performing induction, the data type you are performing induction over (e.g. the numbers) appears at the type level, not the term level. In the words of a category theorist, both recursion and induction have associated initial algebras, but the carrier sets and endofunctors are different. In this blog post, I hope to elucidate precisely what the difference between recursion and induction is. Unfortunately, I need to assume some familiarity with initial algebras: if you don’t know what the relationship between a fold and an initial algebra is, check out this derivation of lists in initial algebra form.
When dealing with generalized abstract nonsense, the most important first step is to use a concrete example! So let us go with the simplest nontrivial data type one can gin up: the natural numbers (our examples are written in both Coq and Haskell, when possible):
Inductive nat : Set := (* defined in standard library *) | 0 : nat | S : nat -> nat. data Nat = Z | S Nat
Natural numbers are a pretty good example: even the Wikipedia article on F-algebras uses them. To recap, an F-algebra (or sometimes simply “algebra”) has three components: an (endo)functor f, a type a and a reduction function f a -> a. For simple recursion over natural numbers, we need to define a functor NatF which “generates” the natural numbers; then our type a is Nat and the reduction function is type NatF Nat -> Nat. The functor is defined as follows:
Inductive NatF (x : Set) : Set := | F0 : NatF x. | FS : x -> NatF x. data NatF x = FZ | FS x
Essentially, take the original definition but replace any recursive occurrence of the type with a polymorphic variable. As an exercise, show that NatF Nat -> Nat exists: it is the (co)product of () -> Nat and Nat -> Nat. The initiality of this algebra implies that any function of type NatF x -> x (for some arbitrary type x) can be used in a fold Nat -> x: this fold is the homomorphism from the initial algebra (NatF Nat -> Nat) to another algebra (NatF x -> x). The take-away point is that the initial algebra of natural numbers consists of an endofunctor over sets.

Let’s look at the F-algebra for induction now. As a first try, let’s try to use the same F-algebra and see if an appropriate homomorphism exists with the “type of induction”. (We can’t write this in Haskell, so now the examples will be Coq only.) Suppose we are trying to prove some proposition P : nat -> Prop holds for all natural numbers; then the type of the final proof term must be forall n : nat, P n. We can now write out the morphism of the algebra: NatF (forall n : nat, P n) -> forall n : nat, P n. But this “inductive principle” is both nonsense and not true:
Hint Constructors nat NatF. Goal ~ (forall (P : nat -> Prop), (NatF (forall n : nat, P n) -> forall n : nat, P n)). intro H; specialize (H (fun n => False)); auto. Qed.
(Side note: you might say that this proof fails because I’ve provided a predicate which is false over all natural numbers. But induction still “works” even when the predicate you’re trying to prove is false: you should fail when trying to provide the base case or inductive hypothesis!)
We step back and now wonder, “So, what’s the right algebra?” It should be pretty clear that our endofunctor is wrong. Fortunately, we can get a clue for what the right endofunctor might be by inspecting the type the induction principle for natural numbers:
(* Check nat_ind. *) nat_ind : forall P : nat -> Prop, P 0 -> (forall n : nat, P n -> P (S n)) -> forall n : nat, P n
P 0 is the type of the base case, and forall n : nat, P n -> P (S n) is the type of the inductive case. In much the same way that we defined NatF nat -> nat for natural numbers, which was the combination of zero : unit -> nat and succ : nat -> nat, we need to define a single function which combines the base case and the inductive case. This seems tough: the result types are not the same. But dependent types come to the rescue: the type we are looking for is:
fun (P : nat -> Prop) => forall n : nat, match n with 0 => True | S n' => P n' end -> P n
You can read this type as follows: I will give you a proof object of type P n for any n. If n is 0, I will give you this proof object with no further help (True -> P 0). However, if n is S n', I will require you to furnish me with P n' (P n' -> P (S n')).
We’re getting close. If this is the morphism of an initial algebra, then the functor IndF must be:
fun (P : nat -> Prop) => forall n : nat, match n with 0 => True | S n' => P n' end
What category is this a functor over? Unfortunately, neither this post nor my brain has the space to give a rigorous treatment, but roughly the category can be thought of as nat-indexed propositions. Objects of this category are of the form forall n : nat, P n, morphisms of the category are of the form forall n : nat, P n -> P' n. [1] As an exercise, show that identity and composition exist and obey the appropriate laws.
Something amazing is about to happen. We have defined our functor, and we are now in search of the initial algebra. As was the case for natural numbers, the initial algebra is defined by the least fixed point over the functor:
Fixpoint P (n : nat) : Prop := match n with 0 => True | S n' => P n' end.
But this is just True!
Hint Unfold P. Goal forall n, P n = True. induction n; auto. Qed.
Drawing out our diagram:

The algebras of our category (downward arrows) correspond to inductive arguments. Because our morphisms take the form of forall n, P n -> P' n, one cannot trivially conclude forall n, P' n simply given forall n, P n; however, the presence of the initial algebra means that True -> forall n, P n whenever we have an algebra forall n, IndF n -> P n. Stunning! (As a side note, Lambek’s lemma states that Mu P is isomorphic to P (Mu P), so the initial algebra is in fact really really trivial.)
In conclusion:
So, the next time someone tells asks you what the difference between induction and recursion is, tell them: Induction is just the unique homomorphism induced by an initial algebra over indexed propositions, what’s the problem?
Acknowledgements go to Conor McBride, who explained this shindig to me over ICFP. I promised to blog about it, but forgot, and ended up having to rederive it all over again.
[1] Another plausible formulation of morphisms goes (forall n : nat, P n) -> (forall n : nat, P' n). However, morphisms in this category are too strong: they require you to go and prove the result for all n... which you would do with induction, which misses the point. Plus, this category is a subcategory of the ordinary category of propositions.
by Edward Z. Yang at April 27, 2013 07:30 AM
Having attempted to read a few textbooks on my Kindle, I have solemnly concluded that the Kindle is in fact a terrible device for reading textbooks. The fundamental problem is that, due to technological limitations, the Kindle is optimized for sequential reading. This can be seen in many aspects:
A textbook cannot be read as a light novel. So, while the Kindle offers the tantalizing possibility of carrying a stack of textbooks with you everywhere, in fact, you’re better off getting the actual dead tree version if you’re planning on doing some serious studying from it. That is not to say textbook ebooks are not useful; in fact, having a searchable textbook on your laptop is seriously awesome—but this is when you’re using the textbook as a reference material, and not when you’re trying to actually learn the material.
by Edward Z. Yang at April 15, 2013 11:52 PM
I very rarely post linkspam, but given that I’ve written on the subject of anonymizing Bitcoins in the past, this link seems relevant: Zerocoin: making Bitcoin anonymous. Their essential innovation is to have a continuously operating mixing pool built into the block chain itself; they pull this off using zero-knowledge proofs. Nifty!
Here is a puzzle for the readers of this blog. Suppose that I am a user who wants to anonymize some Bitcoins, and I am willing to wait expected time N before redeeming my Zerocoins. What is the correct probability distribution for me to pick my wait time from? Furthermore, suppose a population of Zerocoin participants, all of which are using this probability distribution. Furthermore, suppose that each participant has some utility function trading off anonymity and expected wait time (feel free to make assumptions that make the analysis easy). Is this population in Nash equilibrium?
by Edward Z. Yang at April 11, 2013 10:54 PM
A few days ago, I was installing a Fedora 18 on a brand-new machine, which had two 1TB SATA hard drives which I wanted to put into a RAID1 (mirroring) configuration. mdadm makes managing software RAID on Linux easy, and the Fedora Installer (anaconda) has always had extremely powerful tools for setting up a RAID.
Anaconda also supports Logical Volume Manager (LVM), which is part of the recommended configuration for Fedora, because it adds lots of cool features, like being able to have non-contiguous partitions, moving partitions around the disk, between disks, and even spanning disks!
Combining the data redundant of mdadm with the partitioning features of LVM is a common strategy, but unfortunately one that's not longer supported in Fedora 18's Anaconda. In fact, there's an open bug on the subject, and a very heated discussion on the mailing lists.
The official solution is to forego the graphical installer entirely, and is to use a kickstart script -- Anaconda's automated install mode. After a lot of trial and error, I managed to get a kickstart that works, but only when the installer was booted with noefi. The kickstart was copied onto a FAT32-formatted flash drive as the file ks.cfg, and loaded with the boot parameter ks=hd:/sdc1/ks.cfg
This kickstart has one down side: it formats the entire raw disks, so it's not suitable for adding a Fedora installation to an existing LVM-with-mdadm configuration. I'm still investigating how to get anaconda to cooperate when the RAID and LVM is already configured.
by Alex Chernyakhovsky (noreply@blogger.com) at April 09, 2013 07:31 AM
(Selinger) Here is a fairy tale: The evil king calls the poor shepherd and gives him these orders. “You must bring me the philosophers stone, or you have to find a way to turn the philosopher’s stone to gold. If you don’t, your head will be taken off tomorrow!” What can the poor shepherd do to save his life?
Hat tip to Chris for originally telling me a different variant of this story. Unfortunately, this quote from Lectures on the Curry-Howard Isomorphism was the only reference I could find. What should the shepherd do? Is there something a little odd about this story?
by Edward Z. Yang at April 07, 2013 09:52 PM
Humbly presented for your consideration: Exhibit A, an NDSEG essay that did not get accepted; Exhibit B, an NDSEG essay that did get accepted. It’s pretty cool what making a statement more focused can do. (See also Philip Guo’s page on the topic.)
by Edward Z. Yang at April 06, 2013 02:40 AM
A few weeks ago I taught a class on Open Source: Contributing to free culture (catalog entry) for Spark, a one-day program put on by the student-run MIT Educational Studies Program. I was fortunate to have two helpful co-teachers, Tyler Hallada and Jacob Hurwitz, who assisted with the lesson plan and the in class lecture.
We ended up teaching 3 sessions of the 1hr 50min class that Saturday, with about 10 students in each session.
I was pretty impressed by the quality of the students; a number of them had used GNU/Linux before, but even those who hadn't were able to gain something from the experience. The class was broken up into three segments:
by Luke Faraone (noreply@blogger.com) at April 03, 2013 10:16 PM
I wrote a guest post on the Rackspace devops blog about some cool SSH config tips. This config has changed a lot of how I work — you should check it out!
by gdb at April 03, 2013 04:02 PM
Last week, I made my very first submission to ICFP! The topic? An old flame of mine: how to bound space usage of Haskell programs.
We describe the first iteration of a resource limits system for Haskell, taking advantage of the key observation that resource limits share semantics and implementation strategy with profiling. We pay special attention to the problem of limiting resident memory usage: we describe a simple implementation technique for carrying out incremental heap censuses and describe a novel information-flow control solution for handling forcible resource reclamation. This system is implemented as a set of patches to GHC.
You can get a copy of the submission here. I've reproduced below the background section on how profiling Haskell works; if this tickles your fancy, check out the rest of the paper!
Profiling in Haskell is performed by charging the costs of computation to the “current cost center.” A cost center is an abstract, programmer-specified entity to which costs can be charged; only one is active per thread at any given time, and the cost semantics determines how the current cost center changes as the program executes. For example, the scc cc e expression (set-cost-center) modifies the current cost center during evaluation of e to be cc. Cost centers are defined statically at compile time.
A cost semantics for Haskell was defined by Sansom et al. (1995) Previously, there had not been a formal account for how to attribute costs in the presence of lazy evaluation and higher-order functions; this paper resolved these questions. The two insights of their paper were the following: first, they articulated that cost attribution should be independent of evaluation order. For the sake of understandability, whether a thunk is evaluated immediately or later should not affect who is charged for it. Secondly, they observed that there are two ways of attributing costs for functions, in direct parallel to the difference between lexical scoping and dynamic scoping.
The principle of order-independent cost-attribution can be seen by this program:
f x = scc "f" (Just (x * x)) g x = let Just y = f x in scc "g" y
When g 4 is invoked, who is charged the cost of evaluating x * x? With strict evaluation, it is easy to see that f should be charged, since x * x is evaluated immediately inside the scc expression. Order-independence dictates, then, that even if the execution of x * x is deferred to the inside of scc "g" y, the cost should still be attributed to f. In general, scc "f" x on a variable x is a no-op. In order to implement such a scheme, the current cost-center at the time of the allocation of the thunk must be recorded with the thunk and restored when the thunk is forced.
The difference between lexical scoping and dynamic scoping for function cost attribution can be seen in this example:
f = scc "f" (\x -> x * x) g = \x -> scc "g" (x * x)
What is the difference between these two functions? We are in a situation analogous to the choice for thunks: should the current cost-center be saved along with the closure, and restored upon invocation of the function? If the answer is yes, we are using lexical scoping and the functions are equivalent; if the answer is no, we are using dynamic scoping and the scc in f is a no-op. The choice GHC has currently adopted for scc is dynamic scoping.
by Edward Z. Yang at April 02, 2013 08:36 PM
Single export refers to a design pattern where a module identifier is overloaded to also represent a function or type inside the module. As far as I can tell, the term “single export” is not particularly widely used outside the ECMAScript TC39 committee; however, the idea shows up in other contexts, so I’m hoping to popularize this particular name (since names are powerful).
The basic idea is very simple. In JavaScript, a module is frequently represented as an object:
var sayHello = require('./sayhello.js');
sayHello.run();
The methods of sayHello are the functions exported by the module. But what about sayHello itself? Because functions are objects too, we could imagine that sayHello was a function as well, and thus:
sayHello()
would be a valid fragment of code, perhaps equivalent to sayHello.run(). Only one symbol can be exported this way, but in many modules, there is an obvious choice (think of jQuery’s $ object, etc).
This pattern is also commonly employed in Haskell, by taking advantage of the fact that types and modules live in different namespaces:
import qualified Data.Map as Map import Data.Map (Map)
Map is now overloaded to be both a type and a module.
by Edward Z. Yang at April 01, 2013 12:39 AM
On Linux and other UNIX system, a domain name is translated to an IP address using the resolver(3) family of functions. Usually, these are provided by libresolv, but on some systems are available directly in libc.
Over the last couple of days, I've been working to overhaul the code for Hesiod, a system for querying users, groups, and other useful information information in a networked computer environment through DNS. I've been testing the code on Linux, FreeBSD, Solaris, and OS X, all of which were able to successfully compile the library.
To my surprise, on OS X, the configure script determined that the resolver functions exist in libc, and didn't include libresolv. When I went to run the Hesiod test suite, all I got was a linker error, that _res_9_init was not found.
What.
Let's write a simple program that calls res_init() and see what happens.
$ diff -U3 resolver-demo.c resolver-demo-resolv.c
--- resolver-demo.c 2013-03-28 18:08:07.000000000 -0400
+++ resolver-demo-resolv.c 2013-03-28 18:08:03.000000000 -0400
@@ -3,7 +3,7 @@
#include <string.h>
// Uncomment this and link with -lresolv
-//#include <resolv.h>
+#include <resolv.h>
#include <arpa/nameser_compat.h>
#include <arpa/inet.h>
$ gcc resolver-demo.c -o demo
$ gcc resolver-demo-resolv.c -o demo-resolv
Undefined symbols for architecture x86_64:
"_res_9_dn_expand", referenced from:
_main in cck8nJ1v.o
"_res_9_init", referenced from:
_main in cck8nJ1v.o
"_res_9_search", referenced from:
_main in cck8nJ1v.o
ld: symbol(s) not found for architecture x86_64
collect2: ld returned 1 exit status
$ gcc resolver-demo-resolv.c -o demo-resolv -lresolv
$ ./demo mit.edu
Response: 18.9.22.69
$ ./demo-resolv mit.edu
Response: 18.9.22.69
by Alex Chernyakhovsky (noreply@blogger.com) at March 28, 2013 10:21 PM
I want to talk about an interesting duality pointed out by Mark Miller between two otherwise different language features: weak maps and private symbols. Modulo implementation differences, they are the same thing!
A weak map is an ordinary associative map, with the twist that if the key for any entry becomes unreachable, then the value becomes unreachable too (though you must remember to ignore references to the key from the value itself!) Weak maps have a variety of use-cases, including memoization, where we’d like to remember results of a computation, but only if it will ever get asked for again! A weak map supports get(key) and set(key, value) operations.
A private symbol is an unforgeable identifier of a field on an object. Symbols are useful because they can be generated “fresh”; that is, they are guaranteed not to conflict with existing fields that live on an object (whereas one might get unlucky with _private_identifier_no_really); a private symbol has the extra stipulation that one cannot discover that it exists on an object without actually having the symbol—e.g. an object will refuse to divulge the existence of a private symbol while enumerating the properties of an object. A private symbol psym might be created, and then used just like an ordinary property name to get (obj[psym]) and set (obj[psym] = value) values.
To see why these are the same, lets implement weak maps in terms of private symbols, and vice versa (warning, pseudocode ahead):
function WeakMap() {
var psym = PrivateSymbol();
return {
get: function(key) { return key[psym]; },
set: function(key, value) { key[psym] = value; }
}
}
function PrivateSymbol() {
return WeakMap();
}
// pretend that get/set are magical catch-all getters and setters
Object.prototype.get = function(key) {
if (key instanceof PrivateSymbol) { return key.get(this); }
else { return this.old_get(key); }
}
Object.prototype.set = function(key, value) {
if (key instanceof PrivateSymbol) { return key.get(this, value); }
else { return this.old_set(key, value); }
}
Notice, in particular, that it wouldn’t make sense to enumerate all of the entries of a weak map; such an enumeration would change arbitrarily depending on whether or not a GC had occurred.
If you look at this more closely, there is something rather interesting going on: the implementation strategies of weak maps and private symbols are the opposite of each other. With a weak map, you might imagine a an actual map-like data structure collecting the mappings from keys to values (plus some GC trickery); with a private symbol, you would expect the values to be stored on the object itself. That is to say, if we say “WeakMap = PrivateSymbol” and “key = this”, then the primary difference is whether or not the relationships are stored on the WeakMap/PrivateSymbol, or on the key/this. WeakMap suggests the former; PrivateSymbol suggests the latter.
Is one implementation or the other better? If objects in your system are immutable or not arbitrarily extensible, then the private symbol implementation may be impossible to carry out. But if both implementations are possible, then which one is better depends on the lifetimes of the objects in question. Garbage collecting weak references is expensive business (it is considerably less efficient than ordinary garbage collection), and so if you can arrange for your weak mappings to die through normal garbage collection, that is a win. Therefore, it’s better to store the mapping on the shorter-lived object. In the case of a memotable, the key is going to be a bit more ephemeral than the map, which results in a very strange consequence: the best implementation strategy for a weak map doesn’t involve creating a map at all!
Alas, as with many elegant results, there are some difficulties stemming from complexities in other parts of the ECMAScript specification. In particular, it is not at all clear what it means to have an “read-only weak map”, whereas an read-only private symbol has an obvious meaning. Furthermore, collapsing these two rather distinct concepts into a single language may only serve to confuse web developers; a case of a proposal being too clever for its own good. And finally, there is an ongoing debate about how to reconcile private state with proxies. This proposal was introduced to solve one particular aspect of this problem, but to our understanding, it only addresses a specific sub-problem, and only works when the proxy in question is a membrane.
by Edward Z. Yang at March 19, 2013 04:12 AM
If you hang out long enough with a certain crowd (in my case, it was the ECMAScript TC39 committee), you will probably hear the term membrane tossed around. And eventually, you will start to wonder, “Well, what is a membrane, anyway?”
As is the case with many clever but simple ideas, membranes were first introduced as a footnote [1] in a PhD thesis. Suppose that you are building distributed system, in which you pass references to objects between two separate nodes. If I want to pass a reference to foo in process A to process B, I can hardly just hand over an address—the memory spaces are not the same! So instead, I need to create a wrapper object wrappedFoo representing foo in B, which knows how to access the original object in A. So far so good.
Now here’s the catch: what if I pass a reference to wrappedFoo back to process A? If I were not very clever, I’d do the same thing as I did originally: create a new wrapper object wrappedWrappedFoo in A which knows how to access wrappedFoo in B. But this is silly; really, when I cross back over to A, I want to get back the original foo object.
This wrap-unwrap behavior is precisely what a membrane is. We consider the original object foo to be “inside” the membrane (a so-called wet object), and as it exits the membrane, it is wrapped with its own little membrane. However, when the object returns to its original membrane, the wrapper goes away. Just like in biology!
There is one last operation, called a “gate”: this occurs when you invoke a method on a wrapped object. Since the wrapper cannot actually perform the method, it has to forward the request to the original object. However, the arguments of the method need to get wrapped (or unwrapped) as they get forwarded; as you might expect.
While I used an RPC-like system to demonstrate the basic principle of membranes, a more conventional use is to enforce access control. Membranes are quite important; Mozilla relies on them extensively in order to enforce access restriction between objects from different websites which may interact with each other, but need security checks. (Actually, did you know that Mozilla is using a capability-based system for their security? Kind of neat!) It’s important to notice that when we unwrap, we are skipping security checks—the only reason this is acceptable is because the only objects that will be able to access the unwrapped object are precisely those objects in the same domain. For a more modern treatment of the subject, check out a more recent paper, Trustworthy Proxies: Virtualizing Objects with Invariants, which includes a lucid explanation of membranes.
[1] Well, actually it was a figure; figure 9.3 on page 71, to be precise!
by Edward Z. Yang at March 15, 2013 07:49 AM
At Stripe, we rely heavily on ruby and EventMachine to power various internal and external services. Over the last several months, we’ve known that one such service suffered from a gradual memory leak, that would cause its memory usage to gradually balloon from a normal ~50MB to multiple gigabytes.
It was easy enough to work around the leak by adding monitoring and restarting the process whenever memory usage grew too large, but we were determined to track down the root cause. Our exploration is a tour through a number of different debugging tools and techniques, so I thought I would share it here.
One powerful technique for tracking down tough memory leaks is
post-mortem analysis. If our program’s normal memory footprint is
50MB, and we let it leak until it’s using, say, 2GB,
1950/2000 =
97.5% of the program’s memory is leaked objects! If we look at a core
file (or, even better, a running image in gdb), signs of the leak
will be all over the place.
So, we let the program leak, and, when its memory usage got large enough, failed active users over to a secondary server, and attached gdb to the bloated image.
Our first instinct in a situation like this is that our Ruby code is leaking somehow, such as by accidentally keeping a list of every connection it has ever seen. It’s easy to investigate this possibility by using gdb, the Ruby C API, and the Ruby internal GC hooks:
(gdb) p rb_eval_string("GC.start")
$1 = 4
(gdb) p rb_eval_string("$gdb_objs = Hash.new 0")
$2 = 991401552
(gdb) p rb_eval_string("ObjectSpace.each_object {|o| $gdb_objs[o.class] += 1}")
$3 = 84435
(gdb) p rb_eval_string("$stderr.puts($gdb_objs.inspect)")
$4 = 4
Calling rb_eval_string lets us inject arbitrary ruby code into the
running ruby process, from within gdb. Using that, we first trigger a
GC — making sure that any unreferenced objects are cleaned up — and
then walk the Ruby ObjectSpace, building up a census of which Ruby
objects exist. Looking at the output and filtering for the top
objects, we find:
String => 26399
Array => 8402
Hash => 2161
Proc => 608
Just looking at those numbers, my gut instinct is that nothing looks too out of whack. Running some back-of-the-envelope numbers confirms this instinct:
It’s possible, of course, that one of those Strings has been growing
without bound, and is now billions of characters long, but that feels
unlikely. A quick survey of String lengths using ObjectSpace would
confirm that, but we didn’t even bother at this point.
So, we’ve mostly ruled out a Ruby-level leak. What now?
Well, as mentioned, 95+% of our program’s memory footprint is leaked objects. So if we just take a random sample of bits of memory, we will find leaked objects with very good probability. We generate a core file in gdb:
(gdb) gcore leak.core
Saved corefile leak.core
And then look at a random page (4k block) of the core file:
$ off=$(($RANDOM % ($(stat -c "%s" leak.core)/4096)))
$ dd if=leak.core bs=4096 skip=$off count=1 | xxd
0000000: 0000 0000 0000 0000 4590 c191 3a71 b2aa ........E...:q..
...
Repeating a few times, we notice that most of the samples include what looks to be a repeating pattern:
00000f0: b05e 9b0a 0000 0000 0000 0000 0000 0000 .^..............
0000100: 0000 0000 0000 0000 0100 0000 dcfa 1939 ...............9
0000110: 0000 0000 0000 0000 0a05 0000 0000 0000 ................
0000120: 0000 0000 0000 0000 d03b a51f 0000 0000 .........;......
0000130: 8000 0000 0000 0000 8100 0000 0000 0000 ................
0000140: 00a8 5853 1b7f 0000 0000 0000 0000 0000 ..XS............
0000150: 0000 0000 0000 0000 0100 0000 0100 0000 ................
0000160: 0000 0000 0000 0000 ffff ffff 0000 0000 ................
0000170: 40ef e145 0000 0000 0000 0000 0000 0000 @..E............
0000180: 0000 0000 0000 0000 0100 0000 0000 0000 ................
0000190: 0000 0000 0000 0000 0a05 0000 0000 0000 ................
00001a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00001b0: 8000 0000 0000 0000 8100 0000 0000 0000 ................
00001c0: 00a8 5853 1b7f 0000 0000 0000 0000 0000 ..XS............
00001d0: 0000 0000 0000 0000 0100 0000 0100 0000 ................
00001e0: 0000 0000 0000 0000 ffff ffff 0000 0000 ................
00001f0: 4062 1047 0000 0000 0000 0000 0000 0000 @b.G............
0000200: 0000 0000 0000 0000 0100 0000 1b7f 0000 ................
0000210: 0000 0000 0000 0000 0a05 0000 0000 0000 ................
0000220: 0000 0000 0000 0000 e103 0000 0000 0000 ................
0000230: 8000 0000 0000 0000 9100 0000 0000 0000 ................
0000240: 00a8 5853 1b7f 0000 0000 0000 0000 0000 ..XS............
0000250: 0000 0000 0000 0000 0100 0000 0100 0000 ................
0000260: 0000 0000 0000 0000 ffff ffff 0000 0000 ................
0000270: 50b0 350b 0000 0000 0000 0000 0000 0000 P.5.............
0000280: 0000 0000 0000 0000 0100 0000 0000 0000 ................
0000290: 0000 0000 0000 0000 0a05 0000 0000 0000 ................
00002a0: 0000 0000 0000 0000 30de 4027 0000 0000 ........0.@'....
00002b0: 0077 7108 0000 0000 0060 7b97 b4d2 f111 .wq......`{.....
00002c0: 9000 0000 0000 0000 8100 0000 0000 0000 ................
00002d0: 00a8 5853 1b7f 0000 0000 0000 0000 0000 ..XS............
00002e0: 0000 0000 0000 0000 0100 0000 0100 0000 ................
Those ffff ffff blocks, repeated every 128 bytes, leap out at me,
and 4 out of 5 samples of the core file reveal a similar pattern. It seems
probable that we’re leaking 128-byte objects of some sort, some field
of which is -1 as a signed 32-bit integer, i.e. ffff ffff in hex.
Looking further, we also notice a repeated 00a8 5853 1b7f 0000, two lines before each ffff ffff. If you’ve stared at too many Linux coredumps, as I
have, that number looks suspicious. Interpreted in little-endian, that
is 0x00007f1b5358a800, which points near the top of the userspace
portion of the address space on an amd64 Linux machine.
In fewer words: It’s most likely a pointer.
The presence of an identical pointer in every leaked object suggests that the pointer most likely points to some kind of “type” object or tag, containing information about type of the leaked object. For instance, if we were leaking Ruby String objects, every one would have an identical pointer to the Ruby object that represents the `String’ class. So, let’s take a look:
(gdb) x/16gx 0x00007f1b5358a800
0x7f1b5358a800: 0x0000000000000401 0x00007f1b53340f24
0x7f1b5358a810: 0x00007f1b532c7780 0x00007f1b532c7690
0x7f1b5358a820: 0x00007f1b532c7880 0x00007f1b532c78c0
0x7f1b5358a830: 0x00007f1b532c74f0 0x00007f1b532c74c0
0x7f1b5358a840: 0x00007f1b532c7470 0x0000000000000000
0x7f1b5358a850: 0x0000000000000000 0x0000000000000000
0x7f1b5358a860: 0x0000000000000406 0x00007f1b5332a549
0x7f1b5358a870: 0x00007f1b532c7a40 0x00007f1b532c7a30
The first field, 0x401, contains only two bits set, suggesting some
kind of flag field. After that, there are a whole bunch of
pointers. Let’s chase the first one:
(gdb) x/s 0x00007f1b53340f24
0x00007f1b53340f24: "memory buffer"
Great. So we are leaking … memory buffers. Thanks.
But this is actually fantastically informative, especially coupled
with the one other piece of information we have: /proc/<pid>/maps
for the target program, which tells us which files are mapped into our
program at which addresses. Searching that for the target address, we
find:
7f1b53206000-7f1b5336c000 r-xp 00000000 08:01 16697 /lib/libcrypto.so.0.9.8
0x7f1b53206000 ≤ 0x7f1b5358a800 < 7f1b5336c000, so this mapping
contains our “type” object. libcrypto is the library containing
OpenSSL’s cryptographic routines, so we are leaking some sort of
OpenSSL buffer object. This is real progress.
I am not overly familiar with libssl/libcrypto, so let’s go to the source to learn more:
$ apt-get source libssl0.9.8
$ cd openssl*
$ grep -r "memory buffer" .
./crypto/err/err_str.c:{ERR_PACK(ERR_LIB_BUF,0,0) ,"memory buffer routines"},
./crypto/asn1/asn1.h: * be inserted in the memory buffer
./crypto/bio/bss_mem.c: "memory buffer",
./README: sockets, socket accept, socket connect, memory buffer, buffering, SSL
./doc/ssleay.txt:- BIO_s_mem() memory buffer - a read/write byte array that
./test/times:talks both sides of the SSL protocol via a non-blocking memory buffer
Only one of those is a string constant, so we go browse
./crypto/bio/bss_mem.c and read the docs (bio(3) and
buffer(3)) a bit. Sparing you all the details, we learn:
BIO structure as a generic abstraction around any kind of
source or sink of data that can be read or written to.BIO has a pointer to a BIO_METHOD, which essentially contains
a small amount of metadata and a
vtable,
describing what specific kind of BIO this is, and how to interact
with it. The second field in a BIO_METHOD is a char * pointing
at a string holding the name of this type.BIOs is the mem BIO, backed
directly by an in-memory buffer (a BUF_MEM). The BIO_METHOD for
memory BIOs has the type tag "memory buffer".So, it appears we are leaking BIO objects. Interestingly, we don’t
actually seem to be leaking the underlying memory buffers, just the
BIO struct that contains the metadata about the buffer.
This kind of leak has to be in some C code somewhere. Clearly nothing in pure ruby code should be able to do this. The server in question contains no C extensions we wrote, so it’s probably in some third-party library we use.
A leak in OpenSSL itself is certainly possible, but OpenSSL is very widely used and quite mature, so let’s assume (hope) that our leak is not there, for now.
That leaves EventMachine as the most likely culprit. Our program uses SSL heavily via the EventMachine APIs, and we do know that EventMachine contains a bunch of C/C++, which, to be frank, does not have a sterling reputation.
So, we pull up an EventMachine checkout. There are a number of ways to
construct a new BIO, but the most basic and common seems to be
BIO_new, sensibly enough. So let’s look for that:
$ git grep BIO_new
ext/rubymain.cpp: out = BIO_new(BIO_s_mem());
ext/ssl.cpp: BIO *bio = BIO_new_mem_buf (PrivateMaterials, -1);
ext/ssl.cpp: pbioRead = BIO_new (BIO_s_mem());
ext/ssl.cpp: pbioWrite = BIO_new (BIO_s_mem());
ext/ssl.cpp: out = BIO_new(BIO_s_mem());
Great: there are some calls (so it could be one of them!), but not too many (so we can reasonably audit them all).
Starting from the top, we find, in ext/rubymain.cpp:
static VALUE t_get_peer_cert (VALUE self, VALUE signature)
{
VALUE ret = Qnil;
X509 *cert = NULL;
BUF_MEM *buf;
BIO *out;
cert = evma_get_peer_cert (NUM2ULONG (signature));
if (cert != NULL) {
out = BIO_new(BIO_s_mem());
PEM_write_bio_X509(out, cert);
BIO_get_mem_ptr(out, &buf);
ret = rb_str_new(buf->data, buf->length);
X509_free(cert);
BUF_MEM_free(buf);
}
return ret;
}
The OpenSSL APIs are less than perfectly self-descriptive, but it’s not too hard to puzzle out what’s going on here:
We first construct a new BIO backed by a memory buffer:
out = BIO_new(BIO_s_mem());
We write the certificate text into that BIO:
PEM_write_bio_X509(out, cert);
Get a pointer to the underlying BUF_MEM:
BIO_get_mem_ptr(out, &buf);
Convert it to a Ruby string:
ret = rb_str_new(buf->data, buf->length);
And then free the memory:
BUF_MEM_free(buf);
But we’ve called the wrong free function! We’re freeing buf, which
is the underlying BUF_MEM, but we’ve leaked out, which is the
BIO itself we also allocated. This is exactly the kind of leak we saw in our
core dump!
Continuing our audit, we find the exact same bug in
ssl_verify_wrapper in ssl.cpp. Reading code, we learn that
t_get_peer_cert is called by the Ruby function get_peer_cert, used
to retrieve the peer certificate from a TLS connection, and that
ssl_verify_wrapper is called if you pass :verify_peer => true to
start_tls, to convert the certificate into a Ruby
string for passing to the ssl_verify_peer hook.
So, any time you make a TLS or SSL connection with EventMachine
and verify the peer certificate, EventMachine was leaking 128
bytes! Since it’s presumably pretty rare to make a large number of SSL
connections from a single Ruby process, it’s not totally surprising
that no one had caught this before.
Having found the issue, the fix was simple, and was promptly merged upstream.
by nelhage at March 07, 2013 05:13 PM
Along with a Nexus 7, I also acquired a Kindle Paperwhite over winter break. (Wi-Fi only) I have been quite pleased by this purchase, though in an unexpected way: while I have not increased the number of books I read, the Kindle has materially changed how I read articles on the Internet. Not via their web browser, which is essentially unusable except for the simplest tasks, but via tools which take articles on the Internet and convert them into ebook form.
For blog posts I use Calibre with the Google Reader source. This has been revolutionary: I no longer read blog posts in my browser; instead, I bundle them up on my Kindle and read them at my leisure. This change has also meant that I read blog posts much more carefully (interfluidity used to only get a skim; now I can actually make it through the posts). This setup is a little nontrivial so I’ll probably describe it in a later blog post. (I used to use Klip.me, which had easy setup, but (1) it could not handle images (so equations and graphs are right out), and (2) it botched formatting of pre formatted text.)
For longform articles I use Longform, which serves up a interesting mix of in-depth reporting and nonfiction articles. They have a very convenient “Send to Kindle” button which I’m told is served by Readability; I wonder if I should add a button like that to my blog. I am also using Amazon’s Send to Kindle Firefox extension for my paywalled articles (primarily LWN.net), although its results seem a bit spottier than Readability.
For academic papers, the going has been a bit rough, but I have gotten decent results on two-column papers with cut2col, which manages to make text large enough to be readable in Kindle’s PDF reader. Landscape mode also helps a bit with reading PDFs. Generally, however, managing which PDFs are on my device is a bit of an open problem. I haven’t started using Calibre yet for this purpose, although I have used it to perform some conversions between ebook formats. These conversions don’t work particularly well, although they are usable. It’s well worth getting the latest version of Calibre, since there are some bugs in the current Quantal version; I use ppa:n-muench/calibre.
For textbooks, ebook editions are still extremely expensive. I would love to carry around all of the famous computer science textbooks in my pocket, but to date I don’t have any good way of doing so without breaking the bank. Books suffer similarly: it turns out I did most of my pre-Kindle reading borrowing books from libraries; however, I hear the Palo Alto public library does Kindle lending, and I intend to check them out at some point in time. The typesetting for public domain books put out by places like Project Gutenberg are notoriously spotty; Readability and similar services seem to do a much better job! Alas, the Stanford library system does not appear to carry ebooks. The Amazon free samples of books have also been good fun, and I’ve collected quite a few of them.
Some annoyances with the Kindle itself:
All-in-all, I am quite pleased with this device, and I would recommend it to anyone interested in reducing the amount of time they spend staring at a computer monitor.
by Edward Z. Yang at January 30, 2013 02:00 PM
I’d like to talk about some nitty-gritty details of GHC’s thread scheduling, discovered over the course of working on stride scheduling for GHC. Most of these choices are merely implementation details and are not part of any specification. While these choices shouldn’t be relied upon, they are worth knowing, since many of these details were accreted over the course of many performance bugs, benchmark tests and other battles. In this post, I’ll attempt to give some historical insight into why many choices were made. These insights should generalize to any system that would like to implement green threads, lightweight threads that use less memory than traditional operating system threads. For space reasons, I’m not going to talk about STM or sparks (though they are also quite interesting).
Update: A large portion of this material has been incorporated into the scheduler page in the GHC commentary
I’d first like to discuss some brief background about the runtime system first and point out some perhaps nonintuitive design choices. A thread is represented by a TSO (thread-state object) by GHC, i.e. the StgTSO struct in includes/rts/storage/TSO.h. [1] In Haskell, TSOs can be passed around as ThreadId objects. The Stg in front of the struct name indicates that TSOs are garbage collected, like other closures in Haskell. The TSO, along with the stack allocated with it (STACK), constitute the primary memory overhead of a thread. Default stack size, in particular, is controlled by the GC flag -ki, and is 1k by default. [2] Threads are run by Capabilities, which can be thought of virtual cores managed by GHC. Capabilities are, in turn, mapped to true operating system threads, or Tasks, though we won’t talk about them much.
Being garbage collected has two major implications for TSOs. First, TSOs are not GC roots, so they will get GC'd if there is nothing holding on to them (e.g. in the case of deadlock), and their space is not automatically reclaimed when they finish executing [3]. Usually, a TSO will be retained by a Capability’s run queue (a GC root), or in the list of waiting threads of some concurrency variable, e.g. an MVar. Second, a TSO must be considered a mutable object, and is thus subject to the conventional GC write barriers necessary for any mutable object in a generational garbage collector. [4] The dirty bit tracks whether or not a TSO has been modified; it is always set when a thread is run and also when any of the pointer fields on a TSO are modified. Two fields, set by setTSOLink and setTSOPrev, are of particular interest to the scheduler.
The run queue is at the heart of the scheduler, as any runnable thread will hit the run queue before the scheduler actually pops it off the queue and runs it. There’s one per capability rts/Capability.h (in the bad old days, there was a global run queue, but this performed badly for multithreaded processes), and it is implemented as a doubly-linked list run_queue_hd and run_queue_tl. [6] The head and tail pointers mean that the queue is actually a deque: this is important because the scheduler will often have to handle threads that were interrupted in some way, and should let the threads get back on. The links themselves are on the TSOs and modified with setTSOLink and setTSOPrev, so modifying the queue dirties the TSOs involved. [7] Otherwise, the run queue is exclusively owned by the scheduler. If there are idle capabilities and if we have more than one thread left in our run queue, threads will be pushed to other queues with schedulePushWork.
Threads are put in front (pushOnRunQueue) if:
Threads are put in back (appendToRunQueue) in the case of pre-emption, or if it’s new; particularly, if
Benchmarks like nofib are very important, even if they are synthetic, as they will often be construed as primary evidence whether or not a change to the scheduler speeds or slows things down. One reason is that it is much easier to tell why a short program that torture tests threads has slowed down than it is to tell why a large, complicated multithreaded program no longer seems very snappy. But really, the main motivation is convenience: nofib programs are easy to measure and easy to compare. Fortunately, the tests often measure something quite specific, so I’d like to describe the tests that compose the smp nofib suite here:
The GHC scheduler is pretty complicated! Much of the current behavior was created in response to specific problems: the right choices are not obvious a priori! I hope this post will serve as a valuable reference for any future GHC hackers interested in playing around with the scheduler, as well as for anyone else who needs to implement a scheduler for their runtime system. Much of the historical data was gleaned from comments (though I found some out-of-date ones), liberal use of git blame, and cross-referencing with the bug tracker—these are all useful places to figure out, “Well, why does that code do that?” In this post, I hope I’ve answered that question, to some degree.
[1] Initialization of StgTSO is handled in createThread in rts/Threads.c; this function is in turn invoked by createGenThread, createIOThread and createStrictIOThread in rts/RtsAPI.c. These functions setup the initial stack state, which controls what the thread executes when it actually gets run. These functions are the ones invoked by the fork# and other primops (entry-points for primops are located in rts/PrimOps.cmm).
[2] Actually, your usable stack will be a little smaller than that because this size also includes the size of the StgTSO struct. (This is only really for allocating lots of threads into one block, however, as once a GC occurs the TSOs and stacks will no longer be adjacent.)
[3] Here is a sample program which demonstrates how holding onto ThreadId using stable pointers (which force the object their pointing to to never be GC'd) can leak memory:
import Control.Concurrent
import Control.Monad
import Foreign.StablePtr
n = 400000
main = do
ms <- replicateM n (newEmptyMVar >>= \m -> (forkIO (putMVar m ()) >>= newStablePtr) >> return m)
mapM_ takeMVar ms
The heap profile of the run shows none of the TSO/STACK objects being deallocated, even when the MVars drain out as threads finish executing.
[4] The write barrier for generational GCs refers not to memory barrier of multithreaded execution, but rather, notification for the garbage collector when a mutable reference in the old generation changes, and may now possibly point to an object in the young generation. Write barriers are necessary because the old generation will not be traversed during a minor collection, and thus if old generations may point to an object in a young generation, we may miss the fact that a young object is still alive even though it has no references from other young objects. In GHC, a write barrier is implemented by adding an object to the mutable list (mut_list) of a Capability if it is not in the youngest generation. (Some objects, like MutArr#, are permanently on the mutable list; in such a case, a write barrier may not be necessary. But see [5] for more details.) Objects will usually track their dirty status, so that they don’t add themselves to the mutable list multiple times. (Accidentally adding an object multiple times is harmless, but means the GC has to do extra work traversing the mutable list.) Additionally, if we can guarantee that the new reference does not point to the young generation (for instance, it is a static closure like END_TSO_QUEUE), then dirtying the object is not necessary. Getting this stuff right is tricky, to say the least!
[5] There is a bit of a sordid story here. Keeping an object permanently on the mutable list is done by scavenge_mutable_list in rts/sm/Scav.c, which will unconditionally re-add such an object to the mutable list if it sees it there. How does the object get on the mutable list in the first place? It’s not placed on the list upon creation; rather, upon the first minor GC on the youngest generation, the scavenging GC notices the object and places it on the mutable list by gct->failed_to_evac = rtsTrue. How do we end up freeing the object? The mutable list is considered a set of root pointers, but it is only scavenged, not evacuated. If an item on the mutable list ends up not being evacuated, it will be blown away regardless. (This does mean, however, that its elements will not be freed until the next GC.) Isn’t it really inefficient to always be scanning these arrays? Yes, and this used to be a problem (ticket #650), nowadays mitigated by card marking. The same story applied to TSOs (ticket #1589), but the fix here was to properly apply a write barrier and not keep the objects permanently on the mutable list; this improved performance quite a bit when there were a lot of threads (even if you don’t scavenge their pointers, traversing a huge mutable list is still a pain.) Creating a lot of small mutable arrays is apt to be painful.
[6] It used to be singly linked, but fixing ticket #3838 required the ability to remove TSOs from the run queue.
[7] Since these fields are always traversed by the GC, it’s important that they do not contain NULL pointers or garbage. Instead, we set them to the static closure END_TSO_QUEUE. Because this is guaranteed not to be in the young generation, this is why you do not need to dirty the TSO after setting this field.
[8] Sometimes, a heap overflow and a context switch occur simultaneously. If the thread requested a large block, we still always push it in front (because we don’t want another thread to steal our large block); however, otherwise, the context switch takes precedence and the thread is booted to the end of the queue—the context switch is checked as late as possible. (See commit 05881ecab)
by Edward Z. Yang at January 28, 2013 08:00 AM
Today, at around 2:30 PM EST, Scripts went down. For most MIT students, this means a few class websites are inaccessible, a few friends' blogs are down, and web development is a bit annoying. For me, it means panic mode. I'm a member of the Scripts Maintainer Team, so the availability and security of the service is my responsibility.
The MIT network has been particularly unreliable recently, leading to speculation that there was a DDoS attack mounted by Anonymous. While we're still waiting for an official diagnosis from IS&T, I'm going to go ahead and say this: if Anonymous did take down the MIT network, and is continuing to attack it, please stop. You're just making students' lives worse.
The current pattern of "hacktivism" is supposed to get people's attention, and it is—just not those that can do anything. Scripts is MIT's largest web host but it's run entirely by student volunteers. Any and all attacks that are supposed to get the attention of the administration are instead being handled by SIPB members.
For those of you that want to do something in Aaron Swartz's memory, please find a more positive and productive task than upsetting MIT students that had nothing to do with the administration's decision.
by Alex Chernyakhovsky (noreply@blogger.com) at January 15, 2013 09:17 PM
So you want to make a web app. In today’s world, there is a panoply of software to assist you: you can use an all-in-one framework, or you can grab libraries to deal with the common needs of templating, database access, interactivity, etc. These libraries unify common functionality and take care of edge-cases you might otherwise not have the resources to deal with.
But there is one tool which is conspicuously absent: the natural language processing library.
“Now wait!” you may be saying, “of course there are NLP libraries, nltk and lingpipe come to mind.” Sure, but are you actually using these libraries? “Maybe not, but my application doesn’t need NLP, you see.”
The thing is, you are doing language processing in your application, even if you don’t realize it: “string concatenation” is really just a simple form of natural language generation, a subfield of NLP in its own right. [1] If you need to perform a more complicated task, such as pluralize nouns, capitalize sentences or change the grammatical form of verbs, you’ll need linguistic data. [2] This data is an essential part of many traditional NLP tasks. However, if you need to pluralize something today, you’re more likely to copy-paste a list of regexes off the Internet rather than think, “Hm, I should install an NLP library.” Part of this is because, while NLP libraries do contain this data, it is not publicized well.
It’s also worth considering if your application could benefit from any traditional NLP, including keyword generation, canonicalization (When are two things written slightly differently the same?), language identification, full-text search, autocompletion, topic detection and clustering, content summarization, parsing human-written dates and locations, etc. While it’s rare for an application to need all of these features, most would benefit from a few of them. For example, a blog application might want keyword generation to generate tags, full-text search to search posts, content summarization for non-fullpage views of posts, and date parsing for scheduling posts. These features tend to be absent, however, because they are often difficult to implement properly. Modern approaches often require models to be trained on large corpora of data—so-called data-driven models. Most of the time, this setup cost doesn’t seem worth it; if the feature is to be implemented (e.g. as an extension), a bag of heuristics is quicker.
Both of these problems hint at the trouble with current NLP frameworks: they assume that users are interested in building NLP systems, as opposed to using NLP systems. I shouldn’t need a PhD in computational linguistics to get my nouns to pluralize correctly or parse dates robustly. I shouldn’t need a PhD to get passable results on conventional, well-studied NLP applications. The default expectation should not be that users need to train a model: pre-existing models can easily be reused. Although there is an upper limit to how good an NLP algorithm can do without any tuning, the principled approach can still offer improvements over heuristics. But even more importantly, once a model is being used, developers who want to improve their results can train their own model on text from their own application, which is likely to carry domain-specific terminology and patterns. The library should be initially easy to use, and principled enough to be a gateway drug into the wonderful world of computational linguistics. Who knows what other applications could arise when developers recognize NLP as an accessible tool for their toolbox? [3]
Here is my call to arms: I want to see all of the current “baby-NLP” functionality collected together into a single place, where they get benefit from shared linguistic data and serve as easy-to-use features that initially attract developers. I would like to see more complicated but useful NLP technology become more accessible to a non-linguistic audience. And I would like all of this to be based on principled NLP foundations, so that it is possible to improve on the out-of-the-box models and algorithms. NLP practitioners are often very careful not to overstate what their systems are capable of (in contrast to the irrational exuberance of the 1980s). That’s OK: sometimes, the bar really is that low.
Thanks to Gregory Price, Eric Kow and Richard Tibbetts for helping review earlier drafts of this article.
[1] As a field, natural language generation doesn’t really consider string concatenation to be a true method; instead, it is interested in how to generate text from a functional description of intent. One neat example is referring expression generation.
[2] For example, the functionality (e.g. pluralization rules collected in the language/ folder in MediaWiki. MediaWiki is one of the most international open source projects, and I find it a fascinating source of information about linguistic oddities in foreign languages.
[3] As an example, I'd like to sketch how natural language generation can assist internationalization of applications. Suppose that you would like to let a user know that “you have three new messages.” The most obvious way to implement this would be with: printf("You have %d new message(s).", numMessages). Now, there are a number of shortcuts that have been taken here: we always print out a numeric digit, rather than AP style which uses English for numbers between zero and nine, and we’ve sidestepped whether or not “message” should be pluralized by tacking on an (s) on the end.
If we’d like to handle those cases, the next obvious thing to do is to add a few new functions: we’ll need a function apnumber to convert 3 to three, and we’ll need a function pluralize to convert message into messages when numMessages is greater than one. So you would end up with something like printf("You have %s new %s", apnumber(numMessages), pluralize("message", numMessages)). This is the ad hoc approach which will work reasonably well on English but will get you into trouble when you realize other languages have things like noun-adjective agreement (“nouveau message” versus “nouveaux messages”). Internationalization frameworks have long recognized and offered mechanisms for dealing with these cases; however, the average English-based project is unlikely to know about these problems until they internationalize.
However, there exists a representation which is agnostic to these issues. Consider the dependency grammar of this sentence, which we have extracted with a little NLP:
nsubj(have-2, You-1) root(ROOT-0, have-2) num(messages-5, three-3) amod(messages-5, new-4) dobj(have-2, messages-5)
We might ask, “Given data of this form, can we automatically generate an appropriate sentence in some language, which conveys the information and is grammatically correct?” That is a pretty hard task: it is the fundamental question of NLG. (It's not quite equivalent to machine translation, since we might require a user to add extra information about the functional intent that would otherwise be very hard to extract from text.) While it would be cool if we had a magic black box which could crank out the resulting sentences, even today, the tools developed by NLG may help reduce translator burden and increase flexibility. I think that’s well worth investigating.
by Edward Z. Yang at January 02, 2013 05:00 AM
I acquired a Google Nexus 7 (Wi-Fi only) over winter break. I don’t really like getting new devices: they invariably require a lot of work to setup to my liking. Here are some notes:
Oh yeah, and Happy New Year. :)
Update: I had my Nexus 7 inexplicably brick itself. Fortunately, once the phone is unlocked, it is very easy to reflash the image (and I didn’t lose data, which normally occurs when you first unlock a fone). I did this by fastboot update image-nakasi-jop40d.zip while the phone was in the bootloader (hold down both volume keys while powering up, and the image was downloaded from Google), and then applying the last set of steps from here to get SuperSu installed again (i.e. fastbooting into ClockworkMod and then sideloading SuperSu).
by Edward Z. Yang at January 01, 2013 02:19 AM
Metro maps are a visual metaphor for complex, interdependent story lines developed by Dafna Shahaf. Dafna’s thesis involved techniques for automatically taking a corpus of news articles and extracting a coherent narratives that covered the overall space. For our final CS448b project, we took one of the narratives Dafna had generated and created a system for displaying the maps. (The demo is best viewed on a large monitor.)

We only had enough time to get the viewer aspect polished, but we think that it would not be too difficult to extend this framework for the construction of metro maps (in case you don’t have access to Dafna’s algorithm).
This is joint work with Russell Chou and Jacob Jensen.
by Edward Z. Yang at December 13, 2012 09:44 AM
On the prompting of Steven Hum, I've put some finishing touches on my Sup patchset and am “releasing” it to the world (more on what I mean by “release” shortly.) The overall theme of this patchset is that it integrates as much Sup metadata it can with Maildir data. In particular:
There is at least a high probability the patchset will work for you, since I’ve been using it actively for a while. Sup will sometimes crash; if it doesn't happen reproduceably or cause data loss, I probably won’t investigate too hard. Some of my patches are a bit sketchy (especially those labeled HACK: I’ve attempted to document all the skeevy bits in commit messages and code comments.) So, how supported is this version of Sup? Well:
Some of the early patches are pretty uncontroversial though, and I’d like to see them get into mainline eventually. You can get the code here: http://gitorious.org/~ezyang/sup/ezyang/commits/maildir-sync/
sent-save-to
Configures where to save sent mail to. If this hook doesn't exist,
the global sent setting will be used (possibly defaulting to sup://sent)
Variables:
message: RMail::Message instance of the mail to send.
account: Account instance matching the From address
Return value:
Source to save mail to, nil to use default
compose-from
Selects a default address for the From: header of a new message
being composed.
Variables:
opts: a dictionary of ComposeMode options, including :from, :to,
:cc, :bcc, :subject, :refs and :replytos
Return value:
A Person to be used as the default for the From: header
draft-save-to
Selects a source to save a draft to.
Variables:
from_email: the email part of the From: line, or nil if empty
Return value:
A source to save the draft to.
To use this functionality, in config.yaml, you need a new option :maildir_labels:
:maildir_labels: :stanford: [[:inbox, 4], [null, 6]]
The value of this option is a dictionary of "accounts" to lists of precedences. (The account label stanford doesn’t actually mean anything; it's just for documentation.) Read it as follows:
For messages belonging in source 4 or source 6 (consult sources.yaml), if the message has the :inbox tag, move it to source 4, otherwise move it to source 6.
This will automatically start working for any new mail you change the labels of. In order to apply this to old mail, you need to run sup-sync-back-maildir. If you're going to move a lot of mail, you probably want to run this version of OfflineIMAP: https://github.com/ezyang/offlineimap
by Edward Z. Yang at December 01, 2012 11:33 PM
You can. Imagine a version of Haskell where every constructor was strict, e.g. every field had a ! prefix. The semantics of this language are well defined; and in fact, the fine folks at CMU have known about this for some time:
Up to this point we have frequently encountered arbitrary choices in the dynamics of various language constructs. For example, when specifying the dynamics of pairs, we must choose, rather arbitrarily, between the lazy dynamics, in which all pairs are values regardless of the value status of their components, and the eager dynamics, in which a pair is a value only if its components are both values. We could even consider a half-eager (or, equivalently, half-lazy) dynamics, in which a pair is a value only if, say, the first component is a value, but without regard to the second.
Similar questions arise with sums (all injections are values, or only injections of values are values), recursive types (all folds are values, or only folds of values are values), and function types (functions should be called by-name or by-value). Whole languages are built around adherence to one policy or another. For example, Haskell decrees that products, sums, and recursive types are to be lazy, and functions are to be called by name, whereas ML decrees the exact opposite policy. Not only are these choices arbitrary, but it is also unclear why they should be linked. For example, we could very sensibly decree that products, sums, and recursive types are lazy, yet impose a call-by-value discipline on functions. Or we could have eager products, sums, and recursive types, yet insist on call-by-name. It is not at all clear which of these points in the space of choices is right; each has its adherents, and each has its detractors.
Are we therefore stuck in a tarpit of subjectivity? No! The way out is to recognize that these distinctions should not be imposed by the language designer, but rather are choices that are to be made by the programmer. This may be achieved by recognizing that differences in dynamics reflect fundamental type distinctions that are being obscured by languages that impose one policy or another. We can have both eager and lazy pairs in the same language by simply distinguishing them as two distinct types, and similarly we can have both eager and lazy sums in the same language, and both by-name and by-value function spaces, by providing sufficient type distinctions as to make the choice available to the programmer.
This is from the Polarization chapter of Harper’s Practical Foundations for Programming Languages. Personally, I think call-by-name with (by default) eager data types by default is an under-appreciated point in the design space: with this combination, you still get the ability to implement your own control-flow structures like if (just not on data structures) and have lazy bindings, but you no longer have to worry about a large class of space leak. Of course, this regime doesn't eliminate all problems: for example, if you foldl instead of foldl', you will still end up with a long line of function applications and stack overflow. It’s not clear to me if there is an alternative form of fix which dodges this bullet.
by Edward Z. Yang at November 26, 2012 02:00 PM
Joe Zimmerman recently shared with me a cool new way of thinking about various encryption schemes called functional encryption. It’s expounded upon in more depth in a very accessible recent paper by Dan Boneh et al.. I’ve reproduced the first paragraph of the abstract below:
We initiate the formal study of functional encryption by giving precise definitions of the concept and its security. Roughly speaking, functional encryption supports restricted secret keys that enable a key holder to learn a specific function of encrypted data, but learn nothing else about the data. For example, given an encrypted program the secret key may enable the key holder to learn the output of the program on a specific input without learning anything else about the program.
Quite notably, functional encryption generalizes many existing encryption schemes, including public-key encryption, identity-based encryption and homomorphic encryption. Unfortunately, there are some impossibility results for functional encryption in general in certain models of security (the linked paper has an impossibility result for the simulation model.) There’s no Wikipedia page for functional encryption yet; maybe you could write it!
Apropos of nothing, a math PhD friend of mine recently asked me, “So, do you think RSA works?” I said, “No, but probably no one knows how to break it at the moment.” I then asked him why the question, and he mentioned he was taking a class on cryptography, and given all of the assumptions, he was surprised any of it worked at all. To which I replied, “Yep, that sounds about right.”
by Edward Z. Yang at November 25, 2012 08:24 PM
Functions are awesome. What if we made a PL that only had functions?
Objects are awesome. What if we made a PL where everything was an object?
Lazy evaluation is awesome. What if we made a PL where every data type was lazy?
Extremist programming (no relation to extreme programming) is the act of taking some principle, elevating it above everything else and applying it everywhere. After the dust settles, people often look at this extremism and think, “Well, that was kind of interesting, but using X in Y was clearly inappropriate. You need to use the right tool for the job!”
Here’s the catch: sometimes you should use the wrong tool for the job—because it might be the right tool, and you just don’t know it yet. If you aren’t trying to use functions everywhere, you might not realize the utility of functions that take functions as arguments [1] or cheap lambdas [2]. If you aren’t trying to use objects everywhere, you might not realize that both integers [3] and the class of an object [4] are also objects. If you aren’t trying to use laziness everywhere, you might not realize that purity is an even more important language feature [5].
This leads to two recommendations:
There are a lot of situations where extremism is inappropriate, but for fun projects, small projects and research, it can really teach you a lot. One of the most memorable interactions I had in the last year was while working with Adam Chlipala. We were working on some proofs in Coq, and I had been taking the moderate route of doing proofs step-by-step first, and then with Ltac automation once I knew the shape of the proof. Adam told me: “You should automate the proofs from the very beginning, don’t bother with the manual exploration.” [6] It was sage advice that made my life a lot better: I guess I just wasn’t extremist enough!
Files are awesome. What if we made an OS where everything was a file?
Cons cells are awesome. What if we made a PL where everything was made of cons cells?
Mathematics is awesome. What if we made a PL where everything came from math?
Arrays are awesome. What if we made a PL where everything was an array?
[1] Higher-order functions and combinators: these tend to not see very much airplay because they might be very verbose to write, or because the language doesn't have a very good vocabulary for saying what the interface of a higher-order function is. (Types help a bit here.)
[2] Cheap lambdas are necessary for the convenient use of many features, including: monads, scoped allocation (and contexts in general), callbacks, higher-order functions.
[3] Consider early versions of Java prior to the autoboxing of integer and other primitive types.
[4] Smalltalk used this to good effect, as does JavaScript.
[5] This is one of my favorite narratives about Haskell, it comes from Simon Peyton Jones’ presentation Wearing the hair shirt (in this case, laziness).
[6] This is the essence of the Chlipala school of Coq proving, in recognition of how astonishingly easy it is to trick experienced computer scientists into writing the equivalents of straight-line programs by hand, without any abstractions.
by Edward Z. Yang at November 20, 2012 09:15 PM
Most of the time when programmers write code, they're writing in userspace. This "normal" code doesn't have to worry about the underlying hardware, it interacts with the operating system's abstractions only.
by Alex Chernyakhovsky (noreply@blogger.com) at November 18, 2012 07:35 PM
“Everything is a file.” [1] This was the design philosophy taken to its logical extreme in Plan 9. Any interface you could imagine was represented as a file. Network port, pixel buffers, kernel interfaces—all were unified under a common API: the file operations (open, read, write...) Plan 9 used this to eliminate most of its system calls: it had only thirty-nine, in contrast to modern Linux's sprawling three hundred and twenty-six.
When I first heard of Plan 9, my first thought was, “But that’s cheating, right?” After all, they had reduced the number of syscalls but increased the number of custom files: complexity had merely been shifted around. But one of my labmates gave me a reason why this was still useful: per-process mountpoints. These mountpoints meant that I could give each process their own view of the filesystem—usually the same, but sometimes with some vital differences. Suppose that I wanted to tunnel the network connection of one of my applications: this application would be accessing the network through some file, so I instead could mount a network filesystem to the network files of another system, and transparently achieve proxying without any cooperation from my application. [2]
Let’s step back for a moment and put on our programming language hats. Suppose that a file is an abstract data type, and the syscall interface for manipulating files is the interface for this data type. What are mounts, in this universe? Another friend of mine pointed out the perfectly obvious analogy:
Files : Mounts :: Abstract Data Types : Dependency Injection
In particular, the mount is a mechanism for modifying some local namespace, so that when a file is requested, it may be provided by some file system completely different to what the process might have expected. Similarly, dependency injection specifies a namespace, such that when an object is requested, the concrete implementation may be completely different to what the caller may have expected.
The overall conclusion is that when developers implemented dependency injection, they were reimplementing Plan 9’s local mounts. Is your dependency injection hierarchical? Can you replace a hierarchy (MREPL), or mount your files before (MBEFORE) or after (MAFTER) an existing file system? Support runtime changes in the mount? Support lexical references (e.g. dot-dot ..) between entities in the hierarchy? I suspect that existing dependency injection frameworks could learn a bit from the design of Plan 9. And in Haskell, where it seems that people are able to get much further without having to create a dependency injection framework, do these lessons map back to the design of a mountable file system? I wonder.
[1] Functional programmers might be reminded of a similar mantra, “Everything is a function.”
[2] For the longest time, Linux did not provide per-process mount namespaces, and even today this feature is not available to unprivileged users—Plan 9, in contrast, had this feature available from the very beginning to all users. There is also the minor issue where per-process mounts are actually a big pain to work with in Linux, primarily, I dare say, due to the lack of appropriate tools to assist system administrators attempting to understand their applications.
by Edward Z. Yang at November 09, 2012 12:45 AM
I'm taking a Data Visualization course this fall, and one of our assignments was to create an interactive visualization. So I thought about the problem for a little bit, and realized, “Hey, wouldn’t it be nice if we had a version of hp2ps that was both interactive and accessible from your browser?” (hp2any fulfills this niche partially, but as a GTK application).
A week of hacking later: hp/D3.js, the interactive heap profile viewer for GHC heaps. Upload your hp files, share them with friends! Our hope is that the next time you need to share a heap profile with someone, instead of running hp2ps on it and sending your colleague the ps file, you’ll just upload the hp file here and send a colleague your link. We’ve tested it on recent Firefox and Chrome, it probably will work on any sufficiently modern browser, it definitely won’t work with Internet Explorer.

Some features:
Give it a spin, and let me know about any bugs or feature suggestions! (Some known bugs: sometimes Yesod 500s, just refresh until it comes up. Also, we lack backwards animations, axis changing is a little choppy and you can’t save annotations on the OTHER band.)
by Edward Z. Yang at November 02, 2012 06:42 AM
October has come, and with it, another Ubuntu release (12.10). I finally gave in and reinstalled my system as 64-bit land (so long 32-bit), mostly because graphics were broken on my upgraded system. As far as I could tell, lightdm was dying immediately after starting up, and I couldn't tell where in my copious configuration I had messed it up. I also started encrypting my home directory.
My labmates continue to tease me for not switching to Arch. We’ll see...
by Edward Z. Yang at October 24, 2012 03:00 PM
I was wandering through the Gates building when the latest issue of the ACM XRDS, a student written magazine, caught my eye.

“Oh, didn’t I write an article for this issue?” Yes, I had!

The online version is here, though I hear it’s behind a paywall, so I’ve copypasted a draft version of the article below. Fun fact: The first version of this article had a Jeff Dean fact, but we got rid of it because we weren’t sure if everyone knew what Jeff Dean facts were...
True fact: as a high school student, Jeff Dean wrote a statistics package that, on certain functions, was twenty-six times faster than equivalent commercial packages. These days, Jeff Dean works at Google, helping architect and optimize some of the biggest data-crunching systems Google employs on a day-to-day basis. These include the well known MapReduce (a programming model for parallelizing large computations) and BigTable (a system which stores almost all of Google's data). Jeff's current project is infrastructure for deep learning via neural networks, a system with applications for speech/image recognition and natural language processing.
While Jeff has become a public face attached to much of Google's internal infrastructure projects, Jeff stresses the fact that these projects require a mix of areas of expertise from people. Any given project might have people with backgrounds in networking, machine learning and distributed systems. Collectively, a project can achieve more than any person individually. The downsides? With all of the different backgrounds, you really need to know when to say: “Hold on, I don't understand this machine learning term.” Jeff adds, however, that working on these teams is lots of fun: you get to learn about a sub-domain you might not have known very much about.
Along with a different style of solving problems, Google also has different research goals than academia. Jeff gave a particular example of this: when an academic is working on a system, they don't have to worry about what happens if some really rare hardware failure occurs: they simply have to demo the idea. But Google has to worry about these corner cases; it is just what happens when one of your priorities is building a production system. There is also a tension with releasing results to the general public. Before the publication of the MapReduce paper, there was an internal discussion about whether or not to publish. Some were concerned that the paper could benefit Google's competitors. In the end, though, Google decided to release the paper, and you can now get any number of open source implementations of MapReduce.
While Jeff has been at Google for over a decade, the start of his career looked rather different. He recounts how he ended up getting his first job. “I moved around a lot as a kid: I went to eleven schools in twelve years in lots of different places in the world... We moved to Atlanta after my sophomore year in high school, and in this school, I had to do an internship before we could graduate... I knew I was interested in developing software. So the guidance counselor of the school said, 'Oh, great, I'll set up something', and she set up this boring sounding internship. I went to meet with them before I was going to start, and they essentially wanted me to load tapes into tape drives at this insurance company. I thought, 'That doesn't sound much like developing software to me.' So, I scrambled around a bit, and ended up getting an internship at the Center for Disease Control instead.”
This “scrambled” together internship marked the beginning of many years of work for the CDC and the World Health Organization. First working at Atlanta, and then at Geneva, Jeff spent a lot of time working on what progressively grew into a larger and larger system for tracking the spread of infectious disease. These experiences, including a year working full-time between his graduation from undergraduate and his arrival at graduate school, helped fuel is eventual choice of a thesis topic: when Jeff took an optimizing compilers course, he wondered if he could teach compilers to do the optimizations he had done at the WHO. He ended up working with Craig Chambers, a new faculty member who had started the same year he started as a grad student. “It was great, a small research group of three or four students and him. We wrote this optimizing compiler from scratch, and had fun and interesting optimization work.” When he finished his PhD thesis, he went to work at Digital Equipment Corporation and worked on low-level profiling tools for applications.
Jeff likes doing something different every few years. After working on something for a while, he'll pick an adjacent field and then learn about that next. But Jeff was careful to emphasize the fact that while this strategy worked for him, he also thought it was important to have different types of researchers, to have people who were willing to work on the same problem for decades, or the entire career—these people have a lot of in depth knowledge in this area. “There's room in the world for both kinds of people.” But, as he has moved from topic to topic, it turns out that Jeff has come back around again: his current project at Google on parallel training of neural networks was the topic of Jeff's undergraduate senior thesis. “Ironic,” says Jeff.
by Edward Z. Yang at October 22, 2012 02:00 PM
For those of you back at my university, you may have noticed I'm not there this semester.
![]() |
| Google's 2012 FEP team |
![]() |
| Myself and Obey Arthur Liu at a SIPB hackathon |
by Luke Faraone (noreply@blogger.com) at October 08, 2012 03:57 PM
We launched Stripe CTF 2.0 on Wednesday. Thus far we’ve had 14,000 signups, and over 500 people have captured the flag. Designing an architecture to handle this many users, all running their potentially malicious code on the level servers, was definitely a challenge, but it was also a ton of fun.
Our first and foremost design goal was pretty simple: don’t let people root the machines. It’s horrendously difficult to keep a shared Linux login machine secure, so the best you can do is apply all of the security countermeasures you can think of. Each level server is configured in the same way (maintaining multiple configurations is a great way to end up with an oversight due to increased complexity). All user-facing services run in a chroot with only /home, /tmp, /var/tmp, and /var/log writeable. This is implemented by mounting a filesystem (created using debootstrap) at /var/chroot and bind-mounting it read-only to /var/chroot-ro. That way a user in the chroot can’t clobber system files even with escalated privileges (thus cutting off many vectors to obtaining a root shell), but we can perform maintenance from outside the chroot. We didn’t mount /proc in the chroot (except where needed to make other software work, and even there we mounted read-only or even chmod‘d 700), and we set /var/log in the chroot to be only accessible to admins. We also went and removed the setuid bit from all binaries (find / -perm -4000 is a handy command to find them).
Our second design goal was isolation. Malicious users should not be able to affect gameplay for others. We hadn’t really expected our first CTF to take off, and hence had explicitly decided to punt on user isolation features. As a result, people kept fork-bombing and preventing others from accessing the levels. We spent the week getting very good at clearing out forkbombs (which is a skill I hope never to employ again). This time, we decided to run every end user’s code as a separate UNIX user, meaning that we could use Linux’s native per-user resource limits. (We also considered using LXC, but didn’t have enough data about its scaling properties — we were also hesitant to have a security-critical component of our system be one that we didn’t know intimately.) Since everything was running as its own user, we needed to spawn server processes on demand — mod_fcgid + suEXEC provide exactly this feature (we also threw in suPHP to make the PHP levels more intuitive to people, though I would have preferred running everything under a single setuid wrapper). I wrote a quick wrapper script around suEXEC to set resource limits and nice the child FCGI processes, which would be inherited by any children processes they might spawn. We also set low disk quotas per user and removed unneeded device nodes (e.g. /dev/ram*).
We used the following mod_fcgid config to make sure that users weren’t allocated too many processes (FcgidMaxProcessesPerClass) and that processes were eventually killed off (FcgidMinProcessesPerClass). The FcgidMaxProcesses setting turned out to be useless, as there’s a compile-time constant FCGID_MAX_APPLICATION (set to 1024 on Ubuntu) which dictates a hard limit, and we didn’t want to recompile mod_fcgid by hand.
FcgidMinProcessesPerClass 0 FcgidMaxProcessesPerClass 2 FcgidMaxProcesses 5000 FcgidProcessLifeTime 150
The initial CTF level was written in node, and hence we were stuck with the problem of trying to run node under FCGI (which as far as I can tell no one has actually done before). As a giant hack, I wrote a FCGI shim which translates the FCGI protocol back to HTTP (I’ll go over the details in another blog post). The shim then spawns the node process directly and speaks HTTP to it over a UNIX socket. We also used the shim in a later level to spawn a cluster of server processes on-demand — it turned out to be a pretty flexible approach for getting the process management benefits of mod_fcgid without actually having to speak FCGI.
A couple of levels required us to run a simulated browser so people could exploit XSS bugs. For those levels, we ran a PhantomJS script every few minutes. To make sure it only ran when needed (and that it ran as your assigned UNIX user, in case of Webkit vulnerabilities), we spawned the headless Webkit instance directly from the FCGI dispatcher, using the following code:
Thread.abort_on_exception = true
Thread.new do
while true
started = Time.now
pid = fork do
cmd = File.join(File.dirname(__FILE__), '../browser-runner.sh')
exec([cmd, cmd])
end
Process.wait(pid)
status = $?.exitstatus
$stderr.puts "Exited with status #{$?}" unless status == 0
sleep(60 * (3 * rand + 1))
end
end
Giving UNIX users random names (and using mod_userdir as a routing layer) served as a convenient access-control mechanism. We assigned CTF solvers a level URL such as https://level01-2.stripe-ctf.com/user-iqrncaxifi, which was backed by a UNIX user with name user-iqrncaxifi. To prevent people from discovering other users, we set permissions on /etc/passwd and /etc/group to 600. It turns out that this makes some software a bit sad (e.g. bash can’t fill in the username in your shell prompt, and Python can’t load the site module), but for the most part those can be worked around.
One issue with the UNIX user approach is actually creating the accounts. It turns out that adduser by default does a linear scan through all possible user IDs, and effectively grinds to a halt once you have lots of users on a system. If you pass it a uid and gid directly, it runs in about 500ms, which still isn’t fast enough for on-demand allocation. Hence we maintained a central database of UNIX users and preallocated 1000 users per box.
We added a variety of other controls. For example, we firewalled off outbound network access on all servers (don’t want people using them to send spam). We also ran our machines such that even if they were completely compromised, we wouldn’t have to care. Hence we ran the CTF on a throwaway domain (stripe-ctf.com) with a fresh SSL cert generated just for the contest.
All told, building CTF was probably about two or three weeks of my time, plus as much time from a combination of other people (props to Andy, Sidd, Ludwig, Andrew, and Colin for doing an awesome job on this). We’ll be publicly posting the levels after the contest is over, so you should check those out. There are a ton of other details on the infrastructure I could delve into; let me know if there’s anything else you’re wondering about!
by gdb at August 27, 2012 05:39 PM
Back in February we ran a Capture the Flag competition at Stripe. It was a ton of fun, and we had about 10,000 people participate. We decided we’d run another one at the end of the summer, which is pretty rapidly approaching.
I’m at a Stripe hackathon right now, which I’m using an excuse to find time for putting the finishing touches on a pair of levels I’m writing for our next CTF. It’s going to be awesome — be ready!
by gdb at August 05, 2012 12:28 AM
I’ve always loved reading academic CS papers. Even when their contents aren’t directly related to anything I’m working on, there’s a lot of general CS knowledge and mindset to be learned from reading them. As well, it’s just a lot of fun.
When I joined Stripe, I started maintaining a list of papers to read, and I resolved to read a paper each week. But I soon realized that I kept procrastinating when it came to actually reading the papers. There was always some more code to write, or some urgent issue that seemed more important than reading a paper. So I realized that I needed a forcing function to actually make me follow through on my resolution — and thus I recruited a few fellow Stripes (as well as a friend from a neighboring startup) to start the Stripe paper reading group with me.
Each week, we select a paper. We then meet on Fridays at the Stripe office to discuss the paper over lunch. You can view the list of past papers at https://github.com/gdb/stripe-prg/wiki/Papers — there are some really good ones on there.
Beyond being a simple forcing function, I’ve found the group discussion to be really helpful. Sometimes we spend the time going over tricky corner cases the paper omits, and sometimes we tie the paper back to related problems that we’ve been thinking about. In any case, it’s been great to have other people to talk these issues through with.
This week, when I went to select a paper, I found myself with an unfamiliar problem. My paper reading wishlist was empty! If that’s not the sign of success, I’m not sure what is.
Based on my experience, I’d wholeheartedly recommend instituting a paper reading group at your own workplace. Alternatively, I’ve recently started inviting people from outside Stripe to join the group. If you’re in SF and you’d be interested in attending, you should let me know.
by gdb at July 17, 2012 05:12 PM
In Thunderbird, if you read a message on a mailing list and want to reference a post in a blog or elsewhere, you may often want to access a copy of the mailing list posting on the web. While you can often accomplish this by searching for the subject or manually finding it in the specific archives of that list, if you're using full mail headers, there's a built-in way to find a message on the web.
Each email or Usenet message has a unique Message-ID. These are indexed at various providers which archive mailing lists, with Gmane being the most notable in the Free Software community.
If you right-click on the Message-ID of a message in Thunderbird, you can choose to "Open Browser with Message-ID". By default this menu item opens sensible-browser with a Google Groups page, which, while indexing of Usenet, remains ignorant about most mailing lists. Most people reading this will probably find that Gmane carries their favourite mailing lists, while Google does not.
This is configurable, but sadly you have to choose one or another. In Thunderbird, go to Edit → Preferences → Advanced → Config Editor, and set the "mailnews.messageid_browser.url" property to http://mid.gmane.org/%mid. (default of http://groups.google.com/groups?selm=%mid&rnum=1)
by Luke Faraone (noreply@blogger.com) at July 08, 2012 02:54 AM
On Tuesday, June 26th, 2012, all of the scripts.mit.edu servers will be upgraded from Fedora 15 to Fedora 17, which was released on May 29. We strongly encourage you to test your website as soon as possible, and to contact us at scripts@mit.edu or come to our office in W20-557 if you experience any problems. The easiest way to test your site is to run the following commands at an Athena workstation and then visit your website in the browser that opens, but see this page for more details and important information: athena% add scripts
athena% firefox-test
by ezyang at June 10, 2012 10:20 PM
Startups are hard.
They’re an emotional roller coaster. On any given day, it’s equally likely that you’ll feel like your company is on an inevitable path to success as it is you’ll wonder whether the company will even exist tomorrow. But most of the time you’re too busy focusing on your work to feel much of anything at all.
Figuring out what problem your startup should solve is hard on its own, but every minute of every day you’re faced with a similar decision that you have to make on the fly. Out of the infinite space of possible ways you could be spending your time, which tasks should you pick? When there are multiple known bugs to fix, complicated customer queries to answer, and new feature development that’s been on your todo list for months, you need to efficiently prioritize and decide what needs to be done now and what will have to wait until later.
You have to spend time doing things you don’t like, or complete a job you are completely unqualified to do. Maybe you have to become an expert in some arcane system, and you have to do it on a tight deadline (true story). Or maybe you just want your company to have T-shirts, and you have to figure out how this Photoshop thing works (also a true story).
So why do so many people insist on working at startups? I can’t speak for anyone else, but I can tell you why I do.
I get to spend every day working towards something great. Even when the emotional roller coaster goes through a dip, I know that the goal we’re all working towards is worthwhile. The feedback loop between my work and its impact on the world is incredibly tight.
I get to solve hard problems every day. Technical challenges are easy enough to find at any job, but I can’t imagine another environment that forces you to tackle so many challenges at once. You need to figure out how to be efficient, how to work with others effectively, and how to prioritize and manage your own work. At the end of every day, I’ve learned a ton, whether technically, personally, or professionally.
And finally, I get to work with awesome people. At a startup, you have the opportunity to build a team that conforms to whatever standards you find important. I have the incredible fortune that everyone I work with is an amazing individual who has a ton to teach me and whom I really enjoy being around.
In my book, most of the mythos around startups is exaggerated. You don’t have to work insane hours, and you don’t have to give up your social life. But I can’t imagine doing a company without having a desire to explore new things and possessing a healthy tolerance for pain. A startup is simultaneously the best and worst parts of my life, but there’s no doubt that I am far better off for it.
by gdb at May 14, 2012 04:26 PM
I gave a talk on Friday at MongoSF on how to run MongoDB for high availability. The slides are posted on slideshare, but without speaker’s notes they’re probably not all that useful. Sorry about that. But to try to make it up to you, I’m going to go through the talk’s contents in the remainder of this post. As well, the video from the talk is now (highly) available on the 10gen website.
For a company like Stripe, availability is incredibly important. If we are unable to accept an API request, that causes downtime not just for us but also for our customers. Since the limiting factor on availability for many applications is the database (if that’s down, you can’t serve any requests), when choosing a database we sought to find one that allows maximum availability.
In an ideal world, your highly-available database would be a collection of indistinguishable nodes. You could add or remove nodes from the pool without any service interruption, and your nodes can fail or become partitioned from one another without downtime. Unfortunately, there are theoretical results which show that such a database can’t exist. So any real distributed database needs to make tradeoffs.
MongoDB has one of the cleanest availability models of commercially-available databases that I’ve seen. MongoDB nodes are clustered into replica sets. Each node in a replica set contains a full copy of the data. At any given time, there is at most one primary, which is the only node capable of accepting writes. If that node fails or is asked to step down, an election occurs and a new primary is selected. Any node can accept reads, though reads to secondaries are not guaranteed to be consistent. There is no single point of failure or manual administration needed to promote a secondary. Determination of the current cluster configuration is managed by a smart client driver; the application writer needs only to provide the driver with a set of seed nodes to establish an initial connection.
The one case that you as an application writer need to worry about is in the midst of cluster reconfiguration. When the set begins reconfiguring, the servers close all open connections. Any mongo operations will fail until the cluster reconfiguration has completed. Your driver will then bubble this as an error to the application level. (In general, it doesn’t have enough information to know what to do in this case, so it bails. If for example an error is returned to a write, the client can’t tell whether that write was accepted or not.) So to achieve robustness in the case of cluster reconfigurations, you need to write code to handle these exceptions.
At Stripe, we make all of our writes idempotent. We then wrap all Mongo operations in retry logic — if the operation throws an exception, we simply retry it until it succeeds (with an appropriate backoff and timeout). Note that your concurrency model needs to be considered when evaluating whether a write is truly idempotent — make sure a conflicting write can’t sneak in between tries.
With this retry code, our application then can handle arbitrary failures or routine reconfigurations of our Mongo cluster. I come from a MySQL world where I couldn’t even reboot my master server for routine maintenance. Being able to withstand the catastrophic failure of our database primary without any service interruption was basically a dream come true.
That’s really all the code you need to write for high availability with MongoDB. However, you also need to use proper administration and deployment techniques to ensure you don’t take yourself down. You’re most vulnerable during replica set reconfigurations, and so you should minimize the number of times you have to do one. While replica set reconfigurations should be safe, it’s plausible to accidentally be in a world where no node is eligible to become primary. As well, if there’s been lots of flapping, we’ve seen primary election take a minute or more, causing pending requests to timeout. You should always make sure to test your administration procedures in your staging environment to make sure they’re safe. One gotcha we’ve hit is that a recently-booted Mongo node isn’t eligible to become a primary unless all nodes in the cluster are up. In our staging environment, we had been fairly slow while testing a MongoDB version upgrade. In production, we were much faster, and when we tried to fail over to a recently-upgraded node, suddenly we had no eligible primary node.
One operation we’ll commonly perform is a rebuild of all machines in our database cluster. We used to rebuild one node at a time, adding it to the pool, waiting for a full sync, and then removing the corresponding old node. This required a decent number of replica set reconfigurations, and it was pretty miserable to perform. These days, we instead just add all the new nodes to the pool, allow them to sync, fail over the primary, and then remove the old ones.
If you build your new node via an initial sync rather than restoring from a snapshot, you should make sure to force that node to sync from a secondary. Otherwise, you’ll end up with your primary reading through all of its data, evicting hot data from cache. This can cause severe performance issues. In the current version of MongoDB, however, there’s no supported way to force a sync from a particular server (this feature was recently added to the development version). If you read the source, however, you’ll see that the new node selects the node with lowest ping time to sync from. So we just drop in an iptables rule blackholing traffic from the new node to the primary, wait for it to select its sync target, and then remove the iptables rule. This technique is obviously a giant hack, and I’m looking forward to just use replSetSyncFrom instead
.
One nonobvious point when configuring your replica sets is how to address the machines. Mongo can use either IP addresses or hostnames. (Fun fact: hostnames are dynamically resolved and connections are frequently reestablished. So if you munge /etc/hosts, you can easily repoint a particular hostname.) When rebuilding a cluster, we like to reuse hostnames, e.g. creating a new server db1 to replace the old db1. But these nodes have to exist in the replica set at the same time. So we used to add the new node with a temporary hostname, fail over the primary, remove the old nodes, munge /etc/hosts to point to the new nodes, and then reconfigure with the correct hostnames. But as much fun as it is munging /etc/hosts in the middle of database administration, it’s also very easy to make a mistake. These days, we just configure our replica sets with IP addresses. Then the machines can have whatever hostnames they want.
Perhaps the most common reason that many people go down is for routine maintenance. With MongoDB, it’s not too bad to never go down for maintenance. Since Mongo doesn’t enforce any particular schema, a common technique is to tag objects with a schema version number. You can then lazily migrate objects as you read them, or with a batch process that migrates them in the background (you just need to make your app be able to understand all active schema version numbers). One note is that I recommend against multi-updates for large collections (i.e. don’t run a batch db.update(…)).
Why not, you ask? You see, Mongo has a global write lock. Long running operations, such as a multi-update, yield the lock many times a second. However, if you are running at sufficient scale, they still hold the lock for long enough that contention begins to be a problem and queries start stacking up. As well, a multi-update will cause all the migrated records to be read, evicting hot data and further slowing down queries. So instead, at Stripe we perform all migrations at the application level. We iterate over all relevant objects, performing a single-record update for each. This allows us to slow down the pace and ensure that production traffic is not impacted.
Note that Mongo supports background index builds. Use these. You do have to be careful because, as a long-running operation, they can have production impact. However, we’ve found that these tend to be gentler than multi-updates, and we haven’t really seen issues from them.
In general, you should always be very thorough with your MongoDB usage and deployment. It turns out distributed systems are complicated. Don’t assume that your application can actually handle a particular failure case (e.g. what happens if my current primary suddenly becomes wedged but is still accepting TCP connections) — make sure you’ve actually tested it.
As well, you should be cognizant that MongoDB is under rapid development, which is great because it means new features are being added all the time. However, it also means that sometimes API changes happen, or sometimes regressions are introduced. Always wait at least a few weeks before upgrading to the latest version of MongoDB. And before you do, go and read the issues opened on that version, and ideally all of the issues resolved between your version and the newest one. It’s not very much fun, but it’s the best way to know what changes are in store for you.
I’ll close with a final point. Mongo gets a bad rap sometimes from people who use it incorrectly. E.g. people don’t put their driver into safe mode and then are surprised when they start losing writes. Make sure you’ve read the documentation, understand what the w and j parameters are, and that they’re tuned appropriately for your application. As well, pay attention to your system configuration (such as your ulimit on number of file descriptors) to make sure you don’t end up running into surprises down the road.
MongoDB provides a rich set of primitives for high availability. Through a little bit of code and a lot of proper administration technique, you can achieve essentially 100% uptime on your database, even in the presence of catastrophic machine failure. I’d be curious to hear how people have achieved this with other database systems — let me know if you have techniques you’d like to share.
by gdb at May 07, 2012 03:59 PM
For the past N months, it seems like there is no new technology stack that is either hotter or more controversial than node.js. node.js is cancer! node.js cures cancer! node.js is bad ass rock star tech!. I myself have given node.js a lot of shit, often involving the phrase “explicit continuation-passing style.”
Most of the arguments I’ve seen seem to center around whether node.js is “scalable” or high-performance, and the relative merits of single-threaded event loops versus threading for scaling out, or other such noise. Or how to best write a Fibonacci server in node.js (wat?).
I am going to completely ignore all of that (and I think you should, too!), and argue that node.js is in fact on to something really cool, and is worth using and thinking about, but for a reason that has absolutely nothing to do with scalability or performance.
node.js is cool because it solves a problem shared by virtually every mainstream language. That problem is the fact that, as long as “ordinary” blocking code is the default, it is difficult and unnatural to write networked code in a way that it can be combined with other network code, while allowing them to interact.
In most languages/environments — virtually every other language people use today — when you write networked code, you can either make it fully blocking itself, and implement your own main loop — which is almost always easiest — or you can pick your favorite event loop library (you probably have half a dozen choices), and write your code around that. If you do the latter, not only will your code likely be more awkward than if you chose the blocking main loop approach, but your only reward for the effort is that your code is combinable with the small fraction of other code that also chose the same event loop library.
The upshot of this situation is that if you pick a couple of random networked libraries written by different people — let’s say, an HTTP server, a Twitter client, and an IRC client, for example — and want to combine them — maybe you want a Twitter < -> IRC bridge with a web-based admin panel — you will end up at best having to write some awkward glue code, and at worst doing something truly hackish in order to make them communicate at all.
(In case you’re not convinced about the existence or scope of this problem, you can detour ahead to the optional example of how this problem manifests with typical Python libraries)
node.js solves this problem, somewhat paradoxically by reducing the number of options available to developers. By having a built-in event loop, and by making that event loop the default way to accomplish virtually anything at all (e.g. all of the builtin IO functions work asynchronously on the same event loop), node.js provides a very strong pressure to write networked code in a certain way. The upshot of this pressure is that, since essentially every node.js library works this way, you can pick and choose arbitrary node.js libraries and combine them in the same program, without even having to think about the fact that you’re doing so.
node.js is cool, then, not for any performance or scalability reasons, but because it makes composable networked components the default.
An interesting point here is that there is not really any fundamental differences between what you can do in Python and node.js. The Twisted project, for example, is basically an attempt to implement the node.js ideology in Python (yes, I’m being terribly anachronistic describing it that way). Twisted, however, has a fairly steep learning curve, and “feels” unnatural to developers used to writing “normal” Python, and so relatively few libraries get written for the Twisted environment, compared to the rest of the Python ecosystem. Twisted suffers from the fact that the Python language and Python community are not set up to make Twisted the default way to do things.
The key is that node.js makes it, both technically and socially, easier to write code in this composable way than not to do so. The built-in event loop and nonblocking primitives make it technically easy, and the social culture that has grown up around it discourages libraries that don’t work this way, so libraries that attempt to just block for IO are looked down on and are unlikely to thrive and gain adoption and development resources.
I’m also not ignoring the downside of the node.js style — the potentially convoluted callback-based style, the risk of bringing the whole world to a halt with an accidental blocking call, the single-threaded model that makes it hard to effectively exploit multiple cores. node.js definitely makes you do more work than you might otherwise have to, in many circumstances. But the key is, in exchange for this work, you get something really cool — and something much more valuable, in my opinion, than nebulous performance gains.
I also don’t want to claim that node.js is the only system, or the only technical approach, that makes this property possible. But node.js is the most successful such system I know of, and that’s worth at least as much as the technical possibility — what’s the use of being able to combine third-party libraries, if no one has written anything worth using in the first place?
And similarly, while the single-threaded callback model may not be the best possible model, it seems to have hit some sweet spot for finding a sweet spot in terms of what developers are willing to put up with. Certainly, people are writing node.js code like mad — check out the npm registry for a partial list.
So, node.js is not magic, and it definitely doesn’t cure cancer. But there is something here worth looking at. The next time you need to glue some unrelated networked services together, give node.js a shot – I think you’ll like it. And if you’re still not convinced, glue a quick HTTP frontend onto whatever you’ve created. I promise you’ll be shocked by how easy it is.
As an optional addendum, here’s a step-by-step discussion of how just bad the situation is in most other languages.
Let’s imagine we want to write a trivial Jabber -> IRC bridge: A bot that lurks in an IRC channel, and signs on to Jabber, and sends all the messages it receives on Jabber into the IRC channel. This is the kind of simple problem that can be described in one sentence, and sounds like it should take all of 20 lines of code, but actually turns out to be rather a nuisance.
Python has, by this point, a great library ecosystem, so we happily
start googling, and find the plausible looking python-irclib
and xmpppy libraries. Great. So, what does code in each of
those look like? Well, in python-irclib, we construct a subclass of
SingleServerIRCBot, and call the start function, which runs the
IRC main loop:
bot = MyBot(channel, nickname, server, port)
bot.start()
And in xmpppy, we construct an xmpp.Client object, and call
Client.Process in a loop, with a timeout:
conn = xmpp.Client(server)
# connect to the server
while True:
conn.Process(1)
Ok, so, we launch a thread for each one, and a few minutes of fumbling later, we’re connected to both Jabber and IRC. So far, so good. We’re using Python’s threads, which will inevitably bring us a world of pain, but I’ll ignore that for now, since a better threading implementation could fix most of the pain.
But now, what do we do when we receive a Jabber message? We want to
send a message out the python-irclib instance, but how do we do that?
python-irclib isn’t thread safe, so we can’t just call .send() from
the Jabber thread. Ok, so we add in a Queue.Queue, and have the
Jabber thread push messages onto it.
Now we just need to make the IRC thread fetch messages from this
queue. But how do we do that? The IRC thread is blocked somewhere deep
inside python-irclib, waiting for network traffic. How do we wake it
up to read messages from the queue? The easiest way is to switch from
calling start to calling process_once in a loop with a short
timeout.
This will work, and we’ll eventually get something working, but now we’re forced into polling, with all the annoying latency/CPU tradeoffs that entails, and also half of our code so far has just been spent gluing these two libraries together.
In node.js, on the other hand, we’d just instantiate client objects for both protocols, set up some event handlers, and … well, that’s about it. Because everything’s hooked into the same main loop, they’ll Just Work together, and because we’re all running single-threaded, we can mostly just communicate directly between the two libraries without having to think too hard about race conditions or anything.
The point, of course, is not that this is impossible to write this program in Python. It is, and I’ve done it, and it’s not that terrible. But any option you take will involve some annoying tradeoffs, and will involve making lots of irrelevant plumbing decisions about how to make your pieces play well together. And compared to all that, node.js feels like a breeze.
by nelhage at March 12, 2012 03:36 PM
Slow code loading is an insidious problem. It tends to creep upon a codebase so gradually you don’t notice it happening. First you find yourself waiting a second, then a few seconds, then maybe ten or more seconds every time you want to reload your code to test out some change.
I recently wrote a tool to profile Ruby code load times. The tool, called require-prof, is implemented as a simple wrapper over Ruby’s require and load. It spits out a list of the most expensive files to load in your project (correctly subtracting out the time spent recursively requiring other files), so you can go and easily tune your project.
I’ve now used require-prof to greatly improve code load times on a few code bases. In no particular order, here are a few points that I’ve noticed:
mime-types for one reason or another, though none of them actually use its code. On my laptop, loading mime-types takes about 200 milliseconds because it spends a bunch of time loading and registering all of its hardcoded mime types. I have a patched mime-types so that, if ENV["RUBY_MIME_TYPES_LAZY_LOAD"] is set to "true", it will lazily do the parsing at runtime. This drops the mime-types load time to a few milliseconds. require. Otherwise, it will search through your entire load path on every require, which can become pretty expensive. This may be most easily implemented using a wrapper around require, for example using a simple function such as:
def base_require(relpath)
require File.expand_path(File.join(File.dirname(__FILE__), relpath))
end
in a top-level file of your project.
require-prof to find your bottlenecks, make sure to load your code once to ensure it’s all in cache before paying attention to any of the numbers you see. mail gem is incredibly slow to load (about a second or two on my laptop). A lot of the slowness is due to loading its giant autogenerated parsers. I didn’t dig enough to both figure out how to make it load quickly and still not break, but in my projects I shifted over to dynamically loading it when needed. Once your code load time has been tuned to your satisfaction, the next most important operation is to ensure that you don’t end up back where you started a few months down the time. I’ve been considering adding alerts to my code that complain loudly whenever it takes over a certain amount of time to load, but I haven’t quite decided on the best way of doing so.
by gdb at February 06, 2012 07:50 PM
One of the most beautiful aspects of Stripe is how it lowers the barrier for software developers to build revenue-creating businesses. Before Stripe, programmatically accepting payments was largely reserved for large companies willing to put in the time and drudgery required to obtain a merchant account. It was easy to write applications, easy to deploy applications, but very hard to charge for applications .
When I decided to drop out of MIT and join Stripe, I didn’t really understand just how true this was. In my case, it really hit home when I was creating domaincli, a command-line client for registering domain names. I wanted to allow a user to run domaincli register gregbrockman.com, enter his or her credit card details, and then end up the proud owner of gregbrockman.com. As you might imagine, there are two principal challenges with making a tool like work: you need to plug in an API to actually register the domain names, and you need to plug in another API to let people pay for said domains.
Not knowing anything about domain registration APIs, or even whether they existed, I started Googling around to see what solutions were available. The results were quite surprising–it turns out that loads of registrars expose APIs. Seeing that Go Daddy had an offering (everyone’s favorite company, I know), I started poking around their product page. However, you had to pay several hundred dollars even to get access to the documentation, and I found rumors in forums that their API was a complicated XML-based mess. So I moved on to others’ reseller APIs.
One company that looked quite good was a Dutch registrar. Their documentation was reasonable (dare I say even good?), and I found several forum posts lauding their API. So I signed up for an account, played around in their test environment, and even PayPal’d in some funds for my account. But when I tried to get my live credentials, I ran into a popup notifying me that I had to request access to their production environment, and the request would be processed during normal business hours. This was on a Friday night, and I was planning on writing domaincli as my weekend project. So with a heavy heart, I realized this registrar wasn’t going to work.
So I started skimming through several other registrars, but it seemed that none of them offered instantaneous setup. I finally stumbled across a registrar who did, though their API was documented only by a pair of terribly-written client libraries. Though their WSDL worked for calls such as “check if this domain is registered”, I could never get their “register this domain” call to actually work.
Disheartened, I was about ready to give up. I dejectedly tried one more Google query, not expecting anything to turn up. But to my surprise, I ended up finding Internet.bs, a registrar with a documented JSON API and, more importantly, instant setup. Their API was a bit complicated, so it took some time to grok the relevant parameters and fully integrate, but in the end in didn’t take more than an hour or two.
After the wild west that was plugging in domain registration, it felt almost like cheating to just use Stripe for payments. I created an account, plugged in a create_token call to send the user’s card details directly to Stripe, and then added server-side create_customer and create_charge calls to store a card on file and charge it, respectively.
The rest was just adding some wiring around these two APIs. You can use the resulting tool by installing it via sudo pip install domaincli, or checking out the source on Github.
by gdb at January 23, 2012 06:06 PM
The above is a new Stripe WordPress plugin. I’m just running it in test mode; feel free to enter test card details such as those from https://stripe.com/docs/testing.
by gdb at January 20, 2012 07:51 PM
While there are a host of static analysis tools for statically-typed languages like C or Java, dynamic languages like Ruby are traditionally regarded as too dynamic to be able to do any useful analysis without actually running the code. While there are some rudimentary tools you can run over your Ruby codebase (such as druby, reek, and roodi), as far as I can tell there is no tool you can just point at a main file in your Ruby codebase and have it start telling you about your bugs. Frustrated with the state of existing tools, I sat down one weekend to write a proof-of-concept Ruby static analysis tool, which I creatively dubbed ruby-static-checker.
Probably my most common cause of error while programming in Ruby is a name error–caused for example by typoing a variable name or by forgetting to update the name of a class after a refactor. My initial iteration of ruby-static-checker supports checking for typoed variable names. Just run bin/ruby-static-checker , and it will start spitting possible name errors at you. Here’s how it works.
The chief design goal of ruby-static-checker is that it needed to be runnable on unmodified Ruby projects with only the specification of a single file (rather than the entire project). Rather than resolve the import graph by hand, I load the project into the static analysis tool as a library. This has the side benefit of making any load-time class generation trivial to analyze.
The rest of ruby-static-checker is walking over abstract syntax trees for all defined classes (using the pretty cool parse_tree gem). Since Ruby does not require parentheses in order to call a method, it’s perhaps nonobvious, but in fact it is statically resolvable whether a particular name refers to a local variable (in which case no name error is possible) or a function call (in which case a name error is possible). Ruby represents these two cases differently in the AST, and so the checking problem reduces to building a list of defined methods and checking against the calls that are made.
Of course, dynamic constructs like method_missing, Object.send, runtime imports, and load-time branching can cause both false positives and false negatives for this approach. I consider these issues an acceptable cost–I wanted a tool that would generally find bugs in reasonable codebases, and missing out on the more Byzantine corner cases seemed perfectly fine.
I’ve run ruby-static-checker over a couple of codebases, including Stripe’s. Though it usually fires off a number of false positives (usually almost entirely in library code), I’ve also found actual bugs with it. I’ve also found it fairly useful to make sure that ruby-static-checker has found no new name errors in between runs.
Anyway, the tool was written in a weekend, so there’s obviously tons more to do. I’m curious to see how well the load-as-library approach works for more advanced analyses.
by gdb at January 16, 2012 06:32 PM
It’s taken me a long time to fully appreciate the value of people.
In middle and high school, I spent gobs of time studying chemistry and working on my math research. Try as I might, I was unable to find any peers who had similar interests or inclinations. The result was that I ended up spending the vast majority of my time working in isolation. My study carrel at the library became a second home, and I became resigned to thinking that I was alone. While I wasn’t overjoyed at the thought that I could not find anyone with whom to collaborate or share a common struggle, I also wasn’t too distraught about it. That was just how things were, and I accepted that fact without question.
I got my first glimpse of the potential joys of working with others when starting college. At Harvard, I began working in a study group for my math class, the infamous Math 55. Each week the professor would rain ten excruciatingly difficult problems on our heads, and it was all but impossible to keep up with the work on our own. For the first few weeks of school, I worked alone as I usually would, pounding my head against the problems until they yielded. While I would ultimately triumph, my willpower and time to do work for other classes would both be all but depleted. However, a number of us gradually started spending more and more time working together on the harder problems, until before we knew it we had a full-fledged tradition of meeting and collaboratively working on the problems each week. And correspondingly, I stopped feeling quite so beat down when handing in a problem set–in fact, I felt something that was most akin to pride. I realized that having a team of smart collaborators had turned this once harrowing experience into a positive one, with each of us providing mental and emotional support to the others.
I tried to employ this model in other classes, never with much success. A number of collaborators would gather to work on our physics problem sets. However, I found that our group was too mixed in experience, diligence, and ability. A few of the members clearly were just there to copy answers, while some came without ever having looked at the problem set. I started dreading going to physics study group almost as much as I had dreaded working on my math problem sets alone, even though most of the members were extremely smart and dedicated. I realized that the composition of a group of collaborators needs to be precisely tuned, and there is simply no room for those who won’t hold up their end of the social contract.
By the time sophomore year rolled around, I had all but despaired of finding another study group like the one from Math 55. I began working alone in my room, but I found myself consumed with loneliness. I started trying to work in the dining hall, where at least there were other people around. And while that did make me feel a lot better, the underlying problem of working alone remained.
Though I didn’t really comprehend it at the time, in my other endeavors throughout college, the most pleasant were those which involved a number of other people working towards a common goal. When the Harvard Computer Society’s LDAP server blew up right after I became the Director of Systems, a group of four of us worked nonstop for the next eighteen hours to bring our services back online. I can’t think of a more cherished memory or of a time I felt more accomplished during my college years.
These days, working on Stripe, the vast majority of my day is spent working with other people. We discuss architecture and strategy; we review each others’ code; we hang out and talk about life. We are bound together by a common goal, an overarching purpose, a reason for existence: to fix payments for the web, and in the process build an awesome company. I’ve learned that I absolutely need to be around people in order to be happy, but they must be the right sort of people. In all our hiring, we’re very conscious of this fact, and we make sure to only bring on board people who will contribute in the right way to our culture.
It’s shocking to me how far my views have evolved. Four years ago I was sure that I preferred doing everything on my own, and that having a shared goal with others was just inefficient. Now, I’m sure that I prefer working in a team and being able to share not only the burden and responsibility but also the joy that come from driving towards an goal. I’ll be interested to see what my views are in another four years, but until then, I’ve figured out how to maximize my happiness while working.
by gdb at January 09, 2012 07:35 PM
For the first six months after joining Stripe, I slept on a mattress in the entryway bedroom of our five-bedroom house. We were getting a great bargain on rent, since the landlord was under the mistaken impression it was only a two-bedroom house. But what she didn’t realize was that room with the refrigerator in it is a perfectly valid bedroom. And that space where one might be tempted to put a dining room table or something equally wasteful also made a perfect living space. My room moonlighted as both a vestibule and a living room, which meant that people were always transiting through it, but having grown up next to train tracks I’ve developed the ability to sleep through arbitrary amounts of noise so it wasn’t a big deal. Admittedly, there wasn’t much space in the house to do anything but sleep, but that was all we wanted to use it for in any case.
The vast majority of my waking hours were spent at the office. I’d typically show up at 10 in the morning and leave at 2 (also in the morning). On days when I was feeling lazy and wanted to leave early, I’d usually head out at midnight.
I was loving every moment of this. At last, I was surrounded by exactly the kinds of people I wanted to be surrounded by, all the time. I was doing the kind of work I had been dreaming of doing. And best of all, it all felt like it was building up to some higher purpose, and that every incremental hour I put into Stripe was meaningfully increasing our probability of success.
My first day at work was spent being credentialed. Patrick created user accounts for me on the various third-party services we were using and gave me access on our servers. One thing I noticed was that the shared passwords for some of these third-party services were kept in cleartext on one of these servers. The security nerd in me cried out in protest, and I spent the next day building a service, password-vault, for encrypting shared passwords to people’s GPG keys on the client side and storing the encrypted password on the server. We started using password-vault immediately, and it’s still in use today. It was amazing to me that I could go from identifying a problem needing solving to having an adopted solution in so little time.
I worked in an analogous way on a day-to-day basis. There were a number of systems that clearly needed building, our server architecture needed some serious love, and we did not have T-shirts. Each day, I would decide the order of priority of these problems, and I’d choose when and how they would be solved. Sometimes another Stripe would push a task onto my plate, if it pertained to my area of interest or expertise, and sometimes I’d push a task onto theirs. On the whole, we operated as a decentralized network of independent nodes, with loose central coordination as we pushed for a single higher-level goal.
I think this workflow and work dynamic are really easy to have in a small startup, but they are just as easy to lose over time. As a company grows, communication becomes harder, and even knowing what other people are working on becomes a challenge. A lot of companies try to solve this by pigeonholing their employees into narrower roles, with direction and oversight from a manager. We’ve opted for a different approach, instead continuing to hire people who work well as independent nodes, and to push really hard to make sure people are communicating and have a clear, shared vision of where we need to go.
We don’t all live in the same house anymore, but we try to maintain the same feeling of closeness and camaraderie. One of the standards we apply when interviewing a potential Stripe employee is whether this person is someone who, if you knew he or she was alone in the office, would make you want to go to the office just to be around him or her.
I still spend basically every waking moment at the office working on Stripe. I’ll take a weekend every so often to hack on a side project. But I’ve worked hard to make sure that Stripe remains the most enjoyable and incrementally useful application of my time, and so my days remain happily ensconced in working on my latest project for Stripe. As we continue to grow, I think we’ll have to continue to work hard to maintain our culture and all the properties that make Stripe an awesome place to work, but I also am confident that being cognizant of this challenge is half the battle.
by gdb at December 19, 2011 05:56 PM
I made the decision to drop out of MIT on a Thursday.
I spent most of that Friday meeting with my professors. I wanted to explain to them why they weren’t going to be seeing me in classes any more, and for those whom I knew well to say goodbye. Their reactions were mixed, but all were very heartfelt. My advisor thought I was making a huge mistake, that now was the time for me to be in academia, and that I could always go do a startup in the future. My AI professor was enthusiastically supportive, commenting that now is the time for me to be learning and exploring and that a startup would be a great place to do that. Another of my professors told me about his own experiences, both good and bad, with startups he’d founded; he gave me advice about what to watch out for. I also met with a dean at Student Support Services, who actually submitted my request to go on a leave of absence.
Saturday was the day I packed up my life. I grabbed some boxes from the UPS store, and within the next eight hours there was no sign I had even inhabited my room. For some reason I happen to have a picture of me with said boxes:
Sunday I spent hanging out with the friends I was going to be leaving behind.
And then on Monday, I was on a plane to California.
It struck me as insane how I was able to so completely alter the course of my life in such a short window. Within the span of four days, I had transitioned from being a MIT student, thinking of little more than how I was going to complete the next problem set, to working full time on a startup in Silicon Valley. Usually lives are much less malleable than that. Going abroad after high school, starting college, and transferring to MIT were all major life events that had required months of advance planning. But somehow I managed to pull off the transition smoothly.
Looking back, I’m still quite surprised that I left school at all. I had always pictured myself as the kind of person who wanted absolute certainty in decisions, who would try to gather all available data and spend weeks pondering it before making a decision. But the timescales here had been far too small to make a fully rational decision, and I had made my decision off of the sense that the others at /dev/payments were exactly the kinds of people I had been trying for years to find. For the first time in my life, I had explicitly made a choice based off of gut rather than rationality.
Since then, I think my views on gut instinct have changed significantly. My decision to join Stripe (which is what we renamed /dev/payments to) turned out to be the best decision I’ve made thus far in my life. Looking back at the other life choices I’ve made, I’ve realized that for the most part the outcome after all my thought and information gathering matched my initial gut reactions. I’ve also been running recruiting at Stripe for a while now, and I’ve noticed that my initial feelings about a candidate tend to translate extremely well into whether they end up being a good fit for us or not.
To some extent, this mentality is necessary in a startup. Being able to rapidly make decisions and act quickly is perhaps the chief advantage that you have by being a startup. Knowing when to tune out the desire for full rationality and instead trust one’s instincts is a skill that I think I’m still honing, but all evidence seem to indicate that it’s well worth having.
by gdb at December 12, 2011 12:04 AM
Fall 2010 marked the beginning of my junior year and second semester at MIT. At the end of the summer, I had sat down and planned my goals for the next few years. I had a breakdown for the next semester, the next year, and the next few after that as well, and I had compiled the results in a notes file. (This codification of who I am and what I am trying to accomplish turned out to be hugely beneficial, and since then I’ve made sure to do the same every six months or so.)
My short-term plan was to start working on a research project, start getting more involved in major open-source projects, and build more side projects. On the whole, I wrote, “I think the plan is still to get as good as I can at the things that I do.” I wanted to learn, but I had recognized that class was not the place that I was going to learn how to build.
In the longer term, I wanted to do a startup. I figured it’d be a long time before I was confident enough in my abilities that it would make sense to do so. And so in the meanwhile, I was going to finish up my undergraduate, do a masters at MIT, and then start grad school, where hopefully I would come up with a cool technology to build a company around.
I began my preparations for the semester. I came up with an idea for a research project that I was extremely excited about (static buffer overrun detection), and I recruited three of my friends to work on this with me. And I was going through the course catalog picking out classes for the next semester, I was overjoyed when I noticed a startup class (entitled “Founder’s Journey”). I immediately signed up for the class.
My computer science classes that semester were far better than the ones I had been taking previously, but still they left me hungering for more. I would typically work intensely for a few days finishing the assignments for the next few weeks, which left me a lot of time to work on my own projects.
My research project, unfortunately, never really got off the ground. My friends and I would read papers and meet weekly with a graduate student, but one by one my friends found that they had other obligations which ended up taking priority over the project. By the end of the first month, I was the only one left. I was disappointed, because I had really been looking forward to working with others towards building something awesome, but I continued onwards anyway.
The startup class also ended up falling short of my hopes. I had been expecting it to be a cookbook for how to start a company. I anticipated practical advice, such as here’s when you start talking to a lawyer and this is what you have to watch out for when taking investment. Instead, it was a much fuzzier class than I was looking for, talking a lot about entrepreneurial thinking and reasoning. After a few weeks, I came to the realization that even outside of computers, the things that I wanted to learn were not going to be taught to me in a classroom; I ended up dropping the course.
But I had not given up on learning the parts of doing a startup outside of writing code. I resolved to spend more time meeting people who were working on companies and to talk with them and how they were approaching problems. I also planned to try tracking how these approaches correlated with success of these people’s startups.
So when I received a Facebook message from a person doing a startup who wanted to meet up, I accepted. I had no intention of joining the guy’s company, but I wanted to get started with my practical learning about how to build a startup as soon as possible.
The next day, we met up in the MIT student center. He began to weave the story of the startup he was building. They were building a developer-focused payments company, called /dev/payments at the time. He showed me the product they had built thus far, and demonstrated running a curl request to create a sample payment. I really could not have been more underwhelmed. I didn’t really get it–why was building a curlable API an interesting company? I could build a curlable API from my dormroom.
And then he mentioned the investors who were involved–Peter Thiel, Elon Musk, Sequoia, Y Combinator, and a host others. I hadn’t heard of all the names he mentioned, but I’ll admit that the ones who I had heard of made me rescind my judgement that he was not working on anything interesting. I didn’t have any payments experience, but they had some well-reputed investors who clearly did. Presumably those investors would be able to distinguish just another API from something truly revolutionary. (Incidentally, I’ve since decided that this was wrong–investors are very good at evaluating team and market, but they don’t tend to judge based on product nearly as much as I assumed.)
When I returned to my room, I began searching for all of the information I could find on /dev/payments (which was not much; they had done a pretty good job of staying under the radar) and the founders (of which there was a considerable amount; they had already built and sold a previous company and had a lot of cool code on Github). I also pinged Corey, the only person I knew in the startup scene in Silicon Valley, and asked if he had heard of the company or the founders before. A few days later, he got back to me, saying that he had reached out into his network and heard a bunch of extremely positive things. As well, it turned out the founders and I had some mutual friends; they all confirmed the positivity.
It was at that point that I began seriously considering this startup as a potential place to work rather than just a guinea pig. I accepted an offer to fly out for a weekend. And within five minutes of stepping into their office, I was flattened by the realization that these were the exact kinds of people I wanted to work with. I can’t remember a word of the conversation we had, but I can recall very clearly the feeling of belonging it generated, like I had finally found my people.
Everything about the company seemed perfect. It was still unlaunched, though it had some beta customers who were loving the product. It had only one full-time employee (plus a part-time office manager), and I would be the first engineering hire. And best of all, rather than putting me up in a hotel for the weekend, I was invited to crash with the founders. My ideal picture of a startup had long been finding a group of people who I loved spending time around and focusing singlemindedly on solving a real problem. I wanted to spend my time living and breathing the problem, working and hanging out and existing with my coworkers, and generally letting loose the intensity had been lacking an outlet in school. And it was clear that this company was exactly that.
I went back to school in time for class on Monday and started thinking very hard. I consulted with tons of people. Many thought I was crazy to even think about leaving school for a startup. Many thought I was crazy to leave school for this startup (especially knowing so little about the product). But a few people realized what I was after and saw that I wasn’t finding it at MIT and thought I might not be completely insane.
By Thursday night, I was about ready to make a decision. I was leaning towards, but I wasn’t quite sure I was ready to make the leap. One of the founders happened to be in Boston, and we met up at the Asgard. We spent a long time chatting about the company, about life, about the universe, and really about everything. And I realized then that what really mattered to me and made me willing to take this leap was that it was clear that the people from /dev/payments were looking out for not just their best interests, but my own as well. And so, my fears allayed, I decided to put school on hold and enter the startup world.
by gdb at December 03, 2011 09:53 AM
My first semester at MIT was one of the most eye-opening experiences of my life. Having transferred from Harvard, I felt like I had just moved into either a playground or an alien planet (or possibly an alien playground). There was some amount of becoming acclimated to a new physical campus, a new way of registering for classes (which culminated in me wandering around the Sloan Business School for hours, trying to figure out why no one could find my registration), and a fresh set of faces. But those weren’t the interesting bits. For me, the truly fascinating part of being at MIT was finding all the parts of the college experience I hadn’t even imagined I’d been missing.
For example, I soon realized that each living group had its own persistent subculture, usually with tight alumni ties. One floor in my dorm was full of gamers, another populated by mathematicians. At East Campus, I found two halls with cultures that strongly appealed to me, and I started spending decent amounts of time hanging around those halls trying to passively experience what they had to offer. I avoided the West Campus dorms at all costs, remembering well the experience that had driven me away from MIT in the first place.
In the circles I frequented, geek culture was rampant, if not dominant. It was rare that a conversation did not include a computer science joke of some form. At parties people would shoot each other with nerf guns. I found that with very little effort, I could surround myself by computer scientists and spend all my time living and breathing tech.
My biggest dissatisfaction with the move was that I did not have a research project. Since first semester of freshman year, I had been planning on doing research, but for one reason or another there was always something in the way. Research had always struck me as a fascinating concept (“how could one spin knowledge out of nothing?”), and CS research all the more so (“they’ll let me publish a paper on a coding project I’m hacking on for fun?!?”), and it would also be necessary for me to get into a good graduate school. So I began speaking with professors and found one whom I really liked and who had a project that seemed quite interesting.
About this time, Jessica and I had SIPB chair and vice-chair initiation. This was nothing formal–really just an evening of hanging out with former chairs and vice-chairs and listening to their stories of the past few years. As they told the stories and dispensed advice on running the organization and its projects, I realized that these people were exactly the kinds of people I had transferred to MIT to learn from and surround myself with. I had already known from reputation that they were technically brilliant, but I hadn’t expected they’d be willing to pass on their knowledge, nor that I’d enjoy spending time with them quite so much.
As it turned out, these alumni were all working at a startup called Ksplice, which had been founded by a group of SIPB members a few years earlier. One of the Ksplicers had been trying to recruit me as an intern, but I had fended him off because between research and classes and becoming acclimated to MIT I knew I’d have no time to work at a startup. But after that evening, I knew I had to do whatever it took to work with them. Even if it meant giving up research for the semester–graduate school positioning could wait.
So I told my professor I would not in fact be able to work on the project, and I began working part-time at Ksplice. My initial project was a fancy testing framework for rebootless updates. Over the course of a month, I wrote it in a very magical way, having tons of library functions that would apply heuristics to their arguments to coerce them into the correct types, and I think I even worked in some uses of getattr().
As I submitted my work for review, I felt quite proud of myself. So it was a bit of a shock when my reviewer’s comments came back saying that my code was basically unusable. As I read his comments in detail, I was even more shocked to realize how right he was. He pointed out that my interfaces, by being so broad, made it basically impossible to tell what kinds of arguments they actually accept, much less what they are intended to do. And as well, he pointed out that I should clean up my commit history, which was something that I had not even known was possible. So I spent another week going back and cleaning up my code and revision history, polishing it until no signs of my previous design were visible.
The rest of my time at Ksplice proceeded in much the same manner. (I ended up staying through the summer.) Every time I submitted code for review, every time I interacted with a member of the team, I felt myself learning and gaining new perspective on what it means to write good code. It wasn’t always smooth sailing, but from a technical perspective it was exactly what I had been looking for.
I was at the same time working on projects, sometimes on my own and sometimes with other SIPB members. My largest project of the semester was Anygit, which attempted to reverse the Git object model and support queries mapping a SHA1 to all parent objects or repositories containing that SHA1. I spent many hours architecting and rearchitecting its components, trying to build it in a way that would scale to hold all of the Git objects in the world. I taught myself a lot of Git internals as well as researched various datastores, learned how to diagnose database performance issues, and just generally explored an ecosystem I’d never really touched before.
In comparison, my technical learning through classes was felt paltry. The classes I was taking were good, but they weren’t great. I would voraciously wolf down the material, and find myself disappointed that there was no more to consume. I was required to take the introductory computer science class, 6.01, and though I was fortunate enough to be able to convince the professor to let me get credit for being a Lab Assistant, I still found myself deeply unsatisfied with how I was spending my time academically.
On the whole though, I don’t think I had ever been happier. I had finally found a home, surrounded myself with the kinds of people I had been seeking, and was completely immersed in computer science. I was exploring a new environment, learning the physics of this new world, and at the same time learning more about myself. My one regret was that I still hadn’t picked up a research project, but that’s what next semester was for, right?
by gdb at November 26, 2011 09:12 AM
Someone wrote a Stripe plugin for WordPress. Stripe gets a lot of requests for a WordPress plugin, so I’m pretty excited about it. I’m writing this post to test it out:
by gdb at November 23, 2011 08:01 PM
Once I had made the decision to transfer to MIT, my planning turned from high level strategy to implementation details. First and foremost, I needed a place to live. The MIT housing office sent pointed me to an application form with four dropdowns for you to select your top four dorms. I put East Campus for all four of them, and I added a heartfelt note about how much I wanted to be in EC.
Secondly, I needed to decide how to spend my January. Harvard had just shifted their schedule so that finals were before winter break, adding a J-term that coincided with MIT’s IAP. I looked into doing on some CS research at Harvard, but I realized that all I wanted to with my J-term was start poking around MIT and get acclimated to my new home.
About this time, an email arrived on the hcs-jobs mailing list from an MIT professor looking for some students to build a prototype of his startup idea over IAP. It sounded perfect–I would get to spend time around MIT, I would get to know a professor, and I would do all this while working on a cool project. I sent in an extremely excited application email, and I was overjoyed when I found out that I had been hired.
Once I was fully committed to joining the physical community, I wanted to join the virtual one as well. I started hanging around Zephyr, an MIT-internal chat system. I was amazed at how people whom I had briefly met at SIPB meetings started subscribing to my personal class (kind of the analog of a Facebook wall) and there engaging in fascinating conversations, technical and otherwise.
My MIT housing didn’t kick in until the end of IAP, so my original IAP plan had been to stay in the Harvard dorms, where I had procured special permission to be allowed to remain on campus (due to budget constraints, Harvard was trying to kick everyone out of the dorms for J-term). The day before I was going to head home for Christmas break, the “J-Term Housing Committee” emailed me to say that, because I had no housing for the spring term, and because students were to stay in their spring-time housing, I was actually not allowed to stay in the dorms over J-term.
I began to panic. I thought about appealing the decision, but since I was about to head home, I had no time to do so. I brought up my conundrum on Zephyr, and I was inspired by the number of people who jumped in to say that I should swing by their living group, which would be happy to take me in over IAP. It was at that moment that I knew I had made the right decision in transferring. Harvard had chosen to kick me out even when it had agreed to let me stay, and the decision maker was the faceless “J-Term Housing Committee”. At MIT, my soon-to-be peers were the gatekeepers, and they were all so excited about their living groups that they wanted to show them off to a total stranger.
That night, I stopped by pika for dinner. The residents took me on a tour, and they told me they’d be in touch shortly about whether I could stay over IAP. And sure enough, a few days later, I received an email saying that I was welcome to do so.
My Christmas break was full of SIPB project work. Mitch, a maintainer of the scripts project, set me up with a project with which I found myself enthralled. Mitch taught me various technical skills, such as how to do Fedora packaging, and he was on the whole a font of knowledge and advice–exactly what I had been looking for at MIT. I also started hacking on XVM, trying to make its packages usable outside of MIT’s infrastructure.
Having found myself spending nearly every waking moment working on SIPB projects or Zephyring with members of the SIPB community, I knew that I had to run for an officer position in SIPB whenever the first opportunity to do so would be. I wanted to make a difference, to improve the community as much as it was already improving me, and to help other people join and become as involved as I. I started asking around and was told that officer elections were coming up at the end of January. The one problem–only full members were eligible to run, which would mean I’d have to be elected to membership before then.
Soon the beginning of January rolled around and I headed back to MIT. On weekdays, I’d work 8-10 hours at the professor’s office, go hang out in the SIPB office for another few hours, and then head back to pika. On weekends I got to spend the entire day at SIPB. I made a point of spending some time in the SIPB office every single day. I loved being there, meeting the people I had barely dared to dream existed, and the worst part of every day was when I had to leave the office and head back home.
After a few weeks, I was elected to full membership. I was thrilled–I could now run for office! The next week was officer elections. When nominations for chair were called, someone called out my name. Along with two other candidates, I went up to the front of the room and answered a barrage of questions about how I would run SIPB. We were sent to the hall while the votes were tallied, and Jessica (incidentally the person who had told me to check out pika) ended up winning. The next position was vice-chair. I was again nominated, and this time after the votes were tallied I was shocked to see that I had in fact won.
On this happy note, the next semester began. I had finally found the community I had been looking for. I was working on awesome projects, and once again there were mentors to teach and guide me along the way. And I was at the same time in a position to give back, to help cultivate and shape the community that was at the same time molding me. I had truly found paradise.
by gdb at November 21, 2011 08:41 AM
On Monday, November 21, 2011, all of the scripts.mit.edu servers will be upgraded from Fedora 13 to Fedora 15, which was released on May 25. We strongly encourage you to test your website as soon as possible, and to contact us at scripts@mit.edu or come to our office in W20-557 if you experience any problems. The easiest way to test your site is to run the following commands at an Athena workstation and then visit your website in the browser that opens, but see this page for more details and important information: athena% add scripts
athena% firefox-test
by Alexander Chernyakhovsky at November 06, 2011 02:00 AM
For those of you who do a lot of Debian upgrade path testing…
I got annoyed at equivs not supporting Breaks: lines at all (Debian bug #571638), so I wrote a small shell script to spit out the minimal debhelper 7 rules file-using package. You can then edit debian/control the way you’d edit an equivs control file and run debuild, or you can continue writing a package for actual software, since it’s a real package and not just an equivs control file.
I also got annoyed at apt not really being able to install packages files on local disk, so I wrote an even tinier shell script to add all packages (recursively) in the current directory to an apt repo and add it via the file protocol to sources.list. It turns out dpkg-scanpackages does this, it’s just easier to have the two-line shell script to remember for you what the command is and what the sources.list line should be.
newpackage and fakerepo are in my Athena home directory. See the comments at the top for usage instructions. I might find a better place for these at some point, but I don’t anticipate ever having to update them…
by geofft at August 16, 2011 01:01 AM
I’ve posted the final slides from my talk this year at DEFCON and Black Hat, on breaking out of the KVM Kernel Virtual Machine on Linux.
[Edited 2011-08-11] The code is now available. It should be fairly well-commented, and include links to everything you’ll need to get the exploit up and running in a local test environment, if you’re so inclined.
In addition, as I mentioned, this bug was found by a simple KVM fuzzer I wrote. I’m also going to clean that up and release it, but don’t expect it too soon.
I had a great time meeting lots of interesting people at BlackHat and DEFCON, some that I’d met online and others I hadn’t. If any of you are ever in Boston, drop me a note and we can grab a beer or something.
by nelhage at August 08, 2011 05:32 PM
Our whitelist of file types served directly to the web has for a long time included .doc, .xls, and .ppt. With the advent of new XML-based Microsoft Office formats, and with the popularity of LibreOffice and OpenOffice.org, there have been requests for whitelisting these additional file types. As of yesterday, the new Office XML filetypes — .docx, .xlsx, .pptx, etc. — as well as ODF file types — .odt, .ods, .odp, etc. — will also be served directly to the web.
In addition to files you place in your locker, this also affects files uploaded to your website via the standard upload feature of apps such as MediaWiki and WordPress.
The full list of whitelisted extensions is posted in our FAQ.
by geofft at June 09, 2011 04:24 AM
The answer to this was much less obvious than I expected it to be, so I figured I’d blog it. If you’re using Google’s ClientLogin API for turning a username and a password into an auth token, what do you do with the token? The result of POSTing to /accounts/ClientLogin gets you three values, and Google says to “reference” the Auth=foo line and discard the rest.
The answer is to send back a header of the form “Authorization: GoogleLogin auth=foo”. (That is to say, your authentication token doesn’t go anywhere in the POST data, which is what I tried for a while…)
Incidentally, I’m doing this because I’m playing with Google Cloud Print. I successfully created a cloud printer printed a PDF from my phone to, uh, a shell wget command. More on that later.
by geofft at May 13, 2011 06:41 AM
If you program in Python, you’re probably familiar with the
pickle serialization library, which provides for efficient
binary serialization and loading of Python datatypes. Hopefully,
you’re also familiar with the warning printed prominently near the
start of pickle‘s documentation:
Warning: The pickle module is not intended to be secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.
Recently, however, I stumbled upon a project that was accepting and unpacking untrusted pickles over the network, and a poll of some friends revealed that few of them were aware of just how easy it is to exploit a service that does this. As such, this blog post will describe exactly how trivial it is to exploit such a service, using a simplified version of the code I recently encountered as an example. Nothing in here is novel, but it’s interesting if you haven’t seen it.
The vulnerable code was a Twisted server that listened over SSL. The code looked roughly like the following:
class VulnerableProtocol(protocol.Protocol): def dataReceived(self, data): # Code to actually parse incoming data according to an # internal state machine # If we just finished receiving headers, call verifyAuth() to check authentication def verifyAuth(self, headers): try: token = cPickle.loads(base64.b64decode(headers['AuthToken'])) if not check_hmac(token['signature'], token['data'], getSecretKey()): raise AuthenticationFailed self.secure_data = token['data'] except: raise AuthenticationFailed |
So, if we just send a request that looks something like:
AuthToken: <pickle here>
The server will happily unpickle it.
So, what can we do with that? Well, pickle is supposed to allow us
to represent arbitrary objects. An obvious target is Python’s
subprocess.Popen objects — if we can trick the target
into instantiating one of those, they’ll be executing arbitrary
commands for us! To generate such a pickle, however, we can’t just
create a Popen object and pickle it; For various mostly-obvious
reasons, that won’t work. We could read up on the “pickle” format and
construct a stream by hand, but it turns out there is no need to.
pickle allows arbitrary objects to declare how they should be
pickled by defining a __reduce__ method, which should
return either a string or a tuple describing how to reconstruct this
object on unpacking. In the simplest form, that tuple should just
contain
pickle will pickle each of these pieces separately, and then on
unpickling, will call the callable on the provided arguments to
construct the new object.
And so, we can construct a pickle that, when un-pickled, will execute
/bin/sh, as follows:
import cPickle import subprocess import base64 class RunBinSh(object): def __reduce__(self): return (subprocess.Popen, (('/bin/sh',),)) print base64.b64encode(cPickle.dumps(RunBinSh())) |
At this point, we’ve basically won. We can run arbitrary shell commands on the target, and there are any number of ways we could bootstrap from here up to an interactive shell and whatever else we might want.
For completeness, I’ll explain what I did, since it’s a moderately
cute trick. subprocess.Popen lets us select which file descriptors
to attach to stdin, stdout, and stderr for the new process by passing
integers for the stdin and similarly-named arguments, so we can open
our /bin/sh on arbitrarily-numbered fd’s.
However, as mentioned above, the target server uses Twisted, and it serves all requests in the same thread, using an asynchronous event-driven model. This means we can’t necessarily predict which file descriptor on the server will correspond to our socket, since it depends on how many other clients are connected.
It also means, however, that every time we connect to the server, we’ll open a new socket inside the same server process. So, let’s guess that the server has fewer than, say, 20 concurrent connections at the moment. If we connect to the server’s socket 20 times, that will open 20 new file descriptors in the server. Since they’ll get assigned sequentially, one of them will almost certainly be fd 20. Then, we can generate a pickle like so, and send it over:
import cPickle import subprocess import base64 class Exploit(object): def __reduce__(self): fd = 20 return (subprocess.Popen, (('/bin/sh',), # args 0, # bufsize None, # executable fd, fd, fd # std{in,out,err} )) print base64.b64encode(cPickle.dumps(Exploit())) |
We’ll open a /bin/sh on fd 20, which should be one of our 20
connections, and if all goes well, we’ll see a prompt printed to one
of those. We’ll send some junk on that fd until we manage to get the
original server to error out and close the connection, and we’ll be
left talking to /bin/sh over a socket. Game over.
Again, nothing here should be novel, nor would I expect any of these
pieces to take a competent hacker more than few minutes to figure out,
given the problem. But if this blog post teaches someone not to use
pickle on untrusted data, then it will be worth it.
by nelhage at March 20, 2011 10:38 PM
reptyr (announced recently on this blog) takes a
process that is currently running in one terminal, and transplants it
to a new terminal. reptyr comes from a proud family of similar
hacks, and works in the same basic way: We use ptrace(2)
to attach to a target process and force it to execute code of our own
choosing, in order to open the new terminal, and dup2(2) it over
stdout and stderr.
The main special feature of reptyr is that it actually changes the
controlling terminal of the target process. The “controlling terminal”
is a concept maintained by UNIX operating systems that is independent
of a process’s file descriptors. The controlling terminal governs
details like where ^C gets delivered, and how applications are
notified of changes in window size.
Processes are grouped into two levels of hierarchical groups: sessions, and process groups. Each group is named by an ID, which is the PID of the initial leader (either “session leader” or “process group leader”). Even if the leader exits, that number is still the ID for the group. Sessions are used for terminal management — Every process in a session has the same controlling terminal, and each terminal belongs to at most one session. Process groups are a sub-division within sessions, and are used primarily for job control within the shell. For a more in-depth explanation, see part 3 of my earlier series on termios.
If you check out tty_ioctl(4), you’ll find that Linux has an
ioctl, TIOCSCTTY, that can be used to set the controlling terminal
of a process, and you could be forgiven for thinking that all we need
is to make the target call that ioctl, and we’re done.
However, if we read closer, we find that it has several restrictions. In particular:
The calling process must be a session leader and not have a controlling terminal already. If this terminal is already the controlling terminal of a different session group then the ioctl fails with EPERM […]
In the typical case, where I’m trying to attach a (say) mutt that
you spawned from your shell, mutt won’t be a session leader — your
shell will be the session leader, and mutt will be the process group
leader for a process group containing only itself.
So, we need to make the target a session leader. Conveniently, there’s
a system call for that: setsid(2).
However, reading that man page, we find a new caveat: setsid(2)
fails with EPERM if
The process group ID of any process equals the PID of the calling process. Thus, in particular, setsid() fails if the calling process is already a process group leader.
The shell creates a new process group for every job you launch, and so
our target mutt will be a process group leader, and unable to
setsid(). The usual solution for programs that want to setsid is to
fork(), so that the child is still in the parent’s session and
process group, and then setsid() in the child. However, fork()ing
our mutt and killing off the parent seems potentially disruptive, so
let’s see if we can avoid that.
So, we’re going to need to change mutt‘s process group ID, so that
there are no processes with process group IDs equal to its
PID. Following some trusty SEE ALSO links, we get to
setpgid(2). There’s a bunch of text in that man page, but the key
bit is:
If setpgid() is used to move a process from one process group to another, both process groups must be part of the same session (see setsid(2) and credentials(7)). In this case, the pgid specifies an existing process group to be joined and the session ID of that group must match the session ID of the joining process.
We need to find a process group in the same session as mutt to move
our mutt into, and then we’ll be able to setsid. We could try to
find one — the shell is a plausible candidate, for instance — but
there’s an alternate, more direct route: Create one.
While we have mutt captured with ptrace, we can make it fork(2)
a dummy child, and start tracing that child, too. We’ll make the child
setpgid to make it into its own process group, and then get mutt
to setpgid itself into the child’s process group. mutt can then
setsid, moving into a new session, and now, as a session leader, we
can finally ioctl(TIOCSCTTY) on the new terminal, and we win.
It turns out I didn’t invent this technique — injcode and neercs work the same way. But I did discover it independently of them, and it was a fun little hunt through unix arcana.
by nelhage at February 09, 2011 03:06 AM
Over the last week, I’ve written a nifty tool that I call reptyr. reptyr is a utility for taking an existing running program and attaching it to a new terminal. Started a long-running process over ssh, but have to leave and don’t want to interrupt it? Just start a screen, use reptyr to grab it, and then kill the ssh session and head on home.
You can grab the source, or read on for some more details.
There’s a shell script called screenify that’s been going
around the internet for nigh on 10 years now that is supposed to use
gdb to accomplish the same thing. There’s also a project called
retty that tries to do the same thing, in C using ptrace()
directly.
The difference between those programs and reptyr is that reptyr works much, much, better.
If you attach a less using screenify or retty, it will still take
input from the old terminal. If you attach an ncurses program, and
resize the window, the program probably won’t resize correctly. ^C
and ^Z will still be processed on the old terminal — typing them in
the new terminal won’t do anything useful.
reptyr fixes all of these problems and more, and is the only such tool I know of that does so. I’ve never seen a program that doesn’t behave noticeably incorrectly after attaching with retty or screenify, whereas with reptyr most programs I have tried work flawlessly.
reptyr works in the same basic way as screenify and retty — it
attaches to the target process using the ptrace API, opens the new
terminal, and dup2s it over the old file descriptors. It also copies
the termios settings from the old terminal to the new terminal.
The main thing that reptyr does that no one else does is that it
actually changes the controlling terminal of the process you are
attaching. This is the detail that makes many things Just Work,
including ^C and ^Z and window resizing.
Switching the target’s controlling terminal is not easy and involves a
fair bit of trickery with ptrace and Linux’s terminal APIs. I will
probably do another blog post some time about the dirty details of how
I make this work, but for now you can check out
attach.c if
you really want to know.
reptyr still has a number of limitations — it doesn’t generally work, for example, if the target process has any children. I know how to fix most of these problems, though, so expect it to get better with time. Please let me know if you find it useful!
(Edited to add:) Nothing is really new. A commenter on reddit pointed out that injcode
and neercs both accomplish the same thing, even using the same trick
to change the CTTY. Ah well, I had run writing it anyways, and apparently I
wasn’t the only one who didn’t know about the existing alternatives. neercs is a full screen replacement, though, and I think that reptyr should be more robust than injcode — I use a different techique for ptrace-hijacking, for example — and so hopefully this tool still has a niche as a more robust standalone utility. Certainly, judging from the amount of enthusiasm I’ve seen for this tool, this still isn’t a problem that is solved to the average user’s satisfaction.
by nelhage at January 22, 2011 01:56 AM
Powered by Planet Venus!
Last updated: May 22, 2013 06:07 PM