To request addition or removal, please email sipb-www at mit.edu. Planet is updated every thirty minutes.
I have spent many years as an software engineer who was a total outsider to machine-learning, but with some curiosity and occasional peripheral interactions with it. During this time, a recurring theme for me was horror (and, to be honest, disdain) every time I encountered the widespread usage of Python pickle in the Python ML ecosystem. In addition to their major security issues1, the use of pickle for serialization tends to be very brittle, leading to all kinds of nightmares as you evolve your code and upgrade libraries and Python versions.
Suppose we’ve got a service. We’ll gloss over the details for now, but let’s stipulate that it accepts requests from the outside world, and takes some action in response. Maybe those requests are HTTP requests, or RPCs, or just incoming packets to be routed at the network layer. We can get more specific later. What can we say about its performance? All we know is that it receives requests, and that it acts on them.
What’s the “right” level of CPU utilization for a server? If you look at a monitoring dashboard from a well-designed and well-run service, what CPU utilization should we hope to see, averaged over a day or two? It’s a very general question, and it’s not clear it should have a single answer. That said, for a long time, I generally believed that higher is always better: we should aim for as close to 100% utilization as we can.
Ever since its introduction in the 2017 paper, Attention is All You Need, the Transformer model architecture has taken the deep-learning world by storm. Initially introduced for machine translation, it has become the tool of choice for a wide range of domains, including text, audio, video, and others. Transformers have also driven most of the massive increases in model scale and capability in the last few years. OpenAI’s GPT-3 and Codex models are Transformers, as are DeepMind’s Gopher models and many others.
In my day job at Anthropic, we run relatively large distributed systems to train large language models. One of the joys of using a lot of computing resources, especially on somewhat niche software stacks, is that you spend a lot of time running into the long-tail of bugs which only happen rarely or in very unusual configurations, which you happen to be the first to encounter. These bugs are frustrating, but I also often enjoy them.
One of the annoying things about scraping websites is bouncing back and forth between the browser where you are using Dev Tools to work out what selectors you should be using to scrape out data, and your actual scraping script, which is usually some batch program that may have to take a few steps before the step you are debugging. A batch script is fine once your scraper is up and running, but while developing, it's really handy to pause the scraping process at some page and fiddle around with the DOM to see what to do.
This interactive-style development is exactly what Juypter notebooks shine at; when used in conjunction with a browser-based scraping library like Puppeteer, you can have exactly this workflow. Here's the setup:
There will be a live browser instance which you can poke at using Dev Tools, and you type commands into the Jupyter notebook and see how they affect the browser state.
I tweeted about this and the commenters had some good suggestions about other things you could try:
by Edward Z. Yang at November 23, 2021 02:28 PM
CPU cycles are cheaper than they have ever been, and cloud computing has never been more ubiquitous. All the major cloud providers offer generous free tiers, and services like GitHub Actions offer free compute resources to open-source repositories. So why do so many developers still build software on their laptops? Despite the embarrassment of riches of cheap or even free cloud compute, most projects I know of, and most developers, still do most of their software development — building and running code — directly on their local machines.
Last week, Frederic Cambus wrote about building LLVM quickly on some very large machines, culminating in a 2m37s build on a 160-core ARM machine. I don’t have a giant ARM behemoth, but I have been working on a tool I call Llama, which lets you offload computational work – including C and C++ builds – onto Amazon Lambda. I decided to see how good it could do at a similar build.
I'm launching a new podcast, the PyTorch Developer Podcast. The idea is to be a place for the PyTorch dev team to do bite sized (10-20 min) topics about all sorts of internal development topics in PyTorch. For now, it's just me monologuing for fifteen minutes about whatever topic I decide. The plan is to release an episode daily, five days a week, until I run out of things to say (probably not for a while, I have SO MANY THINGS TO SAY). I don't edit the podcasts and do minimal planning, so they're a bit easier to do than blog posts. Check it out! There's two episodes out already, one about how we do Python bindings for our C++ objects and another about history and constraints of the dispatcher. If there are any topics you'd like me to cover, give a shout.
by Edward Z. Yang at May 05, 2021 03:26 PM
At Facebook, we have an internal convention for tooling called "rage". When something goes wrong and you want to report a bug, the tool developer will typically ask you to give them a rage. For a command line tool, this can be done by running a rage subcommand, which will ask about which previous CLI invocation you'd like to report, and then giving you a bundle of logs to send to the developer.
A rage has an important property, compared to a conventional log level flag like -v: rage recording is always on. In other words, it is like traditional server application logs, but applied to client software. Logging is always turned on, and the rage subcommand makes it easy for a user to send only the relevant portion of logs (e.g., the logs associated with the command line invocation that is on).
For some reason, rage functionality is not that common in open source tools. I can imagine any number of reasons why this might be the case:
Still, in the same way most sysadmins view logging as an invaluable tool for debugging server issues, I think rage reporting is an invaluable tool for debugging client issues. In ghstack, it didn't take very many lines of code to implement rage reporting: ghstack.logs (for writing the logs to the rage directory) and ghstack.rage (for reading it out). But it has greatly reduced my support load for the project; given a rage, I can typically figure out the root cause of a bug without setting up a reproducer first.
by Edward Z. Yang at April 25, 2021 04:03 AM
People who work with me tend to realize that I have Opinions about databases, and SQL databases in particular. Last week, I wrote about a Postgres debugging story and tweeted about AWS’ policy ban on internal use of SQL databases, and had occasion to discuss and debate some of those feelings on Twitter; this article is an attempt to write up more of them into a single place I can refer to.
PyTorch is a fairly large and active open source project, and sometimes we have people come to us and ask if there are any lessons from how we run PyTorch that they could apply to their own projects. This post is an attempt to describe some of the processes as of 2021 that help PyTorch operate effectively as an open source project. I won't claim that everything we do necessarily the best way to go about doing things, but at the very least, everything I describe here is working in practice.
Background. Not all open source projects are the same, and there are some peculiarities to PyTorch which may reduce the applicability of some of what I describe below in other contexts. Here are some defining features of PyTorch, as a project:
Alright, so how does PyTorch deal with its scale? Here are some of the things we do.
Issue triage. PyTorch receives too many bug reports a day for any one person to keep track of all of them. Largely inspired by this apenwarr post, we setup an oncall rotation amongst Facebook contributors to serve as first line triage for all of these issues. The golden rule of issue triage is that you DO NOT fix bugs in triage; the goal of triage is to (1) route bugs to the correct people via appropriate GitHub labels, and (2) look for high priority bugs and raise awareness of these bugs. Every week, we have a meeting to review high priority bugs (and other bugs marked for triage review) and talk about them. The oncall itself rotates daily, to discourage people from letting a week's worth of issues pile up in the backlog, and we use a relatively intricate search query to make sure only relevant issues show up for the oncall to handle.
The most important consequence of issue triage is that you can unwatch PyTorch repository as a whole. Instead, by watching various labels (using our cc bot), you can trust that you will get CC'ed to issues related to topics, even if the triager doesn't know that you're interested in the issue! The weekly meeting makes sure that all maintainers collectively have an idea about what major issues are currently affecting PyTorch, and helps socialize what we as a project think of as a "high priority" issue. Finally, the high priority label is a good way to find impactful problems to work on in the project, even if you don't know much else about the project.
Pull request triage. Similarly, we receive a decent number of drive by pull requests from one time contributors. Those people are not in a good position to find reviewers for their contributions, so we also have a triager look through these pull requests and make sure someone is assigned to review them. If the PR is particularly simple, the triager might just go ahead and merge it themselves. There's actually some good automation for doing this (e.g., homu) but we've been too lazy to set any of it up, and by hand reviewer assignment doesn't seem to be too much burden on top of the existing oncall.
Tree hugging oncall. PyTorch has a huge CI system covering many different system configurations which most contributors rely on to test if their changes are safe. Sometimes people break master. Separate from the triage oncall, we have a tree hugging oncall whose job it is to revert jobs if they break master. This oncall involves mostly paying attention to the CI HUD and reverting commits if they result in master breakage in one of the configurations.
Importing to Facebook infrastructure. We actually run Facebook infrastructure directly off of the HEAD branch in PyTorch. The tooling that makes this possible is fbshipit, which mirrors commits between Facebook's internal monorepo and our public GitHub repository. This setup has been something of a double-edged sword for us: requiring Facebook and GitHub to be in sync means that only Facebook employees can actually land pull requests (we try to streamline the process as much as possible for external maintainers, but at the end of the day someone at Facebook has to actually push the green button), but it means we don't have to worry about doing periodic "mega-imports" into Facebook infrastructure (which we have done in the past and were quite difficult to do). We are very interested in fixing this situation and have floated some proposals on changing how we do internal releases to make it possible to let external contributors land PRs directly.
RFCs. Most feature discussion happens on GitHub issues, but sometimes, a feature is too big and complicated to adequately discuss in a GitHub issue. In those cases, they can be discussed in the rfcs repository (inspired by the Rust RFCs process). The formal process on this repository isn't too solidified yet, but generally people go there if they feel that it is too difficult to discuss the issue in GitHub issues. We don't yet have a process for shepherding unsolicited RFCs.
Conclusion. PyTorch's open source process isn't rocket science: there's an oncall, the oncall does some things. The devil is in the details: all of PyTorch's oncall responsibilities are carefully scoped so that your oncall responsibilities aren't something that will take an unbounded amount of time; they're something you can knock out in an hour or two and call it a day. You could make the argument that we rely excessively on oncalls when automation is possible, but what we have found is that oncalls require less infrastructure investment, and integrate well with existing processes and flows at Facebook. They might not be right everywhere, but at least for us they seem to be doing a good job.
by Edward Z. Yang at January 06, 2021 04:56 PM
Years ago, Nadav Rotem related to me this story about why basic block procedures in Swift are not as good as they seem. Nelson Elhage reminded me about this on Twitter and so I thought this should be put into the public record.
Basic block procedures make certain optimizations more difficult. Consider this program:
block j3 (%y1, %y2) { ... } block j1 () { jump j3(%x1, %x2) } block j2 () { jump j3(%x3, %x4) }
Is this program easier or more difficult to optimize than the traditional SSA with phi-nodes formulation?
L1: goto L3 L2: goto L3 L3: %y1 = phi [%x1, %L1] [%x3, %L2] %y2 = phi [%x2, %L1] [%x4, %L2]
Suppose that the optimizer determines that y1 is unused inside j3/L3 and can be eliminated. In basic block land, y1 can be eliminated simply by deleting "y1 = phi x1 x3". However, in join point land, you have to not only eliminate y1 but also update all the call sites of j3, since you've changed the function signature. In a mutable AST, changing function signatures is a pain; in particular, the mutations you would have to do to eliminate the argument include intermediate states that are not valid ASTs (making it easy to accidentally trigger asserts.)
When I saw this example, I wondered why GHC (which has the moral equivalent of basic block procedures in the form of join points) didn't have this problem. Well, it turns out this optimization can be done as a series of local transformations. First, we do a worker/wrapper transformation, introducing an intermediate block (the worker) that drops the dead argument:
block j3 (%y1, %y2) { jump wj3(%y2) } block j1 () { jump j3(%x1, %x2) } block j2 () { jump j3(%x3, %x4) } block wj3 (%y2) { ... }
Later, we inline j3, which removes the wrapper. Worker/wrapper is a very important optimization for functional programs, but it's easy to imagine why it is less preferred in mutable compiler land.
by Edward Z. Yang at October 24, 2020 11:34 PM
One of the features I miss most in non-Haskell programming languages is algebraic data types (ADT). ADTs fulfill a similar role to objects in other languages, but with more restrictions: objects are an open universe, where clients can implement new subclasses that were not known at definition time; ADTs are a closed universe, where the definition of an ADT specifies precisely all the cases that are possible. We often think of restrictions of a bad thing, but in the case of ADTs, the restriction of being a closed universe makes programs easier to understand (a fixed set of cases to understand, as opposed to a potentially infinite set of cases) and allows for new modes of expression (pattern matching). ADTs make it really easy to accurately model your data structures; they encourage you to go for precise types that make illegal states unrepresentable. Still, it is generally not a good idea to try to manually reimplement your favorite Haskell language feature in every other programming language you use, and so for years I've suffered in Python under the impression that ADTs were a no go.
Recently, however, I have noticed that a number of new features in Python 3 have made it possible to use objects in the same style of ADTs, in idiomatic Python with virtually no boilerplate. The key features:
The key idea: define each constructor as a dataclass, put the constructors together into an ADT using a Union type, and use isinstance tests to do pattern matching on the result. The result is just as good as an ADT (or better, perhaps; their structural nature bears more similarity to OCaml's polymorphic variants).
Here's how it works. Let's suppose that you want to define an algebraic data type with two results:
data Result = OK Int | Failure String showResult :: Result -> String showResult (OK result) = show result showResult (Failure msg) = "Failure: " ++ msg
First, we define each constructor as a dataclass:
from dataclasses import dataclass @dataclass(frozen=True) class OK: result: int @dataclass(frozen=True) class Failure: msg: str
Using the automatically generated constructors from dataclasses, we can construct values of these dataclasses using OK(2) or Failure("something wrong"). Next, we define a type synonym for the union of these two classes:
Result = Union[OK, Failure]
Finally, we can do pattern matching on Result by doing isinstance tests:
def assert_never(x: NoReturn) -> NoReturn: raise AssertionError("Unhandled type: {}".format(type(x).__name__)) def showResult(r: Result) -> str: if isinstance(r, OK): return str(r.result) elif isinstance(r, Failure): return "Failure: " + r.msg else: assert_never(r)
assert_never is a well known trick for doing exhaustiveness checking in mypy. If we haven't covered all cases with enough isinstance checks, mypy will complain that assert_never was given a type like UnhandledCtor when it expected NoReturn (which is the uninhabited type in Python).
That's all there is to it. As an extra bonus, this style of writing unions is compatible with the structured pattern matching PEP, if it actually gets accepted. I've been using this pattern to good effect in our recent rewrite of PyTorch's code generator. If you have the opportunity to work in a statically typed Python codebase, give this style of code a try!
by Edward Z. Yang at October 14, 2020 06:08 PM
If this is your first time reading about PyTorch internals, you might want to check out my PyTorch internals post first. In this post, I want to talk about one particular part of PyTorch's internals: the dispatcher. At a first glance, the dispatcher is just a glorified if statement: based on some information about the tensor inputs, decide what piece of code should be called. So why should we care about the dispatcher?
Well, in PyTorch, a lot of things go into making an operator work. There is the kernel that does the actual work, of course; but then there is support for reverse mode automatic differentiation, e.g., the bits that make loss.backward() work. Oh, and if your code under torch.jit.trace, you can get a trace of all the operations that were run. Did I mention that if you run these operations on the inside of a vmap call, the batching behavior for the operators is different? There are so many different ways to interpret PyTorch operators differently, and if we tried to handle all of them inside a single function named add, our implementation code would quickly devolve into an unmaintainable mess. The dispatcher is not just an if statement: it is a really important abstraction for how we structure our code internally PyTorch... and it has to do so without degrading the performance of PyTorch (too much, anyway).
At the end of this post, our goal will be to understand all the different parts of this picture fit together. This post will proceed in three parts.
First, we'll talk about the dispatcher itself. What is the dispatcher, how does it decide what kernel to call? Second, we'll talk about the operator registration API, which is the interface by which we register kernels into the dispatcher. Finally, we'll talk about boxing and unboxing, which are a cross-cutting feature in the dispatcher that let you write code once, and then have it work on all kernels.
OK, so what is the dispatcher? For every operator, the dispatcher maintains a table of function pointers which provide implementations for each dispatch key, which corresponds roughly to one of the cross-cutting concerns in PyTorch. In the diagram above, you can see there are dispatch entries in this table for backends (CPU, CUDA, XLA) as well as higher-level concepts like autograd and tracing. The dispatcher's job is to compute a dispatch key, based on the input tensors and some other stuff (more on this shortly), and then do an indirect jump to the function pointed to by the table.
Those of you who are familiar with C++ may observe that this table of function pointers is very similar to virtual tables in C++. In C++, virtual methods on objects are implemented by associating every object with a pointer to a virtual table that contains implementations for each virtual method on the object in question. In PyTorch, we essentially reimplemented virtual tables, but with some differences:
Fun historical note: we used to use virtual methods to implement dynamic dispatch, and reimplemented them when we realized we needed more juice than virtual tables could give us.
So how exactly do we compute the dispatch key which we use to index into the dispatch table? The basic abstraction we use for computing what dispatch key to use is a dispatch key set, which is a bitset over dispatch keys. The general concept is that we union together dispatch key sets from various sources (and in some case mask out some dispatch keys), giving us a final dispatch key set. Then, we pick the first dispatch key in the set (dispatch keys are implicitly ordered by some priority) and that is where we should dispatch to. What are these sources?
There is also a local exclude set, which is used to exclude dispatch keys from dispatch. A common pattern is for some handler to handle a dispatch key, and then mask itself off via the local exclude set, so we don't try reprocessing this dispatch key later.
Let's walk through the evolution of dispatch key through some examples.
(Warning: This description is out-of-date for PyTorch master. Instead of Autograd being in global, it is instead on the Tensor. Everything else proceeds as before.)
The most canonical example of the dispatch machinery in operation is how it handles autograd. Read the diagram from the top to the bottom. At the very top, Autograd is in the global set, and the local exclude set is empty. When we do dispatch, we find autograd is the highest priority key (it's higher priority than CPU), and we dispatch to the autograd handler for the operator. Inside the autograd handler, we do some autograd stuff, but more importantly, we create the RAII guard AutoNonVariableTypeMode, which adds Autograd to the local exclude set, preventing autograd from being handled for all of the operations inside of this operator. When we redispatch, we now skip the autograd key (as it is excluded) and dispatch to the next dispatch key, CPU in this example. As local TLS is maintained for the rest of the call tree, all other subsequent dispatches also bypass autograd. Finally, in the end, we return from our function, and the RAII guard removes Autograd from the local exclude set so subsequent operator calls once again trigger autograd handlers.
Another similar example is tracing, which is similar to autograd where when we enter the tracing handler, we disable tracing for nested calls with ExcludeDispatchKeyGuard. However, it differs from autograd in how tracing is initially triggered: tracing is toggled by a dispatch key that is added to the local include set when you turn on tracing (with IncludeDispatchKeyGuard), as opposed to the global dispatch key from Autograd (Update: now a dispatch key on tensors).
One final example is the BackendSelect key, which operates a little differently from normal keys. The problem backend select solves is that sometimes, the default dispatch key set calculation algorithm doesn't know how to work out what the correct dispatch key should be. One notable case of this are factory functions, which don't have any Tensor arguments (and so, naively, would not dispatch to anything). BackendSelect is in the global dispatch key set, but is only registered for a few operators (for the rest, it is a fallthrough key). The BackendSelect handler inspects the arguments and decides what the final dispatch key should be, and then does a direct dispatch to that key, bypassing dispatch key calculation.
The slide summarizes some of the most common sequences of handlers that get processed when dispatching some operation in PyTorch. Most of the time, it's autograd, and then the backend (with a backend select in-between if you are a factory function). For XLA, there is also an XLAPreAutograd key (Update: This key is now simply AutogradXLA) which can be used to override the behavior of the Autograd key. And of course, if you turn on every feature in PyTorch all at once, you can end up stopping at a lot of handlers. Notice that the order in which these handlers are processed matters, since handlers aren't necessarily commutative.
So we talked a lot about how we decide what function pointers in the dispatch table to call, but how do these pointers get in the dispatch table in the first place? This is via the operator registration API. If you have never seen this API before, you should take a look at the Dispatcher in C++ tutorial, which describes how the API works at a very high level. In this section, we'll dive into more detail about how exactly the registration API maps to the dispatch table. Below, you can see the three main ways of interacting with the operator registration API: you define schemas for operators and then register implementations at dispatch keys; finally, there is a fallback method which you can use to define a handler for all operators at some dispatch key.
To visualize the impact of these registration operators, let us imagine that the dispatch tables for all operators collectively form a grid, like this:
On one axis, we have each operator supported in PyTorch. On the other axis, we have each dispatch key we support in our system. The act of operator registration involves filling in cells with implementations under these two axes.
When we register a kernel for a single operator at a specific dispatch key, we fill in a single cell (blue below):
When you register a kernel as a "catch-all" kernel for all dispatch keys in an operator, you fill in an entire row for the operator with one kernel (red below). By the way, if this seems like a strange thing to want to do, it is! And we're working to remove this capability in favor of more specific fills for a subset of keys.
When you register a kernel as a fallback for kernel for a single dispatch key, you fill in the column for that dispatch key (green).
There's a precedence to these registrations: exact kernel registrations have the highest precedence, and catch all kernels take precedence over fallback.
I want to spend the last part of this post talking about the boxing and unboxing facilities in our dispatcher, which turn out to be pretty important for enabling backend fallback. When you are a programming language designer, there is a classic tradeoff you have to make in deciding whether or not you want to use a boxed or unboxed representation for data:
A boxed or homogenous representation is a data representation where every type of object in your system has the same layout. Typically, this means you have some representation that has a header describing what the object in question is, and then some regular payload after it. Homogenous representations are easy to work with in code: because you can always assume that data has some regular layout, you can write functions that work polymorphically over any type of data (think of a function in Java that takes in an arbitrary Object, for example). Most garbage-collected languages have some boxed representation for heap objects, because the garbage collector needs to be able to work over any type of heap object.
In contrast, an unboxed or heterogenous representation allows objects to have a different layout depending on the data in question. This is more efficient than a homogenous representation, as each object can tailor its internal representation to exactly what is needed for the task at hand. However, the downside is we can no longer easily write a single function that works polymorphically over many types of objects. In C++, this problem is worked around using templates: if you need a function to work on multiple types, the C++ compiler will literally create a new copy of the function specialized to each type it is used with.
By default, C++ defaults heterogenous layout, but we have implemented homogenous layout in PyTorch by way of the IValue struct (short for interpreter value), which implements a boxed representation that we can use in our interpreter. An IValue is a two word structure consisting of a payload word (usually a pointer, but it could also be an integer or float directly packed into the field) and a tag word which tells us what kind of value the IValue is.
This means we have two calling conventions for functions in PyTorch: the usual, C++, unboxed convention, and a boxed convention using IValues on a stack. Calls (from end users) can come from unboxed API (direct C++ call) or boxed API (from the JIT interpreter); similarly, kernels can be implemented as direct C++ functions (unboxed convention), or can be implemented as a boxed fallback (which by necessity is boxed, as they are polymorphic over all operators).
If I call from boxed API to a boxed fallback, it's easy to see how to plug the two components together...
...but how do I get from the unboxed API to the boxed fallback?
We need some sort of adapter to take the unboxed inputs and turn them into IValues so that they can be passed via the boxed calling convention. This is done via a boxing adapter, which is automatically generated using C++ templates working off of the unboxed C++ types in the outward facing API.
There is also an inverse problem, which is what to do if we have inputs from an boxed API and need to call into an unboxed kernel. Similarly, we have an unboxing adapter, which performs this translation. Unlike the boxing adapter, this adapter is applied to the kernel itself, since C++ templates only work at sites where the unboxed type is statically available (at the boxed API site, these types are not known, so you literally cannot implement this.) Note that we always keep the unboxed API around, so that if a user calls in from the unboxed API, we can fastpath straight to the unboxed kernel.
So here is what boxing and unboxing looks overall:
Boxing and unboxing are a key feature in the implementation of boxed fallback: without them, we could not let people write single kernels which would work everywhere (and indeed, in the past, people would write code generators to generate repetitive kernels for every function). With template-based boxing and unboxing, you can write a single boxed kernel, and then have it work for operators, even if those operators are defined externally from the library.
So that's PyTorch's dispatcher in a nutshell! The dispatcher is still being continuously worked on; for example, Ailing Zhang recently landed a rework of how autograd dispatch keys are handled, which means that we actually no longer have a single Autograd key but have split autograd keys for AutogradCPU/AutogradCUDA/... We're generally interested in improving the user experience for people who register kernels to the dispatcher. Let us know if you have any questions or comments!
by Edward Z. Yang at September 10, 2020 06:29 PM
For the longest time, I thought of implicit parameters and dynamic scoping were basically the same thing, since they both can be used to solve similar problems (e.g., the so called "configuration problem" where you need to plumb down some configuration deep into a nested body of function definitions without defining them all explicitly). But implicit parameters have a reputation of being something you shouldn't use (use reflection instead), whereas dynamic scoping via the reader monad is a useful and well understood construct (except for the bit where you have to monadify everything). Why the difference?
Oleg points out that implicit parameters are not really dynamic scoping, and gives an example where Lisp and Haskell disagree. And you don't even want the Lisp behavior in Haskell: if you think about the operational notion of dynamic scoping (walk up the stack until you find a binding site of the dynamic variable), it's not very compatible with laziness, since a thunk (which accesses a dynamic variable) will be forced at some unpredictable point in program execution. You really don't want to have to reason about where exactly a thunk will be executed to know how its dynamic variables will be bound, that way lies madness. But somehow, in a strict language, no one has trouble figuring out what should happen with dynamic scoping (well, mostly--more on this shortly).
It turns out that the research community has figured out the difference is that implicit parameters are a coeffect. I believe this was first observed in Coeffects: Unified static analysis of context-dependence (a more modern presentation is in Coeffects: A calculus of context-dependent computation; and a more Haskelly presentation can be found in Embedding effect systems in Haskell). Although, Tomas was commenting on my blog in 2012 about similar ideas, so this probably had been in the works for a while. The key point is that for some coeffects (namely, implicit parameters), call-by-name reduction preserves types and coeffects, and so implicit parameters do not blow up in your face in the same way dynamic scoping (an effect) would. These necessarily behave differently! Type classes are coeffects too, and this is why modern use of implicit parameters in Haskell explicitly acknowledges this (e.g., in the reflection package).
At this year's ICFP, I was pointed at an interesting technical report about implicit values and functions in Koka, a new twist on the dynamic scoping. I found myself wondering if Haskell implicit parameters could learn a thing or two from this work. Implicit values make the good choice of defining implicit values globally at the top level, so that they can participate in normal module namespacing, as opposed to an un-namespaced bag of dynamically scoped names (this is also an improvement that reflection makes over implicit parameters). But actually, it seems to me that implicit functions are taking a page from implicit parameters!
The big innovation is the implicit function is that it resolves all dynamic references in the function (not just lexically, but for all further dynamic calls) to the lexical scope (the dynamic scope at the time the function was defined), producing a function that has no dependence on implicit values (aka, has no effect saying that the implicit value must be defined at the time the function is called.) This is exactly what an implicit parameter let ?x = ... binding would have done, in effect directly filling in the dictionary for the implicit function at definition site, rather than waiting. Very contextual! (Of course, Koka implements this using algebraic effects, and gets to the right semantics with a very simple translation anyway). The result is not exactly dynamic scoping, but as the TR says, it leads to better abstraction.
It is difficult to see how implicit values/functions could make their way back into Haskell, at least without some sequencing constructing (e.g., a monad) lurking around. Though implicit functions behave much like implicit parameters, the rest of the dynamic scoping (including the binding of the implicit function itself) is just good old effectful (not coeffectful) dynamic scope. And you can't just do that in Haskell, without breaking type preservation under beta-reduction and eta-expansion. Haskell has no choice but to go all the way, and once you get beyond the obvious problems of implicit parameters (which reflection fixes), things seem to mostly work out.
by Edward Z. Yang at August 27, 2020 05:51 AM
Summary: Read about my efforts to solve the game of Ultimate Tic Tac Toe. It’s been a fun journey into interesting algorithms and high-performance parallel programming in Rust. Backstory Starting around the beginning of the COVID-19 lockdown, I’ve gotten myself deeply nerdsniped by an attempt to solve the game of Ultimate Tic Tac Toe, a two-level Tic Tac Toe variant which is (unlike Tic Tac Toe) nontrivial and contains some interesting strategic elements.
I've recently been working on a revamp of how we specify tensor shape formulas in PyTorch. As part of this process, I classified every single operator in PyTorch by its shaping behavior; yes, that's all 1364 of them (this includes each variant of an operator; e.g., inplace and out= keyword variants). During the process, I tried to come up with categories to help classify what operators did. One of the surprises from the process was discovering that shaping behaviors that I previously thought were uncommon, actually showed up a bit more often than one might have expected.
These categories are interesting in their own right and can be used to help understand how PyTorch's API fits together. Here are all the categories I devised.
TensorIterator (505, e.g., add, sum) operators are PyTorch's bread and butter; these operators do pointwise operations and reductions and support broadcasting and type promotion. The name TensorIterator refers to an internal abstraction we have in PyTorch for implementing these operations; you can read more about it on the wiki and in this blog post. TensorIterator is a real workhorse in PyTorch: the plurarity (though not majority) of operators are implemented in this way! Note that this category includes some functions that used equivalent, legacy functionality (but did not exactly use TensorIterator).
Fixed (273, e.g., convolution, addbmm) operators are operators which only work on a fixed number of dimensions. This assumption makes writing efficient kernels a lot easier, as indexing math is simple with fixed dimensionality. (For example, TensorAccessor is an internal class which lets you view a tensor at fixed dimensionality known at compile time). Sometimes, the first dimension is treated as a batch dimension, but not always (unfortunately, I didn't distinguish these cases in my dataset). Some fixed operators actually support multiple dimensions, but only a fixed number of them; for example, because we only support 1-3D convolutions, this counts as fixed. (Compare with this FeatureBatched, below!)
N-Dimensional (107, e.g., squeeze, index_add, tensordot) operators are operators which work generically on tensors of arbitrary dimensionality. These are the operations for which it is difficult to write generic shaping rules for in symbolic form, as you need a language that can talk about list manipulations. An important subclass of N-dimensional operators are Identity (42, e.g., clone, contiguous; not included in the count above) operators work over arbitrary dimensionality, but they always return a tensor with the same size as their input. Another subclass are Flatten (11, e.g. take, bucketize) operators which accept tensors of any dimensionality, but always treat them as 1D tensors internally.
Composite (95, e.g., kl_div, isfinite) operators are implemented in other operators, and don't themselves have shape checking (instead, they rely on the operations they call to check shapes). Note this category is probably a bit underreported, as in some cases when it was obvious what the underlying behavior of an operator was, I classified the operator as that category, rather than Composite.
Batched (94, e.g., nll_loss, adaptive_avg_pool2d) operators are like fixed dimensionality operators, except they accept an arbitrary number of batch dimensions at their beginning. Many fixed operators should be batched operators; others cannot be converted into batched operators without introducing ambiguity as to where the batch dimensions end. Compare these with FeatureBatched (19, e.g., batch_norm, embedding) operators, which are like batched operators, but rather than accept batch dimensions at the beginning, they accept an arbitrary number of feature dimensions at the end.
Factory (90, e.g., empty) operators produce new tensors without having any tensor inputs.
Trivial (59, e.g., size, is_floating_point) operators aren't actual tensor operations, but ways to return non-Tensor information or access internal data structures
Sparse (40) operators are special because their size calculations take account of both dense and sparse dimensions.
Dynamic (15, e.g., unique) operators produce outputs whose shapes depend on the data of their input tensors
Variadic (14, e.g., cat) operators take multiple input tensors; similar to n-dimensional operations they are difficult to capture symbolic
You can take a look at the full data set at https://docs.google.com/spreadsheets/d/e/2PACX-1vQQFW0T_bucT5KZn0BHYTC1KYhkL6ZMG5ZxQWc6UmAkHUDYpqkpzXnsb59uv2TB0Jgc1Q6qO63bx6WQ/pubhtml
by Edward Z. Yang at May 06, 2020 03:56 PM
Alex Gaynor recently asked this question in an IRC channel I hang out in (a channel which contains several software engineers nearly as obsessed with software testing as I am): uhh, so I’m writing some code to handle an econnreset… how do I test this? This is a good question! Testing ECONNRESET is one of those fiddly problems that exists at the interface between systems — in his case, with S3, not even a system under his control — that can be infuriatingly tricky to reproduce and test.
Suppose we have some codebase we’re considering applying some patch to, and which has a robust and maintained test suite. Considering the patch, we may ask, is this patch acceptable to apply and deploy. By this we mean to ask if the patch breaks any important functionality, violates any key properties or invariants of the codebase, or would otherwise cause some unacceptable risk or harm. In principle, we can divide all patches into “acceptable” or “unacceptable” relative to some project-specific notion of what we’re willing to allow.
Last week, I wrote about the mindset that computer systems can be understood, and behaviors can be explained, if we’re willing to dig deep enough into the stack of abstractions our software is built atop. Some of the ensuing discussion on Twitter and elsewhere lead me to write this followup, in which I want to run through a few classes of systems where I’ve found pursuing in-detail understanding of the system wasn’t the right answer.
Introduction This post attempts to describe a mindset I’ve come to realize I bring to essentially all of my work with software. I attempt to articulate this mindset, some of its implications and strengths, and some of the ways in which it’s lead me astray. Software can be understood I approach software with a deep-seated belief that computers and software systems can be understood. This belief is, for me, not some abstruse theoretical assertion, but a deeply felt belief that essentially any question I might care to ask (about computers) has a comprehensible answer which is accessible with determined exploration and learning.
At this point in my career, I’ve worked on at least three projects where performance was a defining characteristic: Livegrep, Taktician, and Sorbet (I discussed sorbet in particular last time, and livegrep in an earlier post). I’ve also done a lot of other performance work on the tools I use, some of which ended up on my other blog, Accidentally Quadratic. In this post, I want to reflect on some of the lessons I’ve learned while writing performant software, and working with rather a lot more not-so-performant software.
vmap is an interface popularized by JAX which offers you a vectorizing map. Semantically, a vmap is exactly equivalent to a map in Haskell; the key difference is that operations run under a vmap are vectorized. If you map a convolution and a matrix multiply, you will have one big loop which repeatedly calls convolution and matrix multiply for each entry in your batch. If you vmap a convolution and matrix multiply, you'll call the batched versions of convolution and matrix multiply once. Unless you have a fuser, on most modern deep learning frameworks, calling the batched implementations of these operations will be much faster.
JAX implements vmap in a somewhat complicated fashion; they have a "batched interpreter" which translates operations on primitives into their batched versions, and have to track metadata about what tensors are batched and in what way so that they can insert appropriate broadcasts and unsqueezes. I mentioned this to Simon Peyton Jones, and he immediately asked, couldn't Haskell's typechecker work this out automatically? The answer is, yes! All of the book-keeping JAX has to do is effectively doing runtime type inference; if you have a compiler that can do it for you at compile time, there is nearly nothing to implement.
To give away the punchline, we are going to implement a family of functions vmap that will run these two examples:
example1 :: [Float] -> [Float] -> [Float] example1 a0 b0 = vmap0_2 (\a b -> add a b) a0 b0 example2 :: [Float] -> [Float] -> [[Float]] example2 a0 b0 = vmap0 (\a -> vmap1 (\b -> add a b) b0) a0
When run in an interpreter, we will see:
*Test> example1 [1,2,3] [4,6,8] [5.0,8.0,11.0] *Test> example2 [1,2,3] [4,6,8] [[5.0,7.0,9.0],[6.0,8.0,10.0],[7.0,9.0,11.0]]
These results are equivalent to what you would have gotten using a plain old map; however, there will be no loop in the implementation of vmap. (The fact that we can't write a single vmap that works universally is due to a limitation in Haskell; we'll discuss this more later.)
We're going to need a few language extensions, so let's get this out of the way first:
{-# LANGUAGE RankNTypes, GADTs, MultiParamTypeClasses, KindSignatures, TypeApplications, FunctionalDependencies, FlexibleContexts, FlexibleInstances, UndecidableInstances, IncoherentInstances #-}
Our plan of attack is that we want to write the definitions of vmap so that we infer a type for add which makes the necessary broadcasting clear. A trivial implementation of vmap would have the signature ([a] -> [b]) -> [a] -> [b] (aka the identity function), but the standard list type doesn't let us distinguish between dimensions we should broadcast together, and dimensions we shouldn't (this is the reason example1 and example2 give different results: in example2, we broadcast along each dimension separately, so that we end up with a cartesian product in the end; in example1, we broadcast the dimensions together and get the zippy behavior). Each distinct invocation of vmap should give us a new dimension, which ought not to be mixed up with other invocations of vmap. When you hear this in Haskell, your first instinct should be, "I know, let's use a rank 2 type!" vmap moves us from the non-type-branded world of vanilla lists [Float] to a type-branded world of size-indexed vectors Vec s Float, where the s variables are all skolem variables bound by our rank 2 type:
data Vec s a = Vec { unVec :: [a] } instance Functor (Vec s) where fmap f (Vec xs) = Vec (map f xs) vmap0 :: (forall s. Vec s a -> Vec s b) -> [a] -> [b] vmap0 f = unVec . f . Vec
The implementation of vmap0 doesn't do anything: we just wrap the lists into their type-branded equivalent vectors. We can also provide a 2-ary version of vmap0, which takes two lists and assigns them the same type branding all at once:
vmap0_2 :: (forall s. Vec s a -> Vec s b -> Vec s c) -> [a] -> [b] -> [c] vmap0_2 f a b = unVec (f (Vec a) (Vec b))
(In principle, some sort of applicative-y thing should make it possible to write just a vap (analogous to ap) and then get all of the n-ary versions for free, but in my brief investigation I didn't see a good way of doing this.)
When we nest vmap, it may be the case that the function doesn't directly return a Vec s b, but a functor containing Vec s b. vmap1 handles this case (we'll discuss this more shortly):
vmap1 :: Functor f => (forall s. Vec s a -> f (Vec s b)) -> [a] -> f [b] vmap1 f = fmap unVec . f . Vec
With our implementations of vmap in hand, we can take a look at our examples and ask Haskell what the type of add ought to be, if we didn't have an implementation of it:
example1 :: [Float] -> [Float] -> [Float] example1 a0 b0 = vmap0_2 (\a b -> _add a b) a0 b0
Gives:
• Found hole: _add :: Vec s Float -> Vec s Float -> Vec s Float Where: ‘s’ is a rigid type variable bound by a type expected by the context: forall s. Vec s Float -> Vec s Float -> Vec s Float
However:
example2 :: [Float] -> [Float] -> [[Float]] example2 a0 b0 = vmap0 (\a -> vmap1 (\b -> _add a b) b0) a0
Gives:
• Found hole: _add :: Vec s Float -> Vec s1 Float -> Vec s (Vec s1 Float) Where: ‘s1’ is a rigid type variable bound by a type expected by the context: forall s1. Vec s1 Float -> Vec s (Vec s1 Float) at test.hs:41:20-44 ‘s’ is a rigid type variable bound by a type expected by the context: forall s. Vec s Float -> Vec s [Float] at test.hs:41:7-48
Notice that the inferred types of _add are different in these two cases: in the first example, we infer that we have two tensors batched in the same way, and we want to "zip" them together. In the second example, we see that each tensor has a distinct batch dimension, and we end up with a 2-D result!
At this point, the job of vmap is done: our holes have types which we can use to determine what the necessary behavior is. You could use these types to select an appropriate kernel to perform vectorized addition. But I promised runnable code, so let's implement a simple version of add using old fashioned map.
The good old fashioned way to do type level computation in Haskell is with a type class, of course! Let's define a multi-parameter type class for the function add; unlike the definition of (+) in Num, we'll let the inputs and output all have different types:
class Add a b c | a b -> c where add :: a -> b -> c
We can easily implement addition on plain floating point:
instance Add Float Float Float where add = (+)
If I pass add two arguments whose outer-most vector agree in their type brand (aka, they came from the same vmap), I should zip them together, as I did in example1. I can write another instance to express this logic:
instance Add a b r => Add (Vec s a) (Vec s b) (Vec s r) where add (Vec a) (Vec b) = Vec (zipWith add a b)
Otherwise, I should broadcast one of the dimensions and then do an addition on the inside. This choice can't easily be made locally, so I have to define these two incoherent instances:
instance Add a b r => Add (Vec s a) b (Vec s r) where add (Vec a) b = Vec (map (\x -> add x b) a) instance Add a b r => Add a (Vec s b) (Vec s r) where add a (Vec b) = Vec (map (\x -> add a x) b)
(GHC's type class resolution engine doesn't backtrack, so I'm not actually sure how it manages to pick the correct instance to use, but in my testing, I got the right instance no matter what order I specified the arguments to add.)
That's it! Running the two examples:
example1 :: [Float] -> [Float] -> [Float] example1 a0 b0 = vmap0_2 (\a b -> add a b) a0 b0 example2 :: [Float] -> [Float] -> [[Float]] example2 a0 b0 = vmap0 (\a -> vmap1 (\b -> add a b) b0) a0
I get:
*Test> example1 [1,2,3] [4,6,8] [5.0,8.0,11.0] *Test> example2 [1,2,3] [4,6,8] [[5.0,7.0,9.0],[6.0,8.0,10.0],[7.0,9.0,11.0]]
So there you have it! vmap in less than a dozen lines of Haskell. One unsatisfactory thing about this implementation is the necessity to define vmap0, vmap1, etc. Can't we just define a generic vmapG :: (forall s. Vec s a -> f (Vec s b)) -> [a] -> f [b] and have f unify with, well, the identity type lambda /\a. a when we need it to have the type of vmap0? Regretfully, type inference with type lambdas is undecidable (the so-called higher-order unification problem), so it seem we have to help GHC out here, even though in our particular case the unification we can do here is very restricted.
by Edward Z. Yang at January 29, 2020 07:14 PM
This is the second in an indefinite series of posts about things that I think went well in the Sorbet project. The previous one covered our testing approach. Sorbet is fast. Numerous of our early users commented specifically on how fast it was, and how much they appreciated this speed. Our informal benchmarks on Stripe’s codebase clocked it as typechecking around 100,000 lines of code per second per core, making it one of the fastest production typecheckers we are aware of.
Testing and feedback loops This post tries to set out one mental model I have for thinking about testing and the purpose testing serves in software engineering, and to explore some of the suggestions of this model. As mentioned in an earlier post, I think a lot about working in long-lived software projects that are undergoing a lot of development and change. The goal when working on these projects is not just to produce a useful artifact at one time, but to maintain and evolve the project over time, optimizing for some combination of the present usefulness of the software, and our ability to continue to evolve and improve it into the future.
In 2017 and 2018, I (along with Paul Tarjan and Dmitry Petrashko) was a founding member of the Sorbet project at Stripe to build a gradual static typechecking system for Ruby, with the aim of enhancing productivity on Stripe’s millions of lines of Ruby, and eventually producing a useful open-source tool. I’m very proud of the work we did (and that others continue to do!) on Sorbet; I think we were very successful, and it was one of the best teams I’ve worked on in a number of ways.
While talking about thinking about tests and testing in software engineering recently, I’ve come to the conclusion that there are (at least) two major ideas and goals that people have when they test or talk about testing. This post aims to outline what I see as these two schools, and explore some reasons engineers coming from these different perspectives can risk talking past each other. Two reasons to test Testing for correctness The first school of testing comprises those who see testing as a tool for validating a software artifact against some externally-defined standard of correctness.
With the ongoing move towards “infrastructure-as-code” and similar notions, there’s been an ongoing increase in the number and popularity of declarative configuration management tools. This post attempts to lay out my mental model of the conceptual architecture and internal layering of such tools, and some wishes I have for how they might work differently, based on this model. Background: declarative configuration management Declarative configuration management refers to the class of tools that allow operators to declare a desired state of some system (be it a physical machine, an EC2 VPC, an entire Google Cloud account, or anything else), and then allow the system to automatically compare that desired state to the present state, and then automatically update the managed system to match the declared state.
Writing a Go/C polyglot Someone on a Slack I’m on recently raised the question of how you might write a source file that’s both valid C and Go, commenting that it wasn’t immediately obvious if this was even possible. I got nerdsniped, and succeeded in producing one, which you can find here. I’ve been asked how I found that construction, so I thought it might be interesting to document the thought / discovery / exploration process that got me there.
This post is a long form essay version of a talk about PyTorch internals, that I gave at the PyTorch NYC meetup on May 14, 2019.
Hi everyone! Today I want to talk about the internals of PyTorch.
This talk is for those of you who have used PyTorch, and thought to yourself, "It would be great if I could contribute to PyTorch," but were scared by PyTorch's behemoth of a C++ codebase. I'm not going to lie: the PyTorch codebase can be a bit overwhelming at times. The purpose of this talk is to put a map in your hands: to tell you about the basic conceptual structure of a "tensor library that supports automatic differentiation", and give you some tools and tricks for finding your way around the codebase. I'm going to assume that you've written some PyTorch before, but haven't necessarily delved deeper into how a machine learning library is written.
The talk is in two parts: in the first part, I'm going to first introduce you to the conceptual universe of a tensor library. I'll start by talking about the tensor data type you know and love, and give a more detailed discussion about what exactly this data type provides, which will lead us to a better understanding of how it is actually implemented under the hood. If you're an advanced user of PyTorch, you'll be familiar with most of this material. We'll also talk about the trinity of "extension points", layout, device and dtype, which guide how we think about extensions to the tensor class. In the live talk at PyTorch NYC, I skipped the slides about autograd, but I'll talk a little bit about them in these notes as well.
The second part grapples with the actual nitty gritty details involved with actually coding in PyTorch. I'll tell you how to cut your way through swaths of autograd code, what code actually matters and what is legacy, and also all of the cool tools that PyTorch gives you for writing kernels.
The tensor is the central data structure in PyTorch. You probably have a pretty good idea about what a tensor intuitively represents: its an n-dimensional data structure containing some sort of scalar type, e.g., floats, ints, et cetera. We can think of a tensor as consisting of some data, and then some metadata describing the size of the tensor, the type of the elements in contains (dtype), what device the tensor lives on (CPU memory? CUDA memory?)
There's also a little piece of metadata you might be less familiar with: the stride. Strides are actually one of the distinctive features of PyTorch, so it's worth discussing them a little more.
A tensor is a mathematical concept. But to represent it on our computers, we have to define some sort of physical representation for them. The most common representation is to lay out each element of the tensor contiguously in memory (that's where the term contiguous comes from), writing out each row to memory, as you see above. In the example above, I've specified that the tensor contains 32-bit integers, so you can see that each integer lies in a physical address, each offset four bytes from each other. To remember what the actual dimensions of the tensor are, we have to also record what the sizes are as extra metadata.
So, what do strides have to do with this picture?
Suppose that I want to access the element at position tensor[0, 1] in my logical representation. How do I translate this logical position into a location in physical memory? Strides tell me how to do this: to find out where any element for a tensor lives, I multiply each index with the respective stride for that dimension, and sum them all together. In the picture above, I've color coded the first dimension blue and the second dimension red, so you can follow the index and stride in the stride calculation. Doing this sum, I get two (zero-indexed), and indeed, the number three lives two below the beginning of the contiguous array.
(Later in the talk, I'll talk about TensorAccessor, a convenience class that handles the indexing calculation. When you use TensorAccessor, rather than raw pointers, this calculation is handled under the covers for you.)
Strides are the fundamental basis of how we provide views to PyTorch users. For example, suppose that I want to extract out a tensor that represents the second row of the tensor above:
Using advanced indexing support, I can just write tensor[1, :] to get this row. Here's the important thing: when I do this, I don't create a new tensor; instead, I just return a tensor which is a different view on the underlying data. This means that if I, for example, edit the data in that view, it will be reflected in the original tensor. In this case, it's not too hard to see how to do this: three and four live in contiguous memory, and all we need to do is record an offset saying that the data of this (logical) tensor lives two down from the top. (Every tensor records an offset, but most of the time it's zero, and I'll omit it from my diagrams when that's the case.)
Question from the talk: If I take a view on a tensor, how do I free the memory of the underlying tensor?
Answer: You have to make a copy of the view, thus disconnecting it from the original physical memory. There's really not much else you can do. By the way, if you have written Java in the old days, taking substrings of strings has a similar problem, because by default no copy is made, so the substring retains the (possibly very large string). Apparently, they fixed this in Java 7u6.
A more interesting case is if I want to take the first column:
When we look at the physical memory, we see that the elements of the column are not contiguous: there's a gap of one element between each one. Here, strides come to the rescue: instead of specifying a stride of one, we specify a stride of two, saying that between one element and the next, you need to jump two slots. (By the way, this is why it's called a "stride": if we think of an index as walking across the layout, the stride says how many locations we stride forward every time we take a step.)
The stride representation can actually let you represent all sorts of interesting views on tensors; if you want to play around with the possibilities, check out the Stride Visualizer.
Let's step back for a moment, and think about how we would actually implement this functionality (after all, this is an internals talk.) If we can have views on tensor, this means we have to decouple the notion of the tensor (the user-visible concept that you know and love), and the actual physical data that stores the data of the tensor (called storage):
There may be multiple tensors which share the same storage. Storage defines the dtype and physical size of the tensor, while each tensor records the sizes, strides and offset, defining the logical interpretation of the physical memory.
One thing to realize is that there is always a pair of Tensor-Storage, even for "simple" cases where you don't really need a storage (e.g., you just allocated a contiguous tensor with torch.zeros(2, 2)).
By the way, we're interested in making this picture not true; instead of having a separate concept of storage, just define a view to be a tensor that is backed by a base tensor. This is a little more complicated, but it has the benefit that contiguous tensors get a much more direct representation without the Storage indirection. A change like this would make PyTorch's internal representation a bit more like Numpy's.
We've talked quite a bit about the data layout of tensor (some might say, if you get the data representation right, everything else falls in place). But it's also worth briefly talking about how operations on the tensor are implemented. At the very most abstract level, when you call torch.mm, two dispatches happen:
The first dispatch is based on the device type and layout of a tensor: e.g., whether or not it is a CPU tensor or a CUDA tensor (and also, e.g., whether or not it is a strided tensor or a sparse one). This is a dynamic dispatch: it's a virtual function call (exactly where that virtual function call occurs will be the subject of the second half of this talk). It should make sense that you need to do a dispatch here: the implementation of CPU matrix multiply is quite different from a CUDA implementation. It is a dynamic dispatch because these kernels may live in separate libraries (e.g., libcaffe2.so versus libcaffe2_gpu.so), and so you have no choice: if you want to get into a library that you don't have a direct dependency on, you have to dynamic dispatch your way there.
The second dispatch is a dispatch on the dtype in question. This dispatch is just a simple switch-statement for whatever dtypes a kernel chooses to support. Upon reflection, it should also make sense that we need to a dispatch here: the CPU code (or CUDA code, as it may) that implements multiplication on float is different from the code for int. It stands to reason you need separate kernels for each dtype.
This is probably the most important mental picture to have in your head, if you're trying to understand the way operators in PyTorch are invoked. We'll return to this picture when it's time to look more at code.
Since we have been talking about Tensor, I also want to take a little time to the world of tensor extensions. After all, there's more to life than dense, CPU float tensors. There's all sorts of interesting extensions going on, like XLA tensors, or quantized tensors, or MKL-DNN tensors, and one of the things we have to think about, as a tensor library, is how to accommodate these extensions.
Our current model for extensions offers four extension points on tensors. First, there is the trinity three parameters which uniquely determine what a tensor is:
If you want to add an extension to PyTorch tensors (by the way, if that's what you want to do, please talk to us! None of these things can be done out-of-tree at the moment), you should think about which of these parameters you would extend. The Cartesian product of these parameters define all of the possible tensors you can make. Now, not all of these combinations may actually have kernels (who's got kernels for sparse, quantized tensors on FPGA?) but in principle the combination could make sense, and thus we support expressing it, at the very least.
There's one last way you can make an "extension" to Tensor functionality, and that's write a wrapper class around PyTorch tensors that implements your object type. This perhaps sounds obvious, but sometimes people reach for extending one of the three parameters when they should have just made a wrapper class instead. One notable merit of wrapper classes is they can be developed entirely out of tree.
When should you write a tensor wrapper, versus extending PyTorch itself? The key test is whether or not you need to pass this tensor along during the autograd backwards pass. This test, for example, tells us that sparse tensor should be a true tensor extension, and not just a Python object that contains an indices and values tensor: when doing optimization on networks involving embeddings, we want the gradient generated by the embedding to be sparse.
Our philosophy on extensions also has an impact of the data layout of tensor itself. One thing we really want out of our tensor struct is for it to have a fixed layout: we don't want fundamental (and very frequently called) operations like "What's the size of a tensor?" to require virtual dispatches. So when you look at the actual layout of a Tensor (defined in the TensorImpl struct), what we see is a common prefix of all fields that we consider all "tensor"-like things to universally have, plus a few fields that are only really applicable for strided tensors, but are so important that we've kept them in the main struct, and then a suffix of custom fields that can be done on a per-Tensor basis. Sparse tensors, for example, store their indices and values in this suffix.
I told you all about tensors, but if that was the only thing PyTorch provided, we'd basically just be a Numpy clone. The distinguishing characteristic of PyTorch when it was originally released was that it provided automatic differentiation on tensors (these days, we have other cool features like TorchScript; but back then, this was it!)
What does automatic differentiation do? It's the machinery that's responsible for taking a neural network:
...and fill in the missing code that actually computes the gradients of your network:
Take a moment to study this diagram. There's a lot to unpack; here's what to look at:
The whole point of autograd is to do the computation that is described by this diagram, but without actually ever generating this source. PyTorch autograd doesn't do a source-to-source transformation (though PyTorch JIT does know how to do symbolic differentiation).
To do this, we need to store more metadata when we carry out operations on tensors. Let's adjust our picture of the tensor data structure: now instead of just a tensor which points to a storage, we now have a variable which wraps this tensor, and also stores more information (AutogradMeta), which is needed for performing autograd when a user calls loss.backward() in their PyTorch script.
This is yet another slide which will hopefully be out of date in the near future. Will Feng is working on a Variable-Tensor merge in C++, following a simple merge which happened to PyTorch's frontend interface.
We also have to update our picture about dispatch:
Before we dispatch to CPU or CUDA implementations, there is another dispatch on variables, which is responsible for unwrapping variables, calling the underlying implementation (in green), and then rewrapping the results into variables and recording the necessary autograd metadata for backwards.
Some implementations don't unwrap; they just call into other variable implementations. So you might spend a while in the Variable universe. However, once you unwrap and go into the non-Variable Tensor universe, that's it; you never go back to Variable (except by returning from your function.)
In my NY meetup talk, I skipped the following seven slides. I'm also going to delay writeup for them; you'll have to wait for the sequel for some text.
Enough about concepts, let's look at some code.
PyTorch has a lot of folders, and there is a very detailed description of what they are in the CONTRIBUTING document, but really, there are only four directories you really need to know about:
That's a lot of places to look for code; we should probably simplify the directory structure, but that's how it is. If you're trying to work on operators, you'll spend most of your time in aten.
Let's see how this separation of code breaks down in practice:
When you call a function like torch.add, what actually happens? If you remember the discussion we had about dispatching, you already have the basic picture in your head:
Each of these steps corresponds concretely to some code. Let's cut our way through the jungle.
Our initial landing point in the C++ code is the C implementation of a Python function, which we've exposed to the Python side as something like torch._C.VariableFunctions.add. THPVariable_add is the implementation of one such implementation.
One important thing to know about this code is that it is auto-generated. If you search in the GitHub repository, you won't find it, because you have to actually build PyTorch to see it. Another important thing is, you don't have to really deeply understand what this code is doing; the idea is to skim over it and get a sense for what it is doing. Above, I've annotated some of the most important bits in blue: you can see that there is a use of a class PythonArgParser to actually pull out C++ objects out of the Python args and kwargs; we then call a dispatch_add function (which I've inlined in red); this releases the global interpreter lock and then calls a plain old method on the C++ Tensor self. On its way back, we rewrap the returned Tensor back into a PyObject.
(At this point, there's an error in the slides: I'm supposed to tell you about the Variable dispatch code. I haven't fixed it here yet. Some magic happens, then...)
When we call the add method on the Tensor class, no virtual dispatch happens yet. Instead, we have an inline method which calls a virtual method on a "Type" object. This method is the actual virtual method (this is why I say Type is just a "gadget" that gets you dynamic dispatch.) In the particular case of this example, this virtual call dispatches to an implementation of add on a class named TypeDefault. This happens to be because we have an implementation of add that is the same for every device type (both CPU and CUDA); if we had happened to have different implementations, we might have instead landed on something like CPUFloatType::add. It is this implementation of the virtual method that finally gets us to the actual kernel code.
Hopefully, this slide will be out-of-date very soon too; Roy Li is working on replacing Type dispatch with another mechanism which will help us better support PyTorch on mobile.
It's worth reemphasizing that all of the code, until we got to the kernel, is automatically generated.
It's a bit twisty and turny, so once you have some basic orientation about what's going on, I recommend just jumping straight to the kernels.
PyTorch offers a lot of useful tools for prospective kernel writers. In this section, we'll walk through a few of them. But first of all, what do you need to write a kernel?
We generally think of a kernel in PyTorch consisting of the following parts:
In the subsequent slides, we'll walk through some of the tools PyTorch has for helping you implementing these steps.
To take advantage of all of the code generation which PyTorch brings, you need to write a schema for your operator. The schema gives a mypy-esque type of your function, and also controls whether or not we generate bindings for methods or functions on Tensor. You also tell the schema what implementations of your operator should be called for given device-layout combinations. Check out the README in native is for more information about this format.
You also may need to define a derivative for your operation in derivatives.yaml.
Error checking can be done by way of either a low level or a high level API. The low level API is just a macro, TORCH_CHECK, which takes a boolean, and then any number of arguments to make up the error string to render if the boolean is not true. One nice thing about this macro is that you can intermix strings with non-string data; everything is formatted using their implementation of operator<<, and most important data types in PyTorch have operator<< implementations.
The high level API saves you from having to write up repetitive error messages over and over again. The way it works is you first wrap each Tensor into a TensorArg, which contains information about where the tensor came from (e.g., its argument name). It then provides a number of pre-canned functions for checking various properties; e.g., checkDim() tests if the tensor's dimensionality is a fixed number. If it's not, the function provides a user-friendly error message based on the TensorArg metadata.
One important thing to be aware about when writing operators in PyTorch, is that you are often signing up to write three operators: abs_out, which operates on a preallocated output (this implements the out= keyword argument), abs_, which operates inplace, and abs, which is the plain old functional version of an operator.
Most of the time, abs_out is the real workhorse, and abs and abs_ are just thin wrappers around abs_out; but sometimes writing specialized implementations for each case are warranted.
To do dtype dispatch, you should use the AT_DISPATCH_ALL_TYPES macro. This takes in the dtype of the tensor you want to dispatch over, and a lambda which will be specialized for each dtype that is dispatchable from the macro. Usually, this lambda just calls a templated helper function.
This macro doesn't just "do dispatch", it also decides what dtypes your kernel will support. As such, there are actually quite a few versions of this macro, which let you pick different subsets of dtypes to generate specializations for. Most of the time, you'll just want AT_DISPATCH_ALL_TYPES, but keep an eye out for situations when you might want to dispatch to some more types. There's guidance in Dispatch.h for how to select the correct one for your use-case.
On CPU, you frequently want to parallelize your code. In the past, this was usually done by directly sprinkling OpenMP pragmas in your code.
At some point, we have to actually access the data. PyTorch offers quite a few options for doing this.
A lot of kernels in PyTorch are still written in the legacy TH style. (By the way, TH stands for TorcH. It's a pretty nice acronym, but unfortunately it is a bit poisoned; if you see TH in the name, assume that it's legacy.) What do I mean by the legacy TH style?
This code is pretty crazy, and we hate reviewing it, so please don't add to it. One of the more useful tasks that you can do, if you like to code but don't know too much about kernel writing, is to port some of these TH functions to ATen.
To wrap up, I want to talk a little bit about working efficiently on PyTorch. If the largeness of PyTorch's C++ codebase is the first gatekeeper that stops people from contributing to PyTorch, the efficiency of your workflow is the second gatekeeper. If you try to work on C++ with Python habits, you will have a bad time: it will take forever to recompile PyTorch, and it will take you forever to tell if your changes worked or not.
How to work efficiently could probably be a talk in and of itself, but this slide calls out some of the most common anti-patterns I've seen when someone complains: "It's hard to work on PyTorch."
So that's it for a whirlwind tour of PyTorch's internals! Many, many things have been omitted; but hopefully the descriptions and explanations here can help you get a grip on at least a substantial portion of the codebase.
Where should you go from here? What kinds of contributions can you make? A good place to start is our issue tracker. Starting earlier this year, we have been triaging issues; issues labeled triaged mean that at least one PyTorch developer has looked at it and made an initial assessment about the issue. You can use these labels to find out what issues we think are high priority or look up issues specific to some module, e.g., autograd or find issues which we think are small (word of warning: we're sometimes wrong!)
Even if you don't want to get started with coding right away, there are many other useful activities like improving documentation (I love merging documentation PRs, they are so great), helping us reproduce bug reports from other users, and also just helping us discuss RFCs on the issue tracker. PyTorch would not be where it is today without our open source contributors; we hope you can join us too!
by Edward Z. Yang at May 17, 2019 02:11 AM
Some notes collected from a close read of Conal Elliot's Compiling to Categories and The Simple Essence of Automatic Differentiation.
A colleague of mine was trying to define a "tree structure" of tensors, with the hope of thereby generalizing the concept to also work with tensors that have "ragged dimensions." Let's take a look:
Suppose we have a (2, 3) matrix:
tensor([[1, 2, 3], [4, 5, 6]])
One way to think about this is that we have a "tree" of some sort, where the root of the tree branches to two subnodes, and then each subnode branches to three nodes:
/- ROOT -\ ROW 1 ROW 2 / | \ / | \ 1 2 3 4 5 6
Suppose you wanted to define this data structure in Haskell. One obvious way of going about doing this is to just say that a matrix is just a bunch of nested lists, [[Float]]. This works, true, but it isn't very illuminating, and it is certainly not type safe. Type safety could be achieved with sized vectors, but we are still left wondering, "what does it mean?"
Often, inductive definitions fall out of how we compose things together, in the same way that the inductive data structure for a programming language tells us how we take smaller programs and put them together to form a larger program. With matrices, we can think of a pictorial way of composing them, by either attaching matrices together vertically or horizontally. That gives us this vocabulary for putting together matrices, which would let us (non-uniquely) represent every matrix (Compiling to Categories, Section 8):
data Matrix = Scalar Float | Horizontal Matrix Matrix | Vertical Matrix Matrix
But what does it mean? Well, every matrix represents a linear map (if A : (n, m) is your matrix, the linear map is the function R^m -> R^n, defined to be f(x) = A x. We'll call a linear map from a to b, Linear a b). So the question we ask now is, what does it mean to "paste" two matrices together? It's a way of composing two linear maps together into a new linear map:
-- A function definition does not a category make! You have to -- prove that the resulting functions are linear. horizontal :: Linear a c -> Linear b c -> Linear (a, b) c horizontal f g = \(a, b) -> f a + g b -- In matrix form: -- -- [ a ] -- [ F | G ] [ - ] = [ F a + G b ] -- [ b ] vertical :: Linear a c -> Linear a d -> Linear a (c, d) vertical f g = \a -> (f a, g a) -- In matrix form: -- -- [ F ] [ F a ] -- [ - ] [ a ] = [ - ] -- [ G ] [ G a ]
Now we're cooking! Notice that the pasting shows up in the type of the linear map: if we paste horizontally, that just means that the vectors this linear map takes in have to be pasted together (with the tuple constructor); similarly, if we paste vertically, we'll produce output vectors that are the pasted results.
Cool, so we can add some type indexes, and write Linear as a GADT to refine the indices when you apply the constructor:
data Linear a b where Scalar :: Float -> Linear Float Float Horizontal :: Linear a c -> Linear b c -> Linear (a, b) c Vertical :: Linear a c -> Linear a d -> Linear a (c, d)
Is this the end of the story? Not quite. There are many ways you can go about combining linear maps; for example, you could (literally) compose two linear maps together (in the same sense of function composition). It's true that you can paste together any matrix you like with the data type above; how do we decide what should and shouldn't go in our language of linear maps?
To this end, Conal Elliot calls on the language of category theory to adjudicate. A category should define identity and function composition:
identity :: Linear a a identity a = a -- In matrix form: the identity matrix compose :: Linear b c -> Linear a b -> Linear a c compose g f = \a -> g (f a) -- In matrix form: matrix multiply
We find that Horizontal and Vertical are the elimination and introduction operations of cocartesian and cartesian categories (respectively).
But this should we just slap Identity and Compose constructors to our data type? Linear map composition is a computationally interesting operation: if we just keep it around as syntax (rather than doing what is, morally, a matrix multiply), then it will be quite expensive to do operations on the final linear map. Where do representable functors come in? I'm not exactly sure how to explain this, and I've run out of time for this post; stay tuned for a follow up.
by Edward Z. Yang at May 15, 2019 09:42 PM
tl;dr In writer-priority reader/writer locks, as soon as a single writer enters the acquisition queue, all future accesses block behind any in-flight reads. If any readers hold the lock for extended periods of time, this can lead to extreme pauses and loss of throughput given even a very small number of writers. Abstract This post describes a phenomenon that can occur in systems built on reader/writer locks, where slow readers and a small number of writers (e.
Over the last few years — perhaps not that unusually among the nerds I know — I’ve become increasingly fascinated by the Apollo program (and early space program more generally), and been reading my way through a growing number of books and documentaries written about it. At a party this weekend I got asked for my list of Apollo book recommendations, so I decided to write them in a form I can easily share and refer to, in case it’s of interest to anyone else.
Long time readers of mine may be aware that I used a ThinkPad X61T for the past decade. After the hinge on my second instance of the machine, I decided it was finally time to get a new laptop. And I had one particular model on my eye, after Simon Peyton Jones showed me his new laptop at the last Haskell Implementor's Workshop: the Microsoft Surface Book 2. It fits my primary requirement for a laptop: it's a convertible laptop into tablet mode with a digitizer pen. The pen is not Wacom branded but it has an eraser end and can magnetically attach to the laptop (no enclosure for the pen, but I think that for modern hardware that constraint is unsatisfiable.) Furthermore, there is a Linux enthusiast community around the device, which made me feel that it would be more likely I could get Linux to work. So a few weeks ago, I took the plunge, and laid down three grand for my own copy. It has worked out well, but in the classic Linux style, not without a little bit of elbow grease.
The good:
The bad:
I did a stock install of the latest Ubuntu LTS (18.04) dual boot with Windows (1TB hard drive helps!), and then installed jakeday's custom Linux kernel and drivers. Some notes about the process:
I spent a while scratching my head as to why I couldn't install Linux dual-boot. Some Googling suggested that the problem was that Windows hadn't really shutdown; it had just hibernated (for quick startup). I didn't manage to disable this, so I just resized the Windows partition from inside Windows and then installed Linux on that partition.
Don't forget to allocate a dedicated swap partition for Linux; you won't be able to hibernate without it.
The Surface Book 2 has secure boot enabled. You must follow the instructions in SIGNING.md to get signed kernels.
One consequence of generating signed kernels, is that if you have both the unsigned and signed kernels installed update-initramfs -u will update the initrd for your unsigned kernel, meaning that you won't see your changes unless you copy the initrd over! This flummoxed me a lot about the next step...
If you want to use the NVIDIA drivers for your shiny NVIDIA GPU, you need to blacklist nouveau. There are plenty of instructions on the internet but I can personally vouch for remingtonlang's instructions. Make sure you are updating the correct initrd; see my bullet point above. Once this was fixed, a standard invocation of the CUDA installer got me working nvidia-smi. Note that I manually signed the NVIDIA using the instructions here since I already had generated a private key, and it seemed silly to generate another one because NVIDIA's installer asked me to.
Once you install the NVIDIA drivers, you have to be careful about the opposite problem: Xorg deciding it wants to do all its rendering on the NVIDIA card! The usual symptom when this occurs is that your mouse input to Linux is very laggy. If you have working nvidia-smi, you can also tell because Xorg will be a running process on your GPU. In any case, this is bad: you do NOT want to use the dGPU for plain old desktop rendering; you want the integrated one. I found that uncommenting the sample Intel config in /etc/X11/xorg.conf.d fixes the problem:
Section "Device" Identifier "Intel Graphics" Driver "intel" EndSection
But this doesn't play too nicely with VMWare; more on this below.
Sound did not work (it was too soft, or the right speaker wasn't working) until I upgraded to Linux 5.0.1.
After enabling XInput on my fork of Xournal, it did not start working until I restarted Xournal. Eraser worked right out of the box.
Don't forget to make a swap partition (Ubuntu default installer didn't prompt me to make one, probably because I was installing as dual-boot); otherwise, hibernate will not work.
Sometimes, when waking up from hibernate, networking doesn't work. Mercifully, this can be fixed by manually reloading the WiFi kernel module: modprobe mwifiex_pcie and systemctl restart NetworkManager.service. More discussion on this issue.
Sometimes, when waking up from hibernate/suspend, I get a big thermometer icon. When I reboot again it goes away but I have lost my hibernate/suspend. Perplexing! I don't know why this happens.
The sad truth of life is that the Windows tablet experience is much better than the Linux experience--to the point where many would just install Windows and then boot Linux from a virtual machine (or Windows Subsystem for Linux). This was a non-starter for me: a bare metal boot of Linux was necessary to get the best pen input experience. However, why not also make it possible to boot the Linux partition from VMWare running on Windows? This setup is explicitly supported by VMWare, but it took a few days of fiddling to get it to actually work.
I haven't gotten around to setting up xmonad; this is no small part due to the fact that Unity appears to support a very rudimentary form of tiling: Windows-left and Windows-right will move Windows to fill the left/right half of the display, while Windows-up will full screen a Window. I might still try to get xmonad setup on 18.04, but for now it is nice not having to fight with trayer to get the standard icons.
My two top priorities for improving the state of Linux on the Surface Book 2:
Otherwise, I am very happy with this new laptop. One thing in particular is how much faster my mail client (still sup) runs; previously, scanning for new mail would be a crawl, but on this laptop they stream in like a flash. Just goes to show how much an upgrade going from a 1.6GHz processor to a 4.2GHz processor is :3
by Edward Z. Yang at March 18, 2019 01:59 AM
This is a list of reasons why I think Python is a terrible programming language. Naturally, none of them apply to other programming languages.
It's impossible to determine at compile time whether a Python program can terminate.
While strings in Python 3 are significantly better than strings in Python 2, they consist of a series of "code points" instead of characters, so there's no way to reference the third character of a Python string. For ASCII strings, this works fine in C, so I prefer writing in C. Python claims to support Unicode but can't even get this right.
If I compile a Python module on Linux, it doesn't work on a Mac. This is a shocking oversight for a so-called "cross-platform" language.
The os.link
function does not let you create a hard link to a directory.
The standard library sorting function takes at least O(n log n) time. I understand dynamic languages are slow, but why is it this slow?
The cryptography
module claims to implement public-key cryptography, but mathematicians have so far been unable to prove that one-way functions, a prerequisite of public-key cryptography, even exist. It's surprising that the Python ecosystem is of such low quality and holds to dubious scientific standards.
When you compile NumPy or SciPy from source, you need to build FORTRAN code, and FORTRAN sucks which is clearly Python's fault.
If you're running two Python programs from different users on the same machine, a speculative load in one program might be able to learn information from the other program based on cache timing side channels.
Python is unable to reverse entropy.
by Geoffrey Thomas at December 18, 2018 12:00 AM
My name is Rahul Muttineni, CTO of TypeLead, working on building services around a language named Eta. To get started, I'll give an overview of how the project started, and where it is now.
It started as a HSOC project. It was called GHCVM; back then we had plans of making it both on JVM and CLR... we don't think about CLR anymore. I was mentored by Edward Kmett. We got pretty good response on this, so Jo and I decided to take the risk and work on this full time.
Big thanks to the GHC team, really good work. We've worked with the codebase for two years, and the more and more we work with it, we see how much awesome stuff there is. I've learned a lot by working with the code.
What is Eta? Eta is a fork of GHC. During the GSOC project, it started off as a Haskell program that used the GHC API. Midway in the program, I found that there were certain things that I wanted to do that I couldn't do, and I spent 3-4 days setting up a fork. I'll talk about what those limitations are. Like Haskell, it's a ... language, but the key difference is that it runs on the JVM. That is its own set of challenges, primarily with respect to tail calls. The nice thing about Eta is that it runs on the JVM, and it can run a good chunk of projects just like that. lens... recently, in the last month, we got Yesod working... it's in good shape. The next really great type of Eta is the strongly typed FFI. That works really well with the subtyping in JVM. A good chunk of the talk is about how we got that working. One of the main focuses of Eta is to be focused on industrial use. GHC is focused on industrial use, and research, both. There's a tension between the two... the nice thing we have for Eta is we don't have to face that tension; it's easy to make decisions on how to add new features, because will it help companies? If it is yes, otherwise we don't. (SPJ: That can be a hard question to answer!)
Haskell: Avoid success at all costs. We're not going to sacrifice core principles of language for benefit. Pursue success, at minimum cost. We want to make it successful as much as possible, but we want to make as little sacrifice as possible. That will be a little tricky...
What is Eta? What language features does it support? It started off as a fork of GHC 7.10.3. All extensions that work there, work with Eta as well. The only thing was TemplateHaskell and QuasiQuotes didn't work for a long time. We got it working 3-4 months ago. Biggest change is JavaFFI. GHC 7.10.3 is MINUS C FFI. We could have supported it: Java has JNI, but we tried to avoid it because we didn't want to make platform specific bindings to all the libbraries.
Joe backported a bunch of GHC 8 features: StrictData, ApplicativeDo, OverloadedLabels. Backpack was got recently. There's a very particular reason we had to do it: it has to do with the fact that we don't have green threads by default, and we wanted to give the user a choice of threaded runtime versus blocking runtime.
The compiler? It's a fork of GHC, so all the compiler passes are the same. We just chopped off everything after STG; e.g., C-- is gone. We generate bytecode from STG. We don't do any optimizations right now, and won't need to for some fine. We don't have to because in JVM, it's JIT compiled, so we don't have to optimize as much since JVM will remove a lot of the code that's not used anyway. And the driver: GHC generates object files... we decided to use JAR files. They're just zip files that package up a bunch of class files that store Java bytecodes. We also added one more mode for Uberjars. These are JAR files that are packaged up into one giant package.
I'll talk a little bit about how we implemented the REPL; template haskell. It works through the external-interpreter architecture. In GHC that's called iserv: the process, what it does, is handles running the code. So the compiler will still do the typechecking and everything, but once it's done with all that stuff, GHC will generate, a specific bytecode set, for interpreting Haskell efficiently. Because we already generated JVM bytecodes. We didn't need that custom bytecode set; we just compile with optimizations off; that gives us JVM bytecodes, then we send it to the external process, load it up, and execute them. Implementing the REPL is pretty easy how to get all this working together. JVM has a mechanism called classloading, which is very flexible. You can download bytecodes from the network, get code an runtime. Once you load the class, it's statically compiled code, it's optimized the same, etc.
The build tool we use is Etlas. We didn't want to move too far off of GHC, we stuck with Cabal. At the point we started using it, we forked off of Cabal 2.0. Main difference is that it lets you manage Eta versions. Etlas is almost like Stack, but it's much much closer to Cabal. We took the nice features of Stack and added them to Cabal. The other thing is that it does patch management. What we've been finding as we add more features and backport, Eta is not exactly GHC 7.10, nor is it GHC 8.0, it's a weird intermediate state, so certain packages that won't exactly compile without small changes, so we needed some system to apply those changes before we actually run the build. So we setup a GitHub repository that stores all the patch files. What etlas will do, it will get you the most recent set of patches. Then if you install a package, lens or something, it will download lens, apply the patch, and then it will build. Just recently, we were using base 4.8, and recently we upgraded to base 4.11. But we couldn't update to the new Generics infrastructure, because it slowed down compile times. So there were a bunch of packages that would check if they were GHC 8... and then use new generics. So we had to do a bunch of patching for that. But that's the kind of stuff we have to deal with.
The title of this talk is lets go mainstream with eta. I want to take a moment and say, what does that mean? "The ideas, attitudes, or activities that are shared by most people and regarded as normal or conventional." So at what point does a programming language become consdiered normal or conventional? It has to be used a big company, solve a big real world problem, and people have to believe it works. That's a very complicated question, multifaceted, one part of that answer is, it should make it easier to solve real world problems easier than the status quo. Take for example PHP. PHP came out when there was nothing better to program dynamic web applications. It had just the minimum features required to make it useful to build these. Now everyone here is asking the question: Haskell clearly solves a lot of problems better than the status quo. So why isn't it moving forward? That's a big question, I'm going to talk about how we're approaching it.
The strategy we're using internally, is we put on a "Big Company Hat"; we pretend we're a big company with a lot of employees, millions or billions of lines, and try to figure out what problems they'll face. Some problems are crazy long build times, when trying to build huge software; dynamic where you have to make sure junior developers get up to speed... etc. That's couple to get this conversation started.
After thinking about this a long time, we boiled it down to three basic principles, how we will develop Eta.
1. User Experience
2. Performance
3. Safety
User Experience is mainly, an emotional thing, how you feel when you use Eta technology, how you interact with it, what you feel when you get an error, psychologically. When something has good UX, we feel good. That's a very subjective thing, it can vary between different people, we have to figure out a way to standardize / make it universal. Something we forget as software and tool developers, the person developing the software is human. If they get errors persistently over time, they'll get frustrated. Machines will do what you tell them over and over again.
So what have we done in Eta to concern? We've done something very recently; it's not in master yet. Jo and I spent a week refactoring the codebase to refactor the error reporting part of the typechecker. It stores a list of strings; internally in GHC, there's a pretty printed data type, a list of those. The problem is we can't do postprocessing on that. So, what Jo did was made a giant data type with three hundred data constructors, one for every error message in GHC. That refactor to a week (SPJ: only a week?!) How it is now, it's decoupled, now you have, instead of storing in the typechecking monad, storing strings, you store a data type that stores the relevant data to print out that error message. And then at the final point, you can traverse the data type; based on the presence of other errors, you can decide what to do. Now it's pattern matching on certain error patterns and reporting them nicely. This is one example. We talked about simple errors: refactoring, adding an argument, changing the type, that's one of the most common errors you'll get working with Haskell. So we focused on those first. This shows an example of a type error... 'checker', it's an IO action.
GHC would tell you, couldn't match Int -> IO () with IO (). The problem is, for people who don't know how the typechecker works, they won't be able to understand what the typechecker is doing: going argument by argument. Because of the refactor we've done, it was easy to pattern match on this particular case, and say, hey, if the user forgot to put an argument, you can print out an error message of this form. You print an argument is missing, you highlight. (SM: You might have been missing the first argument, in this case!) That's true. It's tricky; sometimes the suggestion you give, might not. We don't tell people what they did exactly wrong, because we don't know. This is not a perfect thing, but we try to give the best suggestion that we can. And an important feature of this, most of how we decdied this layout, we studied languages like Elm and Purescript, which have done good work in this error. PureScript and Elm both, what they do, for a certain type of error, and you're not sure what to do... e.g., our info is not complete, they can go to a particular link and see other things that could have happened. So we don't have to flood the user with every suggestion, we just have to show the user what probably is the cause for it. And if it's a tricky case, not what we posted, in the link, we'll have the case as well.
(BG: There's other information that might be relevant; expanding type synonyms, etc. Do you have this info?) We're still thinking about that. Probably we'll have extra flags and stuff. Eventually, we'll have a mode that prints out JSON for IDEs, then it's easier to parse on the IDE side. (BG: Incidentally, there's a ticket, a student working with Richard, trying to figure out smoething similar).
Another aspect of UX is we added the REPL. Tried to simplify the entry point, try to make it easy. You want types, kinds, and where to find out more information. This is a statically typed language: you always hhave to be thinking about types. So we :set +t: always print out the types when you print things. One more thing, one of the former Scala engineers, has been learning Haskell, and he made a critique of one aspect of the REPL experience. f is a function of two argumets. In a second statement of the REPL, I applied 1. Find instance, show instance, for a goes to a. He said that... no show instance found, just say that this is a function, and you can't print it. That's a change we did. This was very easy for us to do.
Performance: it can mean many things. We're talking about fast developer feedback loop. Compile time and develop time, reducing that feedback loop. Some work we've done in this direction is reproducible builds. As of now, we have bit-for-bit reproducibility in Eta. That amounted to... nikita already did lots of work on reproducibility, he made HAskell interface reproducible; but the last mile of bit for bit is hard, there's many places. For our code generator, it was a lot simpler, we didn't have to do as much. It was 20 lines of code to make it deterministic. The main source of nondeterminism in GHC is the Unique data type, that changes between different runs depending on environment. What we did, was we added a counter. We used to print the uniques in the Java class name; that will make it nondeterministic. So we made a counter: the order in which the bindings make it to STG is the same.
GHCi is known to take up lots of memory, esp. with IDE. Simon Marlow has a bunch of fixes to that; we also backported those.
Another aspect of performance is the actual runtime performance. We're on the JVM, that puts us at a huge disadvantage. We don't have control over many things. The runtime system... this is Java. It's OO, so the runtime system is implemented in Java. We setup a hierarchy for values, that are defined in Eta. We have Closure, it's a class, parent class of all values, thunks, WNF. The Closure class has two methods. evaluate, evaluate to WHNF, and enter will actually enter... it's similar to GHC runtime system. The initial version was modeled exactly after GHC, except for tail calls. The terminology is similar. It's primarily used when you do the body of function. The main subclasses of Closure are Thunk and Value. Value will be the parent class, of things like functions, partiallly applied functions, and data constructors. Thunk will be the superclass of things like CAFs, single entry thunks, and updatable thunks. CAFs don't have free variables, so there's a special case for that, and you create a blackholing entry every time, to avoid two threads evaluating the same thunk. UpdatableThunk pushes an update frame, when it's finished evaluating, it will update the thunk to point to the newly computed value. And SingleEntryThunk, they're evaluated only once, so you can just evaluate it directly without pushing an update frame. This terminology is borrowed from GHC as well.
VAlues: DataCon, Function and PAPs. In the early days, and even now, every function call that was a tail call, is just a method call. This is the only way to make it remotely efficient. (More on stack soon). For static tail recursive calls: singly recursive or mutually recursive, they get compiled to loops. In most cases, they get a nice tight loop. In the mutual case, what will happen is, we collect all of the SCC, and we make one giant method that goes into a loop. Let's say you're in the even/odd example, what will happen is, when even calls odd, there's a variable called target, an integer. Even will be assigned 0, odd is assigned 1, so then you set 1 and restart. (BG: Do you always have unfoldings available for functions you compiled?) This is mutually recursive functions defined in the same module. (SPJ: They might have very different type arguments.) We cat all the arguments into one. The main problem with this argument, is parsers generated with Happy and Alex, we hit limits. (BG: Crash?) Not stack blowup. JVM has method size limit, so you can only have 65000 bytecodes. That's Eta compiled with itself. That's the only thing that's preventing us from using Eta with Eta. But all you need to do is split method into smaller chunks.
So how do we handle tail calls? When we know it , tail recursive, let's say you don't. Let's say you're using CPS. It's so common in Haskell, any fast parser uses CPS. In early days, Aeson would just blow the stack, it was pretty bad. So, we explored trampolining by default, and it was just awful, it was slow, super slow. What we did is turn it off, and let stack blow up. We found a better solution. The JVM has... the only way to unwind the stack is throwing an exception, or returning, and keep on returning until you return all the way down. It turns out, with exceptions, you can turn off the feature that captures the stack trace: that's the most expensive part of an exception. So we have a general exception. So this trampoline mechanism is optional. So, what we do, we have a function 'trampoline :: a -> a', runtime primitive, what it does is activates a boolean in the context which says, I'm going to trampoline now, and it activates a codepath that turns a counter, and once you reach a certain number, which is configurable, it will unwind the stack, and then continue where it needed to go. Our limit is 400, and then we unwind. It used to be in 1000s, but with Happy and Alex, we needed a smaller number. (BG: Inside that context, how much does it cost? But observably, it's faster. A couple months ago, we got PureScript to work in Eta, and it wasn't bad by default?) (SPJ: So you could turn it on by default: all you're doing is counting.) The counting is how we know how big the stack is. In your main function, you could call trampolineIO, and trampoline your whole program. (SPJ: Maybe it's low overhead, and you can do it all the time.) If it's low, we will do it. (How do you resume? Once you raise the exception, what do you store?) The counter happens at the entry point, and it's guarded bby the boolean. So, that, if the limit is exceeded, it will call another function that takes the context. So we store all the arguments in a context variable that gets passed to every eta function. We stash all the arguments into a function that has the state, then wjhen it unwinds, marked by this function, it will call that, with that function and those arguments.
As I mentioned, it's guarded by a boolean. JVM has an optimization, where it observes the boolean is true for a lot of times, it won't even compile that branch in the native code. So if you don't use trampolining, it doesn't affect you at all; the code for the counter will just not be there.
One nice thing I like about Eta is that you actually get stack traces for exceptions. This is because, to get good perf for Eta, you have to implement most primitives on JVM stack. This is a sample stack. You have a schedule loop, and you hare evaluting some IO acttion. applyV/applyN, these are partial applications. Execute an IO action. And another nice part, we've tried to encode it close to the original name. So you can tell this fucntion call happened in statistics.Regression, rnfAll. If you see, you notice there are line numbers. This is not perfect, and we can definitely make it better later... GHC gives you a lot of debugging info at STG time, but because the JVM doesn't have much flexibility, we can only attach one line number to code, so we have to discard all that info. This will get better; we'll stash that debug information in the classfile itself, and then access it and render a better stack trace. (BG: This is Source Notes?) Yeah.
Concurrency: One nice part is, it's nice or not. If you're evaluating a long chain of thunks, you're going to blow the stack. This happily coincides with GHC also having a space leak. Neil Mitchell wrote a blog post about how to detect space leaks: restrict stack size and then figure out which thunk was being evaluated. If you see a stack trace like this, and you see a huge chain of evaluates, in a long chain, you probably have a space leak.
How do I do interop? The way we did interop was, made a thing called the Java monad. IT's supposed to give you the experience of programming JAva. The basic implementation is inspired from IO monad. Object# c is "this", the object that is being threaded through. Because of this encoding, you get the Java experience: you can call dot on the Java object. It's almost like working with Java inside. The argument is called... that's the type constructor that forced us to fork, instead of use the API. You can't declare primitive types in the API. And we had to introduce a new low level representation. Declare wrapper types, wrapping the iterable interface in Java. We've stolen better syntax, which were type applications... resolve it somehow. I'm declaring an Eta type that wraps a JAva type, @java.lang.Iterable.
You use the java function to run the Java monad. All of these have to be imported. newArrayList, newInteger, but we brought some combinators, that let you call methods. It owrked out with the monad. This is sample code that does the same thing as Java code. it just uses standard monadic combinators. If it's a fixed c, it's an instance.
You can use Eta as a better JAva, with referential transparency! Unlike Kotlin or Scala.
How do we handle subtyping? We define uilt in type families. We have a typeclass named extends. Any time you declare a function that takes a given class and any subtype of that class, you can, instead of actually subtyping, we do it with constraints. Extends' takes the info from Inherits and figures it out. You can use the dot operator on anything that is a subclass of Iterator. We had to extend the typechecker just a little bit: a lot of times the type gets stuck in the form Extends' (List JSTring) (List a) where a is unconstrained.
Imports are tiresome, so we're setting up direct Java Interop; actually use JAva reflection to get info class files, and generate imports. "import java java.lang.Math" works, but doesn't scale. Biggest priority for the rest of the year is Java interop, really good IDE support, documentation, language extensions: UnboxedSums, TypeApplications, DerivingVia, QuantifiedConstraints. We have some new language extensions in mind, AnonymousRecords, RowTypePolymorphism... We'll see how that goes.
I was thinking about ways... we work on the same codebase, how to collaborate? We're interested in compile performance, support for unbboxed sums. Worker wrapper has some glitch, and no one got around to fixing it. At some point, maybe not any time soon, that and mutable fields. Pretty important for us. (BG: Do unboxed sums get a lot of usage? Why unboxed sums? Does Eta code make a lot of use?) No. But a lot of people on JVM are annoyed that Maybe is boxed all the time. But if you have unboxed sums, you can represent it as null. (SPJ: Or you can say, just box it, and you won't notice it. If it's fast enough all the time, focus on what's going to make a difference.)
Q: Did you consider using Graal (it's a new virtual machine that supports partial evaluation and partial escape analysis, good for functional languages)?
A: We have looked into it, it's not completely there yet to use, and we're not sure if it's something we can invest time with. We're keeping up with it. (BG: But you lose the JVM!) That's what's preventing us from going there. Maybe if it gets integrated into a mainline VN we might look at it. (Mainline Java is planning to integrate Graal)
Q: (SM) Are you keeping the fork up to date with master GHC?
A: One thing that is out of bounds for us, and for a long time, is all the dependent Haskell work. Everything else, we keep up. If there's any nice bugfixes... (SM: So you're selectively backporting).
Q: (BG) Have you considered unforking.
A: Not yet, no.
by Edward Z. Yang at September 23, 2018 03:02 PM
It's been a year since I got my hood and gown and joined Facebook (where I've been working on PyTorch), but while I've been at Facebook Backpack hasn't been sleeping; in fact, there's been plenty of activity, more than I could have ever hoped for. I wanted to summarize some of the goings on in this blog post.
There's been some really interesting work going on in the libraries using Backpack space. Here are the two biggest ones I've seen from the last few months:
unpacked-containers. The prolific Edward Kmett wrote the unpacked-containers package, which uses the fact that you can unpack through Backpack signatures to give you generic container implementations with hypercharged performance (15-45%) way better than you could get with a usually, boxed representation. A lot of discussion happened in this Reddit thread.
hasktorch. hasktorch, by Austin Huang and Sam Stites, is a tensor and neural network library for Haskell. It binds to the TH library (which also powers PyTorch), but it uses Backpack, giving the post Backpack for deep learning from Kaixi Ruan new legs. This is quite possibly one of the biggest instances of Backpack that I've seen thus far.
Eta supports Backpack. Eta, a JVM fork of GHC, backported Backpack support into their fork, which means that you can use Backpack in your Eta projects now. It was announced in this Twitter post and there was some more discussion about it at this post.
GSOC on multiple public libraries. Francesco Gazzetta, as part of Google Summer of Code, is working on adding support for multiple public libraries in Cabal. Multiple public libraries will make many use-cases of Backpack much easier to write, since you will no longer have to split your Backpack units into separate packages, writing distinct Cabal files for each of them.
By in large, we haven't changed any of the user facing syntax or semantics of Backpack since its initial release. However, there have been some prominent bugfixes (perhaps less than one might expect), both merged and coming down the pipe:
Stack support for Backpack. In Stack issue #2540 I volunteered to implement Backpack support for Stack. However, over the past year, it has become abundantly clear that I don't actually have enough spare time to implement this myself. Looking for brave souls to delve into this; and I am happy to advise about the Backpack aspects.
Pattern synonym support for Backpack. You should be able to fill a signature data T = MkT Int with an appropriate bidirectional type synonym, and vice versa! This is GHC issue #14478 We don't think it should be too difficult; we have to get the matchers induced by constructors and check they match, but it requires some time to work out exactly how to do it.
by Edward Z. Yang at July 14, 2018 11:34 PM
So, you’ve got a program that’s using more and more over time as it runs. Probably you can immediately identify this as a likely symptom of a memory leak. But when we say “memory leak”, what do we actually mean? In my experience, apparent memory leaks divide into three broad categories, each with somewhat different behavior, and requiring distinct tools and approaches to debug. This post aims to describe these classes, and provide tools and techniques for figuring out both which class you’re dealing with, and how to find the leak.
A run-time debugger allows you to see concrete values in a program, make changes to them, and continue running your program. A compile-time debugger allows you to see symbolic values in a program, reason about them, and write the rest of your program, e.g. filling in missing tensor size checks, for example.
Here's an example of a compiler-time debugger in action.
Let's suppose you are writing a simple program to read a pair of tensors from two files and do a matrix multiply on them. "Easy," you think, while writing the following program:
main() { x = load("tensor1.t") y = load("tensor2.t") return matmul(x, y) }
However, there is a twist: this matrix multiply is an unchecked matrix multiply. If you pass it tensors which cannot be validly multiplied together, this is undefined behavior. Your compiler has cottoned up to this fact and is refusing to compile your program. You fire up your compile-time debugger, and it drops you to the point of your program right before the error:
# Ahoy there Edward! I stopped your program, because I could not # prove that execution of this line was definitely safe: main() { x = load("tensor1.t") y = load("tensor2.t") -> return matmul(x, y) } # Here's what's in scope: _x_size : List(Nat) # erases to x.size() _y_size : List(Nat) # erases to y.size() x : Tensor(_x_size) y : Tensor(_y_size) # I don't know anything else about these values
Let's take a careful look at the variables in scope. Our compile-time debugger tells us the type of a variable x by writing x : t, meaning that x has type t. We have all sorts of ordinary types like natural numbers (Nat) and lists of natural numbers (List(Nat)). More interestingly, a tensor is parametrized by a list of natural numbers, which specify their sizes at each dimension. (For simplicity, the underlying field of the tensor is assumed to be fixed.)
Our debugger has a command line, so we can ask it questions about the types of things in our program (:t is for type):
> :t 1 # Here's the type of 1, which you requested: Nat > :t [1, 2, 0] # Here's the type of [1, 2, 0], which you requested: List(Nat) > :t matmul # Here's the type of matmul, which you requested: forall (a, b, c : Nat). (Tensor([a, b]), Tensor([b, c])) -> Tensor([a, c])
The type of matrix multiplication should make sense. We say a matrix multiply takes two 2-D tensors of sizes AxB and BxC, and produces a tensor of size AxC. An equivalent way of phrasing, as was done in the type above, is to say, “for any natural numbers A, B and C, matrix multiply will take a tensor of size AxB and a tensor of BxC, and give you a tensor of size AxC”.
It is also instructive to see what the type of load is:
> :t load # Here's the type of load, which you requested: String ~> exists (size : List(Nat)). Tensor(size)
We do not know what the dimensions of a tensor loaded from a file are; all we can say is that there exists some size (list of natural numbers), which describes the tensor in question. Our compile-time debugger has helpfully given us names for the sizes of our tensors in scope, _x_size and _y_size, and has also told us how to compute them at runtime (x.size() and y.size()).
Enough of this. Let's remind ourselves why our program has failed to typecheck:
> matmul(x, y) # I'm sorry! I was trying to find values of a, b and c which # would make the following equations true: # # [a, b] = _x_size # [b, c] = _y_size # # But I don't know anything about _x_size or _y_size (they are skolem # variables), so I couldn't do it. Cowardly bailing out!
The compiler is absolutely right. We don't know anything about the size of x or y; they could be 2D, or they could be 100D, or not have matching dimensions.
As an aside: sometimes, it's OK not to know anything about the sizes. Consider the case of adding a tensor to itself:
> add # Here's the type of add, which you requested! add : forall (size : List(Nat)). Tensor(size) -> Tensor(size) -> Tensor(size) > add(x, x) Tensor(_x_size) # This type-checked OK! I set size = _x_size and all of the arguments # checked out. You're good to go.
We don't know anything about _x_size, but add doesn't care; it'll take any List(Nat), and _x_size is certainly one of those.
Back to business. We are going to insert dynamic checks will will refine our knowledge of x and y, until it is obvious that matrix multiply will succeed.
What is a dynamic check? Operationally, a dynamic check tests whether or not some condition is true, and aborts if it is not. If we successfully run the dynamic check, we now have new information about the symbolic types in our scope. So for example, after adding a runtime test that two numbers are equal, we can subsequently assume at compile time that the numbers are equal:
> :t assert_eq_nat! (x : Nat) -> (y : Nat) ~> x = y
First things first, we'd like to assert that our tensors are 2D tensors:
> assert_eq_nat!(len(_x_size), 2) # OK! I added assert_eq_nat!(len(x.size()), 2) to your program, and # here's what I know now: _x_size : List(Nat) _y_size : List(Nat) x : Tensor(_x_size) y : Tensor(_y_size) len(_x_size) = 2 # By the way, I can profitably destruct _x_size into its constituent # parts; would you like to do this? (Y/n) > Y # OK, your new context is this: _x0, _x1 : Nat _y_size : List(Nat) x : Tensor([_x0, _x1]) y : Tensor(_y_size) # I don't know anything about the new variables _x0 and _x1, but I # learned enough about _x_size that I was able to eliminate it from # the context (_x_size = [_x0, _x1])
List length is a very helpful property to test against, since it greatly specifies the shape of the list in question. We can do the same for _y_size:
> assert_eq_nat!(len(_y_size), 2) # OK! I added assert_eq_nat!(len(y.size()), 2) to your program, and # here's what I know now: _x0, _x1 : Nat # erases to x.size(0), x.size(1) _y_size : List(Nat) x : Tensor([_x0, _x1]) y : Tensor(_y_size) len(_y_size) = 2 # By the way, I can profitably destruct _y_size into its constituent # parts; would you like to do this? (Y/n) > Y # OK, your new context is this: _x0, _x1 : Nat # erases to x.size(0), x.size(1) _y0, _y1 : Nat # erases to y.size(0), y.size(1) x : Tensor([_x0, _x1]) y : Tensor([_y0, _y1]) # I don't know anything about the new variables _y0 and _y1, but I # learned enough about _y_size that I was able to eliminate it from # the context (_y_size = [_y0, _y1])
We're very close now. All we need to do is assert that the inner dimensions are equal:
> assert_eq_nat!(_x1, _y0) # OK! I added assert_eq_nat!(x.size(1), y.size(0)) to your program. # After doing this, I learned _x1 = _y0, so I replaced all occurrences # of _y0 with _x1. Now the context looks like this. _x0, _x1 : Nat # erases to x.size(0), x.size(1) _y1 : Nat # erases to y1.size(1) x : Tensor([_x0, _x1]) y : Tensor([_x1, _y1])
Victory!
> matmul(x, y) # This type-checked OK! I set a = _x0, b = _x1, c = _y1 and all of the # arguments checked out. You're good to go.
Extracting the contents of this session back into our code, we now have:
main() { x = load("tensor1.t") y = load("tensor2.t") assert_eq_nat!(x.size(), 2) assert_eq_nat!(y.size(), 2) assert_eq_nat!(x.size(1), y.size(0)) matmul(x, y) }
At this point, I must come clean: the compile time debugger I've described above doesn't actually exist. But it is not all that different from the proof modes of interactive proof assistants the automated theorem proving community works with today. But unlike theorem proving, we have a secret weapon: when the going gets tough, the tough turns into a runtime check. Conventional wisdom says that automated theorem proving requires too idealized a setting to be useful in writing software today. Conventional wisdom is wrong.
by Edward Z. Yang at April 07, 2018 03:49 AM
Raise your hand if you've ever put one of these commands in your continuous integration scripts:
Can you tell what the problem is? These commands are not reproducible: depending on when you run them, they may give different results. More insidiously, most of the time they give you the same result (or, perhaps, a different result that still works for your use case).
I know, we need a reproducible build! The prevailing answer to this problem by tooling authors has been to seize the means of production and replace it with something that is reproducible. If you live the npm/yarn ecosystem, lockfiles ensure all of your dependencies redownload the same way every time (except when it doesn't). If you live in the Stack ecosystem, Stackage distributions ensure that you get the same Hackage package every time you build (except when it doesn't...). If you live in the Nix ecosystem, it means literally replacing the packaging system for everything on your system to achieve reproducibility.
So, it seems:
What if we change the question? We have entered this discussion under the assumption that reproducibility is our terminal value. But it's not: it's the mechanism by which we can achieve other goals. In the setting of continuous integration, what we really care about is a system that gives us signal about whether or not a given change set is correct or breaks things. A non-reproducible build interferes with this goal only in the sense that's its harder to tell if a change set has broken things if some random dependency has self-updated itself and broken your build. If this happens, you are blocked: you won't get clean signal until you fix the dependency problem. Broken window theory demands you drop everything and fix the build.
Clearly, we don't care if our dependencies are getting silently upgraded as development proceeds; in fact, we might prefer it, because "automatic" is less friction than "manual", at least when it works. What we do care about is the ability to block the upgrade if it is known to break us or revert the upgrade if we find out later that it caused some breakage.
Online/offline continuous integration. We traditionally think of the continuous integration build as a single pipeline which, when run from beginning to end, gives us signal about whether or not our code works or not. But I think it is better to think of a CI pipeline as dividing into two phases:
The key is that you don't have to run step (1) every build (you didn't want to anyway, for performance reasons.) Instead, the series of immutable snapshots of build environments generated by step (1) gives you the ability to revert or peg to a particular version of all of your dependencies, without having to go and make the universe reproducible. You can have a weekly cronjob rebuilding your environment, running the tests, and only deciding to push the activate snapshot forward if everything passes. You don't have to actually turn off the Internet when you run step (2), but it might help keep you honest.
Think offline. In today's connected world, it's easy to build systems with the assumption that you are always connected to the Internet. Doing so, however, leaves your tool at the mercy of the sound and fury of the real world. By applying a simple principle: "what can I do offline; what must I do online?" we reverse-engineer a design for continuous integration that gives you something almost as good as reproducibility, without forcing you to rewrite the universe. Surely that's worth something.
by Edward Z. Yang at March 13, 2018 02:52 AM
The git bisect
command helps you identify the first commit in a range that broke something. You give it a good commit and a bad one, and it will do a binary search between the two to find the first bad commit. At each step, you say either git bisect good
or git bisect bad
depending on whether it passes your test, and it will move you halfway through the remaining commits in the range.
There are several guides for using git bisect with the Linux kernel (e.g., upstream, Gentoo, and Ubuntu all have one). Unfortunately, they're pretty time-intensive operations; they all say something to the effect of, "now build the kernel, reboot into it, and test it, then type git bisect good
or git bisect bad
depending on whether it worked." For a tricky hardware compatibility bug, this might be your only option. But if you're testing something about the kernel's behavior, this is unnecessarily slow and manual, and you might be tempted to do something else, like read commit logs.
At work a few days ago, someone reported that a certain application no longer worked in a new VM. After some initial debugging with strace
, we determined that the program was calling the listen system call with a backlog of 0: that is, it was saying it was willing to accept up to zero connections. By the specification, it shouldn't work—and yet it did work on their older VM. A few things were different between the new systems, but one notable one was that the new VM had kernel 4.9 and the old one kernel 4.1. (Another was that it was deployed in a new cloud environment that my team is responsible for, with some networking changes, so we wanted to ensure we had not broken anything!)
I tried reading through git log --grep listen v4.1..v4.9 net/
, but there was entirely too much and I couldn't find anything. So I decided to see if bisection could help me, with the use of git bisect run
, which enables fully automated bisecting. I wasn't excited about rebooting my machine to do a binary search across eight kernel releases, but if I could get it to run in some other way, I could just leave it running.
For a normal program, it's pretty easy to use git bisect run
, which just wants a command that returns success (0) or failure (1): you can usually do something like git bisect run make test
. For a kernel regression, though, we'll need a command to boot the kernel and run some code. We can use the qemu virtual machine software for this, which has two properties that make it particularly suitable as such a command: it can boot a Linux kernel directly, instead of emulating a bootloader on a hard disk, and it can run a temporary VM in a single command line without any additional setup.
We'll build ourselves a tiny "initrd" (initial RAM disk), which is what's commonly used to load enough drivers to access your hard drive and completely boot your system. However, our initrd will just contain our one test program, which will possibly print a success message, and shut down the system. We can't meaningfully get a return value out of qemu, so we'll just grep its output for the success message.
The first step is to check out the kernel sources, if we don't have them already, and build a kernel:
$ git clone https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux
$ cd linux
$ make defconfig
$ make -j8
which prints this after a while:
Kernel: arch/x86/boot/bzImage is ready (#21)
Then make sure we can boot our new kernel with qemu, without an initrd:
$ qemu-system-x86_64 -nographic -append console=ttyS0 -kernel arch/x86/boot/bzImage
That is, run it in text mode with the VM's serial console on standard input/output instead of trying to pop up a graphical window, and tell the kernel to use the serial port for console output. If your system supports it, you can add -enable-kvm
to make it a faster, although since we want to shut down the VM immediately once we run our test, it doesn't make a huge difference (2 seconds vs. 4 on my machine).
This will panic, because we gave the kernel neither a root filesystem nor a working initrd. (You can kill the VM by typing Ctrl-A and then X.) So let's write an initrd with a single binary, init
. It needs to shut down the system, so we get back to our prompt:
$ mkdir initrd
$ cd initrd
$ cat > init.c << EOF
#include <sys/reboot.h>
#include <stdio.h>
#include <unistd.h>
int main(void) {
printf("Hello world!\n");
reboot(RB_POWER_OFF);
}
EOF
(Yes, the system call for shutting down the system is named "reboot", because the name "shutdown" was already used for the system call to close a socket. I guess early UNIX computers didn't support initiating a hardware poweroff from software, so the shutdown
command would just stop all processes, sync and unmount disks, and print a message asking the operator to cut power.)
Compile this program statically, so it's a single binary, put it in the particular form required for an initrd (a compressed cpio archive, an old but very simple format with a weird command-line tool) and make sure it's named init
, and then we can boot it up with qemu:
$ cd initrd
$ cc -static -o init init.c
$ echo init | cpio -H newc -o | gzip > initrd.gz
1621 blocks
$ cd ..
$ qemu-system-x86_64 -nographic -append console=ttyS0 -kernel arch/x86/boot/bzImage -initrd initrd/initrd.gz
...
[ 0.502593] ALSA device list:
[ 0.502889] No soundcards found.
[ 0.503554] Freeing unused kernel memory: 1088K (ffffffff81f2f000 - ffffffff8203f000)
[ 0.504262] Write protecting the kernel read-only data: 14336k
[ 0.505004] Freeing unused kernel memory: 1680K (ffff88000185c000 - ffff880001a00000)
[ 0.505855] Freeing unused kernel memory: 1340K (ffff880001cb1000 - ffff880001e00000)
Hello world!
[ 1.089618] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input3
[ 1.092997] ACPI: Preparing to enter system sleep state S5
[ 1.094083] reboot: Power down
Great. We've built our own kernel, passed it a test binary to run, and got it booted in a qemu command that exits. This is turning into something we can pass to git bisect run
. Now it's time to write the actual test. Here's what I ultimately ended up with to track down my bug:
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/time.h>
#include <sys/reboot.h>
#include <sys/ioctl.h>
#include <net/if.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
int main(void) {
/* The problem I was tracing was only reproducible with syncookies
disabled. While the initrd gets unpacked into a writable temporary
filesystem, nothing exists yet, so if I need /proc, I need to create
and mount it myself. */
if (getpid() == 1) {
mkdir("/proc");
mount("proc", "/proc", "proc", 0, NULL);
char buf[] = "0\n";
int fd = open("/proc/sys/net/ipv4/tcp_syncookies", O_WRONLY);
write(fd, buf, 2);
close(fd);
}
int server = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
/* Also, while a loopback ethernet device exist, it's not
enabled, so network tests won't work. This code is equivalent
to `ifconfig lo up`. */
struct ifreq ifreq = {
.ifr_name = "lo",
};
ioctl(server, SIOCGIFFLAGS, &ifreq);
if (!(ifreq.ifr_flags & IFF_UP)) {
ifreq.ifr_flags |= IFF_UP;
ioctl(server, SIOCSIFFLAGS, &ifreq);
}
struct sockaddr_in addr = {
.sin_family = AF_INET,
.sin_port = htons(54321),
.sin_addr = {htonl(INADDR_LOOPBACK)},
};
bind(server, (struct sockaddr *)&addr, sizeof(addr));
listen(server, 0);
int client = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
struct timeval timeout = {3, 0};
setsockopt(client, SOL_SOCKET, SO_SNDTIMEO, &timeout, sizeof(timeout));
if (connect(client, (struct sockaddr *)&addr, sizeof(addr)) == 0) {
printf("Success\n");
} else {
perror("connect");
}
if (getpid() == 1) {
reboot(RB_POWER_OFF);
}
return 0;
}
Most of it is specific to the thing I was trying to test, but you may also need the code to create and mount /proc
or to enable lo
. Also, I put a few things conditional on getpid() == 1
so that I could safely test the program on my host system, where it wasn't running as root and where I didn't want it powering anything off. (I ran it a few times under strace
to make sure it was doing what I expected it to do, and I didn't want to bother with getting strace
inside my initrd.)
So I first made sure this is reproducible on a stock kernel by itself, isolated from any config my workplace might add:
$ qemu-system-x86_64 -nographic -append console=ttyS0 -kernel arch/x86/boot/bzImage -initrd initrd/initrd.gz | grep ^Success
$ git checkout v4.1
$ make defconfig && make -j8
$ qemu-system-x86_64 -nographic -append console=ttyS0 -kernel arch/x86/boot/bzImage -initrd initrd/initrd.gz | grep ^Success
Success
Cool, it's definitely a regression somewhere between those versions. (The set of config options change from kernel version to kernel version, so across this wide of a range, the easiest thing is to just get the current kernel's default config - if you need custom config options, you might want to edit .config after running make defconfig
or something.) Now time to let git bisect run
do its thing:
$ git bisect start
$ git bisect bad v4.9
$ git bisect good v4.1
$ git bisect run sh -c 'make defconfig && make -j8 && qemu-system-x86_64 -nographic -append console=ttyS0 -kernel arch/x86/boot/bzImage -initrd initrd/initrd.gz | grep ^Success'
It started printing a bunch of build logs and I went to work on something else. About half an hour later (I expected it to take longer!), it prints this out:
ef547f2ac16bd9d77a780a0e7c70857e69e8f23f is the first bad commit
commit ef547f2ac16bd9d77a780a0e7c70857e69e8f23f
Author: Eric Dumazet <edumazet@google.com>
Date: Fri Oct 2 11:43:37 2015 -0700
tcp: remove max_qlen_log
This control variable was set at first listen(fd, backlog)
call, but not updated if application tried to increase or decrease
backlog. It made sense at the time listener had a non resizeable
hash table.
Also rounding to powers of two was not very friendly.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
$ git describe --contains ef547f2ac16bd9d77a780a0e7c70857e69e8f23f
v4.4-rc1~141^2~238^2~2
which looks awfully relevant—it implies they were previously rounding off the backlog. Looking at commit, we can see what happened: before kernel 4.4, the backlog argument was always capped to at least 8, and also rounded up to the next power of two. So listen(fd, 0)
was turning into listen(fd, 8)
on older kernels, and the program previously worked despite using listen()
incorrectly. This commit was actually somewhere in the git log
I was trying to read, but I must have scrolled past it.
git reflog
shows that git bisect
went through sixteen commits before settling on this one: it found this one on its 11th try, and then spent 5 more commits confirming that all the commits before this one were good. So I'm glad git bisect run
found this commit, and I'm especially glad it found it in half an hour unattended, without me having to manually compile and test sixteen kernels by hand.
by Geoffrey Thomas at March 02, 2018 12:00 AM
The best and worst thing about semantic import versioning is that it makes BC-breaking changes hard.
In the past few days, Russ Cox has made a splash in a series of white papers describing Go and Versioning. In them, he coins a new term, Semantic Import Versioning, distilling it to the following principle:
If an old package and a new package have the same import path, the new package must be backwards compatible with the old package.
I am very happy Russ has come up with a good name for semantic import versioning, because this concept has been out there for quite a long time, but without a concise name or formulation of its the design. In fact, I would even say that semantic import versioning is inevitable when you take on the premise that you will never break user code. It is so inevitable, that semantic import versioning is already practiced in the wild in a variety of places. Here are a few examples:
One insight I draw from these examples is that what we call an "import version" is really a specification for some series of implementations. To someone who has spent a lot of time thinking about module systems, this is really a step in the right direction: program against interfaces, not implementations.
Another thing we can observe from these examples are the real world consequences of semantic import versioning. One particular effect stands out: semantic import versioning is challenging for maintainers, because it pressures them to maintain multiple major release branches simultaneously (after all, who wants to use pkg/v2 only to have it immediately unmaintained when pkg/v3 comes out). In the traditional release branch model, where one creates a release branch for each major version, only the most well-staffed software development teams can afford to maintain multiple, active release branches (backporting patches is a lot of work!) The friction involved with managing multiple implementations means that less well staffed projects will have a strong pressure to never break backwards compatibility.
This may not sound like a such a bad thing to the "don't break my stuff" grumps in the audience, but a lot of bugs and security problems have stemmed from being literally unable to outlaw harmful and dangerous APIs with BC-breaking changes. The danger of moving the calculus further towards preserving backwards compatibility is a further entrenchment of bad "first try" APIs. So while I do not deny that a genius of Russ's framing is to describe semantic versioning as part of the package path, it also sets up a bad expectation for the feasibility of BC-breaking changes, when what we should be doing is improving the state of tooling so that making a BC-breaking change is "no big deal." To me, the most promising way to reduce the friction of a BC-breaking change is to organize your software development so that a single codebase, under a single build, implements multiple specifications (v1, v2 and v3). As we saw from the examples, compilers can manage this (GCC supports multiple C++ versions), but traditional programming languages make it hard for libraries to do the same thing.
I don't now exactly how to solve this problem, but I do have a few ideas:
These are maybe a bit too radical to expect Go to adopt them, but perhaps the next generation of programming languages will explore this design space further.
by Edward Z. Yang at February 23, 2018 05:06 AM
M: This workshop is bringing together ML and systems. Can you put your place on that spectrum? Who is your home community?
YJ: Right in the middle. I'd like to move more towards systems side, but Berkeley Parallel Labs kicked me out. ML is my home base.
JL: ML is where I come from, and where I will be, but I'm interested in systems. My home is NIPS and ICML
DS: My area is AI and security, did computer security in the past, now moving into AI.
GG: Systems.
JG: I started out in ML, working on probabilistic methods. I basically, in middle of PhD, looked at systems. Now I'm moving to being a systems person that does ML.
M: We've seen a proliferation of deep learning / ML frameworks that require a lot of dev effort, money, time to put in. Q, what is the role of academia of doing research in this area. What kind of large scale ML learning can you do.
GG: I liked YJ's answer last time.
YJ: The thing that is astonishing is that academia is the source of so many innovations. With all due respect, we did very good work in Google, but then Alex came out with 2 GPUs and nuked the field. Academia is the amazing place where we find all of the new ideas, and industry scale it out.
JL: Some examples. If you're coming from academia, maybe you don't have research at big company, but it's an advantage as you will spend time about the right algorithm for solving it efficiently. And that's what will win in the long run. Short term, they'll brute force with AutoML. Long run, the learning algorithms are going to be designed where tjhey won't have parameters. A common ML paper is "we eliminate this hyperparameter". When they're more automatic, more efficient, great things will happen. There's an advantage in being resource constrained, as you will solve things in the right way.
Another example is, the study of machine learning tells us that in thefuture we will regard any model that u just learned and deploy as inherently broken adn buggy as data collection is not part of process of training, deploying. It will decay and become irrelevant. The overall paradagim of ML where you're interacting with the world, and learning, that can be studied easy in academia, and that has huge implications about how you're going to design systems,
DS: People often talk about in a startup, the best thing is to not raise a ton of money; if you're resource constrained you're more focused and creative. ML is really broad, there's lots of problems. Right now we learn from lots of data, but lots of talks at NIPS, humans have amazing ability to learn from very few example. These are problems for academia to tackle, given unique resource constraints.
GG: I'll say, it's difficult to concentrate on top accuracy if you don't have enough data, and the data available to students is stuff like DAWNbench which tends to lag. In academia, we build relationships with industry, send students for internships, they get the ability to do big data, while exploring first principles in university. IT's a challenge, but open publishing and open sharing of code world more berable.
JG: The one thing I've struggled with is focusing on human resources. I have grad students; good students, focus on a key problem can make a lot of progress. We struggle with a lot of data. Struggle with RL really is here, we can build simulators to build at this scale. Being able to use simualtion to get data; be creative, find new and interesting problems.
M: Follow-up on process. I think a lot of you have tried to publish ML in your communities. Are they equipped to appreciate work properly; what is a common reason they don't appreciate.
JG: Publishing ML in systems, or vice versa, is hard. It goes both ways. These communities are not equipped to evaluate work in other field. ML in systems, where if you saw here, it was surprising. Or vice versa, wouldn't have done well in systems venue as systems. The failure mode I see, is systems community doesn't appreciate extreme complexity. In ML, I have this very sophisticated thing, and reducing them to their essential components. ML tries to overextend their complexity as an innovation. MOre broadly, each of these communities has their own biases how they look at research. One thing I've noticed, it's gotten better. Systems is better at evaluating, and at this workshop, people are pushing research in an advanced way.
GG: I'm old, so I've seen creation of conference before. So, you start off with an overlap of areas. In my prior life, it was the notion of storage as a research area, rather than app of devices. You start off, send submission in. The PC has two people that know anything about it, and they aren't assigned, and the reviews are sloppy, and you get one conference that do a little better, but other conferences don't read it. I faced this with fault tolerance, database, OS communities, they don't read each other's stuff. You get enough mass, get a conference that focuses in the middle; reviewing and PC that have seen most of the good work in the area. That's hard, but we're on the edge of doing it in SysML. We're doing the right thing to do competitive, on top of state of the art.
M: Is that the only solution, or can we mix up PCs?
GG: I've seen a lot of experiments to try it. You can end up with permanently fractured communities.
JL: Joey and Dawn are an area chair at ICML. I have found the ML community to be friendly to system type things. There's an area chair systems. Hopefully papers get assigned appropriately.
M: We're not good about that at systems.
DS: About ML and security, we have this problem. In security, we also have very small percentage of ML, and the committee, if you submit ML, it's very hard to find people who can review the paper, and as a consequence, the review quality varies highly. Similar in terms of security in ML, similar problems. It's interesting to think about why this happens and how to solve the problem. In general, sometimes the most interesting work is the interdisciplinary areas. ML and systems, security, and examples I see, including machine learning in systems... so, one thing I actually can understand is, within each community, even though the review quality varies, I can see from committee's perspective, really what they want is papers that are more meaningful to community, help people get exposed to this new area, fostering new exploration. That's part of natural progression. As time goes on, there's more cross pollonization.
JG: We are launching a SysML conference. I had a little bit of reservations: ML is getting better at systems, but now I have to decide where I'm going to send a paper. A lot of papers we see in ML is going to have systems.
GG: When you have a new conference area, not all work is sent there. Overlapping, you have a favorite conference, your heros, and you'll send your most exciting work to that root conference. No problem.
YJ: SysML is great, and this is how it comes out. New fields, it warrants new conferences.
M: Do you think ML expert needs to also be a systems expert? Does such a person who lies at that intersection have a different way of looking? Or you come up with a nice algorithm, and you
JL: It's not OK to have a wall.
There's many way learning algorithms can be changed. The problem with having a wall, if you don't understand, throw engineer. But if you can bridge to understand, they're not artifacts, you can break open and modify. That can let you achieve much better solutions.
GG: AGreed, but what happens initially is you reach over to other side, you put it into system, and it's my innovation that redundancy makes fault tolerance, even though it's fairly pedestrian from the other side. If it is a substantial improvement, it is worth doing. We all grow up.
JG: We need a wall, but we're going to constantly tear it down. Matlab in grad school, we made jokes about it, and MKL community would make it fast. Then they said we are going to build ML for distributed computing algorithms, and ML would write class algorithms for system. That waned in the dev of pytorch, TF, etc., which leveled up abstraction. The stack is building up again; systems community to make more efficient. Well, fp could change, and that could affect algorithm. So we're tearing it down again. But systems is about designing the wall.
YJ: It's more like a bar stool. It's a barrier, but we don't have to be both to do anything, but you need it to make it efficient. A story: a training system we looked at, SGD. That person found a very nicely rounded number: 100. But people frown, you should round to 128. Understanding and improving the common core for CS and engineering, that helps a lot for people to have good sense for how to design ML algorithms.
M: There's a lot of talk about democratizing AI, and all of you have helped that process. What is a truly democratic AI landscape look like, and how far are we from that world.
YJ: I plead guilty in participating in framework wars. When reading CS history, one thing that's pretty natural, when field is strating, there's all sorts of standards, protocols. FTP, Gopher, and now in the end HTTP took over, and everything runs on HTTP. Right now, there's all kinds of different abstractions; boiling it down, everyone is doing computation graph, optimization. I look forward to when we have one really nice graph representation, protocol for optimizing graphs. It's not a rosy dream, because in compilers we have that solution, LLVM. I don't know if we'll reach that state but I think one day we'll get there.
JL: You have AI/ML democratized when anyone can use it. What does that mean, a programmer has a library, or language constructs, which that they use routinely and easily; no issues of data getting mismatched or confused or biased. All the bugs people worry about in data science; those are removed from the system because the system is designed right and easy to use. The level beyond that is when somebody is using a system, that system is learning to adapt to you. There's huge room for improvement in how people interact. I don't know how often there's a rewrite rule driving me crazy; why can't it rewrite the way I want. People can signal info to a learning algorithm, and when those can be used effectively tpo assist people, you have democratized AI.
DS: I have a very different view of democratizing AI. I think it's interesting to think about what democratization here really means. For systems people, it's about making it easier for people to do learning, to use these libraries, platforms. But that's really just providing them with tools. For me, I give talks on demccratizing AI, we are looking at it from a completely different perspective. Code: even, whoever controls AI will control the world. So who controls AI? Even if you give everyone the tools, push a button, but they don't have the data to do the training. So who controls the AI today, and tomorrow? It's Facebook, Microsoft, Google... so for me, democratization means something totally different. Today, they collect data, train models, and they control who has action to model, and users can get recommendations, but not direct access to models. We have a project to actually democratize AI, where users can control their data. Combining blockchain and AI, where users can donate their data to a smart contract, where the smart contract will specify the terms; e.g., if you train a model, the user can use the model, and if the model produces profits, the user can get part of the profits. The smart contract can specify various incentive terms; e.g., if the data is vbetter than others, they can get more profits, and other mechanisms. A developer will supply the ML training algorithm, and get benefits when it is trained well. We are decentralizing th epower of AI; users will be able to get direct access to models and use them. In this case, I hope for an alternate future, where big companies can continue with business, but users by pooling their data in a decentralized fashion, will see actual true democratization of AI; they will access the power of AI. Not just use tools.
(applause)
GG: I think that a lot of what's meant in democratizing AI is how can you move from a small number of people innovating, to a large number. Tool development and standards. We're close to being there. There was an example in the past, was VSLI paint boxes. Up until a certain point, only an EE could really develop hardware at all. They took a lot of effort and time to make sure it could make it through very part without very much crosstalk. a group came together and thought, well, there are some design rules. This lets you build hardware pretty easily. I could paint green/red boxes, hardware months later, worked. It never worked as fast as that EE guy, so there would always be a place for it, but it would let us build a RISC computer, and ship it. We were in the game, we could innvoate, and do it. The tools we're trying to build right now can build on statistical.
JG: When I started PhD, we did integrals and derivatives by hand. Automatic differentiation was a huge step forward. I blame that for the explosion of papers. A first year can build something far more complex than what I could do. That's moving AI forward, on algorithms side.
The data side is interesting, and that is one where I think about in systems. There's a lot of opportunities to think about how security interacts, leveraging hardware to protect it, markets to sell/buy data from sources, and protect the data across a lot of places. I would argue we're making a substantial amount of progress in how we think about algorithms.
M: When I think about democratizing pervasive AI, recent questions that have been consuming our minds, interpretability, fairness, etc. Can you share... any experience where things like interpretability came up and became a problem, issue, do we have to worry about a lot more in ML, or systems-ML.
JG: My grad students come to me and say the models stop working. I don't know how to fix that; the process is very experimental. Tracking experiments is a big part of the process. We cared a lot about interpretable models, and that meant something very particular. Now it's explainable; we don't need to know what it did exactly, but there needs tob e some connection to what we did. Interpretable, explain computation, it could be related or unrelated to the decision. That's two answers about explainability, and how we debug these systems.
GG: SOSP just happened, and they have ten years of... good copies of everything they submitted. At the end of the conference, Peter Chen took all the PDF files, and did a naive bayes classifier, and saw how well he would predict that it would be accepted. And half the things it predicted to be accepted, would be accepted.
So what did they do? They made ad etector for popular authors. And so what you did is those who had succeeded, they will follow behind. I recognize this problem. You might think that you found a good way, but it's actually Nicolai Zeldovich's paper.
DS: There's a big debate. Some think it's really important, and sometimes, as long as the model works, it's fine. Our brain, we can't really explain how we arrive at certain decisions, but it works fine. And it depends on application. Some applications have stronger requirements for explainability; e.g., law and healthcare, whereas in others it's less required. Also as a whole community, there's a lot we don't understand. We can dtalk about causality, transparenty, all related. As a whole community, we don't really understand what explainability means. Not a good definition. All these concepts are related, we're trying to figure out what's the real core. That's a really good open question.
JL: There's two different interpretations. Can you explain to a person? And that's limited; there's no explainable vision models. The other definition is debuggability. If you want to create complex systems, they need to be debuggable. This is nontrivial with a distributed system, it's nomntriival with ML. If you want to create nontrivial ML systems, yo uhave to figure out why they're not behaving the way you want it to.
DS: Do we debug our brains?
JL: Evolution has done this the hard way for a very long way... a lot of people have bugs in their brains. I know I have bugs. I get an ocular migraine sometimes... very annoying. No, we don't debug our brains, and it's a problem.
YJ: I'm suire there's bugs in my brains; I chased chickens in my grandma's house; the chicken has one spot in its back that if you press it, it just ducks and sits there. It shuts off because of fear. WE humans don't do that. But these bugs, are in our brain as well. Chasing for interpretability helps understand how things work. The old days, deep dream; this line of work started with figuring out what the gradients do, and we propagated back, and we found that direct gradient doesn't work; then we added L1 priors, and then we got pictures. This curiosity has lead to the fact that convnets with random weights are codifying the local correlation; we are hardcoding the structured info in CNNs which we didn't know before. So maybe we will not achieve full interpretability, but some amount of interpretability and creativity will help.
(audience questions)
A: I'd really like to hear what Jeff said about ML for systems. As systems, I'm interested in it, but people have said, you can get far with heuristics.
JL: I think it's exciting.
GG: The index databases, when I read it for reviewing, I went, "Wow! Is that possible?" I think things like that will change the way we do systems. The novelty of the application opens a lot of people's minds. Right now we think of the machine learning tools as being expensive things that repeat what humans do easily that computers don't do well. But that's not what DB index is. We can execute it, but we're not better. But to get it half the size and twice the speed, throwing in another way of thinking about compression through a predictor is a fabulous insight.
JG: I tried to publish in this area for a while. For a while, systems didn't like the idea of complex algorithms in the middle of their system. Now, these days, Systems is like, "ML is cool." But where it's easier to have success, you prediction improves the system, but a bad prediction doesn't break the system. So scheduling, that's good. Where models can boost performance but not hurt. The work in ML to solve systems is successful.
DS: ML for systems is super exciting. I'm personally very excited about this domain, esp. for people who have done systems work, and are interested in AI. ML for systems is an amazing domain of ML. I wouldn't be surprised, I would hope to see, in five years, our systems are more ML driven. A lot of systems have a lot of knobs to tune, trial and error setting, where exactly ML can help. On these amazing techniques, RL, bandits, instead of using bandits to serve ads, we can try to autotune systems. Just like we are seeing AI transforming a lot of application domains, and a lot more intelligent system, old systems, the one we built, should be more intelligent. It's a prediction: It hink we are going to see a lot of work in this domain. I think it will transform systems.
M: I work in this quite a bit. We have some successes with bandits in some settings, but there are settings that are really tough: stateful, choices, decisions influence the future, it makes it hard to apply RL, or the RL techniques take a lot of data. There are challenges, but there are successes. There are a lot of papers that apply RL in caching, resource allocation. The real question is why it's not used in production? I don't know if we have an answer to that, papers do it, it seems to be really good, but it's not that mainstream, esp. having RL all over the place. Why isn't it pervasive. That I don't see.
A: Isn't it because it's not verifiable. You want some kind of verification analysis.
GG: It's called a regression sweep. If you deploy on a lot of systems. There's a lot of money, it has to work. If it falls over, that's a lawsuit. I hired a VP of software. OK, now that I'm in charge, things are going to slow down. Every LoC is bugs, if I want low bug, I stop programmers from writing code, by making the bar very high. This is the thing JOy was talking about; they need a really compelling reason with no downsides, and then they have to pass tests before the pass. So anything stochastic has a high bar.
SB: Another thing that is happening, there aren't that many people who have understanding in both areas. It's really hard to do ML in systems without deep expertise in systems. You really need to understand to explain it.
GG: It wasn't that long since we didn't have hosted services.
M: Guardrails, you constrain the ML system to not suggest something bad. We have a scenario in MS, machines are unresponsive. How long to wait? You can do it in ML. The choices are reasonable, they're never more than the max you'd want to wait.
A: On democratization. There's been a lot of talk about optimizing the models so they can bear the cost. Another is decentralizing data... but there's two very big constraints for systems and models. They cost a lot of money, and there's big variance. Because of cost, if some guy gets into programming, and does research, he won't have resources to do it. So they won't go into engineering; they'll intern at Amazon instead. So if there is some community going into lowering the barrier, demoratizing, what solution is there to get people much more easily? Because there's huge economic costs. People are trying to make huge amounts of money, startups, but there's no... systems have faults with decentralization... there's just a big problem colliding and ML.
JG: We teach data, I teach data science at Berkeley. The summary is, what about the costs of getting into DL? There's cost to train models, GPUs, data, how do I get a freshman in college who is excited about this, chromebook, they can do research and explore opportunities. At Berkeley we have exactly this problem. I teach 200 students, a lot of them are freshmen, chromebook ipad as primary computer. We've built tools using Azure... we run a cloud in Azure, and on these devices they can experiment with models. They get to use pretrained models and appreciate how to ... Someone built a Russian Twitterbot detector, and saw value and opportunity in those. And then they got involved in research projects where they had more funds and tools.
JL: The right interfaces make a huge difference, because they prevent you from having bugs that prevent you from doing things. Also, DL, is all the rage, but framing the problem is more important than the representation you do. If you have the right problem, and a dumb representation, you'll still do something interesting. otherwise, it's just not going to work very well at all.
YJ: As industry, don't be afraid of industry and try it out. Back at Berkeley, when Berkeley AI was using GPUs, the requirement was that you have one project per GPU. We students, framed ten different projects, and we just asked for ten GPUs. NVIDIA came to us and asked, what are you donig. We'll just give you 40 GPUs and do research on that. Nowadays, FAIR has residency, and Google AI has residency, all of these things are creating very nice collaborations between industry and academia, and I want to encourage people to try it out. Industry has funds, academia has talent, marrying those together is an everlasting theme.
A: Going back to where do we go forward in terms of conferences, the future of this workshop; has any decision been made, where we go?
SB: This is work in progress. We're interested in feedback and what you think. We've had this workshop evolving for 10 yrs, with NIPS and iCML. Then we did one with SOSP, excciting. We are now doing a separate conference at Stanford in February. We think there's really an important role to play with workshops colocated with NIPS and ICML. We're still planning to conitnue this series of workshops. There's also a growing amount of systems work in ICML and NIPS, natural expansion to accept that work. The field is growing, and we're going to try several venues, and form a community. If people have ideas.
JG: More people should get involved.
M: We plan to continue this; audience is great, participation is great.
It's a panel, so I have to ask you to predict the future. Tell me something you're really excited... 50-100yrs from now. If you're alive then, I will find you and see if your prediction panned out. Or say what you hope will happen...
YJ: Today we write in Python. Hopefully, we'll write every ML model in one line. Classifier, get a cat.
JL: Right now, people are in a phase where they're getting more and more knobs in learning. ML is all about having less knobs. I believe the ML vision of less knobs. I also believe in democratizing AI. You are constantly turning ... around you, and devs can incorporate learning algorithms into systems. It will be part of tech. It's part of hype cycle. NIPS went through a phase transition. At some point it's gotta go down. When it becomes routine, we're democratizing things.
DS: It's hard to give predictions... I guess, right now, we see ML as an example, we see the waves. Not so long ago, there was the wave of NNs, graphical models, now we're back to NNs. I think... I hope that we... there's a plateauing. Even this year, I have been talking to a lot of great ML researchers, even though one can say there has been more papers written this year, when you hear what people talk about in terms of milestones, many people mentioned milestones from past years. AlexNet, ResNet, ... I do hope that we will see new innovation beyond deep learning. I do teach a DL class, but I hope that we see something beyond DL that can bring us... we need something more, to bring us to the next level.
GG: I'm tempted to point out DL is five years ago, and dotcom era was not more than five years... I think, I'm looking forward to a change in the way CS, science in general, does business, having learned from statistical AI. My favorite one is overfitting. I poorly understood overfitting, in vague stories, until ML hammered what this said. I look forward to the time when students tell me, they stopped writing code, because they were adding parameters... and they added a decent random, iid process for testing code. We're no where near there, but I think it's coming.
JG: I'm looking forward to the return of graphical models... actually not. When we're democratizing AI, but what ultimately happens, we're democratizing technology. I can walk up to Alexa and teach it. Or I can teach my Tesla how to park more appropriately. Tech that can adapt to us because it can learn; when I can explain to a computer what I want. (Star Trek but without a transporter.)
by Edward Z. Yang at December 09, 2017 02:17 AM
The below is a transcript of a talk by Daniel Lo on BrainWave, at the ML Systems Workshop at NIPS'17.
Deploy and serve accelerated DNNs at cloud scale. As we've seen, DNNs have enabled amazing applications. Architectures achieve SoTA on computer vision, language translation and speech recognition. But this is challenging to serve in large-scale interactive because there are latency, cost and power constraints. Also, DNNs are growing larger in size and complexity.
We've seen a Cambrian explosion in startups to solve this problem. Research groups have produced DNN processing units, DPUs, custom hardware solutions to prove high throughput efficient serving of DNNs. We categorize them into two categories: fast DPUs, where the algorithms and applications have to be fixed in at design time, because they're fabbing an ASIC, or a soft DPU, FPGA. But for soft DPUs, we haven't seen them deployed at scale.
To address this, we've been working on Project BrainWave. Solution to deploy large scale DNNs with FPGA-acceleration. We've designed it to be fast, flexible and friendly. High throughput, low latency acceleration using FPGAs. Flexibility with adaptive numerical precision, update to latest AI algorithms with reconfigurable FPGAs. And it's user friendly, because we have a full stack solution, compile CNTK/Caffe/TF and compile them down. This is deployed on our configurable cloud, an outer layer of CPUs, a data center that puts everything together, and a layer of reconfigurable FPGAs.
We've been deployed DNN models. LSTM model that takes tens to hundreds of milliseconds CPU. What we see is the 99th percentile for latency; even at 99 we are able to achieve sub-millisecond latencies. When you get to these levels of acceleration, it's negligible in the E2E pipeline.
Next I'll dive into details. It's a full stack solution. starting with a compiler and runtime that takes model sin high level frameworks and compiles them down to our architecture. A flexible ISA for serving DNNs. We have a throughput, low latency serving. We do this all with persistency at scale, to keep models pinned in FPGA memories. Deployed on our wide deployment of Intel FPGAs using hardware microservices.
To begin with, let's talk about hardware microservices. This is something we presented at Micro. The architecture of reconfigurable cloud is FPGAs sit between CPU and network. CPU can use FPGA locally for acceleration, but because FPGAs are connected over network, they can distribute between them. We have a proprietary network protocol for low latency compute.
We'vec disaggregated FPGA compute plane from CPU. So we can aggregate FPGAs together to form larger accelerators, and you don't have to match the rate of FPGAs to CPUs. You can serve a large number of CPUs with a small cluster of FPGAs, or vice versa.
Next I'll talk about the compiler and runtime. Goal is to make it very easy for ML specialists to do this. The typical ML specialist doesn't know how to program this. Models developed in high level frameworks, compile them down to our architecture. If you compile them down first into an intermediate graph based representation. We split them into portions split on FPGAs, and portions on CPU. When we execute, we also have runtime that handles orchestration and scheduling that handles it between parts.
There are two main categories of DNNs we have to optimize for. DNNs that have very high compute to data ratio, convnets, these are well studied. I'm going to focus on the other class of DNNs, those with less compute to data ratio, e.g. dense layers and RNNs.
The conventional approach to accelerating DNNs on FPGAs, you keep all model parameters in DRAM. When a request comes in, you're going to stream the model parameters of DRAM, and return a request. The issue with this is when you have DNN layers that are memory bandwidth bound, you're limited in how fast you can run this by memory bandwidth; you're not getting full compute capabilities of FPGA. Typically the way to solve this is with batching; you send a number of requests and use the model parameters for all requests. WHile you may achieve good throughput, latency will increase. For realtime services, this violates your SLA. What we want to do is provide high performance at low or no batching.
The way we do this is with persisted Dnets. FPGAs have lots of memory on chip: 10MB memory. Since they're on chip, it's high bandwidth. So we're going to keep the model parameters on the chip, so that when we get one request in, we distribute it across the entire FPGA chip.
The obvious question is, what happens if your model doesn't fit on chip? We take advantage of the hardware microcenter. We'll distribute a single model over multiple FPGAs in the datacenter.
Let's look at the architecture and microarchitecture of the processing unit we developed. The BrainWave DPU is a software programmable processor, programmed in single-threaded C, but we've added a number of instructions for serving DNNs, e.g., matrix multiply, convolution, nonlinear activations, embeddings. The processor is designed to use narrow precision format (float16) and easily flexible for extending to newer algorithms.
The microarchitecture of the processor, main portion is dedicated to matrix vector unit; matrix vector multiply, consisting of a number kernels on a tile of a larger matrix. Tiling gives us flexibility while maintaining performance. Other compute units are multifunction units; vector-vector operations, such as element-wise multiply, add and activation functions. Tying it all together is an on-chip network that lets us keep all the compute together at time.
Most of the chip is dedicated to matrix vector unit. It's composed of hundreds of multilane dot product units. Each of these dot product units is consists of tens of adds and muls. To keep them fed with data, each dot product unit is fed by a set of dedicated block rams.
Next, I'd like to show performance results for this architecture. Two years ago, we had a deployment of Stratix V FPGAs. It shows the effective teraflops of this format. 16 bit integer.. we've been playing with our own format Microsoft Floating Point. 4.5Tflops at MSFP5.8. These Stratix are pretty old.
(Demo for latest generation of FPGAs)
Looking at throughput oriented DPU, the latency is 65.81ms. With brainwave, latency is 0.98ms. Under 1 millisecond.
This was done on initial engineering silicon. For production silicon, we're expecting to get 12TOps at 16-bit integer. 90TOps for MSFP8. One question is how does numeric output affects output. Here is the normalized accuracy for three in-house text models, using GRU and LSTM. The orange bar shows what happens when you go to MSFP9, but we've developed a way to fine tune networks for this precision, and you see we recover our accuracy. We're working with MSFP8 and see similar results.
Project BrainWave is our project for accelerating DNNs at cloud scale. We hope it will be fast, friendly and cloud-scale, and expand capabilities of AI in the cloud, providing a way to run higher dimensional RNN networks for NLP and other great applications. We're planning to release to third parties, stay tuned.
Q: When you decrease batch size, what hardware are you evaluating? Hardware utilization as we decrease?
A: We stay highly utilized even as we decrease batch size; even at high batch size, we're still sending requests one by one. (Only one step will be processed?) Right.
Q: Regarding the FP9 and FP8, nine and eight being the number of bits used? (Yes) Is it in any way related to Flexpoint at Intel?
A: We developed this independently of flexpoint, and I'm not able to talk about our numeric format.
Q: In MS, do you really write Verilog for your FPGA, or do you use high level synthesis tool?
A: For this, we are writing System Verilog
Q: Batchnorm layers, which require batch computation; how do you put that onto the FPGA?
A: Part of the work of the compiler is to do splitting between CPU and FPGA. So things that are not amenable to FPGA, including batchnorm, we're still running them on CPU.
by Edward Z. Yang at December 08, 2017 08:08 PM
The below is a transcript of a talk by Virginia Smith on MOCHA, at the ML Systems Workshop at NIPS'17.
The motivation for this work comes from the way we think about solving ML problems in practice is changing. The typical ML workflow looks like this. You start iwth dataset and problem to solve. Say you want to build a classifier to identify high quality news articles. Next step is to select an ML model to solve the problem. Under the hood, to fit the model to your data, you have to select an optimization algorithm. The goal is to find an optimal model that minimizes some function over your data.
In practice, there's a very important part of the workflow that is missing. For new datasets, interesting and systems, the system and properties of system, play a large role in the optimization algorithm we select to fix. To give an example, in the past several years, data that is so large that must be distributed over multiple machines, in a datacenter environment. I've been thinking about how to perform fast distributed optimization in this setting, when data is so large.
But more and more frequently, data is not coming nicely packaged in datacenter. It's coming from mobile phones, devices, distributed across country and globe. Training ML in this setting is challenging. For one, whereas in datacenter you have hundreds to thousands, here you have millions and billions. Also, in datacenter, devices are similar capability; here, you have phones that are old, low battery, not connected to wifi. This can change ability to perform computation at any given iteration.
Additionally, there's heterogeneity in data itself. For privacy and computation reasons, data can become very unbalanced in network. And it can be non-IID, so much so that there can be interesting underlying structure to the data at hand. I'm excited because these challenges break down into both systems and statistical challenges. The one second summary of this work, thinking about both systems and statistical in this federated setting; the punchline is that systems setting plays a role not only in optimization algorithm but also the model we select to fit. IT plays a more important role in this overall workflow.
I'm going to go through how we holistically tackle systems and statistical challenges.
Starting with statistical. The goal is we have a bunch of devices generating data, could be unbalanced; some devices have more data than others. One approach used in past is fit a single model across all of this data. All of the data can be aggregated; you find one model that best achieves accuracy across all of the data simultaneously. The other extreme is you find a model for each of the data devices, and not share information. From systems point of view this is great, but statistically, you might have devices that are only ... that are poor in practice. What we're proposing is something between these two extremes. We want to find local models for each device, but share information in a structured way. This can be captured in a framework called multitask learning.
The goal is to fit a separate loss function for each device. These models can be aggregated in this matrix W, and the function of the regularizer, is to force some structure omega on it. This omega is a task relationship matrix, capturing interesting relationships, e.g., all the tasks are related and you want to learn weights, or most of the tasks are related and there are a few outliers, or there are clusters and groups, or there are more sophisticated relationships like asymmetric relationships. These can all be captured in multitask.
We developed a benchmarking set of real federated data. This includes trying to predict human activity from mobile phone, predict if eating or drinking, land mine, and vehicle sensor; distributed sensor to determine if a vehicle is passing by.
For these various datasets, we compared global, local and MTL. The goal is to fit a SVD model. For each data set, we looked at the average error across tasks, where each model is a task. What you can see is average error, for SVD, is significantly lower than global and local approaches. This makes sense because MTL is much more expressive; it lets you go between these extremes. What's interesting is that in these real data sets, it really helps. Reduction by half. This is a significant improvement in practice.
Given that we like to be using multitask learning to model data in federated environment, the next problem is figure out how to train this in distributed setting, thinking about massive distributed. In particular, the goal is to solve the following optimization objective. In looking how to solve this objective, we note that it's often common to solve for W and omega in an alternating fashion. When you solve for omega, it's centrally, you just need access to models. But W must be distributed because data is solved across devices. The key component how to solve this in practice is the W update. The challenge of doing this is communication is extremely expensive. And because of heterogeneity, you may have massive problems with stragglers and fault tolerance; e.g., someone who turns their phone off.
The high level idea for how we're doing this, take a communication efficient method that works well in data center, and modify it to work in federated setting. It will handle MTL as well as stragglers and fault tolerance.
What is the method we're using? The method we're using is COCOA, which is a state of the art method for empirical risk minimization problems. The thing that's nice about COCOa is it spans prior work of mini-batch and one-shot communication, by making communication a first class parameter of the method. Make it flexible as possible. It does it by not solving the primal formulation, but the dual. The dual is nice because we can easily approximate it by forming a quadratic approximation to the objective; and this more easily decomposes across machines.
To distribute this to federate setting, a key challenge is figuring out how to generalize it to the MTL framework. A second challenge; in COCOA, the subproblems are assumed to be solved to some accuracy theta. This is nice because theta varies from 0 to 1, where 0 is exact solve, and 1 is inexact. This can be thought of as how much time you do local communication versus communication. However, in fact, this is not as flexible as it should be in the federated setting. There is only one theta that is set for all iterations, a ll nodes. And because theta cannot be set exactly to one, it cannot handle fault tolerance, where there's no work performed at any iteration. Making this communication parameter much more flexible in practice.
JHow are we doing this? we developed MOCHA. The goal is to solve multitask learning framework; W and Omega in an alternating fashion. In particular, we're able to form the following dual formulation, similar to COCOA, so it decomposes. In comparison, we make this much more flexible assumption on subproblem parameter. This is important because of stragglers: statistical reasons, unbalance, different distributions, it can be very different in how difficult it is to solve subproblems. Additionally, there can be stragglers due to systems issues. And issues of fault tolerance. So this looks like a simple fix: we make this accuracy parameter more flexible: allow it to vary by node and iteration t, and let it be exactly 1. The hard thing is showing it converges to optimal solution.
Following this new assumption, and you can't have a device go down every single round, we show the following convergence guarantee. For L-Lipschitz loss, we get a convergence at 1/epsilon; for smooth models (logistic regression) we get a linear rate.
How does this perform in practice? The method is quite simple. The assumption is we have data stored at m different devices. We alternate between solving Omega, and W stored on each. While we're solving w update, it works by defining these local subproblems for machines, and calling solver that does approximate solution. This is flexible because it can vary by node and iteration.
In terms of comparing this to other methods, what we've seen is the following. Comparing MOCHA to CoCoA, compared to Mb-SDCA and Mb-SGD. We had simulation, with real data to see what would happen if we do it on wifi. We have simulated time and how close are to optimal. What you can see is that MoCHA is converging much more quickly to optimal solution, because MoCHA doesn't have the problem of statistical heterogeneity, and it's not bogged down by stragglers. This is true for all of the different types of networks; LET and 3G. The blue line and MOCHA and CoCOA, they work well in high communication settings, because they are more flexible. But compared to CoCOA, MOCHA is much more robust to statistical heterogeneity.
What's interesting is that if we impose some systems heterogeneity, some devices are slower than others, we looked at imposing low and high systems heterogeneity, MOCHA with this additional heterogeneity, it's a two orders of magnitude speedup to reach optimal solution.
And for MOCHA in particular, we looked at issue of fault tolerance. What we're showing here, we're increasing the probability a device will drop out at any distribution. Going up until there's half devices, we're still fairly robust to MOCHA converging, in almost the same amount of time. But what we see with green dotted line, of the same device drops out every iteration, it doesn't converge. This shows the assumption we made makes sense in practice.
The punchline is that in terms of thinking this new setting, training ML on these massive networks of devices, this is both a statistical and systems issue. We've addressed it in a holistic matter. Code at http://cs.berkeley.edu/~vsmith I also want to reiterate about SysML conference in February.
Q: When you compare global and local? Why is it always better than global?
A: The motivation why you want to use local model over global model, is that if you have a local data a lot, you might perform better. It boosts the overall sample size. I have some additional experiments where we took the original data, and skewed it even further than it already was. We took the local data, and there was less data locally, and they have global approaches. That's just a function of the data in the devices.
Q: I really like how your method has guarantees, but I'm wondering about an approach where you create a metalearning algorithm locally and have it work locally?
A: That's worth looking into empirically, since you can do fine tuning locally. What we were trying to do first was converge to exact optimal solution, but you might want to just work empirically well, would be good to compare to this setting.
by Edward Z. Yang at December 08, 2017 06:15 PM
The below is a transcript of a talk by Alex Beutel on machine learning database indexes, at the ML Systems Workshop at NIPS'17.
DB researchers think about there research differently. You have a system that needs to work for all cases. Where as in ML, we have a unique circumstance, I'll build a model that works well. In DB, you have to fit all.
To give an example of this is a B-tree. A B-tree works for range queries. We have records, key, we want to find all records for range of keys. 0-1000, you build tree on top of sorted array. To quickly look up starting point in range. What if all my data, all of the keys, from zero to million... it becomes clear, you don't need the whole tree above. You can use the key itself as an offset into the array. Your lookup is O(1), O(1) memory, no need for extra data structure.
Now, we can't go for each app, we can't make a custom implementation to make use of some pattern. DB scale to any application, we don't want to rebuild it any time.
But ML excels in this situation. It works well for a wide variety of distributions, learn and make use of them effectively.
This is the key insight we came to. Traditional data structures make no assumptions about your data. They work under any distribution, and generally scale O(n). Interestingly, learning, these data distributions, can offer a huge win. What we're trying to go to, is instead of scaling to size of data, we scale to complexity of it. With linear data, it's O(1). For other distributions, can we leverage this?
There are three dat structures underlying databases. There are B-Trees; range queries, similarity search. Main index. Hash maps for point lookups; individual records. This is more common throughout CS. And bloom filters, are really common for set-inclusion queries. Do I have a key. If your record is stored on disk, checking first if there's a record with that key is worthwhile. We're going to focus entirely on B-trees.
B-trees take a tree like structure with high branching factor. What makes it really effective is that it's cache efficient. You can store top level nodes in your cache where it's fast to look it up, maybe others in main memory, and the actual memory on disk. By caching the hierarchy appropriately, it makes it efficiently. At a high level, a B-tree maps a key to a page, some given place in memory. Once it finds that page, it will do some local search to find the particular range of that key. That could be a scan or binary search; we know the range will be the position from start of page to page size.
An abstract level, the Btree is just a model. It's taking the position of the key, and trying to estimate the position. What we have in this case, we want to search in this error range to find the ultimate record. At a high level, it would mean that we can't use any model. We need err_min and err_max. But we have all the data. If you have all the data, you know at index construction time, you know all the data you're executing against, and you can calculate what the model's min and max error is.
One interesting thing is this is just a regression problem. What you're really modeling is just the CDF. On the X axis on this plot here, the X axis is your keys, Ys your position. This is modeling where your probability mass is located; where your data is in the keyspace. CDFs are studied somewhat, but not a ton, in the literature. This is a nice new implication of research.
We thought, OK, let's try this out straightaway. Train a model, see how fast it is. We looked at 200M server logs, timestamp key, 2 layer NN, 32-width, relatively small by ML. We train to predict position, square error. A B-Tree executes in 300ns. Unfortunately, with the model, it takes 80000ns. By most ML model speeds, this is great. If you're looking at executing on server, great. But this doesn't work for a database.
There are a bunch of problems baked into this. TF is really designed for large models. Think about translation or superresolution images; these are hefty tasks. We need to make this fast for database level speed. Second, b-trees are great for overfitting. There's no risk of over-fitting in this context. They're also cache efficient; that's not looked at in ML. The last thing is local search in the end. Is that really the most effective way of ultimately finding that key? I'm skipping that part because it's fairly detailed, I'll focus on first three.
The first part is just the raw speed fo execution of ML model. This was built really by Tim, this Learning Index Framework program. What it does is it lets you create different indexes under different configurations. For one thing, it lets you do code compilation for TF, ideas from Tupleware, where you can take a linear model and execute it extremely quickly. We can also train simple models. Use TF for more complex gradient descent based learning; extract weights, and have inference graph be codegenned. And we can do a lot of autotuning, to find what the best model architecture is. We know ahead of time what the best training is. We can make pretty smart decisions about what works best.
The next problem is accuracy and sepeed. If I have 100M records, I narrow down quickly from 1.5M to 24K, with each step down this tree. Each one of those steps is 50-60 cycles to look through that page, and to find what the right branch is. So we have to get to an accurracy of 12000, within 500 mul/add, to beat these levels of hierarchy, which are in cache. This is a steep task. The question is what is the right model? a really wide network? Single hidden layer? This scales nicely, we can fit in 256 layer reasonably. We could go deeper... the challenge is we have width^2, which need to be parallelized somehow. The challenge is, how do we effectively scale this. We want to add capacity to the model, make it more and more accurate, with increased size, without becoming to.
We took a different approach, based on mixed experts. We'll have a key, have a really simple classifier. We get an estimate. Then we can use that estimate to find it at the next stage. Narrow down the CDF range, and try to be more accurate in the subset of space. It will still get key as input; given key, give position, but more narrow space of keys. We build this down, and we'll walk down this hierarchy. This decouples model size and complexity. We have a huge model, overfitting, but we don't have to execute all of the sparsity that you would have to do from a pure ML view. We can decouple it usefully. The nice thing we can do is fall back to B-trees for subsets that are difficult to learn in a model. The LIF framework lets us substitute it in easily. In the worst case, B-tree. Best case, more efficient.
The quick results version here, is we find we have four different data sets. Most are integer data sets; last one is string data set. We're trying to save memory and speed; we save memory hugely; these are really simple models. Linear with simple layer, with possibly two stages. We're able to get a significant speedup in these cases. Server logs one is interesting. It looks at a high level very linear, but there's actually daily patterns to this data accessed. Maps is more linear; it's longitudes of spaces. We created synthetic data that's log normal, and here we see we can model it effectively. Strings is an interesting challenge going forward; your data is larger and more complicated, building models that are efficient over a really long string is different; the overall patterns are harder to have intuition about. One thing really worth noting here, it's not using GPUs or TPUs; it's pureely CPU comparison. Apples-to-apples.
This is mostly going into the B-tree part. This is a regression model looking at CDF of data. We can use these exact same models for hash maps. With bloom filters, you can use binary classifiers. I have a bunch of results in the poster in the back.
A few minutes to talk about rooms for improvement. There are a bunch of directions that we're excited to explore. Obvious one is GPUs/TPUs. It's cPUs because that's when B-trees are most effective; but scaling is all about ML. Improving throughput and latency for models with GPUs, exciting going forward. Modeling themselves; there's no reason to believe hierarchy of models is the right or best choice; it's interesting to build model structures that match your hardware. Memory efficient, underlying architecture of GPUs. In the scale of ns we need for database. Multidimensional indexes; ML excels in high numbers of dimension; most things are not looking at a single integer feature. There's interesting question about how you map to multidimensional indexes that are difficult to scale. If we have a CDF, you can approximately sort it right there. And inserts and updates, assumed read-only databases. Large class of systems, but we get more data. How do we balance overfitting with accuracy; can we add some extra auxiliary data structures to balance this out?
Q: One thing is that when... this problem, we solved pretty well without ML. When we introduce ML, we should introduce new metrics. We shouldn't make our system more fragile, because distribution changes. What would be the worst case when distribution changes?
A: As the data becomes updated... in the case of inference and updates, there's a question about generalization. I think you could look at it from the ML point of view: statistically, test model today on tomorrows inserts. (It's a method. If I use this method, and then train it with data that I don't yet have... and do.) The typical extrapolation to future generalization of ML. Guarantees are hard. There will be a worst case that is awful... but the flip side, that's the ML side... generalization. There's also a point of view, I couple this with classic data structure. we coupled modeling with classic data structures: search, bloom filter case, so you don't actually have this work. You catch worst case.
Let me add to that. If you assume that the inserts follow the same distribution as trained model, then the inserts become all one operation. They're even better. Suppose they don't follow the same distribution? you can still do delta indexing. Most systems do do delta indexing. So inserts are not a big problem.
Q: (Robert) Most of the inputs were one or two real numbers, and outputs are a single real number. how does it work if you use a low degree polynomial, or a piecewise linear classifier on the different digits?
A: In the case of strings, it's not a single input. (Treat it as integer?) Well, it's possibly a thousand characters long. It's not the best representation. Different representations work really well. The last thing I want to say, piecewise linear could work, but when you run 10k, 100k submodels, it's slow. Hierarchy helps. Polynomials are interesting, depends on data source.
Q: Can you comment how bad your worst case is? Average numbers?
A: We specifically always have a spillover. The worst case is defaulting to typical database. We haven't had a case where you do worse, because we'll default to B-tree. (Deterministic execution?) Not inference time.
by Edward Z. Yang at December 08, 2017 06:11 PM
The below is a transcript of a talk by Ion Stoica on Ray, at the ML Systems Workshop at NIPS'17.
We've been working on it at Berkeley for more than one year. Over the past years, there's been tremendous progress in AI. Ad targeting, image&speech, many more. Many applications are based on supervised learning with DNNs. Supervised plus unsupervised are the two dominant approaches.
However, the next generation of AI applications will be very different. They're deployed in mission critical scenarios, need to continually learn from a rapidly changing env. Robotics, self driving cars, unmanned drones, dialogue systems. Implementing this new generation of AI applications requires a broader range of techniques. Stochastic optimization, parallel simulations, many more.
Ray provides a unified platform for implementing these approaches. To motivate Ray, I'll use reinforcement learning. RL learns by interacting with env. A policy mapping from state/observation to action that maximizes a certain reward. What are the reqs of RL? Many applications exhibit nested parallelism: search, where they use data parallel SGD, which then calls a component that does policy evaluation with a model to simulate, that runs in parallel on multiple CPUs. Second, these workloads can be highly heterogenous in hardware and time. Many of these computations require not only CPUs, but GPUs TPUs and FPGAs. Second, this computation can take wildly different times. Simulate a chess game: 3 moves to lose, or 50 moves to win or draw. And in robotics, we need to process in real time, processing the data from sensors in parallel, tens of ms.
Meeting these requirements is not easy. To meet these requirements, you need a system that is flexible and performant. Flexible: it should create and schedule tasks dynamically, and support arbitrary dependencies. Perf: it should scale to hundreds of nodes, sub-millisecond latency, millions of task, and efficiently share numeric data.
Next, I'm going to say how we achieve these challenges. Flexibility? We provide a very flexible model: dynamic tasks graphs. On top of this, we give the two models: parallel tasks and actors.
To talk about parallel tasks, here is Python code: one reads an array from a file, and the other adds two arrays. The code is simple: it creates two arrays a and b from file1 and file2, and sum them up. So now, parallelizing this program is quite easy. If we want to parallelize a function, in order to do that, we need to add a ray.remote decorator to each function. When we invoke these functions, you need to invoke remote method. Remove doesn't return object itself, just the object id. This is very similar to the futures abstraction. To get the actual object, you must invoke ray.get on the object id.
To get a better idea of how Ray is executing, let's execute a simple program. Assumes files stored on different nodes. When read_array on file1, it schedules read_array on the appropriate node. The remote call returns immediately, before the actual read finishes. This allows the driver to run the second task in parallel, running on the node on file 2, and launch the add remote function. All functions have been scheduled remotely, but none of them have finished. To actually get the result, you have to call ray.get on the result. This is a blocking call, you'll wait for the entire computation graph to be executed.
Tasks are very general, but they are not enough. Consider that you want to run a simulator, and this simulator is closed source. In this case, you do not have access to the state. You have state, action, simulations, to set up state in simulator, you cannot do it. So to get around this, there is another use case, where the state is too expensive to create. For example, DNNs on GPUs, in this case, you want to initialize it once, and reinitialize for each simulation.
In order to address these use cases, we add Actor abstraction. An actor is just a remote class. If you have a Counter, you mark it ray.remote, and the when you create the class or invoke methods, you use remote keyword. This is a computation graph for this very simple example. Notice the method invocations also return object identifiers. To get the results, you need to call ray.get on object identifiers. Ray also allows you to specify the number of resources, for actors and tasks.
To put things together, and provide a more realistic example, evaluation strategy, a scalable form of RL, by Salimans et al in OpenAI. In a nutshell, evolution strategy, tries lots of policies, and tries to see which runs best. This is highly parallel. So here is pseudocode for parallel strategies. A worker that does simulation and returns the reward, create twenty workers, and then 200, do 200 simulations, update policy. Again, if you want to parallelize this code, we have to add a bunch of remote, and now on the right hand side, you'll notice I'm also sharing the computation graph. When you invoke now, the Worker.remote, you create 20 remote workers to do it in parallel. And you invoke with the remote keyword. Again, notice that in this case, the results are not the rewards themselves, but they're ids to the reward objects. In order to get the rewards to get policy, you have to call ray.get.
This hopefully gives you a flavor how to program in Ray. Next time, I switch gears, presents system design of Ray; how Ray gets high performance and scalability.
Like many classic computing frameworks, it has a driver, and a bunch of workers. Driver runs a program, worker runs task remotely. You can run and write a bunch of actors. The drivers actors on the same node, they share the data, on shared memory, and the workers and actors of cross nodes, share through distributed object store we built. Each node has a local scheduler, so when a driver wants to run another task, the local scheduler tries to schedule it locally. If it cannot schedule it locally, it invokes global scheduler, and it will schedule another node that has resources. Actor, remote method. Finally, what we do, and one essential part of the design, is we have a Global Control State. It takes all of the state of the system, and centralizes it. The metadata for the objects, in objects table, function. This allows system to be stateless. All these other components can fail, you can bring them up, get the most recent data from global control state. It also allows us to parallelize the global scheduler, because these replicas are going to share the same state in the GCS.
Another nice effect of having a GCS is that it makes it easy to build a bunch of profiling and debugging tools.
This design is highly scalable. Let me try to convince you why this is. To make GcS scalable, we just shard it. All these keys are pseudorandom, so it's easy to shard and load balance. The scheduler as you see is distributed; each node has a local scheduler, and Ray tries to schedule tasks which are spawned by a worker/driver on another task that is locally. The global scheduler, becomes a bottleneck, we can also replicate it. Finally, in systems, even if scheduler is super scalable, in Spark, there's another bottleneck: only the driver can launch new tasks. In order to get around that, we allow in Ray the workers and actors to launch tasks. Really, there is no single bottleneck point.
A few words about implementation. The GCS is implemented with Redis. For object store, we leverage Apache Arrow. For fault tolerance, we use lineage based fault tolerance like Spark. Actors are part of task graph; methods are treated as tasks, so we have a uniform model for providing fault tolerance.
So now some evaluation results. This plot represents the number of tasks per second, and you can see the number of nodes; it scales linearly. You can schedule over 1.8 M/s. Latency of local task execution is 300us, the latency of remote task is 1ms. This plot illustrates fault tolerance. You may ask why you care about fault tolerance? The problem is you need in your program that the simulation may not finish; this makes the program far more complicated, even if you're willing to ignore some results. Here, on this axis, you have the time in seconds, you have two y axes, number of nodes in system, and the throughput. As you can see, the number of nodes is starting at 50, then 25, then to 10, and goes back to 50. In the red area, you show the number of tasks per second; it follows as you may expect, the number of nodes in the system. If you look a little bit, there are some drops; every time, you have a drop in the number of tasks. It turns out this is because of the object reconstruction. When some nodes go away, you lose the objects on the node, so you have to reconstruct them. Ray and Spark reconstruct them transparently. With blue, you can see the re-executed tasks. If you add them, you get a very nice filling curve.
Finally, for evolution strategies, we compared with reference ES from... we followed the OpenAI, and on the X axis, you have number of CPUs, mean time to solve the particular problem; simulator, learning to run, there are three points to notice. One is, as expected, as you add more CPUs, the time to solve goes down. The second is that Ray is actually better than the reference ES, better results, even though the reference ES is specialized for beating. Third, for a very large number of CPUs, ref couldn't do it, but Ray could do better and better. I should add that Ray takes half the amount of code, and was implemented in a couple of hours.
Related work: look, in this area, there are a huge number of systems, that's why you are here, lots of systems. Ray is complimentary to TF, MXNet, PyTorch, etc. We use these systems to implement DNNs. We integrate with TF and PyT. There are more general systems, like MPI and Spark; these have limited support for nested parallelism; computation model, and they have much coarser grained tasks.
To conclude, Ray is a system for high performance and flexibility and scalability. We have two libraries on top of Ray: RLlib and Ray Tune. It's open source, please try, we'd love your feedback. Robert, Philip, Alex, Stephanie, Richard, Eric, Heng, William, and many thanks to my colleague Michael Jordan.
Q: In your system, you also use actor; actor is built up on shared memory. Do you have separate mailbox for actors? How do you do that?
A: No, the actors communicate by passing the argument to the shared object store.
Q: What is the granularity of parallelism? Is it task atomic, or do you split task?
A: The task granularity is given by what is the overhead for launching a task and scheduling the task. The task you see, we are targeting task, low and few ms. The task is not implementing something like activation function. we leave that job to much better frameworks. And a task is executing atomically, a method, in the actors, are serialized.
Q: Question about fault tolerance: in Spark, when you don't have a response for some time, it says this node died. Here, the task is much more, because NN, something like that. So we don't have the same time.
A: We do not do speculation; implicit speculation in Ray, for the reason you mentioned.
Q: Can you give me more details on the reference implementation, doesn't scale
A: The reference implementation, it's the OpenAI implementation, Robert here can provide you a lot more detailed answers to that question.
by Edward Z. Yang at December 08, 2017 06:07 PM
In my last last post, I argued that property-based testing and fuzzing are essentially the same practice, or at least share a lot of commonality. In this followup post, I want to explore that idea a bit more: I’ll first detour into some of my frustrations and hesitations around typical property-based testing tools, and then propose a hypothetical UX to resolve these concerns, which takes heavy inspiration from modern fuzzing tools, specifically the AFL and Google’s OSS-Fuzz.
“Property-based testing” refers to the idea of writing statements that should be true of your code (“properties”), and then using automated tooling to generate test inputs (typically, randomly-generated inputs of an appropriate type), and observe whether the properties hold for that input. If an input violates a property, you’ve demonstrated a bug, as well as a convenient example that demonstrates it. A classic example of property-based testing is testing a sort function:
This is a guest post by Kaixi Ruan.
Backpack is a module system for Haskell, released recently in GHC 8.2.1. As this is a new feature, I wanted to know how people use it. So I searched Twitter every day, and the other day I saw this tweet:
Are there other examples than String/Bytestring/Text? So far I haven’t seen any; it seems like backpack is just for glorified string holes.
There were a number of good responses, but I want to give another use case from deep learning.
In deep learning, people are interested in doing computations on tensors. Tensors can have different value types: int, float, double etc. Additionally, ensor computations can be done on the CPU or GPU. Although there many different types of tensor, the computations for each type of tensor are the same, i.e, they share the same interface. Since Backpack lets you program against one interface which can have multiple implementations, it is the perfect tool for implementing a tensor library.
Torch is a widely used library, implemented in C, for deep learning. Adam Paszke has a nice article about Torch. We can write some Haskell bindings for Torch, and then use Backpack to switch between implementations of float and int tensors. Here is a program that uses tensors via a Backpack signature:
unit torch-indef where signature Tensor where import Data.Int data Tensor data AccReal instance Show AccReal instance Num AccReal read1dFile :: FilePath -> Int64 -> IO Tensor dot :: Tensor -> Tensor -> IO AccReal sumall :: Tensor -> IO AccReal module App where import Tensor app = do x <- read1dFile "x" 10 y <- read1dFile "y" 10 d <- dot x y s <- sumall x print (d + s) return ()
We have a simple main function which reads two 1D tensors from files, does dot product of the two, sums all entries of the first tensor, and then finally prints out the sum of these two values. (This program is transcribed from Adam’s article, the difference is that Adam’s program uses Float Tensor, and we keep the Tensor type abstract so with Backpack we can do both float and int). The program uses functions like dot, which are defined in the signature.
Here is an implementation of dot and types for float tensors. The C functions are called using Haskell’s FFI:
import Foreign import Foreign.C.Types import Foreign.C.String import Foreign.ForeignPtr foreign import ccall "THTensorMath.h THFloatTensor_dot" c_THFloatTensor_dot :: (Ptr CTHFloatTensor) -> (Ptr CTHFloatTensor) -> IO CDouble type Tensor = FloatTensor type AccReal = Double dot :: Tensor -> Tensor -> IO AccReal dot (FT f) (FT g) = withForeignPtr f $ \x -> withForeignPtr g $ \y -> do d <- c_THFloatTensor_dot x y return (realToFrac d)
As you can see, Backpack can be used to structure a deep learning library which has multiple implementations of operations for different types. If you wrote bindings for all of the functions in Torch, you would have a deep learning library for Haskell; with Backpack, you could easily write models that were agnostic to the types of tensors they operate on and the processing unit (CPU or GPU) they run on.
You can find the full sample code on GitHub.
by Edward Z. Yang at August 18, 2017 02:05 AM
tl;dr If you use a Foldable function like length or null, where instance selection is solely determined by the input argument, you should make your code more robust by introducing an explicit type application specifying which instance you want. This isn't necessary for a function like fold, where the return type can cross-check if you've gotten it right or not. If you don't provide this type application, GHC should give a warning suggesting you annotate it explicitly, in much the same way it suggests adding explicit type signatures to top-level functions.
Recently, there has been some dust kicked up about Foldable instances causing "bad" code to compile. The prototypical example is this: you've written length (f x), where f is a function that returns a list [Int]. At some future point in time, a colleague refactors f to return (Warnings, [Int]). After the refactoring, will length (f x) continue to type check? Yes: length (f x) will always return 1, no matter how long the inner list is, because it is using the Foldable instance for (,) Warnings.
The solution proposed in the mailing list was to remove Foldable for Either, a cure which is, quite arguably, worse than the disease. But I think there is definitely merit to the complaint that the Foldable instances for tuples and Either enable you to write code that typechecks, but is totally wrong.
Richard Eisenberg described this problem as the tension between the goals of "if it compiles, it works!" (Haskell must exclude programs which don't work) and general, polymorphic code, which should be applicable in as many situations as possible. I think there is some more nuance here, however. Why is it that Functor polymorphic code never causes problems for being "too general", but Foldable does? We can construct an analogous situation: I've written fmap (+2) (f x), where f once again returns [Int]. When my colleague refactors f to return (Warnings, [Int]), fmap now makes use of the Functor instance (,) Warnings, but the code fails to compile anyway, because the type of (+1) doesn't line up with [Int]. Yes, we can still construct situations with fmap where code continues to work after a type change, but these cases are far more rare.
There is a clear difference between these two programs: the fmap program is redundant, in the sense that the type is constrained by both the input container, the function mapping over it, and the context which uses the result. Just as with error-correcting codes, redundancy allows us to detect when an error has occurred; when you reduce redundancy, errors become harder to detect. With length, the only constraint on the selected instance is the input argument; if you get it wrong, we have no way to tell.
Thus, the right thing to do is reintroduce redundancy where it is needed. Functions like fold and toList don't need extra redundancy, because they are cross-checked by the use of their return arguments. But functions like length and null (and arguably maximum, which only weakly constrains its argument to have an Ord instance) don't have any redundancy: we should introduce redundancy in these places!
Fortunately, with GHC 8.0 provides a very easy way of introducing this redundancy: an explicit type application. (This was also independently suggested by Faucelme.) In this regime, rather than write length (f x), you write length @[] (f x), saying that you wanted length for lists. If you wanted length for maps, you write length @(Map _) (f x). Now, if someone changes the type of f, you will get a type error since the explicit type application no longer matches.
Now, you can write this with your FTP code today. So there is just one more small change I propose we add to GHC: let users specify the type parameter of a function as "suggested to be explicit". At the call-site, if this function is used without giving a type application, GHC will emit a warning (which can be disabled with the usual mechanism) saying, "Hey, I'm using the function at this type, maybe you should add a type application." If you really want to suppress the warning, you could just type apply a type hole, e.g., length @_ (f x). As a minor refinement, you could also specify a "default" type argument, so that if we infer this argument, no warning gets emitted (this would let you use the list functions on lists without needing to explicitly specify type arguments).
That's it! No BC-breaking flag days, no poisoning functions, no getting rid of FTP, no dropping instances: just a new pragma, and an opt-in warning that will let people who want to avoid these bugs. It won't solve all Foldable bugs, but it should squash the most flagrant ones.
What do people think?
by Edward Z. Yang at March 21, 2017 11:50 PM
If you're setting up a service where people can register their own usernames to be used as a hostname (username.example.com
), email address (username@example.com
), or URL path (example.com/username
) within your domain, there are some common names you should avoid letting the general public register.
Many Internet protocols make the assumption that a domain is manually managed by its owners, and in particular assume that a name like admin
must have been registered or approved by the actual owners. Automatic registration breaks this assumption, and has been the source of some attacks. Microsoft Live has fallen victim to this multiple times: in 2008, a researcher signed up for sslcertificates@live.com
and used it to get a login.live.com certificate, and as late as this March, the same problem happened to live.fi, the Finnish version of the service, when an IT professional tried registering the email account hostmaster@live.fi
as his personal Live account, and then found he could receive a certificate for that domain.
This is a list of all the names I know that should be restricted from registration in automated systems. If you know of others, please let me know and I'll update this page.
tl;dr: Regardless of how you're currently using usernames, restrict them to lowercase letters, digits, and hyphens, starting with a letter and not ending with a hyphen (that is, /^[a-z]([a-z0-9-]*[a-z0-9])?$/
as an extended regex). Ban all the names in this file (last updated 2015-11-21). Get yourself listed as a public suffix: see below for directions and implications.
Most of these problems involve a computer on the domain doing an unqualified lookup: when a computer named a.example.com
looks for b
, it will usually find b.example.com
. If you're running a simple hosting service, or similar, you may not need to block all of these, but these names are extremely unlikely to be used by legitimate users anyway. So you may as well block all of them to allow expanding in the future.
localhost
, localdomain
, and broadcasthost
: these are usually present in /etc/hosts
, and applications or scripts might hard-code an assumption about them having their usual value (especially for localhost
).www
: Browsers will often prepend this if the domain itself does not resolve as a hostname.wpad
: Web Proxy Auto-Discovery in several browsers; someone who owns this (unqualified) name can act as a proxy for all web traffic.isatap
: IPv6 tunnel autodiscovery, primarily on Windows. Similarly to WPAD, someone who owns this (unqualified) name can act as a proxy for all IPv6-capable traffic. Windows Server has a built-in blacklist of domain names that defaults to WPAD
and ISATAP
.autoconfig
: Thunderbird's spec for autoconfiguration. Thunderbird will query the website at autoconfig.example.com
for settings when attempting to set up example.com
email. Good way to harvest passwords.imap
, pop
, pop3
, smtp
, mail
, for email clients that make guesses about what your email servers are. (This includes Thunderbird but also many others.)Note that valid hostnames are restricted in syntax: they must only contain letters, digits, or hyphens, and cannot start or end with a hyphen. DNS is case-insensitive, so make sure there are no case collisions. An older standard prevents hostnames from starting with a digit, which is a straightforward way to prevent all-numeric usernames (which can cause problems with tools that accept either names or UIDs). Dots separate portions of a domain name and cause various problems (wildcard certificates only apply to one level, a.b.example.com
can read and write cookies for b.example.com
, etc.), so they're usually more trouble than they're worth. DNS records are much more liberal, but names that don't follow these rules will generally not resolve as hostnames: you can look them up with dig
/host
/etc., but you can't use them in applications. Checking hostname syntax also prevents you from worrying about names like _tcp
or _udp
, which are used in SRV records.
Most parts of the web platform consider two pages with different origins, that is, scheme (http
/ https
), hostname, and port number, to be unrelated websites that cannot interact with each other by default. However, there are a few exceptions, most notably cookies. Web pages at www.example.com
and login.example.com
are allowed to set cookies with a scope of example.com
, despite not sharing the same hostname / origin. The simple rule of allowing parent domains created the problem of supercookies: example.com
could set a cookie scoped to .com
, which would then be sent to all sites ending in .com
. There are two big problems with this: the first is privacy (being tracked across websites), and the second is session-fixation attacks, where an attacker can overwrite your session cookie with their own, and have your actions (including logging in or sending private data) happen within the attacker's session.
The immediate fix was to ban top-level domains, but this still allowed setting cookies for publicly-registrable suffixes like .co.uk
that weren't at the top level. So browser vendors created the public suffix list to track which suffixes are open for public registration. The public suffix list now includes not only "ICANN" entries, such as .com
and .co.uk
, but also "private" entries, such as .herokuapp.com
and .github.io
, since the same problems exist with allowing users to set cookies for all Heroku or GitHub Pages users.
So, if you are letting users register hostnames in your domain, you should get it listed as a public suffix, which requires just sending a pull request or an email. It takes some time for the update to reach browsers (the list is compiled into browsers, so it's only updated by a browser version update), so you should try to do this as far in advance as possible before launching.
Note that by making example.com
a public suffix, nobody, not even code on example.com
itself, can set a cookie for example.com
. If you have a website of your own that needs cookies (analytics, registration, etc.), you'll need to run it at e.g. www.example.com
, and make example.com
just a redirect. Alternatively, you can use a completely separate domain for your own site vs. your users' sites, as with the Heroku and GitHub examples: their own websites are heroku.com
and github.com
.
The CA/Browser Forum Baseline Requirements, section 3.2.2.4 item 4, requires that if a CA is going to validate a domain by coming up with an administrative email address on its own, it may only use admin
, administrator
, webmaster
, hostmaster
, or postmaster
. Reserve all of those names, regardless of whether they go somewhere useful.
All CAs are supposed to be compliant with that these days, but for safety's sake, also reserve root
, info
, ssladmin
, ssladministrator
, sslwebmaster
, sysadmin
, is
, it
, and mis
(see this 2009 comment on Mozilla's bug tracker).
RFC 2142 defines the names info
, marketing
, sales
, support
, abuse
, noc
, security
, postmaster
, hostmaster
, usenet
, news
, webmaster
, www
, uucp
, and ftp
. You won't need most of these to actually reach a useful mailbox, though you should reserve all of them.
You may want to reserve mailer-daemon
, nobody
(a default UNIX user account), noreply
, no-reply
, etc. for automated processes that send email.
Again, as these names are unlikely to be used by legitimate users, it's usually worth blocking them now and keeping your options open, even if you're not currently offering email service. You may add an email service in the future (Amazon launched Send to Kindle by email over a decade after introducing user accounts). As always, you can manually register these names to trusted or internal users.
For many websites with user-provided content, like Twitter, Facebook, or GitHub, user-chosen usernames become part of the URL at top level (https://twitter.com/geofft
, https://github.com/geofft
). If you're building a website like this, the easiest approach is to restrict these usernames as if they were hostnames. This has two advantages: the first is that it's easy to launch a hostname-based system later (e.g. GitHub Pages now supports geofft.github.io
) if you know that all your usernames are valid hostnames.
The second is that there are several URL paths you need to reserve at top level, and all of them happen to contain dots and are therefore invalid hostnames. If you do permit dots, you need to block the following names:
robots.txt
, for the Robots Exclusion Protocol, used to tell well-behaved crawlers how to well-behave.favicon.ico
, for the shortcut icon displayed in the tab bar and other places.crossdomain.xml
, which allows the Flash plugin to make cross-origin requests. Java and Silverlight also look for and trust crossdomain.xml
.clientaccesspolicy.xml
, a Silverlight-specific version of crossdomain.xml
..well-known
, specified in RFC 5785 as a place for these sorts of things so they don't keep cluttering the root level. Thunderbird autoconfiguration looks in here, as do ACME, the automatic certificate enrollment spec from Let's Encrypt; BrowserID / Mozilla Persona; and RFC 7711, a new standard for providing certificates for third-party non-HTTP services. So there are a number of security issues with an unauthorized user being able to create files under /.well-known/
.(These are URLs, not filenames. You should of course also disallow users from creating files named e.g. .htaccess
if your web server respects those.)
All of these are invalid hostnames, so simply requiring usernames to be valid hostnames avoids having to check for these specific cases. If you're only allowing users to choose some portion of the URL, and inserting other text (e.g., example.com/user/geofft
, example.edu/~geofft
), then you don't have to worry about this, but again it may still be useful to keep your options open for other URL, hostname, or email schemes in the future.
Do not allow users to publish custom HTML, especially not custom scripts, at these sorts of URLs. https://example.com/user1
, https://example.com/user2
, and https://example.com/login
all share the same origin, so by the same-origin policy, these web pages can freely interact with each other and mess with each other's content. A few JavaScript interfaces, including service workers, make it very easy to attack another site on the same origin. If you want users to be able to publish custom HTML and JS, use separate hostnames within a public suffix. https://user1.example.com
and https://user2.example.com
are separate origins, and if you have made example.com
a public suffix as mentioned earlier, you can safely let them publish custom scripts, since the sites are no more able to interact with each other than two separate .com
websites could.
This post was inspired by a GitHub issue for Sandstorm's sandcats.io dynamic DNS service; thanks to Asheesh Laroia for pointing me at that thread and reviewing a draft of this article.
by Geoffrey Thomas at November 26, 2015 12:00 AM
I ran across something strange while learning about Rust's stack overflow and segmentation fault handling.
First, some backstory: in the past, Rust (and Go) used segmented stacks, also known as split stacks. This is a scheme that allows you to start each thread with a small amount of stack space, and dynamically allocate a new, discontiguous chunk of stack space when you run out. Each function starts with a call to __morestack
to see if it has enough stack space for that function's variables. If not, __morestack
is supposed to allocate more space, quietly switch out the stack pointer, and call the function. When the function returns, __morestack
frees the additional space and restores the original stack pointer. For a system like pre-1.0 Rust's tasks or Go's goroutines, allocating lots of tiny, growable stacks makes sense.
However, Rust's __morestack
function currently only serves to trigger stack-overflow handling: the only thing that it does, besides aligning the stack properly, is call the rust_stack_exhausted
function, which prints an error message and exits. Rust gives each thread plenty of stack space, so if it overflows, it probably means there's unbounded recursion and the thread should be terminated.
I thought this was a little odd. When you overflow the stack in C, or in any other language without __morestack
, the next instruction tries to access unallocated memory. This causes a standard segmentation fault, which terminates the program. What's the need for Rust to catch this and terminate the program on its own?
Part of the answer is to provide a better error message. A straightforward stack overflow is not a memory safety violation in the same way an access to out-of-bounds memory is, so a segmentation fault is the wrong way to report a stack overflow. But the real answer turns out to be subtler than that. While accessing an invalid page of memory cannot cause data corruption, it's not guaranteed that the page is in fact invalid! With enough luck, you can overflow the stack far enough and reach into a completely unrelated memory region.
It turns out that this is possible in well-defined C, in a worrisomely straightforward way. The following program generates no warnings:
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
void clobber(uintptr_t target_address) {
int dummy;
uintptr_t dummy_address = (uintptr_t)&dummy;
char array[dummy_address - target_address];
array[50] = 'x';
(void)array; // suppress warning about unused variable
}
int main(void)
{
int i;
int *x = malloc(20 * sizeof(int));
for (i = 0; i < 20; i++)
x[i] = 3;
clobber((uintptr_t)x);
for (i = 0; i < 20; i++)
printf("%d ", x[i]);
printf("\n");
return 0;
}
and produces this output on my computer (an x86-64 machine running Debian GNU/Linux 8):
geofft@titan:/tmp$ gcc -Wall -Wextra -Wpedantic --std=c99 -o lol-stacks lol-stacks.c
geofft@titan:/tmp$ ./lol-stacks
3 3 3 3 7864323 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
What's going on? We create a variable on the stack called dummy
, to figure out where our stack is. Then we figure out how far we are from the target address and create a giant array of exactly that size. Now the end of the stack lines up, more or less, with the target address. The process of "creating an array" doesn't involve any actual changes to memory, just changes to the interpretation of memory (and usually updating the stack pointer register), so while there's a wide swath of invalid memory between &dummy
and target
, none of it is accessed. Therefore, because the only access is to a valid memory location, there's no segmentation fault generated, and the change to memory goes through. When the program writes to array
, it finds valid, writable memory at that location—namely, one of the elements in the target array x
. (Remember that on x86 machines, the stack grows downwards, towards lower-valued addresses, so array[0]
is the end of the stack. On other architectures, if the stack grows upwards, we'd need to write to the other end of array
.)
As you might expect with memory corruption, this is exploitable. In 2010, Rafal Wojtczuk, a researcher at Invisible Things Lab, discovered that this sort of confusion could be used in a privilege-escalation exploit against the X server (PDF). A client can cause the X server to exhaust most of its memory, request a shared-memory region that gets mapped right past the stack, and then trigger a recursive function in the X server. As the associated blog post puts it, it "exploits a bug in... well, actually it doesn't exploit any concrete bug, which makes it so much more interesting."
How can we defend against this? The X server vulnerability (CVE-2010-2240) relied on the stack running right up against mapped pages, so the straightforward solution was to ensure that an unmapped guard page is always just past the end of the stack. But this doesn't help for more complicated cases that span more than a single page—like my example above, or even a less-contrived function with a 5000-byte string buffer.
It turns out that GCC has an option, -fstack-check
, that inserts code to check for this problem. If I recompile my program with -fstack-check
, I get a segmentation fault. It appears that gcc has inserted code to step the stack down one page at a time, running a logical-OR with zero at each point, which doesn't affect any value on the stack but forces a memory access. Here's the disassembly of that section of code, from GDB:
0x0000000000400621 <+139>: mov %rsp,%rcx
0x0000000000400624 <+142>: sub %rdx,%rcx
0x0000000000400627 <+145>: cmp %rcx,%rsp
0x000000000040062a <+148>: je 0x40063a <clobber+164>
0x000000000040062c <+150>: sub $0x1000,%rsp
=> 0x0000000000400633 <+157>: orq $0x0,(%rsp)
0x0000000000400638 <+162>: jmp 0x400627 <clobber+145>
The size of my variable-length array has already been computed in %rdx
. GCC has inserted code to step through the stack one page ($0x1000
bytes) at a time, in a loop. At each step, it ORs the value at the stack pointer with zero, which doesn't change any data but forces a read and write of memory.
Why isn't this the default, like other safety mechanisms like -fstack-protector
? I don't know for sure. Part of the answer could be performance, since each somewhat large stack allocation will result in this code being run—with overhead linear in the size of the allocation, instead of just a single change to the stack pointer—but I'm not sure if anyone expects large stack allocations to be performant. Another answer could be lack of interest, since GCC doesn't even enable it by default when compiling Ada code, even though the Ada language requires that stack overflows and segmentation faults be distinct errors.
This technique is common on Windows. Microsoft's rationale for stack checking is less about preventing unexpected changes to memory and more about making sure the stack is actually in place. Windows places a single guard page past the end of the stack, which is reserved in the memory map, but causes a fault when accessed. The kernel takes this as an indication that the process needs more stack space, and will allocate more physical pages, provided that resource limits allow it and there's enough free RAM. Then it moves the guard page past the new end of the stack. So, if your program has an object that is even slightly more than the page size (e.g., a char buffer[5000]
), it's important that the guard page get touched, instead of skipped over. (Linux also uses a similar mechanism, but as far as I can tell, the entire possible stack region is treated as guard pages, so the problem of skipping over a single guard page doesn't arise.)
Rust is looking at moving away from the __morestack
approach to this approach, under the name of stack probes (the term "probe" is also used in GCC documentation). Apparently the right name for this technique is a bit of a question. "Stack checking," as Microsoft and GCC call it, is a somewhat generic name, and I imagine many developers would interpret it as referring to the more typical direction of stack protection. Indeed, I was very surprised to see that the first C program above "worked", clobbering memory, even with -fstack-protector
. It's common to see stack-smashing attacks, where an out-of-bounds pointer overwrites something valuable on the stack, and that's what -fstack-protector
(StackGuard, which uses stack canaries) protects against. But it's not as common to see the inverse problem, where an out-of-bounds stack pointer overwrite something elsewhere in memory, which can be just as unsafe and just as much of a security risk.
Thanks to Nelson Elhage, Liz Denys, Alex Dehnert, and Asheesh Laroia for feedback on drafts of this post.
by Geoffrey Thomas at July 06, 2015 12:00 AM
UNIX (well, POSIX) signals are probably one of the worst parts of the UNIX API, and that’s a relatively high bar. A signal can be sent to your program at any point when it’s running, and you have to deal with it somehow. Traditionally, there are three ways of responding to signals (“dispositions”): you can let the default behavior run, which is often to kill the program; you can ignore it; or you can set up a function to be a signal handler.
If you register a signal handler, it’s called in the middle of whatever code you happen to be running. This sets up some very onerous restrictions on what a signal handler can do: it can’t assume that any locks are unlocked, any complex data structures are in a reliable state, etc. The restrictions are stronger than the restrictions on thread-safe code, since the signal handler interrupts and stops the original code from running. So, for instance, it can’t even wait on a lock, because the code that’s holding the lock is paused until the signal handler completes. This means that a lot of convenient functions, including the stdio
functions, malloc
, etc., are unusable from a signal handler, because they take locks internally. POSIX defines a set of “Async-Signal-Safe” functions that are guaranteed not to rely on locks or non-atomic data structures (because a signal handler could be called between two instructions that update a non-atomic data type). Linux’s signal(7)
manual page documents has a list of these functions. But because of the constrained environment, one common approach is to defer the actual work until you’re no longer running in a signal handler. A standard solution is the “self-pipe trick”: open an unnamed pipe, write a byte to it in your signal handler and return, and read from it in the main loop. This makes signal handling resemble the handling of anything else that’s a file descriptor, like network sockets, which normalizes their handling a bit.
There’s another problem with signal handlers: in addition to interrupting user code, they also need to interrupt kernel code so that it gets delivered. So, if you’re in the middle of reading from a network socket, waiting for a timeout, or even closing a file descriptor, that system call gets aborted so that the signal handler can run. By default, once you exit the signal handler, your system call will appear to return early, with an error exit of errno == EINTR
(“Interrupted system call”). The intention is that you can restart the call yourself if you want, but in practice, very few libraries do, which causes confusing errors for application developers. (Extra credit: explain why the linked solution is wrong.) To avoid these sorts of problems, BSD added a mechanism to mark signal handlers as restartable, which was standardized as the SA_RESTART
flag to the sigaction
system call. But that does not reliably cause system calls to continue instead of returning EINTR
: as documented later in signal(7)
, Linux will return EINTR
in some cases even when a system call is interrupted by an SA_RESTART
handler. (The rationale for many of these makes sense—e.g., timeouts where restarting the call would start a timer from zero—but understanding the problem doesn’t make it any less of a problem.)
So, in 2007, Linux gained an API called signalfd
, which lets you create a file descriptor that notifies on signals. The idea is that you can avoid the complexity of the self-pipe trick, as well as any problems with EINTR
, by just asking the kernel to send you signals via a file descriptor in the first place. You don’t need to register a signal handler, so everything works perfectly… except that signalfd
doesn’t actually change how signal dispositions work. If you only create a signalfd
, signals still get delivered via the the signal-handling mechanism, so in order to actually get the signal to get delivered over the file descriptor, you need to suspend normal processing of the system. There’s another part of the signal API that lets you set a mask to block the signal: the intention is that you can remove the mask later, and then pending signals will get delivered. But this also means that pending signals are readable from the signalfd, and they’re removed from the queue once read.
This would work fine if signal masks applied only to your current process, but they also get inherited to children. So if you’re masking a signal like SIGINT
in order to receive it via signalfd
, and you run a subprocess, then that child process will start up with SIGINT
masked. And while programs could reset their own signal masks when they start up, in practice, no software does. So you have to be very careful to reset any masked signals before starting a child process, and unfortunately you need to do this yourself: most ways of starting child processes, including the standard libc system(3)
function, do not reset handlers or masks.
There’s another problem with masking signals, which is that standard UNIX signals are permitted to coalesce when they queue. Namely, if you have a mask that’s blocking a signal, and that signal gets delivered twice before you proces the first one from the queue, it only gets reported to you once. For something like SIGINT
, this might be okay: probably you don’t need to handle multiple Ctrl-C keystrokes differently from a single one. For something like SIGCHLD
, which notifies you that a child process has terminated, this is quite a bit more unfortunate. It now means that you know that at least one child process has terminated. And while signals can have associated info (via the SA_SIGINFO
flag to sigaction
, or by default for signalfd
), and SIGCHLD
’s siginfo tells you which process has terminated, if it gets coalesced, you only get one set of info. That is, you know that the specified child process has terminated, but also one or more other unspecified child processes could have terminated and the information about which one has been lost. Not only does this apply to unmasking signals and letting them get delivered to a normal signal handler, it also applies to reading signals from a signalfd
, since signalfd
works by dequeuing signals.
It turns out that there is, at least in theory, a better way to handle this. Just go back to the old self-pipe trick, but install the handler with the SA_NODEFER
flag, which allows signal handlers to interrupt themselves. This way you reliably get one handler called per signal. [EDIT: This is wrong, see below.] This brings back the EINTR
problem, but there’s a simple solution to that: simply dedicate a separate thread to signal handling, and make sure all signals get routed there. Then no system calls on any other thread will get interrupted. The only way to ensure this is to mask signals on every other thread, but that’s no worse than the status quo with signalfd
, since we already had to mask signals in order to get them delivered to signalfd
alone. As a bonus, this is portable to other operating systems that don’t have signalfd
, so this is the solution I’m planning on implementing for the MIO cross-platform event-handling library for Rust.
Can signalfd
be fixed? I think it can, if we explicitly tie signalfd
into the signal-disposition mechanism. If signalfd
could claim responsibility for signal delivery, instead of requiring that signals be masked or ignored in addition to using signalfd
, this would solve both problems. For inheritance, just set the close-on-exec flag on the signalfd, and signals will go back to the default behavior in child processes, once the fd no longer exists. For multiple delivery, because signalfd
no longer interacts with signal queueing, it can just be specfied to send one message per signal. Alternatively, a mechanism to mask or ignore a signal that applies to the current process only would also be an improvement. These proposals would be a change to POSIX signal semantics, but fundamentally so is signalfd
, and the only way to get clean handling of signals is to avoid the POSIX API, not work within it.
EDIT: Greg Hudson pointed out to me that even traditional handlers can coalesce signals. If a signal is sent twice before the signal-handling thread is scheduled, it still gets coalesced, even if it's destined for a signal handler with SA_NODEFER
. So while a signal-handling thread can get you separate siginfo most of the time, it still can’t do so reliably. I’ve written an example of this behavior that uses vfork
to prevent the process from being scheduled: it starts multiple children, but only prints one notification. So my claim that there’s a better alternative to signalfd
is wrong, but it’s still true that signalfd
doesn’t solve any of the other troubles with signal handling: it just saves you a separate thread.
Thanks to Nelson Elhage for pointing me at the LWN article, and Alex Chernyakhovsky, David Benjamin, and Yehuda Katz for discussions about all the unfortunate parts of signal handling.
by Geoffrey Thomas at May 17, 2015 12:00 AM
The Name Service
Switch
(NSS) is the feature of many C libraries, including the standard Linux one
(GNU libc) and the standard Solaris one, that allows name lookup
routines to be implemented via plugins. "Name lookup" here refers to
things like usernames, host names, etc. When you run ls -l
, and the
ls
command maps numeric user IDs to usernames, the names are provided
by any module listed in the Name Service Switch's configuration. This
could be the files
module that looks at local files like
/etc/passwd
, the ldap
module that talks to a corporate LDAP server,
etc.
One of my half-forgotten side projects involves writing a new NSS module.
NSS is configured via the file /etc/nsswitch.conf
, which is systemwide
configuration. If I want to test my NSS module, I could install it
globally and reconfigure my system. But I started wondering if there was
a way to load an NSS module just for a single process under my control.
Since the process is running in my user account, I should be able to
reconfigure it, without affecting the rest of the system.
Turns out that it's possible in a somewhat hackish way. There's an
internal glibc function called __nss_configure_lookup
that overrides
NSS settings. If you're writing your own test program, you can just
call, e.g., __nss_configure_lookup("passwd", "files")
to force
all user lookups to go through libnss_files
. If you're using an
existing program, you can shoehorn this in by use of an LD_PRELOAD
:
#include <nss.h>
#include <stdlib.h>
static void __attribute__((constructor))
nsstest_ctor(void)
{
const char *db = getenv("NSSTEST_DB"), *config = getenv("NSSTEST_CONFIG");
if (db && config)
__nss_configure_lookup(db, config);
}
Compile with gcc -fPIC -shared -o nsstest.so nsstest.c
. Then you can
do things like this:
howe-and-ser-moving:/tmp geofft$ ls -ld ~
drwxr-xr-x 355 geofft root 34816 Apr 17 21:25 /afs/athena.mit.edu/user/g/e/geofft
howe-and-ser-moving:/tmp geofft$ LD_PRELOAD=./nsstest.so NSSTEST_DB=passwd NSSTEST_CONFIG=files ls -ld ~
drwxr-xr-x 355 40490 root 34816 Apr 17 21:25 /afs/athena.mit.edu/user/g/e/geofft
Since my account isn't in the files
database on this machine (it's in
hesiod
), my user ID can no longer be looked up if I restrict passwd
lookups to the files
database. A more straightforward way of testing
is using the getent
command that ships with glibc, which lets you ask
for a specific entry in a specific NSS database. For instance, both
files
and hesiod
have entries for the root
user:
howe-and-ser-moving:/tmp geofft$ LD_PRELOAD=./nsstest.so NSSTEST_DB=passwd NSSTEST_CONFIG=files getent passwd root
root:x:0:0:root:/root:/bin/bash
howe-and-ser-moving:/tmp geofft$ LD_PRELOAD=./nsstest.so NSSTEST_DB=passwd NSSTEST_CONFIG=hesiod getent passwd root
root:*:0:101:Wizard A Root,,,:/mit/root:/bin/csh
If you're writing your own NSS library, you'll also need to set
LD_LIBRARY_PATH
to point to the directory where it lives, since NSS
configuration just takes names, not full paths.
by Geoffrey Thomas at April 18, 2015 12:00 AM
I just got back from PyCon, where a few debian-python team members met to discuss some goals for Python packaging in Debian over the next release cycle. One item of interest is moving away from installing Python 2 by default (expecting it to be desupported in 2020), which raises questions about what /usr/bin/python
should mean. At the moment, it very strongly means Python 2 specifically, so there's a question about what it should be on a system with only Python 3 installed—should it become Python 3? Should it be nonexistent?
PEP 0394 recommends that Python 2 should be installed as python2
and Python 3 as python3
, and that python
should mean Python 2 for now. I think it's important that python
continue to mean Python 2, for the reasons described in the "Migration Notes" section of that PEP. In particular, I think it's very important that you should be able to install Python 2 (for the near future), even if your system did not ship with Python 2, and that the system version of Python 3 should not prevent you from using python
to run Python 2 applications.
However, not shipping /usr/bin/python
by default also has its downsides. I made a suggestion on the debian-python list for handling this: this blog post is a more detailed version of that proposal.
The basic motivation for this proposal is that third-party script authors love Python in part because it's just about universally available. #!/usr/bin/env python
, today, is sort of like #!/bin/sh
: you don't necessarily get nice things (Python 3-only features and bash-only features, respectively), but it works. If the python
command stops existing, this is inconvenient for end users, who will need to manually edit scripts or install a version of Python themselves, and authors, who may just decide to use a worse language like #!/bin/sh
. It's also bad for the Python language community, since it removes an incentive to use Python.
So it would be nice to keep the python
command working usefully on both Python 2-only systems and Python 3-only systems. Fortunately, it's pretty doable to port code to work on both Python 2 and 3 with the same source. Especially for third-party scripts with the goal of working out-of-the-box in as many places as possible, writing code to these restrictions isn't particularly onerous, since they needed to stay Python 2-compatible anyway. I've been writing a bunch of Python code recently as polyglot Python 2/3, and the biggest problem has been limiting myself to features in Python 2.7, not getting things to work in both versions of the language.
So here's the proposal: we install a wrapper binary as python
, that defaults to launching Python 2 (or reporting an error if Python 2 is not installed). However, Python 2/3-compatible scripts can include a specific marker indicating they can be run on Python 3, which makes these scripts able to run on both Python 2-only systems and Python 3-only systems.
This marker is based on the existing coding
: syntax from PEP 0263: it's a "magic comment" on the first or second line of the form pyversions=2.7+,3.3+
, indicating which Python major and minor versions are supported. A script compatible with both Python 2 and 3 should include one of these magic comments, and also include a shebang line launching just python
. Here's an example based on one from PEP 0263:
1 2 3 4 5 6 | #!/usr/bin/env python
# -*- coding=utf-8 -*- pyversions=2.6+,3.3+
from __future__ import print_function
print(u"Hello world!")
|
On new systems, the python
command itself is a wrapper binary that knows what versions of the Python interpreter are installed. If it detects a pyversions
comment, it will exec the newest Python interpreter compatible with the comment. For this example, if Python 3.3 is installed, it will launch that: if only Python 3.2 and 2.7 are installed, it will use Python 2.7, since Python 3.2 does not support the u""
literal syntax. Otherwise, it will assume that code is Python 2-only, and exec the newest Python 2 version installed. In either case, if a compatible interpreter cannot be found, it will print an error and exit.
On legacy systems, the python
command refers to Python 2. So Python 2/3-compatible scripts will still be able to find an interpreter they are compatible with.
For legacy scripts that use a shebang stating just python
, on both new and legacy systems, they will only ever run on Python 2, or not at all. This preserves the existing API, and avoids the poor user experience of running Python 2 scripts with the Python 3 interpreter, as PEP 0394 warns against doing.
However, the python
wrapper supports running Python 2/3-compatible scripts with the Python 3 interpreter, which is useful for Python 3-only systems. A future version of Debian can choose to ship the Python 3 interpreter only, and remain compatible with third-party scripts that include the pyversions
magic comment. Meanwhile, these third-party scripts can state #!/usr/bin/env python
in their shebang, and remain compatible with legacy Python 2-only systems, including those (like Mac OS X) that do not include the python2
symbolic link.
The python
wrapper can run in three possible modes:
python script.py
.python
, without the pedagogical speed bump of having to explain versions of the language. Running the latest major version of Python is safe, since the interpreter will print out its own version at startup.python -c
are considered scripted use, since in this context, the python
command is also serving as an API, and existing users of the API expect Python 2 just as much as scripts do. (Imagine, for instance, a shell script with KEY=$(python -c "import local_settings; print SECRET_KEY")
), which will fail if python
means Python 3.) In this mode, the wrapper will instead look for an environment variable named PYVERSIONS
. If this is set, it will be parsed as the value of the pyversions
magic comment. So shell scripts that include Python 2/3-compatible python -c
commands can e.g. export PYVERSIONS=2.7+,3.3+
at the top of the script, and work on systems with either just Python 2 or just Python 3.The end result here is that third-party script authors can continue writing polyglot Python 2/3 code for the indefinite future, and be compatible with both existing Python 2-only distributions and newer Python 3-only distributions. Meanwhile, distributions can move towards installing Python 3 only by default and avoid installing Python 2 as far as possible, without closing off the ability to install Python 2 when needed, or breaking legacy Python 2-only code.
This is a good transition story for distributions like Debian, Ubuntu, and Fedora that are currently trying to move to Python 3 only, but don't want to break existing code. It's also a good transition story for distributions like Arch that have already started moving /usr/bin/python
to Python 3: interactive and scripted use of the python
command can continue to launch Python 3, but compatibility with third-party Python 2 scripts is regained. And most importantly, it's good for third-party script authors, who want to write one script that works on Debian, Ubuntu, Fedora, Arch, and Mac OS X alike, and for their end users.
I'm interested in initial feedback on this idea: if it generally seems like a good plan, I'll firm up the details and write it up as a formal PEP. There are a few details to work out, like how this interacts with the py.exe
Windows launcher described in PEP 0397 (if at all), but my focus is making /usr/bin/python
useful and backwards-compatible on UNIX-like platforms.
Edit 3 May 2015: I've uploaded a proof-of-concept implementation of this launcher to GitHub to see what the overhead and complexity of this approach looks like.
by Geoffrey Thomas at April 17, 2015 12:00 AM
We have changed the MySQL default storage engine on sql.mit.edu from MyISAM to InnoDB. This change only affects newly created tables. Existing tables are unaffected, with the exception noted below of Trac databases. No user action is required.
InnoDB offers many improvements over MyISAM and has become the default engine upstream as of MySQL 5.5. Even though sql.mit.edu still runs MySQL 5.1, we made this change because it is required by Trac 1.0.2. We have also converted all Trac databases to InnoDB to let them continue working with Trac 1.0.2.
If you wish to take advantage of the new features provided by InnoDB, you can convert existing tables using the ALTER TABLE command. You can still create new MyISAM tables by passing the ENGINE=MyISAM parameter to CREATE TABLE. You can also convert InnoDB tables back to MyISAM with ALTER TABLE.
Note that the version of InnoDB running on sql.mit.edu does not yet support FULLTEXT indexes. We expect this to be eventually fixed when we upgrade to MySQL 5.6 or later. Until then, tables that use FULLTEXT indexes should be left on the MyISAM engine.
by Anders Kaseorg at November 21, 2014 12:07 AM
On Wednesday, August 27th, 2014, all of the scripts.mit.edu servers will be upgraded from Fedora 17 to Fedora 20, which was released on December 17. We strongly encourage you to test your website as soon as possible, and to contact us at scripts@mit.edu or come to our office in W20-557 if you experience any problems. The easiest way to test your site is to run the following commands at an Athena workstation and then visit your website in the browser that opens, but see this page for more details and important information: athena% add scripts
athena% firefox-test
by adehnert at August 26, 2014 06:25 PM
One of my favorite SSH tricks is using the ControlMaster
and
ControlPath
options for speeding up connections. The idea is simple:
if you're already logged into a server and you need to log in again,
just reuse the connection you already have instead of redoing the entire
connection handshake. In addition to making connections much faster, if
you're logging in with a password, you don't have to re-type your
password for each connection. This can be really handy if you're doing a
lot of copies with scp
, checking out files with Subversion's
svn+ssh://
, etc., and don't want (SSH keys can also solve this problem
in many cases, but requires you to set up keys on the server, whereas
this just requires client-side configuration.)
Setting this up is easy. We'll designate one ssh
process as a master, and
then when another ssh
process wants to connect to the same host, it will
connect to the master process instead and request a shared connection. In order
to make this work, we first have to set up a place for this master process to
listen for requests.
To do this, add the line
ControlPath ~/.ssh/control-%h-%p-%r
to your ~/.ssh/config
file. You can read more about the syntax in the
ssh_config
man page; this particular setting puts the shared sockets in
your ~/.ssh directory, which is usually a good place for it, and makes
sure that the path is unique for each hostname, port, and remote
username, as recommended by the man page.
Then create the master ssh
process by running it with the -M
option.
You'll get a normal login, which you can use or just keep around in a
minimized window. You have to keep this connection around until you're
done with all your other connections. If for some reason you want to
avoid opening a shell — for instance, you're using this with
GitHub, which doesn't let you open a shell — you'll also want the
-N
option.
Once you've done this, every new ssh
process that's connecting to the
same host (with the same username and port) will reuse the master
connection instead of making its own. You don't need to do anything
special; as long as ControlPath
is set, each new ssh
process will
check for a master connection at that path.
As mentioned, this can make SSH significantly faster for repeated
operations. For instance, git fetch
against GitHub becomes over twice
as fast for me, compared to re-doing SSH key authentication:
geofft@cactuar:~/src/vireo$ time git fetch
real 0m0.305s
user 0m0.005s
sys 0m0.011s
geofft@cactuar:~/src/vireo$ ssh -M -N -f git@github.com
geofft@cactuar:~/src/vireo$ time git fetch
real 0m0.140s
user 0m0.000s
sys 0m0.010s
This example also used the -f
option, which puts the connection into
the background as soon as authentication is complete. To close the
connection, you can use the -O exit
option to ssh
, which, instead of
opening a new connection, signals the ControlMaster process to exit.
geofft@cactuar:~/src/vireo$ ssh -O exit git@github.com
Exit request sent.
geofft@cactuar:~/src/vireo$ time git fetch
real 0m0.303s
user 0m0.010s
sys 0m0.007s
There's a good article on Wikibooks about connection multiplexing that covers more advanced use of this feature.
by Geoffrey Thomas at April 07, 2014 12:00 AM
Powered by Planet Venus!
Last updated: November 28, 2023 02:07 PM