(Offscreen) So welcome to the talk from Niels about dependency resolution in polynomial time. (Neils) I'm Neils, I'm doing my second talk today. This time about something slightly more technical. I wanted to do dependency resolution in deterministic polynomial time, which is what this is all about. I didn't get as far with this as I wanted. I wanted to have a tool ready and have some results, which I didn't quite finish. So these will be my findings work in progress notes. We have six topics I have to cover. A quick introduction. Then something about the hard problem. Then the part the part where it actually turns out to be highly tractable in practice, if you don't think too much about it. And then there is some installing, upgrading, and installing again for how to improve your solvers. Starting with introduction. I'm hoping mostly to debunk the myth that dependency resolution is a very hard problem that we cannot solve, and we should remove packages in the hopes that they will somehow keep the archive from exploding, and the dependency resolvers to break, and apt to die, and god knows what. I've been working on Britney, which is a different problem to solve, but it involves the same techniques, or uses some of the same techniques. Some will be directly applicable to apt, some will not. There's not a one size fits all solution, sorry. And I much less haven't wrote it yet, even if there was. So defining the problem. When you try to install a package on your system, there's actually two problems being solved. One is the part where apt figures out if I have to install Eclipse. You'll be needing a lot of Java packages, you'll be needing a lot of other stuff. And so it figures out which packages are needed for this to make sense. The separate part of the problem, the second problem is, once you have this list, they have to be unpackaged and configured in a certain order for this all to make sense. They are in theory two distinct problems. I'll be talking about the first problem, because that's actually dependency resolution. The other thing is just ordering. The ordering thing is certainly also very important part. I don't want to dismiss that, it's just it is in fact theoretically a polynomial time problem. So to solve the ordering problem, you basically compute all the action- Given a set of actions that need to be done, then there are some constraints between some of them, like you unpack things before you configure it. Maybe you deconfigure something else before that. So you end up with a list of partial ?? constraints. From that you build a graph with ordering. It's fairly simple to do, in practice, if you have the right tools for it. Then without cycles, this is a trivial things where it just orders it and gives you this is the order, go fix. When there's cycles you will get a single action consisting of multiple things to do at the same time, which is really impossible to do. It turns out that the way we defined this whole thing, if you have a package that does not have a post install script, it does not need to be configured separately. That means that it's just unpacked or not unpacked. That tends to break most cylcles. If you want, there's problems with ordering constraints. Help us to find a way to get rid of most of the post install scripts, such as the ones just running lvconfig and nothing else. That could solve some of the cycles, and otherwise it's just feeding it to dpkg and it works. If you have a cycle, it's a separate problem fixing that. I won't be covering that. As I said, the runtime is polynomial. It's something like the size of the package, a regular graph. Which is fairly polynomial. This even covers finding cycles. Again breaking cycles is just removing post install scripts. And of course if you think this is too simple, you can always implement more features on top of it, such as trying to optimize for minimal downtime of services and all that. You can make any trivial problem very hard very soon with that. So, the hard problem. First of all, the players. We've got apt, aptitude, cupt, whatever. We've got Britney. We've got DOSE, edos, whatever its name is today. They solve the hard problem. Or they should be trying to do. Notable tools not affected by it. dpkg. So dpkg surely figures out if you are missing dependencies. It says, "I cannot resolve this, and I don't know where to fetch it, so I'm just giving up." And that is the only thing you can do. Which means it only verifies the solution which is known to be polynomial, and is therefore is not a game player. DAK rm is also doing a polynomial time check, it just happens to be so for other reasons. That's basically the known players. There are probably others. But we're moving onto the hard problem itself So the problem we actually have is, we've got version dependencies, we've got alternative choices, we've got virtual packages. All three makes this problem hard. You basically have to remove all possible alternative choices, guesses, whatever, to make this simple. It just so happens if you move version dependencies among other things, things become really fun. ?? to solve the problem, but upgrading gets really really spicy. So how do we ensure that dpkg is configured before your new package, use a feature new dev package is installed. That's sort of impossible. It becomes simple, but we end up with a broken dependency ??, so it's not really an option. Technically there's also a way to make it simple by having multiple versions of things, and removing negative dependencies, but that's not necessarily easy to do either. Presumably file configs will get in our way or something. We have to redefine where we place all of the files, that's not going to be something we're doing anytime soon. So yeah, it's not really an option. To give you a better understanding of the problem, please consider this example. If we have coreutils at some version, depending on either this version of libc or a newer one. From this can we immediately conclude that if libc 2.19 is known to be installable, that coreutils will also be installable? How many people think we can immediately conclude something from this? How many people think we cannot conclude anything from this? We'll that's good. So it turns out that with negative dependencies and without negative dependencies, it doesn't really matter. With negative dependencies, libc or whatever it depends on could just conflict with coreutils, we're broken, it's out. And without negative dependencies, you could have one of the things where it depends on a version that's newer or older, and no. It's not really trivial to do. You can't conclude something locally, in theory. That's something that can become a problem. Anyway, it is highly tractable in practice. Because if you do have a break, conflicts, or otherwise negative dependency, it tends to be upper bound. So if the previous version of coreutils was installable with that version, the new version will also probably be. Likewise, most dependencies, circular or otherwise, tend to be upper bound. There are cases of version ranges which is a lower bound and upper bound at the same time. They exist, they are just not so common yet. And that's a good thing. The number of possible solutions to any clause actually tends to be fairly limited. Of course there are the exceptions. Packages from unstable. They might just be missing, in which case it's really hard to solve the dependency. You've got mutually exclusive packages all the sendmail stuff, MTAs. The version ranges I mentioned before. And then of course the strictly equal versions, which we see inside packages built from the same source packages. The redeeming feature here: same source. Because that is actually very simple to figure out if they are from the same source. So they tend to be upgraded ?? anyhow. The new versions tend to be available at the same time sorta thing. The problem made hard in a nutshell, is this contrived example for example. You have a package you want to try, and it depends on any number of foos. Each of the foos can either be solved by picking up the bar package or the good one. Which might be a hint to which one we would choose if we want to solve it. And the bar packages, depends on a bad package that does not work with the starting-package, and we are broken. And the good package just doesn't have any dependencies and is therefore good. In the perfect example you saw, you try foo, good, foo1, and then good and be done. In practice, if you don't have some way of ordering these so you know one of them is better than the other. You now have an n times m trace, where you have to guess up and down, which one am I doing, backtrack back and forth and all, and it's going to be horrible. And this contrived example can of course be solved by some solvers that can see the pattern, but you can always a more contrived pattern that's harder to understand. So I tried to keep it somewhat simple. The good thing is, very few people write software like this. And that is the most redeeming feature about the package ??. We do have exceptions. Aspell-dictionaries, ispell-dictionaries. Basically if you depend on ispell-dictionary, which is a virtual package provided by 50 different ispell packages, or aspell packages, and then they depend on something else after that. So in theory they can be a part of creating this problem. Fortunately, they themselves are fairly simple once you've passed them. You also have multiarch foreign, or multiarch allowed with package interdependencies. So if you have multiple architectures, you in theory can have any number of things satisfying the same clause. This gives extra unnecssary work for most resolvers, if you enable them to do multiarch and cross things. We have 12 of them, but I suspect most users of multiarch are somewhat limited to maybe 3. Possible exception being people like writing resolvers and trying them out, torture test them. The real problem, if you had to do this in the archive you would need to do, for example something like, write indifferent awk implementations. We have three. And then afterwards, for example, have m different distinct, but equally valid libc implementations. It's not something we're likely to do in the near future. And you have to do this on multiple levels, because you can just presolve the essential set most of the time. The real problem: if you had to do this in the archive, you would need to do, for example, something like write N different awk implementations. We have 3. Then afterwards for example, have M different distinct but equally valid libc implementations. It's not something we're likely to do in the near future. And you have to do this on multiple levels. Because you can just presolve the essential set most of the time, this particular thing is not very interesting. And the data packages can in theory blow up, when you have truly interchangable data, like we do with the aspell packages. But they tend to either not have any dependencies after the aspell, after the data package, or they have a simple loopback into what you came from. So the blowup blowup tends to be limited to one level. If you have enough of those, it's still going to be a problem, but it keeps it simple. Onto installing. The easier case, as mentioned is when you have a single suit and a single architecture. All things in the graph collapse into what looks like a regular graph, and therefore become simple, trivial. This actually happens so much that if you take Lintian, it has 26 distinct dependency clauses. If you look at each of them, when you only have a single suite and architecture, there's at most one package directly solving each clause locally. Then further down the graph somewhere you depend on for example, debconf, or debconf-2.0 which is one of these alternatives, giving rise to its ?? Linitan itself is not subject to becoming a backtrack point. There's no guesswork when you reach Lintian, which of its dependencies should I pick, which version of it and all. That's not there, you have to go further down to see that. Actually, but the time you reach the debconf thing, it has already been solved by something else, as I recall. Oh, no wait. The dependencies of debconf is already covered by the time you're forced to resolve this choice, so it's not really a big issue, either. You just pick the debconf one, c-debconf current on debconf so it's trivial that way. The question is, when do we have these conditions. We do that a lot actually. If you do builts in unstable for example. ??, only one suite only one architecture, piece of cake. If you install packages of a pure Jesse system, with multiarch, you have the single suite one, which limits a lot of them, and then you have not the single architecture. That's what it is, but in pure Squeeze, you won't have multiarch because because it didn't work back then. So that's even simpler. Also you can do unstable plus experimental, or stable plus backports if you fiddle a bit with the end resolver and say that it is not allowed to pick experimental, unless it is explicitly requested to pick experimental, and from there it only picks what it absolutely needs to solve it. You basically get the same result with a single suite restriction, but that requires you to actually code something for it. And then there's Britney, with the testing migration thing, and that's because she basically takes things, moves it into a new version of testing, and tries to test that this is still the same solution. So she forces it into a single suite currently. So these are common cases where it happens. This is not everywhere where it happens, so the stable upgrades are still funny enough, not a single suite and all that. It happens because there is only one package and architecture. Only one instance of package, you only have one version, one architecture that solves that particular dependency. Whereas with multiarch you could have, i386 and the amd64 version. If you do an upgrade you could have the old version and the new version that may or may not satisfy it. Also we have this thing where we don't like libraries to have alternative implementations, it sort of breaks things horribly. Especially when they are not actually agreeing on the interface they provide. There's the social aspect of it that we don't really bother having 200 interchangeable implementations of everything. I think the record is something like five or six different versions of sendmail, implemenatations. ?? I don't think so, but ?? might have. And even when you do have one of these explosions, it has to actually hide the breakage beneath one of these choice explosion alternative explosions for it to be a problem. You might have aspell over here on the right hand side, and you have a breakage over here which ?? to find out. So if you use the resolver to just the first explosion to solve the other part, it realizes this is not working, I'll bail out. There's a couple things that we do to make this easier. The mos interesting thing is of course, can we still make this simple without the single suite and single architecture restriction. We can to some extent. And that's where we move to upgrading, in deterministic polynomial time. This is where I spent most of my time working. So to start with, when we do an upgrade from stable to stable, and here I'm taking pure stable, no backports and all. The general recommended way to do it is to replace Weezy with Jesse, hit apt-get update, and upgrade afterwards, and apt replaces all the old packages with the new version, if there is a new version. We also want all the packages from Weezy that did not have any version in Jesse to be removed, and there might be a new essential package somewhere. So long story short, upgrade if it's present, remove if it is not present, and then install the new essential packages, if any. They don't tend to change that often. I'll be ignoring the installing part for now. It is of course valid and interesting, but we'll skip it. So let's take a simple example, somewhat contrived, somewhat made up the numbers. They are not too far from reality, but they are not exact either.