
Title:

Description:

In the previous section I walked you

through a whole set of mathematical

derivations designed to produce a maximum

entropy model of

essentially a toy problem,

the toy problem is how can we model

the distribution of arrival times of

New York City taxicabs based on a small

set of data, and that's a problem that

well might be of interest
to you personally,

but is certainly not of
profound scientific interest.

In this next section, what I am

going to do is present to you a reasonably

interesting scientific problem and show

how the maximum entropy approach can

illuminate some interesting features

of that system.

The particular system we have in mind is

what people have tended to call the open

source ecosystem. It is the large

community of people devoted to writing

code in such a way that it is open,

accessible, and debuggable,

and in general,

produced not by an individual,

and not by a corporation, with

copyright protections and control over the

source, but rather by a community of

people sharing and editing
each others' code.

The open source ecosystem is

something that dominates a large fraction

of the computer code run for us today,

including not only Mac OS X, but of course

Linux. It is a great success story and we

would like to study it scientifically.

I used the word ecosystem advisably

in part because a lot of what I am going

to tell you now is a set of tools and

derivations that I learned from John Harte

and people who have worked with John

Harte on the maximum entropy approach

to not social systems, but to biological

systems and in particular ecosystems.

John Harte's book "Maximum Entropy

and Ecology"  I recommend it to you as a

source of a lot more information on the

kinds of tools that I am going to show

you now. My goal here is really to show

you that even simple arguments based on

maximum entropy can provide some really

deep scientific insight.

What I am gonna

take as my source of data, because I am

going to study the empirical world now,

is drawn from source forge. Source forge

is no longer the most popular repository

of open source software  perhaps Github

has now eclipsed it  but for a long

period, perhaps from 1999, and we gather

data up to 2011 on this, it has an

enormous archive of projects that range

from different kinds of computer games to

text editors to business and mathematical

software, some of the code that I've used

in my own research is put up on
Sourceforge.

It is a great place to study in

particular the use of computer languages.

Here, what I have plotted is a

distribution of languages used in the open

source community and found on Sourceforge.

On the xaxis is the log of the

number of projects written in a particular

language. You can see that log zero, that

is one. In the database there are about

twelve languages that have only a round of

order one project. These languages are

extremely rare, in other words, in the

open source movement. Conversely,

on the other end of this logarithmic

scale here at four, so, ten to the four

that's ten thousand, we see there is a

small number of extremely popular

languages, these are sort of the most

common languages you will find on
Sourceforge

they have a run away popularity, ok?

And if you know anything about computer

programming it will not surprise you to

learn that these are languages mostly

derived from C, such as C, C++, Java, ok?

Somewhere in the middle between these

extreme rare birds and these incredibly

common, you know, because these are sort

of like the bacteria of the open source

movement, somewhere in the middle you have

a larger number of moderately popular

languages, ok? So this distribution

of languages is what we are gonna try to

explain using maximum entropy methods, ok?

there is a small number of rare languages

a larger model of moderately popular ones

and then again a very small number of

wildly popular languages, ok?

So I plotted that as a probability

distribution, in fact, P of n where n

is the number of projects in the open

source community that use your language,

and this is the probability that your

language has n projects in the open source

community, what we would like to do is

build a maximum entropy model of this

distribution here, I represented the same

data in a slightly different way, this is

how people tend to represent it, this is a

rank abundance distribution,

what ecologist call a species abundance

distribution. So the top rank language,

rank one, here, is the language with the

most number of projects, and it won't

surprise youth lear that it actually

turned out to be Java, there is

20 thousands projects written in Java,

you can see the second rank language C++,

then C, then PHP, and this far rare

languages down here have much lower ranks,

so higher numbers means lower ranks, like

in 3rd place, 4th place, so a 100th place

down right here on the 100th place is the

very unfortunate language called Turing,

which has only as far as I'm aware in the

archive only two projects are written in

Turing, and you can see some of my

favorite languages like Ruby are somewhere

here in this kind of moderate popularity
zone.

So we represent that data, is the

same data, now what I just plotted here is

log abundance on this axis and here is the

language rank but is a linear, as oppose

to here, where I showed you the log, ok?

this is a loglog plot, this is know a

loglinear plot, so this is the
actual data.

So, the first thing will try

is a

maximum entropy distribution for language

abundance, in other words, for the

probability of finding a language with n

projects, and what we are gonna do is

we are gonna constraint only one thing,

the average popularity of a language,

this is gonna be our one constraint for

the maximum entropy problem, and we are

gonna pick the probability distribution

p of n that maximizes the entropy p log p

negative sum, and from zero to infinity of

p log p, where gonna maximize this

quantity subject to this constraint, and

of course, always subject to the

normalization constraint, that p(n) is

equal to unity, ok? So that's my other

constraint, and of course, we know how to

do this problem already, we know the

functional form, is exactly the same

problem as the one you learnt to do when

you modeled the waiting time for a

New York City taxi cab, ok? You modeled

that problem exactly the same way, so I'm

only gonna constraint the average of the

waiting time, here we're only gonna

constraint the average popularity of a

language, popularity mean the number of

projects written in that language has on

the archive, and so, we know what the

functional form will look like, it looks

like something like e to the negative

lambda n, ok? all over Z, and then all we

have to do is fit lambda and Z, ok? So

that we reproduce the correct abundance

that we see in the data, so this is the

maximum entropy distribution, it's also,

of course, an exponential model, has an

exponential form, and if actually find the

lambda and Z that best reproduce the data,

in other words, that best satisfies this

constraint, and are otherwise
maximum entropy,

here is what we find, ok?

This red band here is the 1 and 2 sigma

contours for the rank abundance

distribution, and all I want you to see on

this graph is an incredibly
bad fit to this.

The maximum entropy distribution

with this constraint,

does not reproduce the data,

is not capable of explaining,

our modeling, the data in any reasonable

way, it radically under predicts

these really extremely popular languages,

it's unable to, in other words, reproduce

the fact that there are languages like

C and Python, that are extremely popular,

it over predicts this kind of mesoscopic

regime, it over predicts the moderately

popular languages, right, as also it does

over predict those those really rare

birds, those really low rank languages,

with very few examples in the archive,

so this is right, this is science fail.