
Title:

Description:

In the previous unit, what I showed you

was a way to rescue MaxEnt models to

describe the language abundance P(n)

by reference to a hidden variable, epsilon

which we referred to as "programmer time"

and what we assumed was that the system

was constrained in two ways:

it was constrained not only to having

a certain average number of projects per

language. We fixed the average popularity

of languages, but we also fixed the average

programmer time devoted to projects

in a particular language. And so this
distribution, here, looks like this.

in functional form. And when we integrate

out this variable, epsilon, we get
something that looks like this.

So we get a different prediction for

the language distribution. And we get

a prediction that, I argue, looks
reasonably good.

It certainly looks better than the
exponential distribution.

I feel honorbound to tell you
about the controversy

that happens when we try to make
these models... and in particular

there is a very different mechanistic
model that looks quite similar.

So this is the Fisher log series.

And the argument behind the Fisher log
series, to explain this distribution,

involves the idea of a "hidden" additional
constraint.

In the open source question, what I've
done is describe that additional

constraint as "programmer time", just
because it seems like it might be a

constraint in the system. OK?
That the average programmer time

for languages is fixed, not just the
average number of projects.

But, and so that means that languages
can vary in their popularity, but also

in their efficiency. In the ecological
modelling, this is "species"

languages are species, and this is the
abundance of species.

So that means the number of particular
instances of the species in the wild.

And also metabolic rate: how much energy
a particular species consumes.

And so in that case, the system is
constrained to a certain average

species abundance, and a certain average
species energy consumption

Here languages are constrained to a certain
abundance, and a certain consumption of

programmer energy. That's the analogy.

So, we can also build a mechanistic model.

of programming language popularity.

And previously, when we studied the taxicab
problem, what we did was,

when we produced the mechanistic model,
we were able to find a very simple one

that had the same predictions as the
MaxEnt model did.

Here, by contrast, what we're going to find
is that the mechanistic model

is going to produce similar behavior, but
the functional form will actually be

slightly distinct. So, here's the mechanistic
model... we imagine that

languages all start out with a baseline

popularity. Whoever invents the language,
for example, has to write

at least one project. There is at least one
programmer at the beginning

of a language's invention, who knows how

to program in that language, somewhat

by definition. And so there's two ways

that that popularity can grow.

It can grow, for example, linearly.

So, on day 1 there's 1 programmer.

And on day 2, that one programmer is
joined by another programmer.

And on day 3, those two programmers
are joined by a third.

And so, over time, what you have is

a growth rate that's linear...

in time.

But, perhaps a more plausible model

for how languages accrue popularity

is multiplicative.

At time 1, there's 1 programmer, and he has

some efficiency of converting other

programmers to his cause. So maybe he's

able to double the number of programmers.

And he's able to double the number of
programmers, because his language

is particularly good, and perhaps perhaps
people who like to program in that language

happen to be particularly persuasive.

And so on the second day, those two
programmers each themselves

go out and convert two people,
because they are the same as the

original programmer in their effectiveness

and the language itself is just as
convincing as it was before.

So each of those two programmers goes out

and gathers two for each and we go to 4

And by a similar argument, we go to 8,

and so this would be the
exponential growth model...

where the number of programmers
as a function of time increases

multiplicatively, as opposed to additively.

So, let's make this model a little more

realistic, and in particular, let's allow
the multiplicative factor,

which in this case we set to 2, we're

going to allow this multiplicative factor
to vary. And in fact, we're going to draw

this multiplicative factor, alpha, from some

distribution. And in fact, it doesn't really
matter what that distribution is

as long as alpha is always greater than 0,

so it's not possible for all programmers
to suddenly disappear.

so it's always greater than zero,

and it's bounded at some point,

so it's impossible for a language to become

infinitely popular after a finite number

of steps. So we're going to draw...

each day we're going to draw a number,
alpha, from this distribution, here.

So, after one day, there are alpha programmers.

After two days, there is alpha... or rather
alpha(1) programmers. (This is the

draw on the first day.) On the second day
there's alpha(2) times alpha(1)

programmers, and so on.
alpha(3) times alpha(2) times alpha(1)

So this is now growth that occurs through

a random multiplicative process.

It's similar to growth that would happen

through a random additive process

except now, instead of adding a random
number of programmers each day

you multiply the total number of

programmers each day, by some factor alpha

drawn from this distribution.

So, you can always convert this multiplicative

process into an additive process,
by a very simple

trick of taking the logarithm.

Over time, if we count programmer numbers

we're multiplying, but if you're working in
log space, we're just adding.

We're adding a random number to the

distribution, as long as alpha is always

strictly greater than zero, these will
always be well defined.

Now, all of a sudden, it looks like the
additive model in log space.

And what we know from the central limit
theorem, is that if you add together

lots of random numbers, that distribution
tends towards a Gaussian distribution.

with some particular mean (mu) and some
particular variance (sigma).

Let's not worry about what mu and sigma

are in particular, but rather note that
that growth happens in log space

The distribution of these sums over long

time scales will end up looking like a
Gaussian distribution.

The average boost per day to a language

looks in log space like a Gaussian
distribution. What that means

is that the exponential growth model

with random kicks, random multiplicative
kicks, actually looks like

a Gaussian in log space, or what we call
a lognormal

in actual number space

So, instead of looking at the logarithm of
the popularity of the language,

just look at the total popularity of

the language, and what that means is

that it looks like the exponential of
log(n) minus some mean squared

over two sigma squared... and then you

just have to be careful to normalize
things properly here.

So, this is the lognormal distribution.

And a mechanistic model where language
growth happens multiplicatively

where a language gains new adherents

in proportion to the number of adherents

it already has, where a language gains

new projects in proportion to the number
of projects it already has

dependent upon the environment 
that's where the multiplicative randomness

comes from. Alpha is a random number

it's not a constant. It's not 2. It's not
the language always necessarily doubles.

But the fact that it grows through

a multiplicative random process

as opposed to an additive process

means that you have a lognormal growth.

And so now you can say, "OK, let's imagine

that languages grow through this
lognormal process.

And let's find the best fit parameters

for mu and sigma." And if you do that, you find

that the mechanistic lognormal model looks

pretty good as well.

We were impressed by how well
the blue line fit this distribution

compared to the red exponential model

the MaxEnt model, constraining only N

That was the red model. Here the blue

model does well... this is the Fisherlog
series.

Unfortunately, a mechanistic model... and
I've given you a short account

of the mechanistic model, here.

Where what's happening is you're adding

together lots of small multiplicative
random kicks

The mechanistic model also works [?]

I will tell you that this fits better.

If you do a statistical analysis, both of
these models have two parameters

If you do a statistical analysis, the
Fisherlog series actually fits better

in particular, it's able to explain these

really highpopularity languages better

these deviations here seem larger

than the deviations here, but you have

to remember that this is on a log scale

So this gets much closer up here

than this does here.

So the mechanistic model, at least
visually, looks like it's extremely

competitive, with the Fisher log
series model

derived from a MaxEnt argument

Statistically speaking, if you look at
these two, this one

is actually slightly dispreferred.

But like many people, what you want is some

ironclad evidence for one versus the other.

And I think the best way to look for

that kind of evidence is to figure out

what, if anything, this epsilon really is
in the real world.

If we were able to build a solid theory

about what epsilon was, and how we could

measure it in the data, then we could see if

this here, this joint distribution, was
well reproduced.

If we could find evidence, for example,

for the fact that these two covary.

That here we have a term that boosts

the popularity of a language if it becomes

more efficient. So, if this goes down,

this can get higher, and the language

can still have the same probability

of being found with those properties.

And of course, the problem is that we don't

know how to measure this, sort of,
mysterious programmer time...

programmer efficiency.

The ecologists have a much better time
with this.

Because the ecologists, they know what
their epsilon is.

They know that their epsilon is metabolic
energy units intake.

So this is "how much a particular instance
of this species consumes in energy

over the course of a day, or
over the course of its lifetime."

And they're able to measure that, and

in fact, they're able to measure this

joint distribution. If we come to study

the open source ecosystem, so far

we don't really have a way to measure this

and so we're unable to measure the joint

and so now we're left with one model that's

mechanistic, right, popularity accrual model

and over here, this model that talks about

there being two constraints on the system

Average number and average
programmer time