Got a YouTube account?

New: enable viewer-created translations and captions on your YouTube channel!

English subtitles

← MaxEnt 9 Modeling the Open Source Ecosystem, Part 3

Get Embed Code
2 Languages

Showing Revision 5 created 03/17/2017 by seonaid.

  1. In the previous unit, what I showed you
  2. was a way to rescue MaxEnt models to
  3. describe the language abundance P(n)
  4. by reference to a hidden variable, epsilon
  5. which we referred to as "programmer time"
  6. and what we assumed was that the system
  7. was constrained in two ways:
  8. it was constrained not only to having
  9. a certain average number of projects per
  10. language. We fixed the average popularity
  11. of languages, but we also fixed the average
  12. programmer time devoted to projects
  13. in a particular language. And so this
    distribution, here, looks like this.
  14. in functional form. And when we integrate
  15. out this variable, epsilon, we get
    something that looks like this.
  16. So we get a different prediction for
  17. the language distribution. And we get
  18. a prediction that, I argue, looks
    reasonably good.
  19. It certainly looks better than the
    exponential distribution.
  20. I feel honor-bound to tell you
    about the controversy
  21. that happens when we try to make
    these models... and in particular
  22. there is a very different mechanistic
    model that looks quite similar.
  23. So this is the Fisher log series.
  24. And the argument behind the Fisher log
    series, to explain this distribution,
  25. involves the idea of a "hidden" additional
  26. In the open source question, what I've
    done is describe that additional
  27. constraint as "programmer time", just
    because it seems like it might be a
  28. constraint in the system. OK?
    That the average programmer time
  29. for languages is fixed, not just the
    average number of projects.
  30. But, and so that means that languages
    can vary in their popularity, but also
  31. in their efficiency. In the ecological
    modelling, this is "species"
  32. languages are species, and this is the
    abundance of species.
  33. So that means the number of particular
    instances of the species in the wild.
  34. And also metabolic rate: how much energy
    a particular species consumes.
  35. And so in that case, the system is
    constrained to a certain average
  36. species abundance, and a certain average
    species energy consumption
  37. Here languages are constrained to a certain
    abundance, and a certain consumption of
  38. programmer energy. That's the analogy.
  39. So, we can also build a mechanistic model.
  40. of programming language popularity.
  41. And previously, when we studied the taxi-cab
    problem, what we did was,
  42. when we produced the mechanistic model,
    we were able to find a very simple one
  43. that had the same predictions as the
    MaxEnt model did.
  44. Here, by contrast, what we're going to find
    is that the mechanistic model
  45. is going to produce similar behavior, but
    the functional form will actually be
  46. slightly distinct. So, here's the mechanistic
    model... we imagine that
  47. languages all start out with a baseline
  48. popularity. Whoever invents the language,
    for example, has to write
  49. at least one project. There is at least one
    programmer at the beginning
  50. of a language's invention, who knows how
  51. to program in that language, somewhat
  52. by definition. And so there's two ways
  53. that that popularity can grow.
  54. It can grow, for example, linearly.
  55. So, on day 1 there's 1 programmer.
  56. And on day 2, that one programmer is
    joined by another programmer.
  57. And on day 3, those two programmers
    are joined by a third.
  58. And so, over time, what you have is
  59. a growth rate that's linear...
  60. in time.
  61. But, perhaps a more plausible model
  62. for how languages accrue popularity
  63. is multiplicative.
  64. At time 1, there's 1 programmer, and he has
  65. some efficiency of converting other
  66. programmers to his cause. So maybe he's
  67. able to double the number of programmers.
  68. And he's able to double the number of
    programmers, because his language
  69. is particularly good, and perhaps perhaps
    people who like to program in that language
  70. happen to be particularly persuasive.
  71. And so on the second day, those two
    programmers each themselves
  72. go out and convert two people,
    because they are the same as the
  73. original programmer in their effectiveness
  74. and the language itself is just as
    convincing as it was before.
  75. So each of those two programmers goes out
  76. and gathers two for each and we go to 4
  77. And by a similar argument, we go to 8,
  78. and so this would be the
    exponential growth model...
  79. where the number of programmers
    as a function of time increases
  80. multiplicatively, as opposed to additively.
  81. So, let's make this model a little more
  82. realistic, and in particular, let's allow
    the multiplicative factor,
  83. which in this case we set to 2, we're
  84. going to allow this multiplicative factor
    to vary. And in fact, we're going to draw
  85. this multiplicative factor, alpha, from some
  86. distribution. And in fact, it doesn't really
    matter what that distribution is
  87. as long as alpha is always greater than 0,
  88. so it's not possible for all programmers
    to suddenly disappear.
  89. so it's always greater than zero,
  90. and it's bounded at some point,
  91. so it's impossible for a language to become
  92. infinitely popular after a finite number
  93. of steps. So we're going to draw...
  94. each day we're going to draw a number,
    alpha, from this distribution, here.
  95. So, after one day, there are alpha programmers.
  96. After two days, there is alpha... or rather
    alpha(1) programmers. (This is the
  97. draw on the first day.) On the second day
    there's alpha(2) times alpha(1)
  98. programmers, and so on.
    alpha(3) times alpha(2) times alpha(1)
  99. So this is now growth that occurs through
  100. a random multiplicative process.
  101. It's similar to growth that would happen
  102. through a random additive process
  103. except now, instead of adding a random
    number of programmers each day
  104. you multiply the total number of
  105. programmers each day, by some factor alpha
  106. drawn from this distribution.
  107. So, you can always convert this multiplicative
  108. process into an additive process,
    by a very simple
  109. trick of taking the logarithm.
  110. Over time, if we count programmer numbers
  111. we're multiplying, but if you're working in
    log space, we're just adding.
  112. We're adding a random number to the
  113. distribution, as long as alpha is always
  114. strictly greater than zero, these will
    always be well defined.
  115. Now, all of a sudden, it looks like the
    additive model in log space.
  116. And what we know from the central limit
    theorem, is that if you add together
  117. lots of random numbers, that distribution
    tends towards a Gaussian distribution.
  118. with some particular mean (mu) and some
    particular variance (sigma).
  119. Let's not worry about what mu and sigma
  120. are in particular, but rather note that
    that growth happens in log space
  121. The distribution of these sums over long
  122. time scales will end up looking like a
    Gaussian distribution.
  123. The average boost per day to a language
  124. looks in log space like a Gaussian
    distribution. What that means
  125. is that the exponential growth model
  126. with random kicks, random multiplicative
    kicks, actually looks like
  127. a Gaussian in log space, or what we call
    a log-normal
  128. in actual number space
  129. So, instead of looking at the logarithm of
    the popularity of the language,
  130. just look at the total popularity of
  131. the language, and what that means is
  132. that it looks like the exponential of
    log(n) minus some mean squared
  133. over two sigma squared... and then you
  134. just have to be careful to normalize
    things properly here.
  135. So, this is the log-normal distribution.
  136. And a mechanistic model where language
    growth happens multiplicatively
  137. where a language gains new adherents
  138. in proportion to the number of adherents
  139. it already has, where a language gains
  140. new projects in proportion to the number
    of projects it already has
  141. dependent upon the environment -
    that's where the multiplicative randomness
  142. comes from. Alpha is a random number
  143. it's not a constant. It's not 2. It's not
    the language always necessarily doubles.
  144. But the fact that it grows through
  145. a multiplicative random process
  146. as opposed to an additive process
  147. means that you have a log-normal growth.
  148. And so now you can say, "OK, let's imagine
  149. that languages grow through this
    log-normal process.
  150. And let's find the best fit parameters
  151. for mu and sigma." And if you do that, you find
  152. that the mechanistic log-normal model looks
  153. pretty good as well.
  154. We were impressed by how well
    the blue line fit this distribution
  155. compared to the red exponential model
  156. the MaxEnt model, constraining only N
  157. That was the red model. Here the blue
  158. model does well... this is the Fisher-log
  159. Unfortunately, a mechanistic model... and
    I've given you a short account
  160. of the mechanistic model, here.
  161. Where what's happening is you're adding
  162. together lots of small multiplicative
    random kicks
  163. The mechanistic model also works [?]
  164. I will tell you that this fits better.
  165. If you do a statistical analysis, both of
    these models have two parameters
  166. If you do a statistical analysis, the
    Fisher-log series actually fits better
  167. in particular, it's able to explain these
  168. really high-popularity languages better
  169. these deviations here seem larger
  170. than the deviations here, but you have
  171. to remember that this is on a log scale
  172. So this gets much closer up here
  173. than this does here.
  174. So the mechanistic model, at least
    visually, looks like it's extremely
  175. competitive, with the Fisher log
    series model
  176. derived from a MaxEnt argument
  177. Statistically speaking, if you look at
    these two, this one
  178. is actually slightly dispreferred.
  179. But like many people, what you want is some
  180. ironclad evidence for one versus the other.
  181. And I think the best way to look for
  182. that kind of evidence is to figure out
  183. what, if anything, this epsilon really is
    in the real world.
  184. If we were able to build a solid theory
  185. about what epsilon was, and how we could
  186. measure it in the data, then we could see if
  187. this here, this joint distribution, was
    well reproduced.
  188. If we could find evidence, for example,
  189. for the fact that these two co-vary.
  190. That here we have a term that boosts
  191. the popularity of a language if it becomes
  192. more efficient. So, if this goes down,
  193. this can get higher, and the language
  194. can still have the same probability
  195. of being found with those properties.
  196. And of course, the problem is that we don't
  197. know how to measure this, sort of,
    mysterious programmer time...
  198. programmer efficiency.
  199. The ecologists have a much better time
    with this.
  200. Because the ecologists, they know what
    their epsilon is.
  201. They know that their epsilon is metabolic
    energy units intake.
  202. So this is "how much a particular instance
    of this species consumes in energy
  203. over the course of a day, or
    over the course of its lifetime."
  204. And they're able to measure that, and
  205. in fact, they're able to measure this
  206. joint distribution. If we come to study
  207. the open source ecosystem, so far
  208. we don't really have a way to measure this
  209. and so we're unable to measure the joint
  210. and so now we're left with one model that's
  211. mechanistic, right, popularity accrual model
  212. and over here, this model that talks about
  213. there being two constraints on the system
  214. Average number and average
    programmer time