Got a YouTube account?

New: enable viewer-created translations and captions on your YouTube channel!

English subtitles

← MaxEnt 7 A Real-World Example: Modeling the Open Source Ecosystem, Part 1

Get Embed Code
2 Languages

Showing Revision 9 created 07/17/2017 by Matias Agelvis.

  1. In the previous section I walked you
  2. through a whole set of mathematical
  3. derivations designed to produce a maximum
  4. entropy model of
  5. essentially a toy problem,
  6. the toy problem is how can we model
  7. the distribution of arrival times of
  8. New York City taxicabs based on a small
  9. set of data, and that's a problem that
  10. well might be of interest
    to you personally,
  11. but is certainly not of
    profound scientific interest.
  12. In this next section, what I am
  13. going to do is present to you a reasonably
  14. interesting scientific problem and show
  15. how the maximum entropy approach can
  16. illuminate some interesting features
  17. of that system.
  18. The particular system we have in mind is
  19. what people have tended to call the open
  20. source ecosystem. It is the large
  21. community of people devoted to writing
  22. code in such a way that it is open,
  23. accessible, and de-buggable,
  24. and in general,
  25. produced not by an individual,
  26. and not by a corporation, with
  27. copyright protections and control over the
  28. source, but rather by a community of
  29. people sharing and editing
    each others' code.
  30. The open source ecosystem is
  31. something that dominates a large fraction
  32. of the computer code run for us today,
  33. including not only Mac OS X, but of course
  34. Linux. It is a great success story and we
  35. would like to study it scientifically.
  36. I used the word ecosystem advisably
  37. in part because a lot of what I am going
  38. to tell you now is a set of tools and
  39. derivations that I learned from John Harte
  40. and people who have worked with John
  41. Harte on the maximum entropy approach
  42. to not social systems, but to biological
  43. systems and in particular ecosystems.
  44. John Harte's book "Maximum Entropy
  45. and Ecology" - I recommend it to you as a
  46. source of a lot more information on the
  47. kinds of tools that I am going to show
  48. you now. My goal here is really to show
  49. you that even simple arguments based on
  50. maximum entropy can provide some really
  51. deep scientific insight.
  52. What I am gonna
  53. take as my source of data, because I am
  54. going to study the empirical world now,
  55. is drawn from source forge. Source forge
  56. is no longer the most popular repository
  57. of open source software - perhaps Github
  58. has now eclipsed it - but for a long
  59. period, perhaps from 1999, and we gather
  60. data up to 2011 on this, it has an
  61. enormous archive of projects that range
  62. from different kinds of computer games to
  63. text editors to business and mathematical
  64. software, some of the code that I've used
  65. in my own research is put up on
  66. It is a great place to study in
  67. particular the use of computer languages.
  68. Here, what I have plotted is a
  69. distribution of languages used in the open
  70. source community and found on Sourceforge.
  71. On the x-axis is the log of the
  72. number of projects written in a particular
  73. language. You can see that log zero, that
  74. is one. In the database there are about
  75. twelve languages that have only a round of
  76. order one project. These languages are
  77. extremely rare, in other words, in the
  78. open source movement. Conversely,
  79. on the other end of this logarithmic
  80. scale here at four, so, ten to the four
  81. that's ten thousand, we see there is a
  82. small number of extremely popular
  83. languages, these are sort of the most
  84. common languages you will find on
  85. they have a run away popularity, ok?
  86. And if you know anything about computer
  87. programming it will not surprise you to
  88. learn that these are languages mostly
  89. derived from C, such as C, C++, Java, ok?
  90. Somewhere in the middle between these
  91. extreme rare birds and these incredibly
  92. common, you know, because these are sort
  93. of like the bacteria of the open source
  94. movement, somewhere in the middle you have
  95. a larger number of moderately popular
  96. languages, ok? So this distribution
  97. of languages is what we are gonna try to
  98. explain using maximum entropy methods, ok?
  99. there is a small number of rare languages
  100. a larger model of moderately popular ones
  101. and then again a very small number of
  102. wildly popular languages, ok?
  103. So I plotted that as a probability
  104. distribution, in fact, P of n where n
  105. is the number of projects in the open
  106. source community that use your language,
  107. and this is the probability that your
  108. language has n projects in the open source
  109. community, what we would like to do is
  110. build a maximum entropy model of this
  111. distribution here, I represented the same
  112. data in a slightly different way, this is
  113. how people tend to represent it, this is a
  114. rank abundance distribution,
  115. what ecologist call a species abundance
  116. distribution. So the top rank language,
  117. rank one, here, is the language with the
  118. most number of projects, and it won't
  119. surprise youth lear that it actually
  120. turned out to be Java, there is
  121. 20 thousands projects written in Java,
  122. you can see the second rank language C++,
  123. then C, then PHP, and this far rare
  124. languages down here have much lower ranks,
  125. so higher numbers means lower ranks, like
  126. in 3rd place, 4th place, so a 100th place
  127. down right here on the 100th place is the
  128. very unfortunate language called Turing,
  129. which has only as far as I'm aware in the
  130. archive only two projects are written in
  131. Turing, and you can see some of my
  132. favorite languages like Ruby are somewhere
  133. here in this kind of moderate popularity
  134. So we represent that data, is the
  135. same data, now what I just plotted here is
  136. log abundance on this axis and here is the
  137. language rank but is a linear, as oppose
  138. to here, where I showed you the log, ok?
  139. this is a log-log plot, this is know a
  140. log-linear plot, so this is the
    actual data.
  141. So, the first thing will try
  142. is a
  143. maximum entropy distribution for language
  144. abundance, in other words, for the
  145. probability of finding a language with n
  146. projects, and what we are gonna do is
  147. we are gonna constraint only one thing,
  148. the average popularity of a language,
  149. this is gonna be our one constraint for
  150. the maximum entropy problem, and we are
  151. gonna pick the probability distribution
  152. p of n that maximizes the entropy p log p
  153. negative sum, and from zero to infinity of
  154. p log p, where gonna maximize this
  155. quantity subject to this constraint, and
  156. of course, always subject to the
  157. normalization constraint, that p(n) is
  158. equal to unity, ok? So that's my other
  159. constraint, and of course, we know how to
  160. do this problem already, we know the
  161. functional form, is exactly the same
  162. problem as the one you learnt to do when
  163. you modeled the waiting time for a
  164. New York City taxi cab, ok? You modeled
  165. that problem exactly the same way, so I'm
  166. only gonna constraint the average of the
  167. waiting time, here we're only gonna
  168. constraint the average popularity of a
  169. language, popularity mean the number of
  170. projects written in that language has on
  171. the archive, and so, we know what the
  172. functional form will look like, it looks
  173. like something like e to the negative
  174. lambda n, ok? all over Z, and then all we
  175. have to do is fit lambda and Z, ok? So
  176. that we reproduce the correct abundance
  177. that we see in the data, so this is the
  178. maximum entropy distribution, it's also,
  179. of course, an exponential model, has an
  180. exponential form, and if actually find the
  181. lambda and Z that best reproduce the data,
  182. in other words, that best satisfies this
  183. constraint, and are otherwise
    maximum entropy,
  184. here is what we find, ok?
  185. This red band here is the 1 and 2 sigma
  186. contours for the rank abundance
  187. distribution, and all I want you to see on
  188. this graph is an incredibly
    bad fit to this.
  189. The maximum entropy distribution
  190. with this constraint,
  191. does not reproduce the data,
  192. is not capable of explaining,
  193. our modeling, the data in any reasonable
  194. way, it radically under predicts
  195. these really extremely popular languages,
  196. it's unable to, in other words, reproduce
  197. the fact that there are languages like
  198. C and Python, that are extremely popular,
  199. it over predicts this kind of mesoscopic
  200. regime, it over predicts the moderately
  201. popular languages, right, as also it does
  202. over predict those those really rare
  203. birds, those really low rank languages,
  204. with very few examples in the archive,
  205. so this is right, this is science fail.