## ← MaxEnt 7 A Real-World Example: Modeling the Open Source Ecosystem, Part 1

• 2 Followers
• 205 Lines

### Get Embed Code x Embed video Use the following code to embed this video. See our usage guide for more details on embedding. Paste this in your document somewhere (closest to the closing body tag is preferable): ```<script type="text/javascript" src='https://amara.org/embedder-iframe'></script> ``` Paste this inside your HTML body, where you want to include the widget: ```<div class="amara-embed" data-url="http://www.youtube.com/watch?v=BnkcBSiLMxc" data-team="complexity-explorer"></div> ``` 2 Languages

Showing Revision 9 created 07/17/2017 by Matias Agelvis.

1. In the previous section I walked you
2. through a whole set of mathematical
3. derivations designed to produce a maximum
4. entropy model of
5. essentially a toy problem,
6. the toy problem is how can we model
7. the distribution of arrival times of
8. New York City taxicabs based on a small
9. set of data, and that's a problem that
10. well might be of interest
to you personally,
11. but is certainly not of
profound scientific interest.
12. In this next section, what I am
13. going to do is present to you a reasonably
14. interesting scientific problem and show
15. how the maximum entropy approach can
16. illuminate some interesting features
17. of that system.
18. The particular system we have in mind is
19. what people have tended to call the open
20. source ecosystem. It is the large
21. community of people devoted to writing
22. code in such a way that it is open,
23. accessible, and de-buggable,
24. and in general,
25. produced not by an individual,
26. and not by a corporation, with
27. copyright protections and control over the
28. source, but rather by a community of
29. people sharing and editing
each others' code.
30. The open source ecosystem is
31. something that dominates a large fraction
32. of the computer code run for us today,
33. including not only Mac OS X, but of course
34. Linux. It is a great success story and we
35. would like to study it scientifically.
36. I used the word ecosystem advisably
37. in part because a lot of what I am going
38. to tell you now is a set of tools and
39. derivations that I learned from John Harte
40. and people who have worked with John
41. Harte on the maximum entropy approach
42. to not social systems, but to biological
43. systems and in particular ecosystems.
44. John Harte's book "Maximum Entropy
45. and Ecology" - I recommend it to you as a
47. kinds of tools that I am going to show
48. you now. My goal here is really to show
49. you that even simple arguments based on
50. maximum entropy can provide some really
51. deep scientific insight.
52. What I am gonna
53. take as my source of data, because I am
54. going to study the empirical world now,
55. is drawn from source forge. Source forge
56. is no longer the most popular repository
57. of open source software - perhaps Github
58. has now eclipsed it - but for a long
59. period, perhaps from 1999, and we gather
60. data up to 2011 on this, it has an
61. enormous archive of projects that range
62. from different kinds of computer games to
63. text editors to business and mathematical
64. software, some of the code that I've used
65. in my own research is put up on
Sourceforge.
66. It is a great place to study in
67. particular the use of computer languages.
68. Here, what I have plotted is a
69. distribution of languages used in the open
70. source community and found on Sourceforge.
71. On the x-axis is the log of the
72. number of projects written in a particular
73. language. You can see that log zero, that
74. is one. In the database there are about
75. twelve languages that have only a round of
76. order one project. These languages are
77. extremely rare, in other words, in the
78. open source movement. Conversely,
79. on the other end of this logarithmic
80. scale here at four, so, ten to the four
81. that's ten thousand, we see there is a
82. small number of extremely popular
83. languages, these are sort of the most
84. common languages you will find on
Sourceforge
85. they have a run away popularity, ok?
86. And if you know anything about computer
87. programming it will not surprise you to
88. learn that these are languages mostly
89. derived from C, such as C, C++, Java, ok?
90. Somewhere in the middle between these
91. extreme rare birds and these incredibly
92. common, you know, because these are sort
93. of like the bacteria of the open source
94. movement, somewhere in the middle you have
95. a larger number of moderately popular
96. languages, ok? So this distribution
97. of languages is what we are gonna try to
98. explain using maximum entropy methods, ok?
99. there is a small number of rare languages
100. a larger model of moderately popular ones
101. and then again a very small number of
102. wildly popular languages, ok?
103. So I plotted that as a probability
104. distribution, in fact, P of n where n
105. is the number of projects in the open
106. source community that use your language,
107. and this is the probability that your
108. language has n projects in the open source
109. community, what we would like to do is
110. build a maximum entropy model of this
111. distribution here, I represented the same
112. data in a slightly different way, this is
113. how people tend to represent it, this is a
114. rank abundance distribution,
115. what ecologist call a species abundance
116. distribution. So the top rank language,
117. rank one, here, is the language with the
118. most number of projects, and it won't
119. surprise youth lear that it actually
120. turned out to be Java, there is
121. 20 thousands projects written in Java,
122. you can see the second rank language C++,
123. then C, then PHP, and this far rare
124. languages down here have much lower ranks,
125. so higher numbers means lower ranks, like
126. in 3rd place, 4th place, so a 100th place
127. down right here on the 100th place is the
128. very unfortunate language called Turing,
129. which has only as far as I'm aware in the
130. archive only two projects are written in
131. Turing, and you can see some of my
132. favorite languages like Ruby are somewhere
133. here in this kind of moderate popularity
zone.
134. So we represent that data, is the
135. same data, now what I just plotted here is
136. log abundance on this axis and here is the
137. language rank but is a linear, as oppose
138. to here, where I showed you the log, ok?
139. this is a log-log plot, this is know a
140. log-linear plot, so this is the
actual data.
141. So, the first thing will try
142. is a
143. maximum entropy distribution for language
144. abundance, in other words, for the
145. probability of finding a language with n
146. projects, and what we are gonna do is
147. we are gonna constraint only one thing,
148. the average popularity of a language,
149. this is gonna be our one constraint for
150. the maximum entropy problem, and we are
151. gonna pick the probability distribution
152. p of n that maximizes the entropy p log p
153. negative sum, and from zero to infinity of
154. p log p, where gonna maximize this
155. quantity subject to this constraint, and
156. of course, always subject to the
157. normalization constraint, that p(n) is
158. equal to unity, ok? So that's my other
159. constraint, and of course, we know how to
160. do this problem already, we know the
161. functional form, is exactly the same
162. problem as the one you learnt to do when
163. you modeled the waiting time for a
164. New York City taxi cab, ok? You modeled
165. that problem exactly the same way, so I'm
166. only gonna constraint the average of the
167. waiting time, here we're only gonna
168. constraint the average popularity of a
169. language, popularity mean the number of
170. projects written in that language has on
171. the archive, and so, we know what the
172. functional form will look like, it looks
173. like something like e to the negative
174. lambda n, ok? all over Z, and then all we
175. have to do is fit lambda and Z, ok? So
176. that we reproduce the correct abundance
177. that we see in the data, so this is the
178. maximum entropy distribution, it's also,
179. of course, an exponential model, has an
180. exponential form, and if actually find the
181. lambda and Z that best reproduce the data,
182. in other words, that best satisfies this
183. constraint, and are otherwise
maximum entropy,
184. here is what we find, ok?
185. This red band here is the 1 and 2 sigma
186. contours for the rank abundance
187. distribution, and all I want you to see on
188. this graph is an incredibly
189. The maximum entropy distribution
190. with this constraint,
191. does not reproduce the data,
192. is not capable of explaining,
193. our modeling, the data in any reasonable
194. way, it radically under predicts
195. these really extremely popular languages,
196. it's unable to, in other words, reproduce
197. the fact that there are languages like
198. C and Python, that are extremely popular,
199. it over predicts this kind of mesoscopic
200. regime, it over predicts the moderately
201. popular languages, right, as also it does
202. over predict those those really rare
203. birds, those really low rank languages,
204. with very few examples in the archive,
205. so this is right, this is science fail.