1 00:00:17,242 --> 00:00:20,037 Six thousand miles of road, 2 00:00:20,037 --> 00:00:22,018 600 miles of subway track, 3 00:00:22,018 --> 00:00:24,037 400 miles of bike lanes, 4 00:00:24,037 --> 00:00:25,721 and a half a mile of tram track, 5 00:00:25,721 --> 00:00:27,739 if you've ever been to Roosevelt Island. 6 00:00:27,739 --> 00:00:30,576 These are the numbers that make up the infrastructure of NYC, 7 00:00:30,576 --> 00:00:32,744 these are the statistics of our infrastructure. 8 00:00:32,744 --> 00:00:35,780 They're the kind of numbers released in reports by city agencies. 9 00:00:35,780 --> 00:00:38,953 For example, the Department of Transportation will probably tell you 10 00:00:38,953 --> 00:00:40,700 how many miles of road they maintain. 11 00:00:40,700 --> 00:00:43,501 The MTA will boast how many miles of subway track there are. 12 00:00:43,501 --> 00:00:45,565 But most city agencies give us statistics. 13 00:00:45,565 --> 00:00:48,802 This is from a report this year from the Taxi & Limousine Commission, 14 00:00:48,802 --> 00:00:53,019 where we've learned that there is about 13,500 taxis here in NYC. 15 00:00:53,019 --> 00:00:54,346 Pretty interesting, right? 16 00:00:54,346 --> 00:00:57,135 But did you ever think about where these numbers came from? 17 00:00:57,135 --> 00:01:00,061 Because for these numbers to exist somebody at the city agency 18 00:01:00,061 --> 00:01:03,671 has to stop and say hmm, here's a number that somebody might want to know. 19 00:01:03,671 --> 00:01:05,900 Here's a number that our citizens want to know. 20 00:01:05,900 --> 00:01:07,640 So they go back to their raw data, 21 00:01:07,640 --> 00:01:09,424 they count, they add, they calculate, 22 00:01:09,424 --> 00:01:11,002 and then they put out reports. 23 00:01:11,002 --> 00:01:13,501 And those reports will have numbers like this. 24 00:01:13,501 --> 00:01:16,043 The problem is, how do they know all of our questions? 25 00:01:16,043 --> 00:01:17,485 We have lots of questions. 26 00:01:17,485 --> 00:01:20,764 In fact, in some ways there's literally an infinite number of questions 27 00:01:20,764 --> 00:01:22,260 that we can ask about our city. 28 00:01:22,260 --> 00:01:23,893 So the agencies can never keep up. 29 00:01:23,893 --> 00:01:25,660 So the paradigm isn't exactly working 30 00:01:25,660 --> 00:01:28,261 and I think our policy makers realize that 31 00:01:28,261 --> 00:01:31,632 because in 2012, Mayor Bloomberg signed into law what he called 32 00:01:31,632 --> 00:01:35,791 the most ambitious and comprehensive open data legislation in the country. 33 00:01:35,791 --> 00:01:37,568 In a lot of ways he's right. 34 00:01:37,568 --> 00:01:42,390 In the last two years the city's released 1,000 data sets on our open data portal 35 00:01:42,390 --> 00:01:44,123 and, it's pretty awesome. 36 00:01:44,123 --> 00:01:45,559 You look at data like this, 37 00:01:45,559 --> 00:01:47,623 and instead of counting the number of cabs, 38 00:01:47,623 --> 00:01:49,600 we can start to ask different questions. 39 00:01:49,600 --> 00:01:52,373 So I had a question: When is rush hour in NYC? 40 00:01:52,373 --> 00:01:55,092 It can be pretty bothersome. When is rush hour exactly? 41 00:01:55,092 --> 00:01:57,820 And I thought to myself, these cabs aren't just numbers, 42 00:01:57,820 --> 00:02:01,134 these are GPS recorders driving around in our city's streets recording 43 00:02:01,134 --> 00:02:02,684 each and every right they take. 44 00:02:02,684 --> 00:02:03,810 There's data there. 45 00:02:03,810 --> 00:02:05,845 And I looked at that data and I made a plot 46 00:02:05,845 --> 00:02:08,573 of the average speed of taxis in NYC throughout the day. 47 00:02:08,573 --> 00:02:12,983 You can see that from around midnight to around 5:18 AM, speed increases, 48 00:02:12,983 --> 00:02:15,860 and at that point, things turn around. 49 00:02:15,860 --> 00:02:19,834 They get slower, slower and slower until about 8:35 AM 50 00:02:19,834 --> 00:02:22,591 when they end up at 11.5 mph. 51 00:02:22,591 --> 00:02:25,741 The average taxi is going at 11.5 mph in our city streets, 52 00:02:25,741 --> 00:02:28,219 and it turns out it stays that way 53 00:02:28,219 --> 00:02:30,697 for the entire day. 54 00:02:30,697 --> 00:02:33,175 (Laughter) 55 00:02:33,175 --> 00:02:36,011 So I said to myself, I guess there's no rush hour in NYC, 56 00:02:36,011 --> 00:02:37,376 there's just a "rush day." 57 00:02:37,376 --> 00:02:38,241 (Laughter) 58 00:02:38,241 --> 00:02:39,158 Makes sense. 59 00:02:39,158 --> 00:02:41,138 This is important for a couple of reasons. 60 00:02:41,138 --> 00:02:44,803 If you are a transportation planner, this might be pretty interesting to know. 61 00:02:44,803 --> 00:02:46,732 But if you want to get somewhere quickly 62 00:02:46,732 --> 00:02:49,699 you now know to set your alarm for 4:45 AM and you're all set. 63 00:02:49,699 --> 00:02:50,497 New York, right? 64 00:02:50,497 --> 00:02:52,181 But there's story behind this data, 65 00:02:52,181 --> 00:02:54,143 it wasn't just available as it turns out. 66 00:02:54,143 --> 00:02:57,733 It actually came from something called a Freedom of Information Law Request, 67 00:02:57,733 --> 00:02:58,658 or a FOIL Request. 68 00:02:58,658 --> 00:03:01,632 This is a form you can find on the Taxi & Limousine Commission website. 69 00:03:01,632 --> 00:03:04,453 In order to access this data, you need to go get this form, 70 00:03:04,453 --> 00:03:06,475 fill it out, and they will notify you. 71 00:03:06,475 --> 00:03:09,282 And a guy name Chris Whong did exactly that. 72 00:03:09,282 --> 00:03:10,973 Chris went down and they told him, 73 00:03:10,973 --> 00:03:13,750 "Just bring a brand new hard drive to our office, 74 00:03:13,750 --> 00:03:17,040 leave it here for 5 hours, we'll copy the data and you take it back." 75 00:03:17,040 --> 00:03:19,305 And that's where this data came from. 76 00:03:19,305 --> 00:03:22,349 Now, Chris is the kind of guy that wants to make the data public, 77 00:03:22,349 --> 00:03:25,975 so it ended up online for all to use and that's where this graph came from. 78 00:03:25,975 --> 00:03:27,859 And the fact that it exists is amazing. 79 00:03:27,859 --> 00:03:29,804 These GPS recorders - really cool! 80 00:03:29,804 --> 00:03:32,826 But the fact that we have citizens walking around with hard drives 81 00:03:32,826 --> 00:03:35,353 picking up data from city agencies to make it public - 82 00:03:35,353 --> 00:03:37,691 it was already kind of public, you could get to it, 83 00:03:37,691 --> 00:03:39,517 but it was "public", it wasn't public. 84 00:03:39,517 --> 00:03:41,572 And we can do better than that as a city, 85 00:03:41,572 --> 00:03:44,349 we don't need our citizens walking around with hard drives. 86 00:03:44,349 --> 00:03:47,252 Now, not every dataset is behind a FOIL request. 87 00:03:47,252 --> 00:03:50,809 Here's a map I made with the most dangerous intersections in NYC 88 00:03:50,809 --> 00:03:53,086 based on cyclist accidents. 89 00:03:53,086 --> 00:03:54,875 So the red areas are more dangerous. 90 00:03:54,875 --> 00:03:57,240 What it shows is first the East side of Manhattan, 91 00:03:57,240 --> 00:04:01,058 especially in the lower area of Manhattan, has more cycle accidents. 92 00:04:01,058 --> 00:04:02,244 That might makes sense 93 00:04:02,244 --> 00:04:05,320 because there are more cyclist coming off the bridges over there. 94 00:04:05,320 --> 00:04:07,386 But there's other hotspots worth studying. 95 00:04:07,386 --> 00:04:10,047 There's Williamsburg. There's Roosevelt Avenue in Queens. 96 00:04:10,047 --> 00:04:12,754 This is exactly the type of data we need for vision zero. 97 00:04:12,754 --> 00:04:14,728 This is exactly what we're looking for. 98 00:04:14,728 --> 00:04:16,778 But there's story behind this data as well. 99 00:04:16,778 --> 00:04:18,304 This data didn't just appear. 100 00:04:18,304 --> 00:04:20,825 How many of you guys know this logo? 101 00:04:20,825 --> 00:04:22,069 Yeah, I see some shakes. 102 00:04:22,069 --> 00:04:24,753 Have you ever tried to copy and paste data out of a PDF 103 00:04:24,753 --> 00:04:25,950 and make sense of it? 104 00:04:25,950 --> 00:04:27,295 I see more shakes. 105 00:04:27,295 --> 00:04:30,683 More of you tried to copying and pasting than knew the logo. I like that. 106 00:04:30,683 --> 00:04:33,731 What happen is, the data that you just saw was actually on a PDF. 107 00:04:33,731 --> 00:04:39,474 In fact, hundreds, and hundreds, of pages of PDF put out by our own NYPD, 108 00:04:39,474 --> 00:04:40,772 and in order to access it, 109 00:04:40,772 --> 00:04:44,075 you either have to copy and paste for hundred and hundred of hours, 110 00:04:44,075 --> 00:04:45,590 or you could be John Krauss. 111 00:04:45,590 --> 00:04:46,861 John Krauss is like, 112 00:04:46,861 --> 00:04:50,232 I'm not going to copy and paste this data, I'm going to write a program. 113 00:04:50,232 --> 00:04:52,384 It's called the NYPD Crash Data Band-Aid. 114 00:04:52,384 --> 00:04:55,227 And it goes to the NYPD's website and it would download PDFs. 115 00:04:55,227 --> 00:04:56,722 Every day with it would search; 116 00:04:56,722 --> 00:04:58,642 if it found a PDF, it would download it, 117 00:04:58,642 --> 00:05:00,912 and it would run some PDF-scraping program, 118 00:05:00,912 --> 00:05:02,296 and out would come the text 119 00:05:02,296 --> 00:05:05,616 and it would go on the Internet, and people could make maps like that. 120 00:05:05,616 --> 00:05:08,831 And the fact that the data is here, that we can have access to it - 121 00:05:08,831 --> 00:05:11,452 every accident, by the way, is a row on this table. 122 00:05:11,452 --> 00:05:13,280 You can imagine how many PDF that is. 123 00:05:13,280 --> 00:05:15,463 The fact that we have access to that is great. 124 00:05:15,463 --> 00:05:17,836 But let's not release it in PDF form. 125 00:05:17,836 --> 00:05:20,570 Because then we're having our citizens write PDF scrapers. 126 00:05:20,570 --> 00:05:22,706 It's not the best use of our citizens' time, 127 00:05:22,706 --> 00:05:24,724 and we, as a city, can do better than that. 128 00:05:24,724 --> 00:05:27,096 The good news is that the de Blasio Administration 129 00:05:27,096 --> 00:05:30,077 actually released this data a few months ago, 130 00:05:30,077 --> 00:05:31,756 so now, we can have access to it. 131 00:05:31,756 --> 00:05:34,353 But there's a lot of data still entombed in PDF. 132 00:05:34,353 --> 00:05:37,831 For example our crime data, still is only available in PDF. 133 00:05:37,831 --> 00:05:39,412 And not just our crime data, 134 00:05:39,412 --> 00:05:41,638 our own city budget. 135 00:05:41,638 --> 00:05:45,406 Our city budget is only readable right now in PDF form. 136 00:05:45,406 --> 00:05:47,441 And it's not just us that can't analyze it - 137 00:05:47,441 --> 00:05:50,152 our own legislators who vote for the budget, 138 00:05:50,152 --> 00:05:52,085 also only get it in PDF. 139 00:05:52,085 --> 00:05:55,892 So our legislators cannot analyze the budget that they are voting for. 140 00:05:55,892 --> 00:05:59,597 And I think as a city we can do a little better than that as well. 141 00:05:59,597 --> 00:06:02,082 Now, there's a lot of data that's not hidden in PDFs. 142 00:06:02,082 --> 00:06:03,839 This is an example of a map I made. 143 00:06:03,839 --> 00:06:07,003 And this is the dirtiest waterways in NYC. 144 00:06:07,003 --> 00:06:08,319 How do I measure dirty? 145 00:06:08,319 --> 00:06:09,945 Well, it's kind of a little weird, 146 00:06:09,945 --> 00:06:12,418 but I looked at the level of fecal coliform, 147 00:06:12,418 --> 00:06:15,627 which is a measurement of fecal matter in each of our waterways. 148 00:06:15,627 --> 00:06:19,068 The larger the circle, the dirtier the water. 149 00:06:19,068 --> 00:06:22,273 The large circles are dirty waters, the smaller circles are cleaner. 150 00:06:22,273 --> 00:06:24,211 What you see is inland waterways. 151 00:06:24,211 --> 00:06:27,460 This is all data that was sampled by the city over the last 5 years. 152 00:06:27,460 --> 00:06:29,716 And inland waterways are, in general, dirtier. 153 00:06:29,716 --> 00:06:31,132 That makes sense, right? 154 00:06:31,132 --> 00:06:32,970 And I learned a few things from this. 155 00:06:32,970 --> 00:06:39,277 Number 1: never swim in anything that ends in creek or canal. 156 00:06:39,277 --> 00:06:42,351 Number 2: I also found the dirtiest waterways in New York City 157 00:06:42,351 --> 00:06:44,047 by this measure, one measure. 158 00:06:44,047 --> 00:06:45,120 In Coney Island Creek, 159 00:06:45,120 --> 00:06:48,476 which is not Coney Island you swim in, luckily, it's on the other side. 160 00:06:48,476 --> 00:06:52,685 But Coney Island Creek, 94% of samples taken over the last 5 years 161 00:06:52,685 --> 00:06:55,220 have had fecal levels so high, 162 00:06:55,220 --> 00:06:58,471 that it would be against state law to swim in the water. 163 00:06:58,471 --> 00:07:01,099 And this is not the kind of fact that you're going to see 164 00:07:01,099 --> 00:07:03,767 boasted in a city report or on the front page of nyc.gov. 165 00:07:03,767 --> 00:07:05,313 You're not going to see it there, 166 00:07:05,313 --> 00:07:08,125 but the fact that we can get to that data, is awesome. 167 00:07:08,125 --> 00:07:09,925 Once again, it wasn't super easy, 168 00:07:09,925 --> 00:07:12,251 because this data was not on the open data portal. 169 00:07:12,251 --> 00:07:14,255 If you were to go to the open data portal, 170 00:07:14,255 --> 00:07:16,774 you'd see just a snippet of it, a year or a few months. 171 00:07:16,774 --> 00:07:20,078 It was actually on the Department of Environmental Protection's website. 172 00:07:20,078 --> 00:07:24,023 Each one of these links is an Excel sheet, and this Excel sheet is different. 173 00:07:24,023 --> 00:07:26,667 Every heading is different: you copy, paste, reorganize. 174 00:07:26,667 --> 00:07:29,592 When you do you can make maps and that's great, but once again, 175 00:07:29,592 --> 00:07:32,473 we can do better than that as a city, we can normalize things. 176 00:07:32,473 --> 00:07:35,653 We're getting there because there's this website that Socrata makes, 177 00:07:35,653 --> 00:07:37,131 called the Open Data Portal NYC. 178 00:07:37,131 --> 00:07:39,275 This is where 1100 data sets, that don't suffer 179 00:07:39,275 --> 00:07:40,829 from the things I told you live, 180 00:07:40,829 --> 00:07:42,899 and that number is growing, and that's great. 181 00:07:42,899 --> 00:07:46,695 You can download data in any format, be it CSV or PDF or Excel document. 182 00:07:46,695 --> 00:07:49,700 Whatever you want, you can download the data that way. 183 00:07:49,700 --> 00:07:51,156 The problem is, once you do, 184 00:07:51,156 --> 00:07:55,625 you'll find that each agency codes their addresses differently. 185 00:07:55,625 --> 00:07:57,670 So, one is street name, intersection street, 186 00:07:57,670 --> 00:08:00,155 street, borough, address building, building, address. 187 00:08:00,155 --> 00:08:03,343 So, once again, you're spending time, even when we have this portal, 188 00:08:03,343 --> 00:08:05,862 you're spending time normalizing our address field. 189 00:08:05,862 --> 00:08:08,380 I think that's not the best use of our citizens' time, 190 00:08:08,380 --> 00:08:10,227 we can do better than that as a city. 191 00:08:10,227 --> 00:08:11,863 We can standardize our addresses. 192 00:08:11,863 --> 00:08:13,811 If we do, we can get more maps like this. 193 00:08:13,811 --> 00:08:16,062 This is a map of fire hydrants in New York City. 194 00:08:16,062 --> 00:08:17,645 But not just any fire hydrant. 195 00:08:17,645 --> 00:08:20,171 These are the top 250 grossing fire hydrants 196 00:08:20,171 --> 00:08:22,862 in terms of parking tickets. 197 00:08:22,862 --> 00:08:24,988 (Laughter) 198 00:08:24,988 --> 00:08:27,109 So I learned a few things from this map. 199 00:08:27,109 --> 00:08:30,239 Number 1: just don't park on the Upper East side. 200 00:08:30,239 --> 00:08:33,516 Just don't. No matter where you park, you will get a hydrant ticket. 201 00:08:33,516 --> 00:08:37,952 Number 2: I found the two highest grossing hydrants in all of New York City. 202 00:08:37,952 --> 00:08:39,475 They are on the Lower East side, 203 00:08:39,475 --> 00:08:44,597 and they are bringing in over 55,000 dollars a year in parking tickets. 204 00:08:44,597 --> 00:08:47,262 And that seemed a little strange to me when I noticed it, 205 00:08:47,262 --> 00:08:49,374 so I did a little digging, and it turns out 206 00:08:49,374 --> 00:08:52,610 what you had is a hydrant and something called a curb extension, 207 00:08:52,610 --> 00:08:54,663 which is like a seven-foot space to walk on, 208 00:08:54,663 --> 00:08:55,846 and then a parking spot. 209 00:08:55,846 --> 00:08:57,940 So these cars came along and the hydrant - 210 00:08:57,940 --> 00:08:59,785 "It's all the way over there, I'm fine," 211 00:08:59,785 --> 00:09:03,254 and there was actually a parking spot painted there beautifully for them. 212 00:09:03,254 --> 00:09:06,269 They would park there and the NYPD disagree with the designation, 213 00:09:06,269 --> 00:09:07,562 and would ticket them. 214 00:09:07,562 --> 00:09:09,905 And it wasn't just me who found a parking ticket. 215 00:09:09,905 --> 00:09:13,612 This is the Google street view car driving by, finding same parking ticket. 216 00:09:13,612 --> 00:09:16,076 So I wrote about this on my blog, on I Quant NY, 217 00:09:16,076 --> 00:09:18,465 and the DOT responded and they said, 218 00:09:18,465 --> 00:09:22,792 "While the DOT has not received any complaints about this location, 219 00:09:22,792 --> 00:09:27,341 we will review the roadway markings and make any appropriate alterations." 220 00:09:27,341 --> 00:09:30,329 I thought to myself, you know, typical government response, 221 00:09:30,329 --> 00:09:31,944 all right, moved on with my life. 222 00:09:31,944 --> 00:09:36,519 But then, a few weeks later, something incredible happened. 223 00:09:36,519 --> 00:09:38,638 They repainted the spot. 224 00:09:38,638 --> 00:09:41,299 And for a second I thought I saw the future of open data 225 00:09:41,299 --> 00:09:43,216 because think about what happened here. 226 00:09:43,216 --> 00:09:48,423 For five years, this spot was being ticketed, and it was confusing. 227 00:09:48,423 --> 00:09:52,804 And then a citizen found something, they told the city and within a few weeks, 228 00:09:52,804 --> 00:09:55,286 the problem was fixed. It's amazing. 229 00:09:55,286 --> 00:09:58,092 A lot of people see open data as being a watch dog, it's not. 230 00:09:58,092 --> 00:09:59,375 It's about being a partner. 231 00:09:59,375 --> 00:10:02,504 We can empower our citizens to be better partners for government, 232 00:10:02,504 --> 00:10:04,081 and it's not that hard. 233 00:10:04,081 --> 00:10:05,530 All we need are a few changes. 234 00:10:05,530 --> 00:10:06,663 If you're FOILing data, 235 00:10:06,663 --> 00:10:09,305 if you seeing your data being FOILed over and over again, 236 00:10:09,305 --> 00:10:12,109 let's release it to the public, that's a sign that it should be made public. 237 00:10:12,109 --> 00:10:15,015 And if you're a government agency releasing a PDF, 238 00:10:15,015 --> 00:10:18,909 let's pass a legislation that requires you to post it with your underlying data, 239 00:10:18,909 --> 00:10:20,944 because that data is coming from somewhere. 240 00:10:20,944 --> 00:10:23,649 I don't know where, but you can release it with the PDF. 241 00:10:23,649 --> 00:10:25,981 And let's adopt and share some open data standards. 242 00:10:25,981 --> 00:10:28,680 Let's start with our addresses here in New York City. 243 00:10:28,680 --> 00:10:30,707 Let's just start normalizing our addresses. 244 00:10:30,707 --> 00:10:32,691 Because New York is a leader in open data. 245 00:10:32,691 --> 00:10:35,318 Despite all this, we're absolutely a leader in open data, 246 00:10:35,318 --> 00:10:38,375 and if we start normalizing things, and set an open data standard, 247 00:10:38,375 --> 00:10:39,332 others will follow. 248 00:10:39,332 --> 00:10:41,834 The state will follow, maybe the federal government, 249 00:10:41,834 --> 00:10:43,393 other countries could follow, 250 00:10:43,393 --> 00:10:47,112 and we're not that far off from a time where you can write one program 251 00:10:47,112 --> 00:10:49,176 and map information from a 100 countries. 252 00:10:49,176 --> 00:10:51,750 It's not science fiction, we're actually quite close. 253 00:10:51,750 --> 00:10:54,037 And by the way, who are we empowering with this? 254 00:10:54,037 --> 00:10:57,614 Because it's not just John Krauss, it's not just Chris Whong. 255 00:10:57,614 --> 00:11:00,905 There are hundred of meetups going around in New York City right now, 256 00:11:00,905 --> 00:11:02,027 active meetups. 257 00:11:02,027 --> 00:11:04,606 There are thousands of people attending these meetups. 258 00:11:04,606 --> 00:11:06,973 These people are going after work and on weekends, 259 00:11:06,973 --> 00:11:09,758 and they're attending these meetups to look at open data, 260 00:11:09,758 --> 00:11:11,384 and make our city a better place. 261 00:11:11,384 --> 00:11:15,929 Groups like BetaNYC who just last week, released something called citygram.nyc 262 00:11:15,929 --> 00:11:18,161 that allows you to subscribe to 311 complaints 263 00:11:18,161 --> 00:11:20,305 around your own home, or around your office. 264 00:11:20,305 --> 00:11:22,686 You put in your address, you get local complaints. 265 00:11:22,686 --> 00:11:25,774 And it's not just the tech community that are after these things. 266 00:11:25,774 --> 00:11:28,399 It's urban planners like the students I teach at Pratt. 267 00:11:28,399 --> 00:11:30,363 It's policy advocates, it's everyone, 268 00:11:30,363 --> 00:11:32,919 it's citizens from a diverse set of backgrounds. 269 00:11:32,919 --> 00:11:35,680 And with some small incremental changes, 270 00:11:35,680 --> 00:11:38,788 we can unlock the passion and the ability of our citizens 271 00:11:38,788 --> 00:11:41,897 to harness open data and make our city even better, 272 00:11:41,897 --> 00:11:45,958 whether is one data set or one parking spot at a time. 273 00:11:45,958 --> 00:11:47,213 Thank you. 274 00:11:47,213 --> 00:11:50,274 (Applause)