1 00:00:06,075 --> 00:00:09,777 (chairman) So, let's welcome Egon, 2 00:00:09,777 --> 00:00:13,479 who will describe how he is trying 3 00:00:13,479 --> 00:00:17,579 to improve the coverage of chemical compounds in Wikidata. 4 00:00:18,289 --> 00:00:19,679 (Egon) Yeah, thank you. 5 00:00:20,689 --> 00:00:22,309 So, let's see... 6 00:00:22,684 --> 00:00:24,920 Oh, this is not right. 7 00:00:25,880 --> 00:00:26,900 Sorry. 8 00:00:31,754 --> 00:00:33,323 They put the wrong slide deck. 9 00:00:40,183 --> 00:00:42,645 (person 1) The one was better than-- (person 2) [inaudible] 10 00:00:42,645 --> 00:00:44,142 (person 1) (laughs) 11 00:00:49,061 --> 00:00:50,281 How about this? 12 00:00:52,633 --> 00:00:54,765 This one actually says WikidataCon. 13 00:00:56,540 --> 00:00:58,476 Slightly different slides. 14 00:01:00,169 --> 00:01:05,559 Okay, so yeah, coverage and correctness, 15 00:01:05,559 --> 00:01:08,091 accuracy, quality, if you like. 16 00:01:09,422 --> 00:01:11,590 And the other thing here 17 00:01:11,590 --> 00:01:14,628 is what makes it different from some of the other things 18 00:01:14,628 --> 00:01:17,276 that we've seen at the WikidataCon 19 00:01:17,276 --> 00:01:20,160 is how I do this quality, 20 00:01:20,160 --> 00:01:21,910 and the coverage, actually. 21 00:01:21,910 --> 00:01:25,220 So I'm actually taking advantage here of my background, 22 00:01:25,220 --> 00:01:27,166 which is in cheminformatics, 23 00:01:27,166 --> 00:01:30,112 which is something that we use in our research. 24 00:01:30,329 --> 00:01:32,989 And cheminformatics 25 00:01:32,989 --> 00:01:35,119 is a way to understanding the chemical structures, 26 00:01:35,119 --> 00:01:36,578 what we will see in a moment. 27 00:01:36,578 --> 00:01:39,840 We can do things that we cannot do with the regular toolsets 28 00:01:39,840 --> 00:01:43,524 that we have, like Shape Expressions, quality constraints and sorts. 29 00:01:45,857 --> 00:01:49,203 Now, one of the interesting things of chemistry 30 00:01:49,496 --> 00:01:51,732 is that chemical structures, 31 00:01:52,852 --> 00:01:55,472 are sometimes the same, sometimes not the same, 32 00:01:55,472 --> 00:01:57,676 depending on what you want to know. 33 00:01:58,096 --> 00:02:00,636 And this slide reflects that a bit, 34 00:02:00,636 --> 00:02:05,006 and what we see here is biologically the same compounds 35 00:02:05,006 --> 00:02:07,943 but chemically two different compounds. 36 00:02:08,496 --> 00:02:11,911 But at biological levels with the [inaudible], 37 00:02:11,911 --> 00:02:13,576 they are in equilibrium, 38 00:02:13,576 --> 00:02:16,485 and you will not be able to really distinguish between them, 39 00:02:16,735 --> 00:02:19,906 unless you're looking for a particular type of biology 40 00:02:19,906 --> 00:02:21,911 like reaction mechanisms. 41 00:02:23,209 --> 00:02:26,979 Another interesting thing about Wikidata and Wikipedia 42 00:02:26,979 --> 00:02:31,546 is that we have things like long-chain fatty acids, 43 00:02:31,546 --> 00:02:34,539 chemical concepts which are not a specific compound 44 00:02:34,539 --> 00:02:36,754 but actually a class of compounds. 45 00:02:37,347 --> 00:02:42,867 Now, this class can be based on similar features in the molecules, 46 00:02:42,867 --> 00:02:46,137 so like in the case of the long-chain fatty acids-- 47 00:02:46,137 --> 00:02:51,777 they all have a long-chain fatty and an acid group. 48 00:02:51,777 --> 00:02:54,101 In other cases, there are the classes 49 00:02:54,101 --> 00:02:57,118 based more on the biological functionality, 50 00:02:57,118 --> 00:03:00,443 like a certain type of inhibitor, like an ACE inhibitor. 51 00:03:01,762 --> 00:03:05,632 And this introduces a lot of interesting things, 52 00:03:05,632 --> 00:03:09,650 partly because of this close link with Wikipedias. 53 00:03:10,270 --> 00:03:12,602 And one of the things that we see 54 00:03:12,602 --> 00:03:17,982 is that Wikipedia may have a chembox for a particular compound 55 00:03:17,982 --> 00:03:20,467 bur actually be about a compound class, 56 00:03:20,467 --> 00:03:25,317 resulting to a slightly different concept 57 00:03:25,317 --> 00:03:27,419 of what the two things are actually meaning 58 00:03:27,419 --> 00:03:29,990 and the sitelink being more complicated. 59 00:03:33,189 --> 00:03:38,899 We need this for understanding the biology. 60 00:03:38,899 --> 00:03:42,921 So the research in our group is understanding the living cell. 61 00:03:43,440 --> 00:03:47,030 The system's biology here described in the biological process 62 00:03:47,030 --> 00:03:50,545 is we have a pathway database for that--WikiPathways, 63 00:03:50,545 --> 00:03:54,022 and if we look at the chemistry in there of the small molecules, 64 00:03:55,659 --> 00:03:59,179 the chemistry is sometimes described in a lot of details, 65 00:03:59,179 --> 00:04:00,739 sometimes in less detail, 66 00:04:00,739 --> 00:04:04,219 pretty much like this Wikipedia Wikidata link that we just had 67 00:04:04,219 --> 00:04:08,429 resulting in basically links to a lot of different databases 68 00:04:08,429 --> 00:04:10,424 with slightly different focuses. 69 00:04:10,889 --> 00:04:15,349 Some databases like LIPID MAPS and the Human Metabolome Database, 70 00:04:15,349 --> 00:04:17,581 they are very much focused on the biology, 71 00:04:17,581 --> 00:04:20,743 whereas a database like ChEBI, 72 00:04:20,743 --> 00:04:24,615 that's very much focused on the chemical entities. 73 00:04:26,235 --> 00:04:27,905 So we try to breach that, 74 00:04:27,905 --> 00:04:33,452 and that two-three years ago gave us a very interesting insight 75 00:04:33,452 --> 00:04:35,871 that if you look at the lines here 76 00:04:35,871 --> 00:04:40,152 where in blue we have the total number of the small molecules 77 00:04:40,152 --> 00:04:41,882 we have in these Pathways 78 00:04:41,882 --> 00:04:44,902 and the numbers in red that we can match to that, 79 00:04:44,902 --> 00:04:48,402 there is this gap, and this gap is complicated chemistry. 80 00:04:49,272 --> 00:04:52,428 Also, poignantly, things missing in Wikidata. 81 00:04:52,779 --> 00:04:55,327 So therefore the need for date completeness 82 00:04:55,327 --> 00:04:57,717 and the data quality. 83 00:05:00,486 --> 00:05:02,184 And here we have an example. 84 00:05:02,184 --> 00:05:05,642 This is actually a curation report of yesterday, 85 00:05:05,642 --> 00:05:08,392 and these are still things that we have in Pathways 86 00:05:08,392 --> 00:05:13,672 but that we do not know what the equivalent thing is in Wikidata. 87 00:05:14,081 --> 00:05:17,527 And one of the things here that I'm picking out here 88 00:05:17,527 --> 00:05:18,932 is strigolactone. 89 00:05:19,117 --> 00:05:21,024 And this is a class of compounds. 90 00:05:22,031 --> 00:05:27,075 So we have that in one of our Pathways, this particular Pathway over there. 91 00:05:29,212 --> 00:05:32,172 So you start matching this to Wikidata and Wikipedia, 92 00:05:32,172 --> 00:05:35,390 and to actually use for this compound Wikipedia page 93 00:05:35,976 --> 00:05:38,866 with these six structures-- 94 00:05:38,866 --> 00:05:42,101 images, name, no links, nothing. 95 00:05:42,101 --> 00:05:43,169 Nothing in Wikipedia, 96 00:05:43,169 --> 00:05:45,561 just this information, not machine-readable. 97 00:05:47,942 --> 00:05:53,022 So based on the name, I can actually find three out of the six 98 00:05:53,022 --> 00:05:54,867 of these compounds in Wikidata, 99 00:05:54,867 --> 00:05:56,642 not linked, not classified. 100 00:05:57,253 --> 00:06:00,433 So if we look at the class of strigolactones, 101 00:06:00,433 --> 00:06:03,015 of which these six are examples, 102 00:06:03,015 --> 00:06:05,305 Wikidata did not give us anything. 103 00:06:05,305 --> 00:06:08,070 So that's the kind of curation that I'm interested in. 104 00:06:09,170 --> 00:06:12,900 On the right here-- that page is actually pretty much empty, 105 00:06:12,900 --> 00:06:17,178 but it's exactly what Scholia is showing for this class of chemical compounds. 106 00:06:18,315 --> 00:06:22,757 So Scholia is one of the tools that I've been using to do this curation. 107 00:06:26,339 --> 00:06:30,548 This missing classification is a bit of information 108 00:06:30,548 --> 00:06:32,057 missing in Wikidata, 109 00:06:32,057 --> 00:06:34,797 but we can add this classification, 110 00:06:34,797 --> 00:06:36,957 and we can retrieve that from some sources. 111 00:06:36,957 --> 00:06:38,687 We will see with LIPID MAPS later, 112 00:06:38,687 --> 00:06:41,157 we can automate adding these missing links, 113 00:06:41,660 --> 00:06:44,371 if we understand the chemistry. 114 00:06:46,274 --> 00:06:50,554 So this diagram over here-- we have fatty acid over there again 115 00:06:50,554 --> 00:06:54,052 and the long-chain fatty acid over here 116 00:06:54,052 --> 00:06:56,672 that we saw on one of the previous slides-- 117 00:06:56,672 --> 00:07:00,998 very long-chain fatty acids and a number of other fatty acids. 118 00:07:01,336 --> 00:07:04,495 This kind of information helps us see 119 00:07:04,495 --> 00:07:09,325 a [inaudible] of the chemistry in Wikidata. 120 00:07:10,351 --> 00:07:13,545 Scholia can visualize the 2D structure, 121 00:07:13,545 --> 00:07:15,998 and this thing is actually automatically generated 122 00:07:15,998 --> 00:07:18,701 from the chemical structure in Wikidata 123 00:07:18,948 --> 00:07:22,198 on the fly creating in the Scalable Vector Graphics. 124 00:07:22,734 --> 00:07:23,864 (coughing) 125 00:07:24,530 --> 00:07:25,563 Sorry. 126 00:07:25,563 --> 00:07:28,150 With the stereochemistry annotation there 127 00:07:28,150 --> 00:07:32,600 to help the chemist see the completeness of the data 128 00:07:32,600 --> 00:07:35,819 because also the stereochemistry might be missing. 129 00:07:36,452 --> 00:07:40,402 We also get an overview on Scholia of related compounds 130 00:07:40,402 --> 00:07:41,872 based on the InChIKey, 131 00:07:41,872 --> 00:07:45,662 where the first block basically indicates how the atoms are connected 132 00:07:45,905 --> 00:07:50,325 and the second column indicates things like stereochemistry 133 00:07:50,325 --> 00:07:53,838 and things like which isotopes are in there, 134 00:07:54,334 --> 00:07:57,334 for example C11 instead of C12 135 00:07:57,334 --> 00:08:01,974 or C13 instead of C12. 136 00:08:03,441 --> 00:08:06,459 The last number of the last letter over here 137 00:08:06,459 --> 00:08:08,671 actually indicates the charge, 138 00:08:08,671 --> 00:08:13,868 so that's the example that we saw earlier between the citric acid and the citrate, 139 00:08:14,493 --> 00:08:16,641 or was it the acetic acid and the acetate? 140 00:08:20,496 --> 00:08:23,086 By putting in a bit of the main knowledge, 141 00:08:23,086 --> 00:08:27,708 we can do a lot more... making sense of what we have in Wikidata. 142 00:08:29,463 --> 00:08:32,643 A bit more about Scholia 143 00:08:32,643 --> 00:08:36,373 is that about data completeness with the physical and chemical properties, 144 00:08:36,373 --> 00:08:40,070 the literature, those are whole things that we want to have access to. 145 00:08:40,070 --> 00:08:45,317 But it only works if we can find the right chemical in Wikidata. 146 00:08:48,897 --> 00:08:52,467 We started using Wikidata in a number of our projects, 147 00:08:52,467 --> 00:08:54,683 so WikiPathways was one of them. 148 00:08:54,683 --> 00:08:56,043 This is another project 149 00:08:56,043 --> 00:08:59,573 in the area of the nanosafety, risk assessment, 150 00:08:59,573 --> 00:09:02,728 where they use OECD testing guidelines 151 00:09:02,728 --> 00:09:07,662 and using Wikidata here to make an overview of the experiments. 152 00:09:07,662 --> 00:09:11,762 And this means that we can now actually start annotating articles 153 00:09:11,762 --> 00:09:14,443 where these protocols have been used. 154 00:09:15,663 --> 00:09:20,389 And in this way, we get a better insight in the quality of literature as well. 155 00:09:20,895 --> 00:09:23,742 We get to see which DDTs 156 00:09:23,742 --> 00:09:27,268 are well tested, established experimental methods 157 00:09:27,268 --> 00:09:32,193 and an indication of how good the data is that came out of that. 158 00:09:35,299 --> 00:09:38,199 Another example--this is nanomaterials, 159 00:09:38,199 --> 00:09:39,922 specific nanomaterials, 160 00:09:40,702 --> 00:09:43,402 where there is a unique code-- we've added that-- 161 00:09:43,402 --> 00:09:45,277 with the same purpose of being able 162 00:09:45,277 --> 00:09:48,006 to track down literature about these nanomaterials. 163 00:09:48,655 --> 00:09:51,913 But, again, we need exact descriptions. 164 00:09:53,101 --> 00:09:56,241 Now, this is the LIPID MAPS classification, 165 00:09:56,241 --> 00:09:58,011 and here we see an interesting thing, 166 00:09:58,011 --> 00:10:01,986 and this has shown up in some of the presentations, 167 00:10:02,876 --> 00:10:04,236 elsewhere as well. 168 00:10:04,236 --> 00:10:08,109 This idea that some of the things that we have in Wikidata 169 00:10:08,109 --> 00:10:10,309 is not always matching the sources. 170 00:10:10,309 --> 00:10:12,055 So different ontological models, 171 00:10:12,055 --> 00:10:16,038 different ideas of what a particular thing means. 172 00:10:16,225 --> 00:10:20,955 And so, if we look at the LIPID MAPS, we have a lipid in the middle 173 00:10:20,955 --> 00:10:22,745 and then a number of classes, 174 00:10:22,745 --> 00:10:25,022 and many of these are in Wikidata. 175 00:10:25,454 --> 00:10:30,534 But here, around actually fatty acids or fatty acyls, 176 00:10:30,864 --> 00:10:33,190 that's where there is a mismatch 177 00:10:33,468 --> 00:10:37,288 causing something that should be actually purely hierarchical, 178 00:10:37,288 --> 00:10:41,070 actually it started to show some loops over there, 179 00:10:41,496 --> 00:10:46,189 the mismatch of two representations of a lipid chemistry. 180 00:10:48,636 --> 00:10:51,546 Now, the goal of this work is not so much to reconcile this 181 00:10:51,546 --> 00:10:53,239 but to visualize it 182 00:10:53,239 --> 00:10:55,809 so that we can understand what is going on 183 00:10:55,809 --> 00:10:58,877 and correct things that are actually clearly wrong. 184 00:11:06,444 --> 00:11:09,619 The interesting about LIPID MAPS is actually that the classification 185 00:11:10,105 --> 00:11:12,848 is indicated in the external identifier. 186 00:11:12,848 --> 00:11:14,605 So one of the things that we've been using 187 00:11:14,605 --> 00:11:18,845 is these external numbers to make this automatic classification 188 00:11:18,845 --> 00:11:22,420 because everything that starts with an LMFA05 189 00:11:22,600 --> 00:11:25,909 is actually a fatty alcohol. 190 00:11:26,242 --> 00:11:28,424 So I can translate that into Quickstatements, 191 00:11:28,424 --> 00:11:29,982 push that into Quickstatements 192 00:11:29,982 --> 00:11:32,992 and get that annotated in Wikidata. 193 00:11:40,133 --> 00:11:44,223 This slide is just reflecting the advantage for LIPID MAPS here 194 00:11:44,223 --> 00:11:46,357 which been collaborating with them 195 00:11:46,357 --> 00:11:50,106 because they get a lot of data out of Wikidata as well, 196 00:11:50,284 --> 00:11:54,786 which we can cross-reference, which we can compare if it's correct. 197 00:11:55,066 --> 00:11:57,973 LIPID MAPS is a quite curated database 198 00:11:58,436 --> 00:12:02,276 but like everyone actually having trouble 199 00:12:02,276 --> 00:12:04,748 with access to literature, 200 00:12:04,748 --> 00:12:07,856 the demand of literature and filtering the literature, 201 00:12:07,856 --> 00:12:10,102 getting to the right articles. 202 00:12:12,626 --> 00:12:15,126 Shape Expressions is probably something that you've seen. 203 00:12:15,126 --> 00:12:18,546 We have a few of them for chemistry now. 204 00:12:18,546 --> 00:12:21,116 This is the example for racemic mixture. 205 00:12:21,116 --> 00:12:26,236 In a case of racemic mixture, you want to have two parts in there. 206 00:12:26,236 --> 00:12:27,358 It's a mixture, 207 00:12:27,358 --> 00:12:30,657 so at least two chemical entities need to be in there. 208 00:12:31,008 --> 00:12:35,561 Moreover, each of the [inaudible] parts has to be a chemical compound. 209 00:12:35,561 --> 00:12:41,311 This is another level of a way we can curate the content. 210 00:12:41,811 --> 00:12:43,755 There have to be more of them. 211 00:12:43,755 --> 00:12:47,775 We have quite a few different concepts in Wikidata 212 00:12:47,775 --> 00:12:49,922 like groups of co-compounds. 213 00:12:50,506 --> 00:12:54,710 There is a class that is of structurally similar compounds, etc. 214 00:12:57,320 --> 00:13:00,532 If you run a query like this, this case for the other one, 215 00:13:00,532 --> 00:13:04,245 other schema that we have for chemical elements, 216 00:13:04,245 --> 00:13:07,059 you can do the same thing-- you can run it on a single item 217 00:13:07,059 --> 00:13:10,400 or you can run that on everything that is a chemical element. 218 00:13:10,400 --> 00:13:13,800 This is something that I can very much recommend 219 00:13:13,800 --> 00:13:16,749 having a look at if you have not done so already. 220 00:13:17,877 --> 00:13:21,550 Now, if we go to the automation of things, 221 00:13:22,172 --> 00:13:26,032 here I'm using a tool called Bioclipse. 222 00:13:26,032 --> 00:13:29,222 This is something that we worked on some time ten years ago. 223 00:13:29,222 --> 00:13:32,246 It's a platform for chemistry and biology, 224 00:13:32,246 --> 00:13:34,646 or cheminformatics and bioinformatics. 225 00:13:34,646 --> 00:13:36,746 aimed at automating things, 226 00:13:39,513 --> 00:13:41,453 including visualizations and sorts. 227 00:13:41,453 --> 00:13:45,385 But I've taken that now and developed a number of scripts 228 00:13:45,385 --> 00:13:47,345 that I can actually run on the command line, 229 00:13:47,345 --> 00:13:50,759 which makes it easier to automate things, as we will see in a moment, 230 00:13:50,759 --> 00:13:52,193 and doing all sort of things, 231 00:13:52,193 --> 00:13:55,957 for example, the classification according to the LIPID MAP identifiers, 232 00:13:56,193 --> 00:14:01,717 that's the scripts all available from the GitHub repository here. 233 00:14:02,673 --> 00:14:05,703 And typically, I have them create Quickstatements 234 00:14:05,703 --> 00:14:08,993 because that gets me an additional check step 235 00:14:08,993 --> 00:14:13,872 after I created the Quickstatements and see what does data actually look like. 236 00:14:15,979 --> 00:14:18,709 Annotation of main subjects. 237 00:14:18,709 --> 00:14:23,144 This one is my script too, starting from SMILES 238 00:14:23,144 --> 00:14:27,473 to actually add chemical compounds that are not in Wikidata yet, 239 00:14:27,473 --> 00:14:28,818 which happens a lot. 240 00:14:29,678 --> 00:14:34,388 So three or four weeks ago I added something like 500 compounds 241 00:14:34,388 --> 00:14:36,358 which our project was looking into 242 00:14:36,358 --> 00:14:41,512 because these are volatile compounds in oils. 243 00:14:44,610 --> 00:14:47,130 This script adds the compounds. 244 00:14:47,130 --> 00:14:51,370 They will later on add the annotation of which pieces that compound comes from 245 00:14:51,370 --> 00:14:52,870 and what the properties are. 246 00:14:55,914 --> 00:14:59,010 Bioclipse itself is based on the Chemistry Development Kit 247 00:14:59,878 --> 00:15:01,868 and a few other libraries. 248 00:15:01,868 --> 00:15:04,723 This allows me to do the chemistry. 249 00:15:04,723 --> 00:15:07,973 And this is a very well-validated toolkit. 250 00:15:07,973 --> 00:15:11,653 The SMILES part has been done by John Mayfield. 251 00:15:11,653 --> 00:15:17,105 I have done a lot of validation against other tools. 252 00:15:17,522 --> 00:15:20,668 And the quality is actually really high now, 253 00:15:21,163 --> 00:15:26,328 comparable or in some cases even better of commercial cheminformatics tools. 254 00:15:26,328 --> 00:15:29,283 It has given me a lot of reassurance 255 00:15:29,283 --> 00:15:34,923 that the quality checking that we do with this tool on Wikidata 256 00:15:34,923 --> 00:15:37,312 is giving interesting results. 257 00:15:38,197 --> 00:15:40,431 This is the Quickstatements. 258 00:15:40,431 --> 00:15:43,275 Quickstatements is Magnus' work, of course. 259 00:15:44,379 --> 00:15:48,089 What happens if we take the SMILES, it calculates the InChI, 260 00:15:48,089 --> 00:15:52,239 and the InChIKey, it even looks up based on the InChIKey, 261 00:15:52,239 --> 00:15:54,199 if there is a PubChem identifier 262 00:15:54,199 --> 00:15:58,609 that uses the InChIKey, the PubChem identifier, 263 00:15:58,609 --> 00:16:01,255 to see if this compound already is in Wikidata. 264 00:16:01,255 --> 00:16:03,780 And only if it's not already there, 265 00:16:03,780 --> 00:16:06,945 then it will actually create a CREATE statement. 266 00:16:09,330 --> 00:16:12,822 A bit of automatic classification here is an option. 267 00:16:12,822 --> 00:16:15,162 So if I'm adding a class of compounds, 268 00:16:15,162 --> 00:16:17,980 I can automatically indicate what these are all... 269 00:16:17,980 --> 00:16:19,970 this type of compounds, 270 00:16:19,970 --> 00:16:22,599 and I can also indicate, if needed, 271 00:16:22,599 --> 00:16:25,214 if there is a particular article where I got this information 272 00:16:25,214 --> 00:16:27,709 from automatically adding references. 273 00:16:29,836 --> 00:16:32,796 Well, this is what the Quickstatements output looks like 274 00:16:32,796 --> 00:16:35,508 for the annotation of main subjects. 275 00:16:36,277 --> 00:16:38,149 You've probably seen that as well. 276 00:16:40,650 --> 00:16:42,820 A newer thing that I started doing 277 00:16:42,820 --> 00:16:46,003 is actually doing reasoning on the data in Wikidata. 278 00:16:46,003 --> 00:16:51,457 So if I have the SMILES, then I can check the molecular formula, for example. 279 00:16:51,457 --> 00:16:53,427 I can check the InChIKey. 280 00:16:55,242 --> 00:17:00,498 At some point, what we are going to do is calculate physicochemical properties 281 00:17:00,498 --> 00:17:03,865 and see if that matches what is in Wikidata. 282 00:17:04,520 --> 00:17:07,151 This will highlight typos 283 00:17:07,846 --> 00:17:10,386 or wrong units, for example. 284 00:17:11,301 --> 00:17:14,591 At this moment... so this is a run of this morning. 285 00:17:14,591 --> 00:17:17,989 What we see here is two tests actually failing, 286 00:17:17,989 --> 00:17:19,539 and this is an example of it. 287 00:17:19,539 --> 00:17:23,300 This is the InChIKey that is computed from the isomeric SMILES 288 00:17:23,300 --> 00:17:26,957 is different from the InChIKey given in the entry. 289 00:17:27,700 --> 00:17:32,198 This can result from data being pulled in from different resources. 290 00:17:33,011 --> 00:17:35,889 So these are entries, about 300 of them, 291 00:17:35,889 --> 00:17:39,479 in the 160,000 chemicals that we have in Wikidata. 292 00:17:39,479 --> 00:17:41,632 So it's a very small amount, really, 293 00:17:42,392 --> 00:17:45,672 where there is information, and someone needs to look at it. 294 00:17:47,315 --> 00:17:51,008 Now, these are all organic compounds 295 00:17:51,008 --> 00:17:54,168 and also quite a few inorganic compounds 296 00:17:54,168 --> 00:17:57,281 where these things just work less well. 297 00:17:57,618 --> 00:18:01,105 But I found in the other test that is failing 298 00:18:01,105 --> 00:18:04,105 immediately a couple of things that are very clearly wrong. 299 00:18:09,186 --> 00:18:11,759 PubChem is a huge database. 300 00:18:11,996 --> 00:18:13,846 They do validation as well. 301 00:18:13,846 --> 00:18:16,622 We are in the process of submitting Wikidata there, 302 00:18:16,622 --> 00:18:18,812 which I'm really happy about. 303 00:18:19,350 --> 00:18:22,530 It's in the last validation step at this moment. 304 00:18:23,323 --> 00:18:25,643 And this will also mean that PubChem, 305 00:18:25,643 --> 00:18:29,143 which has something like 100 million compounds 306 00:18:29,143 --> 00:18:31,149 will actually link back to Wikidata. 307 00:18:31,488 --> 00:18:35,117 It already does this, but via Wikipedia. 308 00:18:35,387 --> 00:18:36,947 (laughing) 309 00:18:36,947 --> 00:18:38,434 Do you recognize it? 310 00:18:38,434 --> 00:18:43,067 With the aforementioned issues there of concept mismatches. 311 00:18:43,591 --> 00:18:45,691 So this will give us a second thing. 312 00:18:45,691 --> 00:18:49,679 And there, also, using the same Bioclipse scripts 313 00:18:49,679 --> 00:18:51,681 or similar Bioclipse scripts, 314 00:18:51,681 --> 00:18:53,493 we get validation reports, 315 00:18:53,493 --> 00:18:57,111 again indicating things that chemists should look at. 316 00:18:59,813 --> 00:19:01,540 That basically wraps it up. 317 00:19:01,803 --> 00:19:04,955 This is still a work in progress, the article is in preparation. 318 00:19:04,955 --> 00:19:08,048 I've been working with Finn here in Scholia 319 00:19:08,048 --> 00:19:10,705 to support this validation. 320 00:19:11,058 --> 00:19:15,538 We're writing up the full work, but for now you can look up this poster. 321 00:19:15,538 --> 00:19:19,352 The slides are on the program of this session, 322 00:19:19,352 --> 00:19:21,825 so you can look at the slides and look at the details. 323 00:19:22,560 --> 00:19:24,939 And a quick acknowledgment: 324 00:19:24,939 --> 00:19:28,370 some of this work has been done by a number of grants that I received. 325 00:19:28,370 --> 00:19:29,836 And thank you very much. 326 00:19:30,282 --> 00:19:32,475 (applause) 327 00:19:35,922 --> 00:19:37,902 (chairman) Are there any questions? 328 00:19:41,142 --> 00:19:42,669 (person 3) Thank you so much for this. 329 00:19:42,669 --> 00:19:44,196 I am [inaudible], 330 00:19:44,196 --> 00:19:47,142 and so far, I've been reading articles 331 00:19:47,142 --> 00:19:49,722 on the [inaudible] Quickipedia on different compounds. 332 00:19:49,722 --> 00:19:53,770 I have a little bit more than 70 articles with different compounds-- 333 00:19:53,770 --> 00:19:55,464 just things I come across. 334 00:19:56,000 --> 00:19:58,021 And my question to you is 335 00:19:58,021 --> 00:20:02,371 if I want to move my chemistry activity from Wikipedia to Wikidata, 336 00:20:02,371 --> 00:20:05,300 how can I help in a way that is very friendly 337 00:20:05,300 --> 00:20:10,031 to somebody who is a beginner in that field on Wikidata? 338 00:20:12,262 --> 00:20:15,832 So, if that compound is in Wikipedia and.. 339 00:20:15,832 --> 00:20:18,092 Sometimes there is actually a Wikidata page. 340 00:20:18,092 --> 00:20:19,972 I occasionally run into this as well, 341 00:20:19,972 --> 00:20:21,902 in the last couple of months not so much anymore 342 00:20:21,902 --> 00:20:23,791 but this morning, actually. 343 00:20:25,522 --> 00:20:27,433 And what I typically do then 344 00:20:27,433 --> 00:20:31,097 is I take the SMILES from [inaudible] infobox 345 00:20:31,097 --> 00:20:32,262 from that compound 346 00:20:32,262 --> 00:20:37,182 or use PubChem to look up the SMILES, check if the information is complete, 347 00:20:37,182 --> 00:20:38,817 particularly the stereochemistry, 348 00:20:39,282 --> 00:20:43,422 and then I use that that creates Wikidata item scripts 349 00:20:43,422 --> 00:20:46,320 to create Quickstatements for that compound. 350 00:20:47,506 --> 00:20:50,016 If there already is a Wikidata item, 351 00:20:50,016 --> 00:20:55,916 I basically just update these scripts, 352 00:20:55,916 --> 00:21:00,065 but rather than say, "Create Last," I replace the last with the Q-codes 353 00:21:00,065 --> 00:21:01,710 that that item already has. 354 00:21:01,710 --> 00:21:04,650 And then it complements or it adds this information 355 00:21:04,650 --> 00:21:06,986 based on the information we had. 356 00:21:07,922 --> 00:21:10,422 This is [manuable], 357 00:21:10,422 --> 00:21:13,677 so you can copy-paste a number of SMILES, put it in a file, 358 00:21:13,677 --> 00:21:15,999 and take that. 359 00:21:18,351 --> 00:21:21,671 Extracting that information to Wikidata is not something I've automated yet, 360 00:21:21,671 --> 00:21:25,088 but this helps me... it's a pretty fast process. 361 00:21:25,782 --> 00:21:28,427 I can show you later how to use that software. 362 00:21:30,728 --> 00:21:32,478 (chairman) Are there other questions? 363 00:21:33,515 --> 00:21:34,784 So, I have one. 364 00:21:35,097 --> 00:21:39,825 Do you make an effort to, in fact, make this more visible 365 00:21:39,825 --> 00:21:42,445 in this bioinformatics community 366 00:21:42,445 --> 00:21:46,671 so that they can start using this structured data? 367 00:21:47,566 --> 00:21:49,326 Yeah, I'm actively doing that. 368 00:21:49,326 --> 00:21:52,256 So what I did not mention in this presentation so much, 369 00:21:52,256 --> 00:21:58,456 but we saw that in... I'd have somewhere to start here-- 370 00:21:58,456 --> 00:22:00,669 this is an overview of different databases. 371 00:22:01,106 --> 00:22:04,706 A similar plot, which actually I do not have on this slide deck 372 00:22:04,706 --> 00:22:09,387 is the number of different identifiers that chemical compounds have, 373 00:22:09,387 --> 00:22:11,347 and I've been working with a number of databases, 374 00:22:11,347 --> 00:22:15,622 like MassBank, the Environmental Protection Agency, 375 00:22:17,184 --> 00:22:18,987 CompTox Dashboard. 376 00:22:19,507 --> 00:22:21,704 I've added links to the BDB database. 377 00:22:21,704 --> 00:22:24,159 So I'm working with a number of projects 378 00:22:24,159 --> 00:22:27,469 for pulling in additional information, 379 00:22:27,739 --> 00:22:30,030 identifies our links out to other databases. 380 00:22:31,192 --> 00:22:33,371 Regarding outreach, yes, 381 00:22:33,371 --> 00:22:36,961 so that wrong slide deck that I was showing at the start, 382 00:22:36,961 --> 00:22:39,283 there was actually a presentation two weeks ago 383 00:22:39,283 --> 00:22:42,148 at an Open Science Meeting around chemistry. 384 00:22:42,503 --> 00:22:45,843 I'm very much pushing this and... 385 00:22:47,410 --> 00:22:48,951 I see a big future here. 386 00:22:48,951 --> 00:22:50,780 There's a lot of interest. 387 00:22:50,780 --> 00:22:54,590 And making people aware of the CC0 license, 388 00:22:54,590 --> 00:22:57,616 that's typically the larger problem. 389 00:22:58,200 --> 00:23:02,967 So we have to pull in the information carefully. 390 00:23:05,303 --> 00:23:06,713 (chairman) Other questions? 391 00:23:08,643 --> 00:23:10,373 - Okay. - Thank you very much. 392 00:23:10,373 --> 00:23:12,036 (chairman) Can we thank the speaker. 393 00:23:12,036 --> 00:23:15,257 (applause)