0:00:06.075,0:00:09.777 (chairman) So, let's welcome Egon, 0:00:09.777,0:00:13.479 who will describe how he is trying 0:00:13.479,0:00:17.579 to improve the coverage [br]of chemical compounds in Wikidata. 0:00:18.289,0:00:19.679 (Egon) Yeah, thank you. 0:00:20.689,0:00:22.309 So, let's see... 0:00:22.684,0:00:24.920 Oh, this is not right. 0:00:25.880,0:00:26.900 Sorry. 0:00:31.754,0:00:33.323 They put the wrong slide deck. 0:00:40.183,0:00:42.645 (person 1) The one was better than--[br](person 2) [inaudible] 0:00:42.645,0:00:44.142 (person 1) (laughs) 0:00:49.061,0:00:50.281 How about this? 0:00:52.633,0:00:54.765 This one actually says WikidataCon. 0:00:56.540,0:00:58.476 Slightly different slides. 0:01:00.169,0:01:05.559 Okay, so yeah, coverage and correctness, 0:01:05.559,0:01:08.091 accuracy, quality, if you like. 0:01:09.422,0:01:11.590 And the other thing here 0:01:11.590,0:01:14.628 is what makes it different [br]from some of the other things 0:01:14.628,0:01:17.276 that we've seen at the WikidataCon 0:01:17.276,0:01:20.160 is how I do this quality, 0:01:20.160,0:01:21.910 and the coverage, actually. 0:01:21.910,0:01:25.220 So I'm actually taking advantage here[br]of my background, 0:01:25.220,0:01:27.166 which is in cheminformatics, 0:01:27.166,0:01:30.112 which is something [br]that we use in our research. 0:01:30.329,0:01:32.989 And cheminformatics 0:01:32.989,0:01:35.119 is a way to understanding [br]the chemical structures, 0:01:35.119,0:01:36.578 what we will see in a moment. 0:01:36.578,0:01:39.840 We can do things that we cannot do [br]with the regular toolsets 0:01:39.840,0:01:43.524 that we have, like Shape Expressions,[br]quality constraints and sorts. 0:01:45.857,0:01:49.203 Now, one of the interesting[br]things of chemistry 0:01:49.496,0:01:51.732 is that chemical structures, 0:01:52.852,0:01:55.472 are sometimes the same,[br]sometimes not the same, 0:01:55.472,0:01:57.676 depending on what you want to know. 0:01:58.096,0:02:00.636 And this slide reflects that a bit, 0:02:00.636,0:02:05.006 and what we see here [br]is biologically the same compounds 0:02:05.006,0:02:07.943 but chemically two different compounds. 0:02:08.496,0:02:11.911 But at biological levels [br]with the [inaudible], 0:02:11.911,0:02:13.576 they are in equilibrium, 0:02:13.576,0:02:16.485 and you will not be able [br]to really distinguish between them, 0:02:16.735,0:02:19.906 unless you're looking [br]for a particular type of biology 0:02:19.906,0:02:21.911 like reaction mechanisms. 0:02:23.209,0:02:26.979 Another interesting thing [br]about Wikidata and Wikipedia 0:02:26.979,0:02:31.546 is that we have things [br]like long-chain fatty acids, 0:02:31.546,0:02:34.539 chemical concepts[br]which are not a specific compound 0:02:34.539,0:02:36.754 but actually a class of compounds. 0:02:37.347,0:02:42.867 Now, this class can be based [br]on similar features in the molecules, 0:02:42.867,0:02:46.137 so like in the case [br]of the long-chain fatty acids-- 0:02:46.137,0:02:51.777 they all have a long-chain fatty[br]and an acid group. 0:02:51.777,0:02:54.101 In other cases, there are the classes 0:02:54.101,0:02:57.118 based more[br]on the biological functionality, 0:02:57.118,0:03:00.443 like a certain type of inhibitor,[br]like an ACE inhibitor. 0:03:01.762,0:03:05.632 And this introduces [br]a lot of interesting things, 0:03:05.632,0:03:09.650 partly because of this close link [br]with Wikipedias. 0:03:10.270,0:03:12.602 And one of the things that we see 0:03:12.602,0:03:17.982 is that Wikipedia may have a chembox [br]for a particular compound 0:03:17.982,0:03:20.467 bur actually be about a compound class, 0:03:20.467,0:03:25.317 resulting to a slightly different concept 0:03:25.317,0:03:27.419 of what the two things [br]are actually meaning 0:03:27.419,0:03:29.990 and the sitelink being more complicated. 0:03:33.189,0:03:38.899 We need this [br]for understanding the biology. 0:03:38.899,0:03:42.921 So the research in our group [br]is understanding the living cell. 0:03:43.440,0:03:47.030 The system's biology here [br]described in the biological process 0:03:47.030,0:03:50.545 is we have a pathway database [br]for that--WikiPathways, 0:03:50.545,0:03:54.022 and if we look at the chemistry in there[br]of the small molecules, 0:03:55.659,0:03:59.179 the chemistry is sometimes [br]described in a lot of details, 0:03:59.179,0:04:00.739 sometimes in less detail, 0:04:00.739,0:04:04.219 pretty much like this Wikipedia[br]Wikidata link that we just had 0:04:04.219,0:04:08.429 resulting in basically links[br]to a lot of different databases 0:04:08.429,0:04:10.424 with slightly different focuses. 0:04:10.889,0:04:15.349 Some databases like LIPID MAPS[br]and the Human Metabolome Database, 0:04:15.349,0:04:17.581 they are very much focused on the biology, 0:04:17.581,0:04:20.743 whereas a database like ChEBI, 0:04:20.743,0:04:24.615 that's very much focused [br]on the chemical entities. 0:04:26.235,0:04:27.905 So we try to breach that, 0:04:27.905,0:04:33.452 and that two-three years ago gave us[br]a very interesting insight 0:04:33.452,0:04:35.871 that if you look at the lines here 0:04:35.871,0:04:40.152 where in blue we have the total number [br]of the small molecules 0:04:40.152,0:04:41.882 we have in these Pathways 0:04:41.882,0:04:44.902 and the numbers in red [br]that we can match to that, 0:04:44.902,0:04:48.402 there is this gap, and this gap [br]is complicated chemistry. 0:04:49.272,0:04:52.428 Also, poignantly, things[br]missing in Wikidata. 0:04:52.779,0:04:55.327 So therefore the need [br]for date completeness 0:04:55.327,0:04:57.717 and the data quality. 0:05:00.486,0:05:02.184 And here we have an example. 0:05:02.184,0:05:05.642 This is actually [br]a curation report of yesterday, 0:05:05.642,0:05:08.392 and these are still things [br]that we have in Pathways 0:05:08.392,0:05:13.672 but that we do not know [br]what the equivalent thing is in Wikidata. 0:05:14.081,0:05:17.527 And one of the things here[br]that I'm picking out here 0:05:17.527,0:05:18.932 is strigolactone. 0:05:19.117,0:05:21.024 And this is a class of compounds. 0:05:22.031,0:05:27.075 So we have that in one of our Pathways,[br]this particular Pathway over there. 0:05:29.212,0:05:32.172 So you start matching this [br]to Wikidata and Wikipedia, 0:05:32.172,0:05:35.390 and to actually use[br]for this compound Wikipedia page 0:05:35.976,0:05:38.866 with these six structures-- 0:05:38.866,0:05:42.101 images, name, no links, nothing. 0:05:42.101,0:05:43.169 Nothing in Wikipedia, 0:05:43.169,0:05:45.561 just this information, [br]not machine-readable. 0:05:47.942,0:05:53.022 So based on the name, [br]I can actually find three out of the six 0:05:53.022,0:05:54.867 of these compounds in Wikidata, 0:05:54.867,0:05:56.642 not linked, not classified. 0:05:57.253,0:06:00.433 So if we look at the class [br]of strigolactones, 0:06:00.433,0:06:03.015 of which these six are examples, 0:06:03.015,0:06:05.305 Wikidata did not give us anything. 0:06:05.305,0:06:08.070 So that's the kind of curation[br]that I'm interested in. 0:06:09.170,0:06:12.900 On the right here--[br]that page is actually pretty much empty, 0:06:12.900,0:06:17.178 but it's exactly what Scholia is showing[br]for this class of chemical compounds. 0:06:18.315,0:06:22.757 So Scholia is one of the tools [br]that I've been using to do this curation. 0:06:26.339,0:06:30.548 This missing classification[br]is a bit of information 0:06:30.548,0:06:32.057 missing in Wikidata, 0:06:32.057,0:06:34.797 but we can add this classification, 0:06:34.797,0:06:36.957 and we can retrieve that [br]from some sources. 0:06:36.957,0:06:38.687 We will see with LIPID MAPS later, 0:06:38.687,0:06:41.157 we can automate [br]adding these missing links, 0:06:41.660,0:06:44.371 if we understand the chemistry. 0:06:46.274,0:06:50.554 So this diagram over here--[br]we have fatty acid over there again 0:06:50.554,0:06:54.052 and the long-chain fatty acid over here 0:06:54.052,0:06:56.672 that we saw on one[br]of the previous slides-- 0:06:56.672,0:07:00.998 very long-chain fatty acids [br]and a number of other fatty acids. 0:07:01.336,0:07:04.495 This kind of information helps us see 0:07:04.495,0:07:09.325 a [inaudible] of the chemistry [br]in Wikidata. 0:07:10.351,0:07:13.545 Scholia can visualize the 2D structure, 0:07:13.545,0:07:15.998 and this thing is actually [br]automatically generated 0:07:15.998,0:07:18.701 from the chemical structure in Wikidata 0:07:18.948,0:07:22.198 on the fly creating [br]in the Scalable Vector Graphics. 0:07:22.734,0:07:23.864 (coughing) 0:07:24.530,0:07:25.563 Sorry. 0:07:25.563,0:07:28.150 With the stereochemistry annotation there 0:07:28.150,0:07:32.600 to help the chemist see [br]the completeness of the data 0:07:32.600,0:07:35.819 because also the stereochemistry [br]might be missing. 0:07:36.452,0:07:40.402 We also get an overview on Scholia [br]of related compounds 0:07:40.402,0:07:41.872 based on the InChIKey, 0:07:41.872,0:07:45.662 where the first block basically indicates[br]how the atoms are connected 0:07:45.905,0:07:50.325 and the second column [br]indicates things like stereochemistry 0:07:50.325,0:07:53.838 and things like [br]which isotopes are in there, 0:07:54.334,0:07:57.334 for example C11 instead of C12 0:07:57.334,0:08:01.974 or C13 instead of C12. 0:08:03.441,0:08:06.459 The last number [br]of the last letter over here 0:08:06.459,0:08:08.671 actually indicates the charge, 0:08:08.671,0:08:13.868 so that's the example that we saw earlier[br]between the citric acid and the citrate, 0:08:14.493,0:08:16.641 or was it the acetic acid and the acetate? 0:08:20.496,0:08:23.086 By putting in a bit of the main knowledge, 0:08:23.086,0:08:27.708 we can do a lot more... making sense [br]of what we have in Wikidata. 0:08:29.463,0:08:32.643 A bit more about Scholia 0:08:32.643,0:08:36.373 is that about data completeness [br]with the physical and chemical properties, 0:08:36.373,0:08:40.070 the literature, those are whole things [br]that we want to have access to. 0:08:40.070,0:08:45.317 But it only works if we can find [br]the right chemical in Wikidata. 0:08:48.897,0:08:52.467 We started using Wikidata [br]in a number of our projects, 0:08:52.467,0:08:54.683 so WikiPathways was one of them. 0:08:54.683,0:08:56.043 This is another project 0:08:56.043,0:08:59.573 in the area of the nanosafety, [br]risk assessment, 0:08:59.573,0:09:02.728 where they use OECD testing guidelines 0:09:02.728,0:09:07.662 and using Wikidata here [br]to make an overview of the experiments. 0:09:07.662,0:09:11.762 And this means that we can now[br]actually start annotating articles 0:09:11.762,0:09:14.443 where these protocols have been used. 0:09:15.663,0:09:20.389 And in this way, we get a better insight [br]in the quality of literature as well. 0:09:20.895,0:09:23.742 We get to see which DDTs 0:09:23.742,0:09:27.268 are well tested, established [br]experimental methods 0:09:27.268,0:09:32.193 and an indication of how good the data is [br]that came out of that. 0:09:35.299,0:09:38.199 Another example--this is nanomaterials, 0:09:38.199,0:09:39.922 specific nanomaterials, 0:09:40.702,0:09:43.402 where there is a unique code-- [br]we've added that-- 0:09:43.402,0:09:45.277 with the same purpose of being able 0:09:45.277,0:09:48.006 to track down literature [br]about these nanomaterials. 0:09:48.655,0:09:51.913 But, again, we need exact descriptions. 0:09:53.101,0:09:56.241 Now, this is [br]the LIPID MAPS classification, 0:09:56.241,0:09:58.011 and here we see an interesting thing, 0:09:58.011,0:10:01.986 and this has shown up [br]in some of the presentations, 0:10:02.876,0:10:04.236 elsewhere as well. 0:10:04.236,0:10:08.109 This idea that some of the things [br]that we have in Wikidata 0:10:08.109,0:10:10.309 is not always matching the sources. 0:10:10.309,0:10:12.055 So different ontological models, 0:10:12.055,0:10:16.038 different ideas [br]of what a particular thing means. 0:10:16.225,0:10:20.955 And so, if we look at the LIPID MAPS,[br]we have a lipid in the middle 0:10:20.955,0:10:22.745 and then a number of classes, 0:10:22.745,0:10:25.022 and many of these are in Wikidata. 0:10:25.454,0:10:30.534 But here, around actually [br]fatty acids or fatty acyls, 0:10:30.864,0:10:33.190 that's where there is a mismatch 0:10:33.468,0:10:37.288 causing something that should be [br]actually purely hierarchical, 0:10:37.288,0:10:41.070 actually it started to show [br]some loops over there, 0:10:41.496,0:10:46.189 the mismatch of two representations [br]of a lipid chemistry. 0:10:48.636,0:10:51.546 Now, the goal of this work[br]is not so much to reconcile this 0:10:51.546,0:10:53.239 but to visualize it 0:10:53.239,0:10:55.809 so that we can understand [br]what is going on 0:10:55.809,0:10:58.877 and correct things [br]that are actually clearly wrong. 0:11:06.444,0:11:09.619 The interesting about LIPID MAPS [br]is actually that the classification 0:11:10.105,0:11:12.848 is indicated in the external identifier. 0:11:12.848,0:11:14.605 So one of the things [br]that we've been using 0:11:14.605,0:11:18.845 is these external numbers[br]to make this automatic classification 0:11:18.845,0:11:22.420 because everything [br]that starts with an LMFA05 0:11:22.600,0:11:25.909 is actually a fatty alcohol. 0:11:26.242,0:11:28.424 So I can translate that [br]into Quickstatements, 0:11:28.424,0:11:29.982 push that into Quickstatements 0:11:29.982,0:11:32.992 and get that annotated in Wikidata. 0:11:40.133,0:11:44.223 This slide is just reflecting [br]the advantage for LIPID MAPS here 0:11:44.223,0:11:46.357 which been collaborating with them 0:11:46.357,0:11:50.106 because they get a lot of data[br]out of Wikidata as well, 0:11:50.284,0:11:54.786 which we can cross-reference, [br]which we can compare if it's correct. 0:11:55.066,0:11:57.973 LIPID MAPS is a quite curated database 0:11:58.436,0:12:02.276 but like everyone actually having trouble 0:12:02.276,0:12:04.748 with access to literature, 0:12:04.748,0:12:07.856 the demand of literature [br]and filtering the literature, 0:12:07.856,0:12:10.102 getting to the right articles. 0:12:12.626,0:12:15.126 Shape Expressions is probably [br]something that you've seen. 0:12:15.126,0:12:18.546 We have a few of them for chemistry now. 0:12:18.546,0:12:21.116 This is the example for racemic mixture. 0:12:21.116,0:12:26.236 In a case of racemic mixture,[br]you want to have two parts in there. 0:12:26.236,0:12:27.358 It's a mixture, 0:12:27.358,0:12:30.657 so at least two chemical entities [br]need to be in there. 0:12:31.008,0:12:35.561 Moreover, each of the [inaudible] parts[br]has to be a chemical compound. 0:12:35.561,0:12:41.311 This is another level of a way[br]we can curate the content. 0:12:41.811,0:12:43.755 There have to be more of them. 0:12:43.755,0:12:47.775 We have quite a few [br]different concepts in Wikidata 0:12:47.775,0:12:49.922 like groups of co-compounds. 0:12:50.506,0:12:54.710 There is a class that is [br]of structurally similar compounds, etc. 0:12:57.320,0:13:00.532 If you run a query like this,[br]this case for the other one, 0:13:00.532,0:13:04.245 other schema that we have[br]for chemical elements, 0:13:04.245,0:13:07.059 you can do the same thing-- [br]you can run it on a single item 0:13:07.059,0:13:10.400 or you can run that on everything [br]that is a chemical element. 0:13:10.400,0:13:13.800 This is something [br]that I can very much recommend 0:13:13.800,0:13:16.749 having a look at [br]if you have not done so already. 0:13:17.877,0:13:21.550 Now, if we go to the automation of things, 0:13:22.172,0:13:26.032 here I'm using a tool called Bioclipse. 0:13:26.032,0:13:29.222 This is something that we worked on[br]some time ten years ago. 0:13:29.222,0:13:32.246 It's a platform [br]for chemistry and biology, 0:13:32.246,0:13:34.646 or cheminformatics and bioinformatics. 0:13:34.646,0:13:36.746 aimed at automating things, 0:13:39.513,0:13:41.453 including visualizations and sorts. 0:13:41.453,0:13:45.385 But I've taken that now [br]and developed a number of scripts 0:13:45.385,0:13:47.345 that I can actually run [br]on the command line, 0:13:47.345,0:13:50.759 which makes it easier to automate things,[br]as we will see in a moment, 0:13:50.759,0:13:52.193 and doing all sort of things, 0:13:52.193,0:13:55.957 for example, the classification [br]according to the LIPID MAP identifiers, 0:13:56.193,0:14:01.717 that's the scripts all available [br]from the GitHub repository here. 0:14:02.673,0:14:05.703 And typically, I have them[br]create Quickstatements 0:14:05.703,0:14:08.993 because that gets me [br]an additional check step 0:14:08.993,0:14:13.872 after I created the Quickstatements [br]and see what does data actually look like. 0:14:15.979,0:14:18.709 Annotation of main subjects. 0:14:18.709,0:14:23.144 This one is my script too, [br]starting from SMILES 0:14:23.144,0:14:27.473 to actually add chemical compounds[br]that are not in Wikidata yet, 0:14:27.473,0:14:28.818 which happens a lot. 0:14:29.678,0:14:34.388 So three or four weeks ago [br]I added something like 500 compounds 0:14:34.388,0:14:36.358 which our project was looking into 0:14:36.358,0:14:41.512 because these are [br]volatile compounds in oils. 0:14:44.610,0:14:47.130 This script adds the compounds. 0:14:47.130,0:14:51.370 They will later on add the annotation [br]of which pieces that compound comes from 0:14:51.370,0:14:52.870 and what the properties are. 0:14:55.914,0:14:59.010 Bioclipse itself is based [br]on the Chemistry Development Kit 0:14:59.878,0:15:01.868 and a few other libraries. 0:15:01.868,0:15:04.723 This allows me to do the chemistry. 0:15:04.723,0:15:07.973 And this is a very well-validated toolkit. 0:15:07.973,0:15:11.653 The SMILES part has been done [br]by John Mayfield. 0:15:11.653,0:15:17.105 I have done a lot of validation [br]against other tools. 0:15:17.522,0:15:20.668 And the quality [br]is actually really high now, 0:15:21.163,0:15:26.328 comparable or in some cases even better [br]of commercial cheminformatics tools. 0:15:26.328,0:15:29.283 It has given me a lot of reassurance 0:15:29.283,0:15:34.923 that the quality checking that we do [br]with this tool on Wikidata 0:15:34.923,0:15:37.312 is giving interesting results. 0:15:38.197,0:15:40.431 This is the Quickstatements. 0:15:40.431,0:15:43.275 Quickstatements[br]is Magnus' work, of course. 0:15:44.379,0:15:48.089 What happens if we take the SMILES,[br]it calculates the InChI, 0:15:48.089,0:15:52.239 and the InChIKey, it even looks up [br]based on the InChIKey, 0:15:52.239,0:15:54.199 if there is a PubChem identifier 0:15:54.199,0:15:58.609 that uses the InChIKey, [br]the PubChem identifier, 0:15:58.609,0:16:01.255 to see if this compound [br]already is in Wikidata. 0:16:01.255,0:16:03.780 And only if it's not already there, 0:16:03.780,0:16:06.945 then it will actually create[br]a CREATE statement. 0:16:09.330,0:16:12.822 A bit of automatic classification[br]here is an option. 0:16:12.822,0:16:15.162 So if I'm adding a class of compounds, 0:16:15.162,0:16:17.980 I can automatically indicate [br]what these are all... 0:16:17.980,0:16:19.970 this type of compounds, 0:16:19.970,0:16:22.599 and I can also indicate, if needed, 0:16:22.599,0:16:25.214 if there is a particular article[br]where I got this information 0:16:25.214,0:16:27.709 from automatically adding references. 0:16:29.836,0:16:32.796 Well, this is what [br]the Quickstatements output looks like 0:16:32.796,0:16:35.508 for the annotation of main subjects. 0:16:36.277,0:16:38.149 You've probably seen that as well. 0:16:40.650,0:16:42.820 A newer thing that I started doing 0:16:42.820,0:16:46.003 is actually doing reasoning [br]on the data in Wikidata. 0:16:46.003,0:16:51.457 So if I have the SMILES, then I can check [br]the molecular formula, for example. 0:16:51.457,0:16:53.427 I can check the InChIKey. 0:16:55.242,0:17:00.498 At some point, what we are going to do[br]is calculate physicochemical properties 0:17:00.498,0:17:03.865 and see if that matches [br]what is in Wikidata. 0:17:04.520,0:17:07.151 This will highlight typos 0:17:07.846,0:17:10.386 or wrong units, for example. 0:17:11.301,0:17:14.591 At this moment... [br]so this is a run of this morning. 0:17:14.591,0:17:17.989 What we see here [br]is two tests actually failing, 0:17:17.989,0:17:19.539 and this is an example of it. 0:17:19.539,0:17:23.300 This is the InChIKey[br]that is computed from the isomeric SMILES 0:17:23.300,0:17:26.957 is different from the InChIKey[br]given in the entry. 0:17:27.700,0:17:32.198 This can result from data [br]being pulled in from different resources. 0:17:33.011,0:17:35.889 So these are entries, about 300 of them, 0:17:35.889,0:17:39.479 in the 160,000 chemicals [br]that we have in Wikidata. 0:17:39.479,0:17:41.632 So it's a very small amount, really, 0:17:42.392,0:17:45.672 where there is information, [br]and someone needs to look at it. 0:17:47.315,0:17:51.008 Now, these are all organic compounds 0:17:51.008,0:17:54.168 and also quite a few inorganic compounds 0:17:54.168,0:17:57.281 where these things just work less well. 0:17:57.618,0:18:01.105 But I found in the other test [br]that is failing 0:18:01.105,0:18:04.105 immediately a couple of things [br]that are very clearly wrong. 0:18:09.186,0:18:11.759 PubChem is a huge database. 0:18:11.996,0:18:13.846 They do validation as well. 0:18:13.846,0:18:16.622 We are in the process [br]of submitting Wikidata there, 0:18:16.622,0:18:18.812 which I'm really happy about. 0:18:19.350,0:18:22.530 It's in the last validation step [br]at this moment. 0:18:23.323,0:18:25.643 And this will also mean that PubChem, 0:18:25.643,0:18:29.143 which has something [br]like 100 million compounds 0:18:29.143,0:18:31.149 will actually link back to Wikidata. 0:18:31.488,0:18:35.117 It already does this, but via Wikipedia. 0:18:35.387,0:18:36.947 (laughing) 0:18:36.947,0:18:38.434 Do you recognize it? 0:18:38.434,0:18:43.067 With the aforementioned issues there [br]of concept mismatches. 0:18:43.591,0:18:45.691 So this will give us a second thing. 0:18:45.691,0:18:49.679 And there, also, [br]using the same Bioclipse scripts 0:18:49.679,0:18:51.681 or similar Bioclipse scripts, 0:18:51.681,0:18:53.493 we get validation reports, 0:18:53.493,0:18:57.111 again indicating things [br]that chemists should look at. 0:18:59.813,0:19:01.540 That basically wraps it up. 0:19:01.803,0:19:04.955 This is still a work in progress,[br]the article is in preparation. 0:19:04.955,0:19:08.048 I've been working [br]with Finn here in Scholia 0:19:08.048,0:19:10.705 to support this validation. 0:19:11.058,0:19:15.538 We're writing up the full work, [br]but for now you can look up this poster. 0:19:15.538,0:19:19.352 The slides are on the program[br]of this session, 0:19:19.352,0:19:21.825 so you can look at the slides[br]and look at the details. 0:19:22.560,0:19:24.939 And a quick acknowledgment: 0:19:24.939,0:19:28.370 some of this work has been done [br]by a number of grants that I received. 0:19:28.370,0:19:29.836 And thank you very much. 0:19:30.282,0:19:32.475 (applause) 0:19:35.922,0:19:37.902 (chairman) Are there any questions? 0:19:41.142,0:19:42.669 (person 3) Thank you so much for this. 0:19:42.669,0:19:44.196 I am [inaudible], 0:19:44.196,0:19:47.142 and so far, I've been reading articles 0:19:47.142,0:19:49.722 on the [inaudible] Quickipedia [br]on different compounds. 0:19:49.722,0:19:53.770 I have a little bit more than 70 articles [br]with different compounds-- 0:19:53.770,0:19:55.464 just things I come across. 0:19:56.000,0:19:58.021 And my question to you is 0:19:58.021,0:20:02.371 if I want to move my chemistry activity [br]from Wikipedia to Wikidata, 0:20:02.371,0:20:05.300 how can I help[br]in a way that is very friendly 0:20:05.300,0:20:10.031 to somebody who is a beginner [br]in that field on Wikidata? 0:20:12.262,0:20:15.832 So, if that compound is in Wikipedia and.. 0:20:15.832,0:20:18.092 Sometimes there is [br]actually a Wikidata page. 0:20:18.092,0:20:19.972 I occasionally run into this as well, 0:20:19.972,0:20:21.902 in the last couple of months [br]not so much anymore 0:20:21.902,0:20:23.791 but this morning, actually. 0:20:25.522,0:20:27.433 And what I typically do then 0:20:27.433,0:20:31.097 is I take the SMILES[br]from [inaudible] infobox 0:20:31.097,0:20:32.262 from that compound 0:20:32.262,0:20:37.182 or use PubChem to look up the SMILES,[br]check if the information is complete, 0:20:37.182,0:20:38.817 particularly the stereochemistry, 0:20:39.282,0:20:43.422 and then I use that [br]that creates Wikidata item scripts 0:20:43.422,0:20:46.320 to create Quickstatements [br]for that compound. 0:20:47.506,0:20:50.016 If there already is a Wikidata item, 0:20:50.016,0:20:55.916 I basically just update these scripts, 0:20:55.916,0:21:00.065 but rather than say, "Create Last,"[br]I replace the last with the Q-codes 0:21:00.065,0:21:01.710 that that item already has. 0:21:01.710,0:21:04.650 And then it complements [br]or it adds this information 0:21:04.650,0:21:06.986 based on the information we had. 0:21:07.922,0:21:10.422 This is [manuable], 0:21:10.422,0:21:13.677 so you can copy-paste [br]a number of SMILES, put it in a file, 0:21:13.677,0:21:15.999 and take that. 0:21:18.351,0:21:21.671 Extracting that information to Wikidata [br]is not something I've automated yet, 0:21:21.671,0:21:25.088 but this helps me... [br]it's a pretty fast process. 0:21:25.782,0:21:28.427 I can show you later [br]how to use that software. 0:21:30.728,0:21:32.478 (chairman) Are there other questions? 0:21:33.515,0:21:34.784 So, I have one. 0:21:35.097,0:21:39.825 Do you make an effort [br]to, in fact, make this more visible 0:21:39.825,0:21:42.445 in this bioinformatics community 0:21:42.445,0:21:46.671 so that they can start using[br]this structured data? 0:21:47.566,0:21:49.326 Yeah, I'm actively doing that. 0:21:49.326,0:21:52.256 So what I did not mention [br]in this presentation so much, 0:21:52.256,0:21:58.456 but we saw that in...[br]I'd have somewhere to start here-- 0:21:58.456,0:22:00.669 this is an overview [br]of different databases. 0:22:01.106,0:22:04.706 A similar plot, which actually [br]I do not have on this slide deck 0:22:04.706,0:22:09.387 is the number of different identifiers[br]that chemical compounds have, 0:22:09.387,0:22:11.347 and I've been working [br]with a number of databases, 0:22:11.347,0:22:15.622 like MassBank, [br]the Environmental Protection Agency, 0:22:17.184,0:22:18.987 CompTox Dashboard. 0:22:19.507,0:22:21.704 I've added links to the BDB database. 0:22:21.704,0:22:24.159 So I'm working with a number of projects 0:22:24.159,0:22:27.469 for pulling in additional information, 0:22:27.739,0:22:30.030 identifies our links out[br]to other databases. 0:22:31.192,0:22:33.371 Regarding outreach, yes, 0:22:33.371,0:22:36.961 so that wrong slide deck[br]that I was showing at the start, 0:22:36.961,0:22:39.283 there was actually [br]a presentation two weeks ago 0:22:39.283,0:22:42.148 at an Open Science Meeting [br]around chemistry. 0:22:42.503,0:22:45.843 I'm very much pushing this and... 0:22:47.410,0:22:48.951 I see a big future here. 0:22:48.951,0:22:50.780 There's a lot of interest. 0:22:50.780,0:22:54.590 And making people aware [br]of the CC0 license, 0:22:54.590,0:22:57.616 that's typically the larger problem. 0:22:58.200,0:23:02.967 So we have to pull in[br]the information carefully. 0:23:05.303,0:23:06.713 (chairman) Other questions? 0:23:08.643,0:23:10.373 - Okay. [br]- Thank you very much. 0:23:10.373,0:23:12.036 (chairman) Can we thank the speaker. 0:23:12.036,0:23:15.257 (applause)