WEBVTT 00:00:06.075 --> 00:00:09.777 (chairman) So, let's welcome Egon, 00:00:09.777 --> 00:00:13.479 who will describe how he is trying 00:00:13.479 --> 00:00:17.579 to improve the coverage of chemical compounds in Wikidata. 00:00:18.289 --> 00:00:19.679 (Egon) Yeah, thank you. 00:00:20.689 --> 00:00:22.309 So, let's see... 00:00:22.684 --> 00:00:24.920 Oh, this is not right. 00:00:25.880 --> 00:00:26.900 Sorry. 00:00:31.754 --> 00:00:33.323 They put the wrong slide deck. 00:00:40.183 --> 00:00:42.645 (person 1) The one was better than-- (person 2) [inaudible] 00:00:42.645 --> 00:00:44.142 (person 1) (laughs) 00:00:49.061 --> 00:00:50.281 How about this? 00:00:52.633 --> 00:00:54.765 This one actually says WikidataCon. 00:00:56.540 --> 00:00:58.476 Slightly different slides. 00:01:00.169 --> 00:01:05.559 Okay, so yeah, coverage and correctness, 00:01:05.559 --> 00:01:08.091 accuracy, quality, if you like. 00:01:09.422 --> 00:01:11.590 And the other thing here 00:01:11.590 --> 00:01:14.628 is what makes it different from some of the other things 00:01:14.628 --> 00:01:17.276 that we've seen at the WikidataCon 00:01:17.276 --> 00:01:20.160 is how I do this quality, 00:01:20.160 --> 00:01:21.910 and the coverage, actually. 00:01:21.910 --> 00:01:25.220 So I'm actually taking advantage here of my background, 00:01:25.220 --> 00:01:27.166 which is in cheminformatics, 00:01:27.166 --> 00:01:30.112 which is something that we use in our research. 00:01:30.329 --> 00:01:32.989 And cheminformatics 00:01:32.989 --> 00:01:35.119 is a way to understanding the chemical structures, 00:01:35.119 --> 00:01:36.578 what we will see in a moment. 00:01:36.578 --> 00:01:39.840 We can do things that we cannot do with the regular toolsets 00:01:39.840 --> 00:01:43.524 that we have, like Shape Expressions, quality constraints and sorts. 00:01:45.857 --> 00:01:49.203 Now, one of the interesting things of chemistry 00:01:49.496 --> 00:01:51.732 is that chemical structures, 00:01:52.852 --> 00:01:55.472 are sometimes the same, sometimes not the same, 00:01:55.472 --> 00:01:57.676 depending on what you want to know. 00:01:58.096 --> 00:02:00.636 And this slide reflects that a bit, 00:02:00.636 --> 00:02:05.006 and what we see here is biologically the same compounds 00:02:05.006 --> 00:02:07.943 but chemically two different compounds. 00:02:08.496 --> 00:02:11.911 But at biological levels with the [inaudible], 00:02:11.911 --> 00:02:13.576 they are in equilibrium, 00:02:13.576 --> 00:02:16.485 and you will not be able to really distinguish between them, 00:02:16.735 --> 00:02:19.906 unless you're looking for a particular type of biology 00:02:19.906 --> 00:02:21.911 like reaction mechanisms. 00:02:23.209 --> 00:02:26.979 Another interesting thing about Wikidata and Wikipedia 00:02:26.979 --> 00:02:31.546 is that we have things like long-chain fatty acids, 00:02:31.546 --> 00:02:34.539 chemical concepts which are not a specific compound 00:02:34.539 --> 00:02:36.754 but actually a class of compounds. 00:02:37.347 --> 00:02:42.867 Now, this class can be based on similar features in the molecules, 00:02:42.867 --> 00:02:46.137 so like in the case of the long-chain fatty acids-- 00:02:46.137 --> 00:02:51.777 they all have a long-chain fatty and an acid group. 00:02:51.777 --> 00:02:54.101 In other cases, there are the classes 00:02:54.101 --> 00:02:57.118 based more on the biological functionality, 00:02:57.118 --> 00:03:00.443 like a certain type of inhibitor, like an ACE inhibitor. 00:03:01.762 --> 00:03:05.632 And this introduces a lot of interesting things, 00:03:05.632 --> 00:03:09.650 partly because of this close link with Wikipedias. 00:03:10.270 --> 00:03:12.602 And one of the things that we see 00:03:12.602 --> 00:03:17.982 is that Wikipedia may have a chembox for a particular compound 00:03:17.982 --> 00:03:20.467 bur actually be about a compound class, 00:03:20.467 --> 00:03:25.317 resulting to a slightly different concept 00:03:25.317 --> 00:03:27.419 of what the two things are actually meaning 00:03:27.419 --> 00:03:29.990 and the sitelink being more complicated. 00:03:33.189 --> 00:03:38.899 We need this for understanding the biology. 00:03:38.899 --> 00:03:42.921 So the research in our group is understanding the living cell. 00:03:43.440 --> 00:03:47.030 The system's biology here described in the biological process 00:03:47.030 --> 00:03:50.545 is we have a pathway database for that--WikiPathways, 00:03:50.545 --> 00:03:54.022 and if we look at the chemistry in there of the small molecules, 00:03:55.659 --> 00:03:59.179 the chemistry is sometimes described in a lot of details, 00:03:59.179 --> 00:04:00.739 sometimes in less detail, 00:04:00.739 --> 00:04:04.219 pretty much like this Wikipedia Wikidata link that we just had 00:04:04.219 --> 00:04:08.429 resulting in basically links to a lot of different databases 00:04:08.429 --> 00:04:10.424 with slightly different focuses. 00:04:10.889 --> 00:04:15.349 Some databases like LIPID MAPS and the Human Metabolome Database, 00:04:15.349 --> 00:04:17.581 they are very much focused on the biology, 00:04:17.581 --> 00:04:20.743 whereas a database like ChEBI, 00:04:20.743 --> 00:04:24.615 that's very much focused on the chemical entities. 00:04:26.235 --> 00:04:27.905 So we try to breach that, 00:04:27.905 --> 00:04:33.452 and that two-three years ago gave us a very interesting insight 00:04:33.452 --> 00:04:35.871 that if you look at the lines here 00:04:35.871 --> 00:04:40.152 where in blue we have the total number of the small molecules 00:04:40.152 --> 00:04:41.882 we have in these Pathways 00:04:41.882 --> 00:04:44.902 and the numbers in red that we can match to that, 00:04:44.902 --> 00:04:48.402 there is this gap, and this gap is complicated chemistry. 00:04:49.272 --> 00:04:52.428 Also, poignantly, things missing in Wikidata. 00:04:52.779 --> 00:04:55.327 So therefore the need for date completeness 00:04:55.327 --> 00:04:57.717 and the data quality. 00:05:00.486 --> 00:05:02.184 And here we have an example. 00:05:02.184 --> 00:05:05.642 This is actually a curation report of yesterday, 00:05:05.642 --> 00:05:08.392 and these are still things that we have in Pathways 00:05:08.392 --> 00:05:13.672 but that we do not know what the equivalent thing is in Wikidata. 00:05:14.081 --> 00:05:17.527 And one of the things here that I'm picking out here 00:05:17.527 --> 00:05:18.932 is strigolactone. 00:05:19.117 --> 00:05:21.024 And this is a class of compounds. 00:05:22.031 --> 00:05:27.075 So we have that in one of our Pathways, this particular Pathway over there. 00:05:29.212 --> 00:05:32.172 So you start matching this to Wikidata and Wikipedia, 00:05:32.172 --> 00:05:35.390 and to actually use for this compound Wikipedia page 00:05:35.976 --> 00:05:38.866 with these six structures-- 00:05:38.866 --> 00:05:42.101 images, name, no links, nothing. 00:05:42.101 --> 00:05:43.169 Nothing in Wikipedia, 00:05:43.169 --> 00:05:45.561 just this information, not machine-readable. 00:05:47.942 --> 00:05:53.022 So based on the name, I can actually find three out of the six 00:05:53.022 --> 00:05:54.867 of these compounds in Wikidata, 00:05:54.867 --> 00:05:56.642 not linked, not classified. 00:05:57.253 --> 00:06:00.433 So if we look at the class of strigolactones, 00:06:00.433 --> 00:06:03.015 of which these six are examples, 00:06:03.015 --> 00:06:05.305 Wikidata did not give us anything. 00:06:05.305 --> 00:06:08.070 So that's the kind of curation that I'm interested in. 00:06:09.170 --> 00:06:12.900 On the right here-- that page is actually pretty much empty, 00:06:12.900 --> 00:06:17.178 but it's exactly what Scholia is showing for this class of chemical compounds. 00:06:18.315 --> 00:06:22.757 So Scholia is one of the tools that I've been using to do this curation. 00:06:26.339 --> 00:06:30.548 This missing classification is a bit of information 00:06:30.548 --> 00:06:32.057 missing in Wikidata, 00:06:32.057 --> 00:06:34.797 but we can add this classification, 00:06:34.797 --> 00:06:36.957 and we can retrieve that from some sources. 00:06:36.957 --> 00:06:38.687 We will see with LIPID MAPS later, 00:06:38.687 --> 00:06:41.157 we can automate adding these missing links, 00:06:41.660 --> 00:06:44.371 if we understand the chemistry. 00:06:46.274 --> 00:06:50.554 So this diagram over here-- we have fatty acid over there again 00:06:50.554 --> 00:06:54.052 and the long-chain fatty acid over here 00:06:54.052 --> 00:06:56.672 that we saw on one of the previous slides-- 00:06:56.672 --> 00:07:00.998 very long-chain fatty acids and a number of other fatty acids. 00:07:01.336 --> 00:07:04.495 This kind of information helps us see 00:07:04.495 --> 00:07:09.325 a [inaudible] of the chemistry in Wikidata. 00:07:10.351 --> 00:07:13.545 Scholia can visualize the 2D structure, 00:07:13.545 --> 00:07:15.998 and this thing is actually automatically generated 00:07:15.998 --> 00:07:18.701 from the chemical structure in Wikidata 00:07:18.948 --> 00:07:22.198 on the fly creating in the Scalable Vector Graphics. 00:07:22.734 --> 00:07:23.864 (coughing) 00:07:24.530 --> 00:07:25.563 Sorry. 00:07:25.563 --> 00:07:28.150 With the stereochemistry annotation there 00:07:28.150 --> 00:07:32.600 to help the chemist see the completeness of the data 00:07:32.600 --> 00:07:35.819 because also the stereochemistry might be missing. 00:07:36.452 --> 00:07:40.402 We also get an overview on Scholia of related compounds 00:07:40.402 --> 00:07:41.872 based on the InChIKey, 00:07:41.872 --> 00:07:45.662 where the first block basically indicates how the atoms are connected 00:07:45.905 --> 00:07:50.325 and the second column indicates things like stereochemistry 00:07:50.325 --> 00:07:53.838 and things like which isotopes are in there, 00:07:54.334 --> 00:07:57.334 for example C11 instead of C12 00:07:57.334 --> 00:08:01.974 or C13 instead of C12. 00:08:03.441 --> 00:08:06.459 The last number of the last letter over here 00:08:06.459 --> 00:08:08.671 actually indicates the charge, 00:08:08.671 --> 00:08:13.868 so that's the example that we saw earlier between the citric acid and the citrate, 00:08:14.493 --> 00:08:16.641 or was it the acetic acid and the acetate? 00:08:20.496 --> 00:08:23.086 By putting in a bit of the main knowledge, 00:08:23.086 --> 00:08:27.708 we can do a lot more... making sense of what we have in Wikidata. 00:08:29.463 --> 00:08:32.643 A bit more about Scholia 00:08:32.643 --> 00:08:36.373 is that about data completeness with the physical and chemical properties, 00:08:36.373 --> 00:08:40.070 the literature, those are whole things that we want to have access to. 00:08:40.070 --> 00:08:45.317 But it only works if we can find the right chemical in Wikidata. 00:08:48.897 --> 00:08:52.467 We started using Wikidata in a number of our projects, 00:08:52.467 --> 00:08:54.683 so WikiPathways was one of them. 00:08:54.683 --> 00:08:56.043 This is another project 00:08:56.043 --> 00:08:59.573 in the area of the nanosafety, risk assessment, 00:08:59.573 --> 00:09:02.728 where they use OECD testing guidelines 00:09:02.728 --> 00:09:07.662 and using Wikidata here to make an overview of the experiments. 00:09:07.662 --> 00:09:11.762 And this means that we can now actually start annotating articles 00:09:11.762 --> 00:09:14.443 where these protocols have been used. 00:09:15.663 --> 00:09:20.389 And in this way, we get a better insight in the quality of literature as well. 00:09:20.895 --> 00:09:23.742 We get to see which DDTs 00:09:23.742 --> 00:09:27.268 are well tested, established experimental methods 00:09:27.268 --> 00:09:32.193 and an indication of how good the data is that came out of that. 00:09:35.299 --> 00:09:38.199 Another example--this is nanomaterials, 00:09:38.199 --> 00:09:39.922 specific nanomaterials, 00:09:40.702 --> 00:09:43.402 where there is a unique code-- we've added that-- 00:09:43.402 --> 00:09:45.277 with the same purpose of being able 00:09:45.277 --> 00:09:48.006 to track down literature about these nanomaterials. 00:09:48.655 --> 00:09:51.913 But, again, we need exact descriptions. 00:09:53.101 --> 00:09:56.241 Now, this is the LIPID MAPS classification, 00:09:56.241 --> 00:09:58.011 and here we see an interesting thing, 00:09:58.011 --> 00:10:01.986 and this has shown up in some of the presentations, 00:10:02.876 --> 00:10:04.236 elsewhere as well. 00:10:04.236 --> 00:10:08.109 This idea that some of the things that we have in Wikidata 00:10:08.109 --> 00:10:10.309 is not always matching the sources. 00:10:10.309 --> 00:10:12.055 So different ontological models, 00:10:12.055 --> 00:10:16.038 different ideas of what a particular thing means. 00:10:16.225 --> 00:10:20.955 And so, if we look at the LIPID MAPS, we have a lipid in the middle 00:10:20.955 --> 00:10:22.745 and then a number of classes, 00:10:22.745 --> 00:10:25.022 and many of these are in Wikidata. 00:10:25.454 --> 00:10:30.534 But here, around actually fatty acids or fatty acyls, 00:10:30.864 --> 00:10:33.190 that's where there is a mismatch 00:10:33.468 --> 00:10:37.288 causing something that should be actually purely hierarchical, 00:10:37.288 --> 00:10:41.070 actually it started to show some loops over there, 00:10:41.496 --> 00:10:46.189 the mismatch of two representations of a lipid chemistry. 00:10:48.636 --> 00:10:51.546 Now, the goal of this work is not so much to reconcile this 00:10:51.546 --> 00:10:53.239 but to visualize it 00:10:53.239 --> 00:10:55.809 so that we can understand what is going on 00:10:55.809 --> 00:10:58.877 and correct things that are actually clearly wrong. 00:11:06.444 --> 00:11:09.619 The interesting about LIPID MAPS is actually that the classification 00:11:10.105 --> 00:11:12.848 is indicated in the external identifier. 00:11:12.848 --> 00:11:14.605 So one of the things that we've been using 00:11:14.605 --> 00:11:18.845 is these external numbers to make this automatic classification 00:11:18.845 --> 00:11:22.420 because everything that starts with an LMFA05 00:11:22.600 --> 00:11:25.909 is actually a fatty alcohol. 00:11:26.242 --> 00:11:28.424 So I can translate that into Quickstatements, 00:11:28.424 --> 00:11:29.982 push that into Quickstatements 00:11:29.982 --> 00:11:32.992 and get that annotated in Wikidata. 00:11:40.133 --> 00:11:44.223 This slide is just reflecting the advantage for LIPID MAPS here 00:11:44.223 --> 00:11:46.357 which been collaborating with them 00:11:46.357 --> 00:11:50.106 because they get a lot of data out of Wikidata as well, 00:11:50.284 --> 00:11:54.786 which we can cross-reference, which we can compare if it's correct. 00:11:55.066 --> 00:11:57.973 LIPID MAPS is a quite curated database 00:11:58.436 --> 00:12:02.276 but like everyone actually having trouble 00:12:02.276 --> 00:12:04.748 with access to literature, 00:12:04.748 --> 00:12:07.856 the demand of literature and filtering the literature, 00:12:07.856 --> 00:12:10.102 getting to the right articles. 00:12:12.626 --> 00:12:15.126 Shape Expressions is probably something that you've seen. 00:12:15.126 --> 00:12:18.546 We have a few of them for chemistry now. 00:12:18.546 --> 00:12:21.116 This is the example for racemic mixture. 00:12:21.116 --> 00:12:26.236 In a case of racemic mixture, you want to have two parts in there. 00:12:26.236 --> 00:12:27.358 It's a mixture, 00:12:27.358 --> 00:12:30.657 so at least two chemical entities need to be in there. 00:12:31.008 --> 00:12:35.561 Moreover, each of the [inaudible] parts has to be a chemical compound. 00:12:35.561 --> 00:12:41.311 This is another level of a way we can curate the content. 00:12:41.811 --> 00:12:43.755 There have to be more of them. 00:12:43.755 --> 00:12:47.775 We have quite a few different concepts in Wikidata 00:12:47.775 --> 00:12:49.922 like groups of co-compounds. 00:12:50.506 --> 00:12:54.710 There is a class that is of structurally similar compounds, etc. 00:12:57.320 --> 00:13:00.532 If you run a query like this, this case for the other one, 00:13:00.532 --> 00:13:04.245 other schema that we have for chemical elements, 00:13:04.245 --> 00:13:07.059 you can do the same thing-- you can run it on a single item 00:13:07.059 --> 00:13:10.400 or you can run that on everything that is a chemical element. 00:13:10.400 --> 00:13:13.800 This is something that I can very much recommend 00:13:13.800 --> 00:13:16.749 having a look at if you have not done so already. 00:13:17.877 --> 00:13:21.550 Now, if we go to the automation of things, 00:13:22.172 --> 00:13:26.032 here I'm using a tool called Bioclipse. 00:13:26.032 --> 00:13:29.222 This is something that we worked on some time ten years ago. 00:13:29.222 --> 00:13:32.246 It's a platform for chemistry and biology, 00:13:32.246 --> 00:13:34.646 or cheminformatics and bioinformatics. 00:13:34.646 --> 00:13:36.746 aimed at automating things, 00:13:39.513 --> 00:13:41.453 including visualizations and sorts. 00:13:41.453 --> 00:13:45.385 But I've taken that now and developed a number of scripts 00:13:45.385 --> 00:13:47.345 that I can actually run on the command line, 00:13:47.345 --> 00:13:50.759 which makes it easier to automate things, as we will see in a moment, 00:13:50.759 --> 00:13:52.193 and doing all sort of things, 00:13:52.193 --> 00:13:55.957 for example, the classification according to the LIPID MAP identifiers, 00:13:56.193 --> 00:14:01.717 that's the scripts all available from the GitHub repository here. 00:14:02.673 --> 00:14:05.703 And typically, I have them create Quickstatements 00:14:05.703 --> 00:14:08.993 because that gets me an additional check step 00:14:08.993 --> 00:14:13.872 after I created the Quickstatements and see what does data actually look like. 00:14:15.979 --> 00:14:18.709 Annotation of main subjects. 00:14:18.709 --> 00:14:23.144 This one is my script too, starting from SMILES 00:14:23.144 --> 00:14:27.473 to actually add chemical compounds that are not in Wikidata yet, 00:14:27.473 --> 00:14:28.818 which happens a lot. 00:14:29.678 --> 00:14:34.388 So three or four weeks ago I added something like 500 compounds 00:14:34.388 --> 00:14:36.358 which our project was looking into 00:14:36.358 --> 00:14:41.512 because these are volatile compounds in oils. 00:14:44.610 --> 00:14:47.130 This script adds the compounds. 00:14:47.130 --> 00:14:51.370 They will later on add the annotation of which pieces that compound comes from 00:14:51.370 --> 00:14:52.870 and what the properties are. 00:14:55.914 --> 00:14:59.010 Bioclipse itself is based on the Chemistry Development Kit 00:14:59.878 --> 00:15:01.868 and a few other libraries. 00:15:01.868 --> 00:15:04.723 This allows me to do the chemistry. 00:15:04.723 --> 00:15:07.973 And this is a very well-validated toolkit. 00:15:07.973 --> 00:15:11.653 The SMILES part has been done by John Mayfield. 00:15:11.653 --> 00:15:17.105 I have done a lot of validation against other tools. 00:15:17.522 --> 00:15:20.668 And the quality is actually really high now, 00:15:21.163 --> 00:15:26.328 comparable or in some cases even better of commercial cheminformatics tools. 00:15:26.328 --> 00:15:29.283 It has given me a lot of reassurance 00:15:29.283 --> 00:15:34.923 that the quality checking that we do with this tool on Wikidata 00:15:34.923 --> 00:15:37.312 is giving interesting results. 00:15:38.197 --> 00:15:40.431 This is the Quickstatements. 00:15:40.431 --> 00:15:43.275 Quickstatements is Magnus' work, of course. 00:15:44.379 --> 00:15:48.089 What happens if we take the SMILES, it calculates the InChI, 00:15:48.089 --> 00:15:52.239 and the InChIKey, it even looks up based on the InChIKey, 00:15:52.239 --> 00:15:54.199 if there is a PubChem identifier 00:15:54.199 --> 00:15:58.609 that uses the InChIKey, the PubChem identifier, 00:15:58.609 --> 00:16:01.255 to see if this compound already is in Wikidata. 00:16:01.255 --> 00:16:03.780 And only if it's not already there, 00:16:03.780 --> 00:16:06.945 then it will actually create a CREATE statement. 00:16:09.330 --> 00:16:12.822 A bit of automatic classification here is an option. 00:16:12.822 --> 00:16:15.162 So if I'm adding a class of compounds, 00:16:15.162 --> 00:16:17.980 I can automatically indicate what these are all... 00:16:17.980 --> 00:16:19.970 this type of compounds, 00:16:19.970 --> 00:16:22.599 and I can also indicate, if needed, 00:16:22.599 --> 00:16:25.214 if there is a particular article where I got this information 00:16:25.214 --> 00:16:27.709 from automatically adding references. 00:16:29.836 --> 00:16:32.796 Well, this is what the Quickstatements output looks like 00:16:32.796 --> 00:16:35.508 for the annotation of main subjects. 00:16:36.277 --> 00:16:38.149 You've probably seen that as well. 00:16:40.650 --> 00:16:42.820 A newer thing that I started doing 00:16:42.820 --> 00:16:46.003 is actually doing reasoning on the data in Wikidata. 00:16:46.003 --> 00:16:51.457 So if I have the SMILES, then I can check the molecular formula, for example. 00:16:51.457 --> 00:16:53.427 I can check the InChIKey. 00:16:55.242 --> 00:17:00.498 At some point, what we are going to do is calculate physicochemical properties 00:17:00.498 --> 00:17:03.865 and see if that matches what is in Wikidata. 00:17:04.520 --> 00:17:07.151 This will highlight typos 00:17:07.846 --> 00:17:10.386 or wrong units, for example. 00:17:11.301 --> 00:17:14.591 At this moment... so this is a run of this morning. 00:17:14.591 --> 00:17:17.989 What we see here is two tests actually failing, 00:17:17.989 --> 00:17:19.539 and this is an example of it. 00:17:19.539 --> 00:17:23.300 This is the InChIKey that is computed from the isomeric SMILES 00:17:23.300 --> 00:17:26.957 is different from the InChIKey given in the entry. 00:17:27.700 --> 00:17:32.198 This can result from data being pulled in from different resources. 00:17:33.011 --> 00:17:35.889 So these are entries, about 300 of them, 00:17:35.889 --> 00:17:39.479 in the 160,000 chemicals that we have in Wikidata. 00:17:39.479 --> 00:17:41.632 So it's a very small amount, really, 00:17:42.392 --> 00:17:45.672 where there is information, and someone needs to look at it. 00:17:47.315 --> 00:17:51.008 Now, these are all organic compounds 00:17:51.008 --> 00:17:54.168 and also quite a few inorganic compounds 00:17:54.168 --> 00:17:57.281 where these things just work less well. 00:17:57.618 --> 00:18:01.105 But I found in the other test that is failing 00:18:01.105 --> 00:18:04.105 immediately a couple of things that are very clearly wrong. 00:18:09.186 --> 00:18:11.759 PubChem is a huge database. 00:18:11.996 --> 00:18:13.846 They do validation as well. 00:18:13.846 --> 00:18:16.622 We are in the process of submitting Wikidata there, 00:18:16.622 --> 00:18:18.812 which I'm really happy about. 00:18:19.350 --> 00:18:22.530 It's in the last validation step at this moment. 00:18:23.323 --> 00:18:25.643 And this will also mean that PubChem, 00:18:25.643 --> 00:18:29.143 which has something like 100 million compounds 00:18:29.143 --> 00:18:31.149 will actually link back to Wikidata. 00:18:31.488 --> 00:18:35.117 It already does this, but via Wikipedia. 00:18:35.387 --> 00:18:36.947 (laughing) 00:18:36.947 --> 00:18:38.434 Do you recognize it? 00:18:38.434 --> 00:18:43.067 With the aforementioned issues there of concept mismatches. 00:18:43.591 --> 00:18:45.691 So this will give us a second thing. 00:18:45.691 --> 00:18:49.679 And there, also, using the same Bioclipse scripts 00:18:49.679 --> 00:18:51.681 or similar Bioclipse scripts, 00:18:51.681 --> 00:18:53.493 we get validation reports, 00:18:53.493 --> 00:18:57.111 again indicating things that chemists should look at. 00:18:59.813 --> 00:19:01.540 That basically wraps it up. 00:19:01.803 --> 00:19:04.955 This is still a work in progress, the article is in preparation. 00:19:04.955 --> 00:19:08.048 I've been working with Finn here in Scholia 00:19:08.048 --> 00:19:10.705 to support this validation. 00:19:11.058 --> 00:19:15.538 We're writing up the full work, but for now you can look up this poster. 00:19:15.538 --> 00:19:19.352 The slides are on the program of this session, 00:19:19.352 --> 00:19:21.825 so you can look at the slides and look at the details. 00:19:22.560 --> 00:19:24.939 And a quick acknowledgment: 00:19:24.939 --> 00:19:28.370 some of this work has been done by a number of grants that I received. 00:19:28.370 --> 00:19:29.836 And thank you very much. 00:19:30.282 --> 00:19:32.475 (applause) 00:19:35.922 --> 00:19:37.902 (chairman) Are there any questions? 00:19:41.142 --> 00:19:42.669 (person 3) Thank you so much for this. 00:19:42.669 --> 00:19:44.196 I am [inaudible], 00:19:44.196 --> 00:19:47.142 and so far, I've been reading articles 00:19:47.142 --> 00:19:49.722 on the [inaudible] Quickipedia on different compounds. 00:19:49.722 --> 00:19:53.770 I have a little bit more than 70 articles with different compounds-- 00:19:53.770 --> 00:19:55.464 just things I come across. 00:19:56.000 --> 00:19:58.021 And my question to you is 00:19:58.021 --> 00:20:02.371 if I want to move my chemistry activity from Wikipedia to Wikidata, 00:20:02.371 --> 00:20:05.300 how can I help in a way that is very friendly 00:20:05.300 --> 00:20:10.031 to somebody who is a beginner in that field on Wikidata? 00:20:12.262 --> 00:20:15.832 So, if that compound is in Wikipedia and.. 00:20:15.832 --> 00:20:18.092 Sometimes there is actually a Wikidata page. 00:20:18.092 --> 00:20:19.972 I occasionally run into this as well, 00:20:19.972 --> 00:20:21.902 in the last couple of months not so much anymore 00:20:21.902 --> 00:20:23.791 but this morning, actually. 00:20:25.522 --> 00:20:27.433 And what I typically do then 00:20:27.433 --> 00:20:31.097 is I take the SMILES from [inaudible] infobox 00:20:31.097 --> 00:20:32.262 from that compound 00:20:32.262 --> 00:20:37.182 or use PubChem to look up the SMILES, check if the information is complete, 00:20:37.182 --> 00:20:38.817 particularly the stereochemistry, 00:20:39.282 --> 00:20:43.422 and then I use that that creates Wikidata item scripts 00:20:43.422 --> 00:20:46.320 to create Quickstatements for that compound. 00:20:47.506 --> 00:20:50.016 If there already is a Wikidata item, 00:20:50.016 --> 00:20:55.916 I basically just update these scripts, 00:20:55.916 --> 00:21:00.065 but rather than say, "Create Last," I replace the last with the Q-codes 00:21:00.065 --> 00:21:01.710 that that item already has. 00:21:01.710 --> 00:21:04.650 And then it complements or it adds this information 00:21:04.650 --> 00:21:06.986 based on the information we had. 00:21:07.922 --> 00:21:10.422 This is [manuable], 00:21:10.422 --> 00:21:13.677 so you can copy-paste a number of SMILES, put it in a file, 00:21:13.677 --> 00:21:15.999 and take that. 00:21:18.351 --> 00:21:21.671 Extracting that information to Wikidata is not something I've automated yet, 00:21:21.671 --> 00:21:25.088 but this helps me... it's a pretty fast process. 00:21:25.782 --> 00:21:28.427 I can show you later how to use that software. 00:21:30.728 --> 00:21:32.478 (chairman) Are there other questions? 00:21:33.515 --> 00:21:34.784 So, I have one. 00:21:35.097 --> 00:21:39.825 Do you make an effort to, in fact, make this more visible 00:21:39.825 --> 00:21:42.445 in this bioinformatics community 00:21:42.445 --> 00:21:46.671 so that they can start using this structured data? 00:21:47.566 --> 00:21:49.326 Yeah, I'm actively doing that. 00:21:49.326 --> 00:21:52.256 So what I did not mention in this presentation so much, 00:21:52.256 --> 00:21:58.456 but we saw that in... I'd have somewhere to start here-- 00:21:58.456 --> 00:22:00.669 this is an overview of different databases. 00:22:01.106 --> 00:22:04.706 A similar plot, which actually I do not have on this slide deck 00:22:04.706 --> 00:22:09.387 is the number of different identifiers that chemical compounds have, 00:22:09.387 --> 00:22:11.347 and I've been working with a number of databases, 00:22:11.347 --> 00:22:15.622 like MassBank, the Environmental Protection Agency, 00:22:17.184 --> 00:22:18.987 CompTox Dashboard. 00:22:19.507 --> 00:22:21.704 I've added links to the BDB database. 00:22:21.704 --> 00:22:24.159 So I'm working with a number of projects 00:22:24.159 --> 00:22:27.469 for pulling in additional information, 00:22:27.739 --> 00:22:30.030 identifies our links out to other databases. 00:22:31.192 --> 00:22:33.371 Regarding outreach, yes, 00:22:33.371 --> 00:22:36.961 so that wrong slide deck that I was showing at the start, 00:22:36.961 --> 00:22:39.283 there was actually a presentation two weeks ago 00:22:39.283 --> 00:22:42.148 at an Open Science Meeting around chemistry. 00:22:42.503 --> 00:22:45.843 I'm very much pushing this and... 00:22:47.410 --> 00:22:48.951 I see a big future here. 00:22:48.951 --> 00:22:50.780 There's a lot of interest. 00:22:50.780 --> 00:22:54.590 And making people aware of the CC0 license, 00:22:54.590 --> 00:22:57.616 that's typically the larger problem. 00:22:58.200 --> 00:23:02.967 So we have to pull in the information carefully. 00:23:05.303 --> 00:23:06.713 (chairman) Other questions? 00:23:08.643 --> 00:23:10.373 - Okay. - Thank you very much. 00:23:10.373 --> 00:23:12.036 (chairman) Can we thank the speaker. 00:23:12.036 --> 00:23:15.257 (applause)