(chairman) So, let's welcome Egon, who will describe how he is trying to improve the coverage of chemical compounds in Wikidata. (Egon) Yeah, thank you. So, let's see... Oh, this is not right. Sorry. They put the wrong slide deck. (person 1) The one was better than-- (person 2) [inaudible] (person 1) (laughs) How about this? This one actually says WikidataCon. Slightly different slides. Okay, so yeah, coverage and correctness, accuracy, quality, if you like. And the other thing here is what makes it different from some of the other things that we've seen at the WikidataCon is how I do this quality, and the coverage, actually. So I'm actually taking advantage here of my background, which is in cheminformatics, which is something that we use in our research. And cheminformatics is a way to understanding the chemical structures, what we will see in a moment. We can do things that we cannot do with the regular toolsets that we have, like Shape Expressions, quality constraints and sorts. Now, one of the interesting things of chemistry is that chemical structures, are sometimes the same, sometimes not the same, depending on what you want to know. And this slide reflects that a bit, and what we see here is biologically the same compounds but chemically two different compounds. But at biological levels with the [inaudible], they are in equilibrium, and you will not be able to really distinguish between them, unless you're looking for a particular type of biology like reaction mechanisms. Another interesting thing about Wikidata and Wikipedia is that we have things like long-chain fatty acids, chemical concepts which are not a specific compound but actually a class of compounds. Now, this class can be based on similar features in the molecules, so like in the case of the long-chain fatty acids-- they all have a long-chain fatty and an acid group. In other cases, there are the classes based more on the biological functionality, like a certain type of inhibitor, like an ACE inhibitor. And this introduces a lot of interesting things, partly because of this close link with Wikipedias. And one of the things that we see is that Wikipedia may have a chembox for a particular compound bur actually be about a compound class, resulting to a slightly different concept of what the two things are actually meaning and the sitelink being more complicated. We need this for understanding the biology. So the research in our group is understanding the living cell. The system's biology here described in the biological process is we have a pathway database for that--WikiPathways, and if we look at the chemistry in there of the small molecules, the chemistry is sometimes described in a lot of details, sometimes in less detail, pretty much like this Wikipedia Wikidata link that we just had resulting in basically links to a lot of different databases with slightly different focuses. Some databases like LIPID MAPS and the Human Metabolome Database, they are very much focused on the biology, whereas a database like ChEBI, that's very much focused on the chemical entities. So we try to breach that, and that two-three years ago gave us a very interesting insight that if you look at the lines here where in blue we have the total number of the small molecules we have in these Pathways and the numbers in red that we can match to that, there is this gap, and this gap is complicated chemistry. Also, poignantly, things missing in Wikidata. So therefore the need for date completeness and the data quality. And here we have an example. This is actually a curation report of yesterday, and these are still things that we have in Pathways but that we do not know what the equivalent thing is in Wikidata. And one of the things here that I'm picking out here is strigolactone. And this is a class of compounds. So we have that in one of our Pathways, this particular Pathway over there. So you start matching this to Wikidata and Wikipedia, and to actually use for this compound Wikipedia page with these six structures-- images, name, no links, nothing. Nothing in Wikipedia, just this information, not machine-readable. So based on the name, I can actually find three out of the six of these compounds in Wikidata, not linked, not classified. So if we look at the class of strigolactones, of which these six are examples, Wikidata did not give us anything. So that's the kind of curation that I'm interested in. On the right here-- that page is actually pretty much empty, but it's exactly what Scholia is showing for this class of chemical compounds. So Scholia is one of the tools that I've been using to do this curation. This missing classification is a bit of information missing in Wikidata, but we can add this classification, and we can retrieve that from some sources. We will see with LIPID MAPS later, we can automate adding these missing links, if we understand the chemistry. So this diagram over here-- we have fatty acid over there again and the long-chain fatty acid over here that we saw on one of the previous slides-- very long-chain fatty acids and a number of other fatty acids. This kind of information helps us see a [inaudible] of the chemistry in Wikidata. Scholia can visualize the 2D structure, and this thing is actually automatically generated from the chemical structure in Wikidata on the fly creating in the Scalable Vector Graphics. (coughing) Sorry. With the stereochemistry annotation there to help the chemist see the completeness of the data because also the stereochemistry might be missing. We also get an overview on Scholia of related compounds based on the InChIKey, where the first block basically indicates how the atoms are connected and the second column indicates things like stereochemistry and things like which isotopes are in there, for example C11 instead of C12 or C13 instead of C12. The last number of the last letter over here actually indicates the charge, so that's the example that we saw earlier between the citric acid and the citrate, or was it the acetic acid and the acetate? By putting in a bit of the main knowledge, we can do a lot more... making sense of what we have in Wikidata. A bit more about Scholia is that about data completeness with the physical and chemical properties, the literature, those are whole things that we want to have access to. But it only works if we can find the right chemical in Wikidata. We started using Wikidata in a number of our projects, so WikiPathways was one of them. This is another project in the area of the nanosafety, risk assessment, where they use OECD testing guidelines and using Wikidata here to make an overview of the experiments. And this means that we can now actually start annotating articles where these protocols have been used. And in this way, we get a better insight in the quality of literature as well. We get to see which DDTs are well tested, established experimental methods and an indication of how good the data is that came out of that. Another example--this is nanomaterials, specific nanomaterials, where there is a unique code-- we've added that-- with the same purpose of being able to track down literature about these nanomaterials. But, again, we need exact descriptions. Now, this is the LIPID MAPS classification, and here we see an interesting thing, and this has shown up in some of the presentations, elsewhere as well. This idea that some of the things that we have in Wikidata is not always matching the sources. So different ontological models, different ideas of what a particular thing means. And so, if we look at the LIPID MAPS, we have a lipid in the middle and then a number of classes, and many of these are in Wikidata. But here, around actually fatty acids or fatty acyls, that's where there is a mismatch causing something that should be actually purely hierarchical, actually it started to show some loops over there, the mismatch of two representations of a lipid chemistry. Now, the goal of this work is not so much to reconcile this but to visualize it so that we can understand what is going on and correct things that are actually clearly wrong. The interesting about LIPID MAPS is actually that the classification is indicated in the external identifier. So one of the things that we've been using is these external numbers to make this automatic classification because everything that starts with an LMFA05 is actually a fatty alcohol. So I can translate that into Quickstatements, push that into Quickstatements and get that annotated in Wikidata. This slide is just reflecting the advantage for LIPID MAPS here which been collaborating with them because they get a lot of data out of Wikidata as well, which we can cross-reference, which we can compare if it's correct. LIPID MAPS is a quite curated database but like everyone actually having trouble with access to literature, the demand of literature and filtering the literature, getting to the right articles. Shape Expressions is probably something that you've seen. We have a few of them for chemistry now. This is the example for racemic mixture. In a case of racemic mixture, you want to have two parts in there. It's a mixture, so at least two chemical entities need to be in there. Moreover, each of the [inaudible] parts has to be a chemical compound. This is another level of a way we can curate the content. There have to be more of them. We have quite a few different concepts in Wikidata like groups of co-compounds. There is a class that is of structurally similar compounds, etc. If you run a query like this, this case for the other one, other schema that we have for chemical elements, you can do the same thing-- you can run it on a single item or you can run that on everything that is a chemical element. This is something that I can very much recommend having a look at if you have not done so already. Now, if we go to the automation of things, here I'm using a tool called Bioclipse. This is something that we worked on some time ten years ago. It's a platform for chemistry and biology, or cheminformatics and bioinformatics. aimed at automating things, including visualizations and sorts. But I've taken that now and developed a number of scripts that I can actually run on the command line, which makes it easier to automate things, as we will see in a moment, and doing all sort of things, for example, the classification according to the LIPID MAP identifiers, that's the scripts all available from the GitHub repository here. And typically, I have them create Quickstatements because that gets me an additional check step after I created the Quickstatements and see what does data actually look like. Annotation of main subjects. This one is my script too, starting from SMILES to actually add chemical compounds that are not in Wikidata yet, which happens a lot. So three or four weeks ago I added something like 500 compounds which our project was looking into because these are volatile compounds in oils. This script adds the compounds. They will later on add the annotation of which pieces that compound comes from and what the properties are. Bioclipse itself is based on the Chemistry Development Kit and a few other libraries. This allows me to do the chemistry. And this is a very well-validated toolkit. The SMILES part has been done by John Mayfield. I have done a lot of validation against other tools. And the quality is actually really high now, comparable or in some cases even better of commercial cheminformatics tools. It has given me a lot of reassurance that the quality checking that we do with this tool on Wikidata is giving interesting results. This is the Quickstatements. Quickstatements is Magnus' work, of course. What happens if we take the SMILES, it calculates the InChI, and the InChIKey, it even looks up based on the InChIKey, if there is a PubChem identifier that uses the InChIKey, the PubChem identifier, to see if this compound already is in Wikidata. And only if it's not already there, then it will actually create a CREATE statement. A bit of automatic classification here is an option. So if I'm adding a class of compounds, I can automatically indicate what these are all... this type of compounds, and I can also indicate, if needed, if there is a particular article where I got this information from automatically adding references. Well, this is what the Quickstatements output looks like for the annotation of main subjects. You've probably seen that as well. A newer thing that I started doing is actually doing reasoning on the data in Wikidata. So if I have the SMILES, then I can check the molecular formula, for example. I can check the InChIKey. At some point, what we are going to do is calculate physicochemical properties and see if that matches what is in Wikidata. This will highlight typos or wrong units, for example. At this moment... so this is a run of this morning. What we see here is two tests actually failing, and this is an example of it. This is the InChIKey that is computed from the isomeric SMILES is different from the InChIKey given in the entry. This can result from data being pulled in from different resources. So these are entries, about 300 of them, in the 160,000 chemicals that we have in Wikidata. So it's a very small amount, really, where there is information, and someone needs to look at it. Now, these are all organic compounds and also quite a few inorganic compounds where these things just work less well. But I found in the other test that is failing immediately a couple of things that are very clearly wrong. PubChem is a huge database. They do validation as well. We are in the process of submitting Wikidata there, which I'm really happy about. It's in the last validation step at this moment. And this will also mean that PubChem, which has something like 100 million compounds will actually link back to Wikidata. It already does this, but via Wikipedia. (laughing) Do you recognize it? With the aforementioned issues there of concept mismatches. So this will give us a second thing. And there, also, using the same Bioclipse scripts or similar Bioclipse scripts, we get validation reports, again indicating things that chemists should look at. That basically wraps it up. This is still a work in progress, the article is in preparation. I've been working with Finn here in Scholia to support this validation. We're writing up the full work, but for now you can look up this poster. The slides are on the program of this session, so you can look at the slides and look at the details. And a quick acknowledgment: some of this work has been done by a number of grants that I received. And thank you very much. (applause) (chairman) Are there any questions? (person 3) Thank you so much for this. I am [inaudible], and so far, I've been reading articles on the [inaudible] Quickipedia on different compounds. I have a little bit more than 70 articles with different compounds-- just things I come across. And my question to you is if I want to move my chemistry activity from Wikipedia to Wikidata, how can I help in a way that is very friendly to somebody who is a beginner in that field on Wikidata? So, if that compound is in Wikipedia and.. Sometimes there is actually a Wikidata page. I occasionally run into this as well, in the last couple of months not so much anymore but this morning, actually. And what I typically do then is I take the SMILES from [inaudible] infobox from that compound or use PubChem to look up the SMILES, check if the information is complete, particularly the stereochemistry, and then I use that that creates Wikidata item scripts to create Quickstatements for that compound. If there already is a Wikidata item, I basically just update these scripts, but rather than say, "Create Last," I replace the last with the Q-codes that that item already has. And then it complements or it adds this information based on the information we had. This is [manuable], so you can copy-paste a number of SMILES, put it in a file, and take that. Extracting that information to Wikidata is not something I've automated yet, but this helps me... it's a pretty fast process. I can show you later how to use that software. (chairman) Are there other questions? So, I have one. Do you make an effort to, in fact, make this more visible in this bioinformatics community so that they can start using this structured data? Yeah, I'm actively doing that. So what I did not mention in this presentation so much, but we saw that in... I'd have somewhere to start here-- this is an overview of different databases. A similar plot, which actually I do not have on this slide deck is the number of different identifiers that chemical compounds have, and I've been working with a number of databases, like MassBank, the Environmental Protection Agency, CompTox Dashboard. I've added links to the BDB database. So I'm working with a number of projects for pulling in additional information, identifies our links out to other databases. Regarding outreach, yes, so that wrong slide deck that I was showing at the start, there was actually a presentation two weeks ago at an Open Science Meeting around chemistry. I'm very much pushing this and... I see a big future here. There's a lot of interest. And making people aware of the CC0 license, that's typically the larger problem. So we have to pull in the information carefully. (chairman) Other questions? - Okay. - Thank you very much. (chairman) Can we thank the speaker. (applause)