(chairman) So, let's welcome Egon,
who will describe how he is trying
to improve the coverage
of chemical compounds in Wikidata.
(Egon) Yeah, thank you.
So, let's see...
Oh, this is not right.
Sorry.
They put the wrong slide deck.
(person 1) The one was better than--
(person 2) [inaudible]
(person 1) (laughs)
How about this?
This one actually says WikidataCon.
Slightly different slides.
Okay, so yeah, coverage and correctness,
accuracy, quality, if you like.
And the other thing here
is what makes it different
from some of the other things
that we've seen at the WikidataCon
is how I do this quality,
and the coverage, actually.
So I'm actually taking advantage here
of my background,
which is in cheminformatics,
which is something
that we use in our research.
And cheminformatics
is a way to understanding
the chemical structures,
what we will see in a moment.
We can do things that we cannot do
with the regular toolsets
that we have, like Shape Expressions,
quality constraints and sorts.
Now, one of the interesting
things of chemistry
is that chemical structures,
are sometimes the same,
sometimes not the same,
depending on what you want to know.
And this slide reflects that a bit,
and what we see here
is biologically the same compounds
but chemically two different compounds.
But at biological levels
with the [inaudible],
they are in equilibrium,
and you will not be able
to really distinguish between them,
unless you're looking
for a particular type of biology
like reaction mechanisms.
Another interesting thing
about Wikidata and Wikipedia
is that we have things
like long-chain fatty acids,
chemical concepts
which are not a specific compound
but actually a class of compounds.
Now, this class can be based
on similar features in the molecules,
so like in the case
of the long-chain fatty acids--
they all have a long-chain fatty
and an acid group.
In other cases, there are the classes
based more
on the biological functionality,
like a certain type of inhibitor,
like an ACE inhibitor.
And this introduces
a lot of interesting things,
partly because of this close link
with Wikipedias.
And one of the things that we see
is that Wikipedia may have a chembox
for a particular compound
bur actually be about a compound class,
resulting to a slightly different concept
of what the two things
are actually meaning
and the sitelink being more complicated.
We need this
for understanding the biology.
So the research in our group
is understanding the living cell.
The system's biology here
described in the biological process
is we have a pathway database
for that--WikiPathways,
and if we look at the chemistry in there
of the small molecules,
the chemistry is sometimes
described in a lot of details,
sometimes in less detail,
pretty much like this Wikipedia
Wikidata link that we just had
resulting in basically links
to a lot of different databases
with slightly different focuses.
Some databases like LIPID MAPS
and the Human Metabolome Database,
they are very much focused on the biology,
whereas a database like ChEBI,
that's very much focused
on the chemical entities.
So we try to breach that,
and that two-three years ago gave us
a very interesting insight
that if you look at the lines here
where in blue we have the total number
of the small molecules
we have in these Pathways
and the numbers in red
that we can match to that,
there is this gap, and this gap
is complicated chemistry.
Also, poignantly, things
missing in Wikidata.
So therefore the need
for date completeness
and the data quality.
And here we have an example.
This is actually
a curation report of yesterday,
and these are still things
that we have in Pathways
but that we do not know
what the equivalent thing is in Wikidata.
And one of the things here
that I'm picking out here
is strigolactone.
And this is a class of compounds.
So we have that in one of our Pathways,
this particular Pathway over there.
So you start matching this
to Wikidata and Wikipedia,
and to actually use
for this compound Wikipedia page
with these six structures--
images, name, no links, nothing.
Nothing in Wikipedia,
just this information,
not machine-readable.
So based on the name,
I can actually find three out of the six
of these compounds in Wikidata,
not linked, not classified.
So if we look at the class
of strigolactones,
of which these six are examples,
Wikidata did not give us anything.
So that's the kind of curation
that I'm interested in.
On the right here--
that page is actually pretty much empty,
but it's exactly what Scholia is showing
for this class of chemical compounds.
So Scholia is one of the tools
that I've been using to do this curation.
This missing classification
is a bit of information
missing in Wikidata,
but we can add this classification,
and we can retrieve that
from some sources.
We will see with LIPID MAPS later,
we can automate
adding these missing links,
if we understand the chemistry.
So this diagram over here--
we have fatty acid over there again
and the long-chain fatty acid over here
that we saw on one
of the previous slides--
very long-chain fatty acids
and a number of other fatty acids.
This kind of information helps us see
a [inaudible] of the chemistry
in Wikidata.
Scholia can visualize the 2D structure,
and this thing is actually
automatically generated
from the chemical structure in Wikidata
on the fly creating
in the Scalable Vector Graphics.
(coughing)
Sorry.
With the stereochemistry annotation there
to help the chemist see
the completeness of the data
because also the stereochemistry
might be missing.
We also get an overview on Scholia
of related compounds
based on the InChIKey,
where the first block basically indicates
how the atoms are connected
and the second column
indicates things like stereochemistry
and things like
which isotopes are in there,
for example C11 instead of C12
or C13 instead of C12.
The last number
of the last letter over here
actually indicates the charge,
so that's the example that we saw earlier
between the citric acid and the citrate,
or was it the acetic acid and the acetate?
By putting in a bit of the main knowledge,
we can do a lot more... making sense
of what we have in Wikidata.
A bit more about Scholia
is that about data completeness
with the physical and chemical properties,
the literature, those are whole things
that we want to have access to.
But it only works if we can find
the right chemical in Wikidata.
We started using Wikidata
in a number of our projects,
so WikiPathways was one of them.
This is another project
in the area of the nanosafety,
risk assessment,
where they use OECD testing guidelines
and using Wikidata here
to make an overview of the experiments.
And this means that we can now
actually start annotating articles
where these protocols have been used.
And in this way, we get a better insight
in the quality of literature as well.
We get to see which DDTs
are well tested, established
experimental methods
and an indication of how good the data is
that came out of that.
Another example--this is nanomaterials,
specific nanomaterials,
where there is a unique code--
we've added that--
with the same purpose of being able
to track down literature
about these nanomaterials.
But, again, we need exact descriptions.
Now, this is
the LIPID MAPS classification,
and here we see an interesting thing,
and this has shown up
in some of the presentations,
elsewhere as well.
This idea that some of the things
that we have in Wikidata
is not always matching the sources.
So different ontological models,
different ideas
of what a particular thing means.
And so, if we look at the LIPID MAPS,
we have a lipid in the middle
and then a number of classes,
and many of these are in Wikidata.
But here, around actually
fatty acids or fatty acyls,
that's where there is a mismatch
causing something that should be
actually purely hierarchical,
actually it started to show
some loops over there,
the mismatch of two representations
of a lipid chemistry.
Now, the goal of this work
is not so much to reconcile this
but to visualize it
so that we can understand
what is going on
and correct things
that are actually clearly wrong.
The interesting about LIPID MAPS
is actually that the classification
is indicated in the external identifier.
So one of the things
that we've been using
is these external numbers
to make this automatic classification
because everything
that starts with an LMFA05
is actually a fatty alcohol.
So I can translate that
into Quickstatements,
push that into Quickstatements
and get that annotated in Wikidata.
This slide is just reflecting
the advantage for LIPID MAPS here
which been collaborating with them
because they get a lot of data
out of Wikidata as well,
which we can cross-reference,
which we can compare if it's correct.
LIPID MAPS is a quite curated database
but like everyone actually having trouble
with access to literature,
the demand of literature
and filtering the literature,
getting to the right articles.
Shape Expressions is probably
something that you've seen.
We have a few of them for chemistry now.
This is the example for racemic mixture.
In a case of racemic mixture,
you want to have two parts in there.
It's a mixture,
so at least two chemical entities
need to be in there.
Moreover, each of the [inaudible] parts
has to be a chemical compound.
This is another level of a way
we can curate the content.
There have to be more of them.
We have quite a few
different concepts in Wikidata
like groups of co-compounds.
There is a class that is
of structurally similar compounds, etc.
If you run a query like this,
this case for the other one,
other schema that we have
for chemical elements,
you can do the same thing--
you can run it on a single item
or you can run that on everything
that is a chemical element.
This is something
that I can very much recommend
having a look at
if you have not done so already.
Now, if we go to the automation of things,
here I'm using a tool called Bioclipse.
This is something that we worked on
some time ten years ago.
It's a platform
for chemistry and biology,
or cheminformatics and bioinformatics.
aimed at automating things,
including visualizations and sorts.
But I've taken that now
and developed a number of scripts
that I can actually run
on the command line,
which makes it easier to automate things,
as we will see in a moment,
and doing all sort of things,
for example, the classification
according to the LIPID MAP identifiers,
that's the scripts all available
from the GitHub repository here.
And typically, I have them
create Quickstatements
because that gets me
an additional check step
after I created the Quickstatements
and see what does data actually look like.
Annotation of main subjects.
This one is my script too,
starting from SMILES
to actually add chemical compounds
that are not in Wikidata yet,
which happens a lot.
So three or four weeks ago
I added something like 500 compounds
which our project was looking into
because these are
volatile compounds in oils.
This script adds the compounds.
They will later on add the annotation
of which pieces that compound comes from
and what the properties are.
Bioclipse itself is based
on the Chemistry Development Kit
and a few other libraries.
This allows me to do the chemistry.
And this is a very well-validated toolkit.
The SMILES part has been done
by John Mayfield.
I have done a lot of validation
against other tools.
And the quality
is actually really high now,
comparable or in some cases even better
of commercial cheminformatics tools.
It has given me a lot of reassurance
that the quality checking that we do
with this tool on Wikidata
is giving interesting results.
This is the Quickstatements.
Quickstatements
is Magnus' work, of course.
What happens if we take the SMILES,
it calculates the InChI,
and the InChIKey, it even looks up
based on the InChIKey,
if there is a PubChem identifier
that uses the InChIKey,
the PubChem identifier,
to see if this compound
already is in Wikidata.
And only if it's not already there,
then it will actually create
a CREATE statement.
A bit of automatic classification
here is an option.
So if I'm adding a class of compounds,
I can automatically indicate
what these are all...
this type of compounds,
and I can also indicate, if needed,
if there is a particular article
where I got this information
from automatically adding references.
Well, this is what
the Quickstatements output looks like
for the annotation of main subjects.
You've probably seen that as well.
A newer thing that I started doing
is actually doing reasoning
on the data in Wikidata.
So if I have the SMILES, then I can check
the molecular formula, for example.
I can check the InChIKey.
At some point, what we are going to do
is calculate physicochemical properties
and see if that matches
what is in Wikidata.
This will highlight typos
or wrong units, for example.
At this moment...
so this is a run of this morning.
What we see here
is two tests actually failing,
and this is an example of it.
This is the InChIKey
that is computed from the isomeric SMILES
is different from the InChIKey
given in the entry.
This can result from data
being pulled in from different resources.
So these are entries, about 300 of them,
in the 160,000 chemicals
that we have in Wikidata.
So it's a very small amount, really,
where there is information,
and someone needs to look at it.
Now, these are all organic compounds
and also quite a few inorganic compounds
where these things just work less well.
But I found in the other test
that is failing
immediately a couple of things
that are very clearly wrong.
PubChem is a huge database.
They do validation as well.
We are in the process
of submitting Wikidata there,
which I'm really happy about.
It's in the last validation step
at this moment.
And this will also mean that PubChem,
which has something
like 100 million compounds
will actually link back to Wikidata.
It already does this, but via Wikipedia.
(laughing)
Do you recognize it?
With the aforementioned issues there
of concept mismatches.
So this will give us a second thing.
And there, also,
using the same Bioclipse scripts
or similar Bioclipse scripts,
we get validation reports,
again indicating things
that chemists should look at.
That basically wraps it up.
This is still a work in progress,
the article is in preparation.
I've been working
with Finn here in Scholia
to support this validation.
We're writing up the full work,
but for now you can look up this poster.
The slides are on the program
of this session,
so you can look at the slides
and look at the details.
And a quick acknowledgment:
some of this work has been done
by a number of grants that I received.
And thank you very much.
(applause)
(chairman) Are there any questions?
(person 3) Thank you so much for this.
I am [inaudible],
and so far, I've been reading articles
on the [inaudible] Quickipedia
on different compounds.
I have a little bit more than 70 articles
with different compounds--
just things I come across.
And my question to you is
if I want to move my chemistry activity
from Wikipedia to Wikidata,
how can I help
in a way that is very friendly
to somebody who is a beginner
in that field on Wikidata?
So, if that compound is in Wikipedia and..
Sometimes there is
actually a Wikidata page.
I occasionally run into this as well,
in the last couple of months
not so much anymore
but this morning, actually.
And what I typically do then
is I take the SMILES
from [inaudible] infobox
from that compound
or use PubChem to look up the SMILES,
check if the information is complete,
particularly the stereochemistry,
and then I use that
that creates Wikidata item scripts
to create Quickstatements
for that compound.
If there already is a Wikidata item,
I basically just update these scripts,
but rather than say, "Create Last,"
I replace the last with the Q-codes
that that item already has.
And then it complements
or it adds this information
based on the information we had.
This is [manuable],
so you can copy-paste
a number of SMILES, put it in a file,
and take that.
Extracting that information to Wikidata
is not something I've automated yet,
but this helps me...
it's a pretty fast process.
I can show you later
how to use that software.
(chairman) Are there other questions?
So, I have one.
Do you make an effort
to, in fact, make this more visible
in this bioinformatics community
so that they can start using
this structured data?
Yeah, I'm actively doing that.
So what I did not mention
in this presentation so much,
but we saw that in...
I'd have somewhere to start here--
this is an overview
of different databases.
A similar plot, which actually
I do not have on this slide deck
is the number of different identifiers
that chemical compounds have,
and I've been working
with a number of databases,
like MassBank,
the Environmental Protection Agency,
CompTox Dashboard.
I've added links to the BDB database.
So I'm working with a number of projects
for pulling in additional information,
identifies our links out
to other databases.
Regarding outreach, yes,
so that wrong slide deck
that I was showing at the start,
there was actually
a presentation two weeks ago
at an Open Science Meeting
around chemistry.
I'm very much pushing this and...
I see a big future here.
There's a lot of interest.
And making people aware
of the CC0 license,
that's typically the larger problem.
So we have to pull in
the information carefully.
(chairman) Other questions?
- Okay.
- Thank you very much.
(chairman) Can we thank the speaker.
(applause)