-
(chairman) So, let's welcome Egon,
-
who will describe how he is trying
-
to improve the coverage
of chemical compounds in Wikidata.
-
(Egon) Yeah, thank you.
-
So, let's see...
-
Oh, this is not right.
-
Sorry.
-
They put the wrong slide deck.
-
(person 1) The one was better than--
(person 2) [inaudible]
-
(person 1) (laughs)
-
How about this?
-
This one actually says WikidataCon.
-
Slightly different slides.
-
Okay, so yeah, coverage and correctness,
-
accuracy, quality, if you like.
-
And the other thing here
-
is what makes it different
from some of the other things
-
that we've seen at the WikidataCon
-
is how I do this quality,
-
and the coverage, actually.
-
So I'm actually taking advantage here
of my background,
-
which is in cheminformatics,
-
which is something
that we use in our research.
-
And cheminformatics
-
is a way to understanding
the chemical structures,
-
what we will see in a moment.
-
We can do things that we cannot do
with the regular toolsets
-
that we have, like Shape Expressions,
quality constraints and sorts.
-
Now, one of the interesting
things of chemistry
-
is that chemical structures,
-
are sometimes the same,
sometimes not the same,
-
depending on what you want to know.
-
And this slide reflects that a bit,
-
and what we see here
is biologically the same compounds
-
but chemically two different compounds.
-
But at biological levels
with the [inaudible],
-
they are in equilibrium,
-
and you will not be able
to really distinguish between them,
-
unless you're looking
for a particular type of biology
-
like reaction mechanisms.
-
Another interesting thing
about Wikidata and Wikipedia
-
is that we have things
like long-chain fatty acids,
-
chemical concepts
which are not a specific compound
-
but actually a class of compounds.
-
Now, this class can be based
on similar features in the molecules,
-
so like in the case
of the long-chain fatty acids--
-
they all have a long-chain fatty
and an acid group.
-
In other cases, there are the classes
-
based more
on the biological functionality,
-
like a certain type of inhibitor,
like an ACE inhibitor.
-
And this introduces
a lot of interesting things,
-
partly because of this close link
with Wikipedias.
-
And one of the things that we see
-
is that Wikipedia may have a chembox
for a particular compound
-
bur actually be about a compound class,
-
resulting to a slightly different concept
-
of what the two things
are actually meaning
-
and the sitelink being more complicated.
-
We need this
for understanding the biology.
-
So the research in our group
is understanding the living cell.
-
The system's biology here
described in the biological process
-
is we have a pathway database
for that--WikiPathways,
-
and if we look at the chemistry in there
of the small molecules,
-
the chemistry is sometimes
described in a lot of details,
-
sometimes in less detail,
-
pretty much like this Wikipedia
Wikidata link that we just had
-
resulting in basically links
to a lot of different databases
-
with slightly different focuses.
-
Some databases like LIPID MAPS
and the Human Metabolome Database,
-
they are very much focused on the biology,
-
whereas a database like ChEBI,
-
that's very much focused
on the chemical entities.
-
So we try to breach that,
-
and that two-three years ago gave us
a very interesting insight
-
that if you look at the lines here
-
where in blue we have the total number
of the small molecules
-
we have in these Pathways
-
and the numbers in red
that we can match to that,
-
there is this gap, and this gap
is complicated chemistry.
-
Also, poignantly, things
missing in Wikidata.
-
So therefore the need
for date completeness
-
and the data quality.
-
And here we have an example.
-
This is actually
a curation report of yesterday,
-
and these are still things
that we have in Pathways
-
but that we do not know
what the equivalent thing is in Wikidata.
-
And one of the things here
that I'm picking out here
-
is strigolactone.
-
And this is a class of compounds.
-
So we have that in one of our Pathways,
this particular Pathway over there.
-
So you start matching this
to Wikidata and Wikipedia,
-
and to actually use
for this compound Wikipedia page
-
with these six structures--
-
images, name, no links, nothing.
-
Nothing in Wikipedia,
-
just this information,
not machine-readable.
-
So based on the name,
I can actually find three out of the six
-
of these compounds in Wikidata,
-
not linked, not classified.
-
So if we look at the class
of strigolactones,
-
of which these six are examples,
-
Wikidata did not give us anything.
-
So that's the kind of curation
that I'm interested in.
-
On the right here--
that page is actually pretty much empty,
-
but it's exactly what Scholia is showing
for this class of chemical compounds.
-
So Scholia is one of the tools
that I've been using to do this curation.
-
This missing classification
is a bit of information
-
missing in Wikidata,
-
but we can add this classification,
-
and we can retrieve that
from some sources.
-
We will see with LIPID MAPS later,
-
we can automate
adding these missing links,
-
if we understand the chemistry.
-
So this diagram over here--
we have fatty acid over there again
-
and the long-chain fatty acid over here
-
that we saw on one
of the previous slides--
-
very long-chain fatty acids
and a number of other fatty acids.
-
This kind of information helps us see
-
a [inaudible] of the chemistry
in Wikidata.
-
Scholia can visualize the 2D structure,
-
and this thing is actually
automatically generated
-
from the chemical structure in Wikidata
-
on the fly creating
in the Scalable Vector Graphics.
-
(coughing)
-
Sorry.
-
With the stereochemistry annotation there
-
to help the chemist see
the completeness of the data
-
because also the stereochemistry
might be missing.
-
We also get an overview on Scholia
of related compounds
-
based on the InChIKey,
-
where the first block basically indicates
how the atoms are connected
-
and the second column
indicates things like stereochemistry
-
and things like
which isotopes are in there,
-
for example C11 instead of C12
-
or C13 instead of C12.
-
The last number
of the last letter over here
-
actually indicates the charge,
-
so that's the example that we saw earlier
between the citric acid and the citrate,
-
or was it the acetic acid and the acetate?
-
By putting in a bit of the main knowledge,
-
we can do a lot more... making sense
of what we have in Wikidata.
-
A bit more about Scholia
-
is that about data completeness
with the physical and chemical properties,
-
the literature, those are whole things
that we want to have access to.
-
But it only works if we can find
the right chemical in Wikidata.
-
We started using Wikidata
in a number of our projects,
-
so WikiPathways was one of them.
-
This is another project
-
in the area of the nanosafety,
risk assessment,
-
where they use OECD testing guidelines
-
and using Wikidata here
to make an overview of the experiments.
-
And this means that we can now
actually start annotating articles
-
where these protocols have been used.
-
And in this way, we get a better insight
in the quality of literature as well.
-
We get to see which DDTs
-
are well tested, established
experimental methods
-
and an indication of how good the data is
that came out of that.
-
Another example--this is nanomaterials,
-
specific nanomaterials,
-
where there is a unique code--
we've added that--
-
with the same purpose of being able
-
to track down literature
about these nanomaterials.
-
But, again, we need exact descriptions.
-
Now, this is
the LIPID MAPS classification,
-
and here we see an interesting thing,
-
and this has shown up
in some of the presentations,
-
elsewhere as well.
-
This idea that some of the things
that we have in Wikidata
-
is not always matching the sources.
-
So different ontological models,
-
different ideas
of what a particular thing means.
-
And so, if we look at the LIPID MAPS,
we have a lipid in the middle
-
and then a number of classes,
-
and many of these are in Wikidata.
-
But here, around actually
fatty acids or fatty acyls,
-
that's where there is a mismatch
-
causing something that should be
actually purely hierarchical,
-
actually it started to show
some loops over there,
-
the mismatch of two representations
of a lipid chemistry.
-
Now, the goal of this work
is not so much to reconcile this
-
but to visualize it
-
so that we can understand
what is going on
-
and correct things
that are actually clearly wrong.
-
The interesting about LIPID MAPS
is actually that the classification
-
is indicated in the external identifier.
-
So one of the things
that we've been using
-
is these external numbers
to make this automatic classification
-
because everything
that starts with an LMFA05
-
is actually a fatty alcohol.
-
So I can translate that
into Quickstatements,
-
push that into Quickstatements
-
and get that annotated in Wikidata.
-
This slide is just reflecting
the advantage for LIPID MAPS here
-
which been collaborating with them
-
because they get a lot of data
out of Wikidata as well,
-
which we can cross-reference,
which we can compare if it's correct.
-
LIPID MAPS is a quite curated database
-
but like everyone actually having trouble
-
with access to literature,
-
the demand of literature
and filtering the literature,
-
getting to the right articles.
-
Shape Expressions is probably
something that you've seen.
-
We have a few of them for chemistry now.
-
This is the example for racemic mixture.
-
In a case of racemic mixture,
you want to have two parts in there.
-
It's a mixture,
-
so at least two chemical entities
need to be in there.
-
Moreover, each of the [inaudible] parts
has to be a chemical compound.
-
This is another level of a way
we can curate the content.
-
There have to be more of them.
-
We have quite a few
different concepts in Wikidata
-
like groups of co-compounds.
-
There is a class that is
of structurally similar compounds, etc.
-
If you run a query like this,
this case for the other one,
-
other schema that we have
for chemical elements,
-
you can do the same thing--
you can run it on a single item
-
or you can run that on everything
that is a chemical element.
-
This is something
that I can very much recommend
-
having a look at
if you have not done so already.
-
Now, if we go to the automation of things,
-
here I'm using a tool called Bioclipse.
-
This is something that we worked on
some time ten years ago.
-
It's a platform
for chemistry and biology,
-
or cheminformatics and bioinformatics.
-
aimed at automating things,
-
including visualizations and sorts.
-
But I've taken that now
and developed a number of scripts
-
that I can actually run
on the command line,
-
which makes it easier to automate things,
as we will see in a moment,
-
and doing all sort of things,
-
for example, the classification
according to the LIPID MAP identifiers,
-
that's the scripts all available
from the GitHub repository here.
-
And typically, I have them
create Quickstatements
-
because that gets me
an additional check step
-
after I created the Quickstatements
and see what does data actually look like.
-
Annotation of main subjects.
-
This one is my script too,
starting from SMILES
-
to actually add chemical compounds
that are not in Wikidata yet,
-
which happens a lot.
-
So three or four weeks ago
I added something like 500 compounds
-
which our project was looking into
-
because these are
volatile compounds in oils.
-
This script adds the compounds.
-
They will later on add the annotation
of which pieces that compound comes from
-
and what the properties are.
-
Bioclipse itself is based
on the Chemistry Development Kit
-
and a few other libraries.
-
This allows me to do the chemistry.
-
And this is a very well-validated toolkit.
-
The SMILES part has been done
by John Mayfield.
-
I have done a lot of validation
against other tools.
-
And the quality
is actually really high now,
-
comparable or in some cases even better
of commercial cheminformatics tools.
-
It has given me a lot of reassurance
-
that the quality checking that we do
with this tool on Wikidata
-
is giving interesting results.
-
This is the Quickstatements.
-
Quickstatements
is Magnus' work, of course.
-
What happens if we take the SMILES,
it calculates the InChI,
-
and the InChIKey, it even looks up
based on the InChIKey,
-
if there is a PubChem identifier
-
that uses the InChIKey,
the PubChem identifier,
-
to see if this compound
already is in Wikidata.
-
And only if it's not already there,
-
then it will actually create
a CREATE statement.
-
A bit of automatic classification
here is an option.
-
So if I'm adding a class of compounds,
-
I can automatically indicate
what these are all...
-
this type of compounds,
-
and I can also indicate, if needed,
-
if there is a particular article
where I got this information
-
from automatically adding references.
-
Well, this is what
the Quickstatements output looks like
-
for the annotation of main subjects.
-
You've probably seen that as well.
-
A newer thing that I started doing
-
is actually doing reasoning
on the data in Wikidata.
-
So if I have the SMILES, then I can check
the molecular formula, for example.
-
I can check the InChIKey.
-
At some point, what we are going to do
is calculate physicochemical properties
-
and see if that matches
what is in Wikidata.
-
This will highlight typos
-
or wrong units, for example.
-
At this moment...
so this is a run of this morning.
-
What we see here
is two tests actually failing,
-
and this is an example of it.
-
This is the InChIKey
that is computed from the isomeric SMILES
-
is different from the InChIKey
given in the entry.
-
This can result from data
being pulled in from different resources.
-
So these are entries, about 300 of them,
-
in the 160,000 chemicals
that we have in Wikidata.
-
So it's a very small amount, really,
-
where there is information,
and someone needs to look at it.
-
Now, these are all organic compounds
-
and also quite a few inorganic compounds
-
where these things just work less well.
-
But I found in the other test
that is failing
-
immediately a couple of things
that are very clearly wrong.
-
PubChem is a huge database.
-
They do validation as well.
-
We are in the process
of submitting Wikidata there,
-
which I'm really happy about.
-
It's in the last validation step
at this moment.
-
And this will also mean that PubChem,
-
which has something
like 100 million compounds
-
will actually link back to Wikidata.
-
It already does this, but via Wikipedia.
-
(laughing)
-
Do you recognize it?
-
With the aforementioned issues there
of concept mismatches.
-
So this will give us a second thing.
-
And there, also,
using the same Bioclipse scripts
-
or similar Bioclipse scripts,
-
we get validation reports,
-
again indicating things
that chemists should look at.
-
That basically wraps it up.
-
This is still a work in progress,
the article is in preparation.
-
I've been working
with Finn here in Scholia
-
to support this validation.
-
We're writing up the full work,
but for now you can look up this poster.
-
The slides are on the program
of this session,
-
so you can look at the slides
and look at the details.
-
And a quick acknowledgment:
-
some of this work has been done
by a number of grants that I received.
-
And thank you very much.
-
(applause)
-
(chairman) Are there any questions?
-
(person 3) Thank you so much for this.
-
I am [inaudible],
-
and so far, I've been reading articles
-
on the [inaudible] Quickipedia
on different compounds.
-
I have a little bit more than 70 articles
with different compounds--
-
just things I come across.
-
And my question to you is
-
if I want to move my chemistry activity
from Wikipedia to Wikidata,
-
how can I help
in a way that is very friendly
-
to somebody who is a beginner
in that field on Wikidata?
-
So, if that compound is in Wikipedia and..
-
Sometimes there is
actually a Wikidata page.
-
I occasionally run into this as well,
-
in the last couple of months
not so much anymore
-
but this morning, actually.
-
And what I typically do then
-
is I take the SMILES
from [inaudible] infobox
-
from that compound
-
or use PubChem to look up the SMILES,
check if the information is complete,
-
particularly the stereochemistry,
-
and then I use that
that creates Wikidata item scripts
-
to create Quickstatements
for that compound.
-
If there already is a Wikidata item,
-
I basically just update these scripts,
-
but rather than say, "Create Last,"
I replace the last with the Q-codes
-
that that item already has.
-
And then it complements
or it adds this information
-
based on the information we had.
-
This is [manuable],
-
so you can copy-paste
a number of SMILES, put it in a file,
-
and take that.
-
Extracting that information to Wikidata
is not something I've automated yet,
-
but this helps me...
it's a pretty fast process.
-
I can show you later
how to use that software.
-
(chairman) Are there other questions?
-
So, I have one.
-
Do you make an effort
to, in fact, make this more visible
-
in this bioinformatics community
-
so that they can start using
this structured data?
-
Yeah, I'm actively doing that.
-
So what I did not mention
in this presentation so much,
-
but we saw that in...
I'd have somewhere to start here--
-
this is an overview
of different databases.
-
A similar plot, which actually
I do not have on this slide deck
-
is the number of different identifiers
that chemical compounds have,
-
and I've been working
with a number of databases,
-
like MassBank,
the Environmental Protection Agency,
-
CompTox Dashboard.
-
I've added links to the BDB database.
-
So I'm working with a number of projects
-
for pulling in additional information,
-
identifies our links out
to other databases.
-
Regarding outreach, yes,
-
so that wrong slide deck
that I was showing at the start,
-
there was actually
a presentation two weeks ago
-
at an Open Science Meeting
around chemistry.
-
I'm very much pushing this and...
-
I see a big future here.
-
There's a lot of interest.
-
And making people aware
of the CC0 license,
-
that's typically the larger problem.
-
So we have to pull in
the information carefully.
-
(chairman) Other questions?
-
- Okay.
- Thank you very much.
-
(chairman) Can we thank the speaker.
-
(applause)