Advanced Data Management and Utilization: Big Data and IoT/CPS
00:00:06—
Hello, everyone. I am Toshiyuki Amagasa of the University of Tsukuba Center for
Computational Sciences and Artificial Intelligence Research Center. I would like
to begin my data science video lecture (Advanced Data Management) on big data and
IoT/CPS. Thank you for your time.
I will explain the aims of this lecture. First of all, you will learn about the
concept of big data. Next, you will learn about the IoT/CPS systems which support
a digitalized society. You will also develop an understanding of the data systems
which support these systems. Finally, you will learn about the issues in utilizing
big data.
With that said, I would like to dive in right away.
First, since it is in the title of the lecture, I will explain what big data is.
True to its name, big data refers to massive amounts of data, but just how massive
are we talking about? The standard definition and so on is shown here, but it is
an amount of data that is quite difficult to process using the typical computers
and software of the time. Such massive amounts of data are referred to as big data.
So, where is data like this found?
While we are usually not very aware of it, the world is overflowing with data. To
give some examples of big data, the amount of data generated every minute is shown
here. For example, the Google search engine performs 6 million search processes
per minute. On YouTube, a video streaming site you probably use regularly, 500
hours of video are posted per minute. Similarly, on the well-known social media
site Facebook,150,000 messages are posted per minute, and likewise, there are
350,000 messages posted per minute on the social media site X (formerly known as
Twitter). As for the email we use every day, statistics indicate that there are
240 million emails sent per minute around the world. If such a massive amount of
data as this is generated in one minute, then you can imagine what a huge amount
of data is generated per day or per year.
This diagram shows the growth in digital data worldwide. This diagram is a little
out of date, but the horizontal axis represents time, and the vertical access
represents the amount of data. As you can see, the amount of data generated is
growing rapidly. The units on the vertical axis may be unfamiliar, but ZB stands
for zettabytes, with 1ZB equal to 1 trillion GB, so clearly this is an extremely
large amount. Under the current circumstances, this huge amount of data is
generated daily, and it is increasing at an accelerating rate.
One of the catalysts for generating such an enormous amount of data is the
increasing digitalization in every sector. Digitalization refers to loading
information from the real world onto computers so that it can be processed and
analyzed on computers. That is to say, digitalization has made things which were
never treated as data before into things that can be loaded onto a computer and
processed. There are various important factors in the advance of digitalization,
but to start with is the reductions in the size and cost of things like computers
and sensors. Another is the ability to transmit data via high-speed wide-area
network infrastructure and so on.
This enables, for example, measuring previously unmeasured temperature and humidity
readings using the sensors in smartphones and the like, so there are now
measurements everywhere there are people. The are security cameras on every street
corner, and if you include images taken with the cameras in the smartphones people
carry, it is now possible to record videos and images anywhere. Smartphones are
equipped with GPS sensors as well, so it is possible to accurately record their
location.
Also, while you usually may not give it much thought, the act of viewing the web
on commonly used web browsers itself is also subject to digitalization. We will
look at these aspects in more detail later.
Now, let’s look at some specific examples of utilizing data.
The first is air conditioning in buildings and factories. Large amounts of
environmental sensors are installed on the roofs of factories, and they are able
to measure factors like the internal temperature and humidity accurately and
comprehensively. This makes it possible to optimize the air conditioning layout
and output, the airflow, and the placement of people and things inside, and as a
result, it improves the efficiency of air conditioning and reduces energy costs,
while also helping to boost workplace productivity.
I am sure you regularly use smartphones, and this next example is about the map
apps. Google Maps, for example, or the iOS Map app if you use an iPhone. You
probably know that maps app can superimpose traffic congestion information on the
map. Google and Apple are able to collect and analyze smartphone location
information in real time to predict congestion conditions in real time, which makes
these services possible. Specifically, if devices are moving above a certain speed
on the map, you can tell that area is not congested, and conversely, if the devices
are moving very slowly or nearly stopped, you can tell that congestion is occurring
there.
The old VICS was a similar system which used a small number of sensors on street
corners, but it used probe cars and sensors at intersections to predict congestion
information. In comparison, the services by Google and Apple are able to use the
very large amount of smartphone location information, giving them a higher density
of information and an overwhelmingly larger coverage area, which enables them to
make more accurate congestion predictions.
The next may be a somewhat surprising example, but it relates to viewing various
websites with the web browsers we commonly use. Many websites are digitalizing the
browsing behavior of users. For example, collecting data on what links they click,
what pages they view, whether they move on to the next page and the course of
actions leading up to it, their dwell time, and so on. If a page is displayed for
a long time, you can tell they are interested in that page, and if they quickly
move on to the next page, you can tell they are not very interested. This data can
be obtained by analyzing the web server access logs. The results are important
clues in selecting content that will attract more viewers and fine tuning the
layout. This is called web content optimization.
Lastly, I will explain an example of data recommendation. Many of you have seen
that most shopping sites like Amazon will recommend products with a message saying
that people who bought this product also bought X. These product recommendations
are known to increase the site’s sales significantly if the recommendations are
pertinent. For this reason, it is extremely important to give accurate
recommendations. In recent years, they have been using a technique called
collaborative filtering. In collaborative filtering, products purchased by users
are recommended to other users with similar purchasing behavior. Alternatively,
items are recommended to the people who buy other items which are associated with
the same purchasing behavior. Interestingly, it is possible to give accurate
recommendations based on the site’s purchasing records without looking up the
details of the product or the user at all. Because the recommendations are made by
multiple users and products in collaboration, it is called collaborative filtering.
There are many more examples of this. In many sports, they analyze the performance
of the team and athletes to devise strategies, make decisions on player positions,
and design training plans. In literature, they use the frequency with which terms
occur in literary works and the like as an approach to conduct statistical data
analysis and evaluate things like the author’s style or the authenticity of a
work. Even in the arts, they analyze details like the shape, coloring, and threedimensional
structure of a work of art in the same way to analyze the artist’s
style or the determine the authenticity of the piece. In the field of biology,
they analyze DNA data obtained from organisms and bioinformatics based on it, as
well as behavioral data from GPS sensors and camera sensors attached to animals,
which is called biologging. There are many such cases like these in which data
analysis utilizing this kind of digitalization is employed.
One thing that is extremely important when using big data is this loop shown here.
Big data analysis is not something you do just once and are finished. To start
with, collecting big data requires systems for continuous acquisition and storage.
First, you have to establish these. Here, you perform analysis on the collected
data, but you need to understand that big data analysis itself is very difficult
at this stage, and takes a long time. As shown here, the big data analysis step is
broken down into a series of processes. You decide which data to actually analyze
from the collected big data, and then perform preprocessing before doing the actual
analysis. Specifically, you clean up the formatting of the data, convert values,
correct errors, and supplement for any missing values. Here, the data is shaped
through much time and effort. In fact, this preprocessing is said to make up as
much as 80% of the total cost of analysis. That is how long the processing takes.
Once it has gone through preprocessing, you are finally ready to perform the
analysis with machine learning and analytical algorithms. While the results of the
analytical algorithms are obtained, you may also need to do visualization
processing of the results, as well as examination and interpretation of the results
of the analysis. Then you determine whether the results are valid. In many cases,
you cannot get the intended results in one go, so in these cases, you have to go
back to the data analysis or preprocessing step, and repeat this process, reworking
things several times until you get valuable data.
Only after going through this process can you can discover new knowledge or valuable
data.
Although you would expect feeding the knowledge or value obtained back to the
target organization or community would lead to improvements in that community or
organization, in reality, you can only assess whether the community or organization
was improved by continuously repeating this series of processes. In some cases, it
is important to make continual adjustments to the overall process as appropriate
through an ongoing process of modifying the data acquired and methods of evaluation.
Finally, I will sum up this section.
Big data is often called 5V. It is characterized by five Vs, and the first of these
is “Volume.” Volume refers to the quantity of data, and this represents the huge
amount of data. The next is “Velocity, “which refers to the speed of transfer,
and this represents the high speed of data generation and transmission over
networks. The third V is “Variety,” which refers to its diversity. It represents
the wide variety of text, numbers, images, videos and other media data that is
generated, transmitted, and stored. After that is “Veracity,” which means its
accuracy. Big data sometimes includes inaccurate or ambiguous information, so this
indicates how important it is to obtain accurate information. The last is “Value,”
which represents the importance of how much new value it can provide based on the
previous four components.
So far, I have been explaining the concept of big data. Next, I will cover the
data systems important for using big data.
Data systems are necessary in order to collect, manage, and utilize large amounts
of data. Here, I will explain the concepts of IoT (the Internet of Things) and CPS
(Cyber-Physical Systems). As the name says, IoT is the Internet of things. This is
a system in which the things surrounding us are given communication capabilities
to enable them to transmit data obtained from sensors and so on over the Internet.
Recently, in addition to home appliances like air conditioners and refrigerators,
all kinds of things like athletic shoes have been equipped with this functionality.
As of 2024, the total number is said to be 17 billion devices. This is a huge
amount, and it is growing steadily.
This diagram shows the system IoT uses to process data. Let’s start from the
bottom up. The lowest layer is IoT devices. IoT devices have sensors and I/O
modules, so can send and receive the data they acquire over networks. On the
Internet, there are relay servers that collect the massive amount of data gathered
by IoT and other devices. They run applications to process the data depending on
its purpose of use. The data stored on the relay server is then transmitted to big
data analysis systems via networks. These days, they mostly exist on the cloud.
After the final processing is performed there, it is provided to various
applications.
In recent years, there has been discussion of systems called Cyber-Physical Systems
(CPS) in similar contexts to IoT systems. So, what are they? CPS are a further
evolution of the IoT. They digitalize real world phenomena using sensors and so
on, and transmit the collected data via networks. The transmitted data is analyzed
in cyberspace using big data processing, and the information and knowledge obtained
as a result are fed back into the real world. By doing so, we seek to solve issues
in various industries and communities. Cyber refers to the world of computers,
while physical refers to the real world, and data gathered from the physical world
is analyzed in cyberspace. The results obtained are then fed back into the physical
world, hence the name. Cyberspace can be described as a mirror image reflecting
real world models, and because of this, it also sometimes called a digital twin.
This is a prototype system of a Cyber-Physical System. As you can see, the world
is at the very bottom, and from there, real world information is digitalized by
various sensors. In this system, the data obtained is stored in the system, and
analysis is performed in order to understand real world problems. The results are
fed back into the real world, helping to improve government administration and the
lives of citizens
Like I said earlier, CPS are an evolution of IoT, so they can be understood as a
type of IoT. While IoT focuses on data acquisition, those systems which also
consider data analysis and feedback to the real world can be called Cyber-Physical
Systems (CPS).
Next, we will look at the data systems that support IoT and CPS. Let’s start with
some familiar devices.
As I have already explained several times in this lecture, compact devices and
sensors, etc., are entry points for collecting data. We will start with smartphones.
Smartphones are equipped with cameras and various sensors for GPS, temperature,
acceleration, and so on. Running many applications also makes it possible to
process data on smartphones. You can transmit the results to servers over the
Internet as well.
In addition to smartphones, smartwatches have also become popular recently.
Smartwatches are also equipped with GPS and acceleration sensors, but the biggest
difference with smartphones is that they can be worn at all times. This enables
them to collect bio-signals like heart rate, and they can continuously monitor
this information. As a result, they can monitor lifestyle habits and detect sudden
illnesses and the like. Of course, big data analysis is also possible by
transmitting the observed results to a server via network.
In addition to these, many compact embedded devices have been developed in recent
years. For example, using programmable devices like Raspberry Pi or Arduino
microcontrollers makes it possible to collect data with sensors and process it.
The data collected by the devices can be transmitted to servers via network.
Networks, which we commonly use without thinking about it, also come in several
varieties. The first is Bluetooth, a network accessed directly by the devices we
use. It supports short range communication for devices. There are various types of
devices, so it supports various profiles, and its features include low power
consumption. For example, it is used to connect things like headphones and
smartwatches, so many of you have probably heard of it.
There are also networks set up in local environments like homes, workplaces, and
schools. These are called local area networks, or LAN. For example, when you
connect a PC or smartphone, etc., to a network via LAN, you often use a wireless
network called Wi-Fi. By connecting through a Wi-Fi router, you can connect to a
broadband network, which is a wide area network.
You can also directly connect a smartphone to a wireless broadband network. In
this case, you are communicating with the Internet through a mobile communication
network like 4G, LTE, or 5G. Among these, 5G in particular is the newest mobile
communication network, and its features include high bandwidth and low latency.
Currently, the coverage area is expanding, and it can be accessed in many places.
Now, how is the big data collected via the Internet managed? Database systems are
essential for managing and utilizing big data. Of the types of databases systems,
relational database systems are currently the type most commonly used.
In relational database systems, data is stored in a tabular form called relations.
As you can see in this image, here are two examples of a relation, user, and
message. Relations have multiple attributes, and in the case of a user, it has
three attributes: ID, name, and E-mail. This is the same for a message. Users can
perform searches of data stored in this way using SQL, a standard language for
queries. For example, a query to show messages by the user named Tsukuba Taro can
be processed by an SQL query like the one in this diagram.
In a relational database system, even if there is a huge amount of data, applying
a special data structure called an index makes it possible to process queries
quickly.
Another very important point is that it supports transactions. As an example of a
transaction, if you think about a shopping site, for instance, you probably know
that when an order is received, you have to create the purchase history and update
the inventory simultaneously. If there is a purchase history but the inventory has
not decreased, there will be an inconsistency between the inventory and the order.
In a relational database system, these changes to multiple databases which need to
be performed simultaneously can be handled in batch increments called transactions.
The system guarantees that no matter how complicated a transaction is, it will
always be either executed or not executed in its entirety. There is no way that it
can only be partially executed and cause an inconsistency. This makes it possible
to manage complicated real-world data.
Relational databases are incredibly useful, but there are many kinds of big data
that are ill-suited for relational databases. For example, data which has a simple
structure and is read and written quickly. They are also not very good at storing
data that does not have a fixed structure. To meet these diverse needs, other types
of databases were developed besides relational databases. These are collectively
known as NoSQL databases, because they are non-SQL databases.
For example, the one shown on the right is an example of a key-value store, which
is one of the main types of NoSQL database. In a key-value store, it only handles
pairs composed of a key, which is the search key, and the associated value. It
also only supports searches for exact matches to a key. In exchange, this not only
allows ultra-fast searching of massive data sets, it also supports writing large
amounts of data.
In addition, document stores for JSON, a type of semi-structured data that is more
loosely organized, are also widely used. One point to note about these NoSQL
databases is that they either do not support transactions, or the support is quite
limited. If complicated transaction processing like the kind I explained in the
previous slide is necessary, then you need to be careful.
So, how is the big data stored in a database processed? You may have realized that
an ordinary PC is not sufficient for processing big data. For example, consider
1PB (petabyte) of data. Even if you read it with a storage device capable of
reading 100MB per second, it would end up taking 115 days to read all of the data.
Practical processing is not possible with this kind of hardware. Therefore,
parallel dispersion processing is used to process big data. Specifically, they
prepare massive computers composed of multiple PCs, which are called cluster
computers. The data is broken down into smaller chunks in advance, and when the
data is processed, those small chunks of data are read by each PC and partially
processed. The overall processing is performed by aggregating these partial results.
They are able to process individual pieces of data in parallel, making it possible
to perform analysis of big data in a realistic time frame.
Finally, let’s look at the issues in utilizing big data.
While there are various issues in the utilization of big data, these five are the
main ones. The first is that the data is monopolized by certain companies and
organizations. Also, the storage and analysis costs are immense. Moreover, there
is a shortage of staff able to utilize big data. In addition, there is utilization
of data which is specialized for a particular purpose. And then there are issues
of security, privacy, and so on. I will explain these issues in the following
slides.
There are a limited number of organizations and companies able to collect and store
big data, and the monopolization by these companies is a problem. This is why the
concept of open data has attracted interest in recent years. In 2013, the G8
nations adopted the Open Data Charter. It holds that the current monopolization of
data by governments and businesses presents a loss of opportunity for the people,
and promoting open data can give rise to general purpose innovations which fulfill
the needs of citizens.
Open data has the five following steps. It starts with releasing data publicly
with an open license that allows anyone to use it regardless of its format. The
next step is to release the data in a machine-processable format. For example,
releasing documents not only scanned photo images, but also in formats like Word
or Excel. The third step is to release data in formats that can be accessed openly.
An Excel file, for example, requires a program like Excel to access the file, but
releasing the data in an openly accessible format like CSV instead enables people
who do not have Excel to access the data as well. The fourth step is to release
the data in RDF format. The fifth and final step is to link the data in RDF format
to other data. I will explain RDF data in detail with the next slide.
RDF, which I explained in the previous slide, is short for Resource Description
Framework, and it refers both to the general-purpose data representation model
established by the Worldwide Web Consortium (W3C) and the associated format. In
RDF, all kinds of data are represented by triples made up of a subject, predicate,
and object. For example, the fact that BOB is interested in the Mona Lisa can be
represented by a triple in which “BOB” is the subject, “is interested in” is
the predicate, and “the Mona Lisa” is the object. In the bottom diagram, the
arrow between BOB and the Mona Lisa corresponds to this triple. In the same way,
the fact that BOB is friends with Alice, or that BOB’s date of birth is July 14,
1990, etc., are described by different triples. In this way, any information can
be described with a graph model. This is sometimes called a knowledge graph.
One of the key features of RDF is that, as you can see from this diagram, you can
freely post links to other data sets, so you can build a data web by posting links
to other data sets. Statistical data from national and local governments as well
as various other information is already stored in RDF format.
The next issue is the cost required for storage and analysis. As we know from what
we have learned so far, collecting big data requires large-scale data systems, and
there are high costs for storage and the cluster computers, etc., used to analyze
it. Furthermore, personnel with specialist knowledge are required to maintain,
manage, and analyze this data, and their numbers are limited. Only a limited number
of organizations and companies have both of these things, so the monopolization of
big data has become a problem.
In connection with this, the third point, the shortage of personnel capable of
utilizing big data, is also a major issue. Discovering useful information by
analyzing big data requires experience and understanding of the problem space in
addition to deep knowledge of statistical analysis, data mining, and AI. Personnel
who possess this expertise are called data scientists, and experts knowledgeable
about computers, networks, and storage are required to operate the large-scale
parallel computing systems to maintain, manage, and analyze big data. This type of
personnel is called infrastructure engineers. The number of people knowledgeable
about infrastructure who also have training as a data scientist is extremely
limited, and they are incredibly valuable right now. How to train this kind of
staff is a very important societal issue.
The utilization of data specialized for a particular purpose is also a serious
issue. Big data is intrinsically collected and stored for a specific purpose. At
the same time, using it for applications other than its original purpose and
linking it or combining it with unintended data is also growing more important
from the perspective of utilizing data resources.
That said, caution is necessary in such cases. For example, using units or a level
of granularity which differs from that of the original collected data requires
conversion. If something was originally tallied weekly in dozens, for example, and
you want to use daily increments and single units instead, you will need to
convert the units and time. The accuracy of data may vary as well. In these cases,
you discard data that does not meet the required level of accuracy. Alternatively,
a process called accuracy supplementation will be necessary. There may also be
data which does not contain required entries, and in such cases, it is processed
by supplementing the data with data from other sources, or inferring the missing
entries, etc. In these other cases, it is necessary to take appropriate measures
on a case-by-case basis. A variety of research is being conducted on using data
outside of its intended purpose, and applying the results is expected to enable
better utilization of data.
It should go without saying that security and privacy are the highest priorities
in utilizing big data. At the same time, how to utilize data which contains personal
information while maintaining the anonymity of the individuals is also a very
important question. Research and development on techniques for this purpose known
as homomorphic encryption and differential privacy is being pursued.
The development of legislation is also progressing. The laws associated with the
utilization of big data may vary from region to region, so it is important to
understand the legal system of the target region before using the data. For example,
there is the Amended Act on the Protection of Personal Information in Japan, while
in the EU, the General Data Protection Regulation (GDPR) is applied. It is important
to develop a deeper understanding of these regulations.
In conclusion, I would like to sum up the contents of this lecture.
In this lecture, we learned about the basic concepts of big data, as well as the
underlying IoT and CPS data systems which support them, and the data systems which
support big data and IoT/CPS in turn. Lastly, we learned about the issues in
utilizing big data. This concludes the lecture. Thank you for watching.
—00:42:27