< Return to Video

ビッグデータとIoT / CPS

  • Not Synced
    Advanced Data Management and Utilization: Big Data and IoT/CPS
    00:00:06—
    Hello, everyone. I am Toshiyuki Amagasa of the University of Tsukuba Center for
    Computational Sciences and Artificial Intelligence Research Center. I would like
    to begin my data science video lecture (Advanced Data Management) on big data and
    IoT/CPS. Thank you for your time.
    I will explain the aims of this lecture. First of all, you will learn about the
    concept of big data. Next, you will learn about the IoT/CPS systems which support
    a digitalized society. You will also develop an understanding of the data systems
    which support these systems. Finally, you will learn about the issues in utilizing
    big data.
    With that said, I would like to dive in right away.
    First, since it is in the title of the lecture, I will explain what big data is.
    True to its name, big data refers to massive amounts of data, but just how massive
    are we talking about? The standard definition and so on is shown here, but it is
    an amount of data that is quite difficult to process using the typical computers
    and software of the time. Such massive amounts of data are referred to as big data.
    So, where is data like this found?
    While we are usually not very aware of it, the world is overflowing with data. To
    give some examples of big data, the amount of data generated every minute is shown
    here. For example, the Google search engine performs 6 million search processes
    per minute. On YouTube, a video streaming site you probably use regularly, 500
    hours of video are posted per minute. Similarly, on the well-known social media
    site Facebook,150,000 messages are posted per minute, and likewise, there are
    350,000 messages posted per minute on the social media site X (formerly known as
    Twitter). As for the email we use every day, statistics indicate that there are
    240 million emails sent per minute around the world. If such a massive amount of
    data as this is generated in one minute, then you can imagine what a huge amount
    of data is generated per day or per year.
    This diagram shows the growth in digital data worldwide. This diagram is a little
    out of date, but the horizontal axis represents time, and the vertical access
    represents the amount of data. As you can see, the amount of data generated is
    growing rapidly. The units on the vertical axis may be unfamiliar, but ZB stands
    for zettabytes, with 1ZB equal to 1 trillion GB, so clearly this is an extremely
    large amount. Under the current circumstances, this huge amount of data is
    generated daily, and it is increasing at an accelerating rate.
    One of the catalysts for generating such an enormous amount of data is the
    increasing digitalization in every sector. Digitalization refers to loading
    information from the real world onto computers so that it can be processed and
    analyzed on computers. That is to say, digitalization has made things which were
    never treated as data before into things that can be loaded onto a computer and
    processed. There are various important factors in the advance of digitalization,
    but to start with is the reductions in the size and cost of things like computers
    and sensors. Another is the ability to transmit data via high-speed wide-area
    network infrastructure and so on.
    This enables, for example, measuring previously unmeasured temperature and humidity
    readings using the sensors in smartphones and the like, so there are now
    measurements everywhere there are people. The are security cameras on every street
    corner, and if you include images taken with the cameras in the smartphones people
    carry, it is now possible to record videos and images anywhere. Smartphones are
    equipped with GPS sensors as well, so it is possible to accurately record their
    location.
    Also, while you usually may not give it much thought, the act of viewing the web
    on commonly used web browsers itself is also subject to digitalization. We will
    look at these aspects in more detail later.
    Now, let’s look at some specific examples of utilizing data.
    The first is air conditioning in buildings and factories. Large amounts of
    environmental sensors are installed on the roofs of factories, and they are able
    to measure factors like the internal temperature and humidity accurately and
    comprehensively. This makes it possible to optimize the air conditioning layout
    and output, the airflow, and the placement of people and things inside, and as a
    result, it improves the efficiency of air conditioning and reduces energy costs,
    while also helping to boost workplace productivity.
    I am sure you regularly use smartphones, and this next example is about the map
    apps. Google Maps, for example, or the iOS Map app if you use an iPhone. You
    probably know that maps app can superimpose traffic congestion information on the
    map. Google and Apple are able to collect and analyze smartphone location
    information in real time to predict congestion conditions in real time, which makes
    these services possible. Specifically, if devices are moving above a certain speed
    on the map, you can tell that area is not congested, and conversely, if the devices
    are moving very slowly or nearly stopped, you can tell that congestion is occurring
    there.
    The old VICS was a similar system which used a small number of sensors on street
    corners, but it used probe cars and sensors at intersections to predict congestion
    information. In comparison, the services by Google and Apple are able to use the
    very large amount of smartphone location information, giving them a higher density
    of information and an overwhelmingly larger coverage area, which enables them to
    make more accurate congestion predictions.
    The next may be a somewhat surprising example, but it relates to viewing various
    websites with the web browsers we commonly use. Many websites are digitalizing the
    browsing behavior of users. For example, collecting data on what links they click,
    what pages they view, whether they move on to the next page and the course of
    actions leading up to it, their dwell time, and so on. If a page is displayed for
    a long time, you can tell they are interested in that page, and if they quickly
    move on to the next page, you can tell they are not very interested. This data can
    be obtained by analyzing the web server access logs. The results are important
    clues in selecting content that will attract more viewers and fine tuning the
    layout. This is called web content optimization.
    Lastly, I will explain an example of data recommendation. Many of you have seen
    that most shopping sites like Amazon will recommend products with a message saying
    that people who bought this product also bought X. These product recommendations
    are known to increase the site’s sales significantly if the recommendations are
    pertinent. For this reason, it is extremely important to give accurate
    recommendations. In recent years, they have been using a technique called
    collaborative filtering. In collaborative filtering, products purchased by users
    are recommended to other users with similar purchasing behavior. Alternatively,
    items are recommended to the people who buy other items which are associated with
    the same purchasing behavior. Interestingly, it is possible to give accurate
    recommendations based on the site’s purchasing records without looking up the
    details of the product or the user at all. Because the recommendations are made by
    multiple users and products in collaboration, it is called collaborative filtering.
    There are many more examples of this. In many sports, they analyze the performance
    of the team and athletes to devise strategies, make decisions on player positions,
    and design training plans. In literature, they use the frequency with which terms
    occur in literary works and the like as an approach to conduct statistical data
    analysis and evaluate things like the author’s style or the authenticity of a
    work. Even in the arts, they analyze details like the shape, coloring, and threedimensional
    structure of a work of art in the same way to analyze the artist’s
    style or the determine the authenticity of the piece. In the field of biology,
    they analyze DNA data obtained from organisms and bioinformatics based on it, as
    well as behavioral data from GPS sensors and camera sensors attached to animals,
    which is called biologging. There are many such cases like these in which data
    analysis utilizing this kind of digitalization is employed.
    One thing that is extremely important when using big data is this loop shown here.
    Big data analysis is not something you do just once and are finished. To start
    with, collecting big data requires systems for continuous acquisition and storage.
    First, you have to establish these. Here, you perform analysis on the collected
    data, but you need to understand that big data analysis itself is very difficult
    at this stage, and takes a long time. As shown here, the big data analysis step is
    broken down into a series of processes. You decide which data to actually analyze
    from the collected big data, and then perform preprocessing before doing the actual
    analysis. Specifically, you clean up the formatting of the data, convert values,
    correct errors, and supplement for any missing values. Here, the data is shaped
    through much time and effort. In fact, this preprocessing is said to make up as
    much as 80% of the total cost of analysis. That is how long the processing takes.
    Once it has gone through preprocessing, you are finally ready to perform the
    analysis with machine learning and analytical algorithms. While the results of the
    analytical algorithms are obtained, you may also need to do visualization
    processing of the results, as well as examination and interpretation of the results
    of the analysis. Then you determine whether the results are valid. In many cases,
    you cannot get the intended results in one go, so in these cases, you have to go
    back to the data analysis or preprocessing step, and repeat this process, reworking
    things several times until you get valuable data.
    Only after going through this process can you can discover new knowledge or valuable
    data.
    Although you would expect feeding the knowledge or value obtained back to the
    target organization or community would lead to improvements in that community or
    organization, in reality, you can only assess whether the community or organization
    was improved by continuously repeating this series of processes. In some cases, it
    is important to make continual adjustments to the overall process as appropriate
    through an ongoing process of modifying the data acquired and methods of evaluation.
    Finally, I will sum up this section.
    Big data is often called 5V. It is characterized by five Vs, and the first of these
    is “Volume.” Volume refers to the quantity of data, and this represents the huge
    amount of data. The next is “Velocity, “which refers to the speed of transfer,
    and this represents the high speed of data generation and transmission over
    networks. The third V is “Variety,” which refers to its diversity. It represents
    the wide variety of text, numbers, images, videos and other media data that is
    generated, transmitted, and stored. After that is “Veracity,” which means its
    accuracy. Big data sometimes includes inaccurate or ambiguous information, so this
    indicates how important it is to obtain accurate information. The last is “Value,”
    which represents the importance of how much new value it can provide based on the
    previous four components.
    So far, I have been explaining the concept of big data. Next, I will cover the
    data systems important for using big data.
    Data systems are necessary in order to collect, manage, and utilize large amounts
    of data. Here, I will explain the concepts of IoT (the Internet of Things) and CPS
    (Cyber-Physical Systems). As the name says, IoT is the Internet of things. This is
    a system in which the things surrounding us are given communication capabilities
    to enable them to transmit data obtained from sensors and so on over the Internet.
    Recently, in addition to home appliances like air conditioners and refrigerators,
    all kinds of things like athletic shoes have been equipped with this functionality.
    As of 2024, the total number is said to be 17 billion devices. This is a huge
    amount, and it is growing steadily.
    This diagram shows the system IoT uses to process data. Let’s start from the
    bottom up. The lowest layer is IoT devices. IoT devices have sensors and I/O
    modules, so can send and receive the data they acquire over networks. On the
    Internet, there are relay servers that collect the massive amount of data gathered
    by IoT and other devices. They run applications to process the data depending on
    its purpose of use. The data stored on the relay server is then transmitted to big
    data analysis systems via networks. These days, they mostly exist on the cloud.
    After the final processing is performed there, it is provided to various
    applications.
    In recent years, there has been discussion of systems called Cyber-Physical Systems
    (CPS) in similar contexts to IoT systems. So, what are they? CPS are a further
    evolution of the IoT. They digitalize real world phenomena using sensors and so
    on, and transmit the collected data via networks. The transmitted data is analyzed
    in cyberspace using big data processing, and the information and knowledge obtained
    as a result are fed back into the real world. By doing so, we seek to solve issues
    in various industries and communities. Cyber refers to the world of computers,
    while physical refers to the real world, and data gathered from the physical world
    is analyzed in cyberspace. The results obtained are then fed back into the physical
    world, hence the name. Cyberspace can be described as a mirror image reflecting
    real world models, and because of this, it also sometimes called a digital twin.
    This is a prototype system of a Cyber-Physical System. As you can see, the world
    is at the very bottom, and from there, real world information is digitalized by
    various sensors. In this system, the data obtained is stored in the system, and
    analysis is performed in order to understand real world problems. The results are
    fed back into the real world, helping to improve government administration and the
    lives of citizens
    Like I said earlier, CPS are an evolution of IoT, so they can be understood as a
    type of IoT. While IoT focuses on data acquisition, those systems which also
    consider data analysis and feedback to the real world can be called Cyber-Physical
    Systems (CPS).
    Next, we will look at the data systems that support IoT and CPS. Let’s start with
    some familiar devices.
    As I have already explained several times in this lecture, compact devices and
    sensors, etc., are entry points for collecting data. We will start with smartphones.
    Smartphones are equipped with cameras and various sensors for GPS, temperature,
    acceleration, and so on. Running many applications also makes it possible to
    process data on smartphones. You can transmit the results to servers over the
    Internet as well.
    In addition to smartphones, smartwatches have also become popular recently.
    Smartwatches are also equipped with GPS and acceleration sensors, but the biggest
    difference with smartphones is that they can be worn at all times. This enables
    them to collect bio-signals like heart rate, and they can continuously monitor
    this information. As a result, they can monitor lifestyle habits and detect sudden
    illnesses and the like. Of course, big data analysis is also possible by
    transmitting the observed results to a server via network.
    In addition to these, many compact embedded devices have been developed in recent
    years. For example, using programmable devices like Raspberry Pi or Arduino
    microcontrollers makes it possible to collect data with sensors and process it.
    The data collected by the devices can be transmitted to servers via network.
    Networks, which we commonly use without thinking about it, also come in several
    varieties. The first is Bluetooth, a network accessed directly by the devices we
    use. It supports short range communication for devices. There are various types of
    devices, so it supports various profiles, and its features include low power
    consumption. For example, it is used to connect things like headphones and
    smartwatches, so many of you have probably heard of it.
    There are also networks set up in local environments like homes, workplaces, and
    schools. These are called local area networks, or LAN. For example, when you
    connect a PC or smartphone, etc., to a network via LAN, you often use a wireless
    network called Wi-Fi. By connecting through a Wi-Fi router, you can connect to a
    broadband network, which is a wide area network.
    You can also directly connect a smartphone to a wireless broadband network. In
    this case, you are communicating with the Internet through a mobile communication
    network like 4G, LTE, or 5G. Among these, 5G in particular is the newest mobile
    communication network, and its features include high bandwidth and low latency.
    Currently, the coverage area is expanding, and it can be accessed in many places.
    Now, how is the big data collected via the Internet managed? Database systems are
    essential for managing and utilizing big data. Of the types of databases systems,
    relational database systems are currently the type most commonly used.
    In relational database systems, data is stored in a tabular form called relations.
    As you can see in this image, here are two examples of a relation, user, and
    message. Relations have multiple attributes, and in the case of a user, it has
    three attributes: ID, name, and E-mail. This is the same for a message. Users can
    perform searches of data stored in this way using SQL, a standard language for
    queries. For example, a query to show messages by the user named Tsukuba Taro can
    be processed by an SQL query like the one in this diagram.
    In a relational database system, even if there is a huge amount of data, applying
    a special data structure called an index makes it possible to process queries
    quickly.
    Another very important point is that it supports transactions. As an example of a
    transaction, if you think about a shopping site, for instance, you probably know
    that when an order is received, you have to create the purchase history and update
    the inventory simultaneously. If there is a purchase history but the inventory has
    not decreased, there will be an inconsistency between the inventory and the order.
    In a relational database system, these changes to multiple databases which need to
    be performed simultaneously can be handled in batch increments called transactions.
    The system guarantees that no matter how complicated a transaction is, it will
    always be either executed or not executed in its entirety. There is no way that it
    can only be partially executed and cause an inconsistency. This makes it possible
    to manage complicated real-world data.
    Relational databases are incredibly useful, but there are many kinds of big data
    that are ill-suited for relational databases. For example, data which has a simple
    structure and is read and written quickly. They are also not very good at storing
    data that does not have a fixed structure. To meet these diverse needs, other types
    of databases were developed besides relational databases. These are collectively
    known as NoSQL databases, because they are non-SQL databases.
    For example, the one shown on the right is an example of a key-value store, which
    is one of the main types of NoSQL database. In a key-value store, it only handles
    pairs composed of a key, which is the search key, and the associated value. It
    also only supports searches for exact matches to a key. In exchange, this not only
    allows ultra-fast searching of massive data sets, it also supports writing large
    amounts of data.
    In addition, document stores for JSON, a type of semi-structured data that is more
    loosely organized, are also widely used. One point to note about these NoSQL
    databases is that they either do not support transactions, or the support is quite
    limited. If complicated transaction processing like the kind I explained in the
    previous slide is necessary, then you need to be careful.
    So, how is the big data stored in a database processed? You may have realized that
    an ordinary PC is not sufficient for processing big data. For example, consider
    1PB (petabyte) of data. Even if you read it with a storage device capable of
    reading 100MB per second, it would end up taking 115 days to read all of the data.
    Practical processing is not possible with this kind of hardware. Therefore,
    parallel dispersion processing is used to process big data. Specifically, they
    prepare massive computers composed of multiple PCs, which are called cluster
    computers. The data is broken down into smaller chunks in advance, and when the
    data is processed, those small chunks of data are read by each PC and partially
    processed. The overall processing is performed by aggregating these partial results.
    They are able to process individual pieces of data in parallel, making it possible
    to perform analysis of big data in a realistic time frame.
    Finally, let’s look at the issues in utilizing big data.
    While there are various issues in the utilization of big data, these five are the
    main ones. The first is that the data is monopolized by certain companies and
    organizations. Also, the storage and analysis costs are immense. Moreover, there
    is a shortage of staff able to utilize big data. In addition, there is utilization
    of data which is specialized for a particular purpose. And then there are issues
    of security, privacy, and so on. I will explain these issues in the following
    slides.
    There are a limited number of organizations and companies able to collect and store
    big data, and the monopolization by these companies is a problem. This is why the
    concept of open data has attracted interest in recent years. In 2013, the G8
    nations adopted the Open Data Charter. It holds that the current monopolization of
    data by governments and businesses presents a loss of opportunity for the people,
    and promoting open data can give rise to general purpose innovations which fulfill
    the needs of citizens.
    Open data has the five following steps. It starts with releasing data publicly
    with an open license that allows anyone to use it regardless of its format. The
    next step is to release the data in a machine-processable format. For example,
    releasing documents not only scanned photo images, but also in formats like Word
    or Excel. The third step is to release data in formats that can be accessed openly.
    An Excel file, for example, requires a program like Excel to access the file, but
    releasing the data in an openly accessible format like CSV instead enables people
    who do not have Excel to access the data as well. The fourth step is to release
    the data in RDF format. The fifth and final step is to link the data in RDF format
    to other data. I will explain RDF data in detail with the next slide.
    RDF, which I explained in the previous slide, is short for Resource Description
    Framework, and it refers both to the general-purpose data representation model
    established by the Worldwide Web Consortium (W3C) and the associated format. In
    RDF, all kinds of data are represented by triples made up of a subject, predicate,
    and object. For example, the fact that BOB is interested in the Mona Lisa can be
    represented by a triple in which “BOB” is the subject, “is interested in” is
    the predicate, and “the Mona Lisa” is the object. In the bottom diagram, the
    arrow between BOB and the Mona Lisa corresponds to this triple. In the same way,
    the fact that BOB is friends with Alice, or that BOB’s date of birth is July 14,
    1990, etc., are described by different triples. In this way, any information can
    be described with a graph model. This is sometimes called a knowledge graph.
    One of the key features of RDF is that, as you can see from this diagram, you can
    freely post links to other data sets, so you can build a data web by posting links
    to other data sets. Statistical data from national and local governments as well
    as various other information is already stored in RDF format.
    The next issue is the cost required for storage and analysis. As we know from what
    we have learned so far, collecting big data requires large-scale data systems, and
    there are high costs for storage and the cluster computers, etc., used to analyze
    it. Furthermore, personnel with specialist knowledge are required to maintain,
    manage, and analyze this data, and their numbers are limited. Only a limited number
    of organizations and companies have both of these things, so the monopolization of
    big data has become a problem.
    In connection with this, the third point, the shortage of personnel capable of
    utilizing big data, is also a major issue. Discovering useful information by
    analyzing big data requires experience and understanding of the problem space in
    addition to deep knowledge of statistical analysis, data mining, and AI. Personnel
    who possess this expertise are called data scientists, and experts knowledgeable
    about computers, networks, and storage are required to operate the large-scale
    parallel computing systems to maintain, manage, and analyze big data. This type of
    personnel is called infrastructure engineers. The number of people knowledgeable
    about infrastructure who also have training as a data scientist is extremely
    limited, and they are incredibly valuable right now. How to train this kind of
    staff is a very important societal issue.
    The utilization of data specialized for a particular purpose is also a serious
    issue. Big data is intrinsically collected and stored for a specific purpose. At
    the same time, using it for applications other than its original purpose and
    linking it or combining it with unintended data is also growing more important
    from the perspective of utilizing data resources.
    That said, caution is necessary in such cases. For example, using units or a level
    of granularity which differs from that of the original collected data requires
    conversion. If something was originally tallied weekly in dozens, for example, and
    you want to use daily increments and single units instead, you will need to
    convert the units and time. The accuracy of data may vary as well. In these cases,
    you discard data that does not meet the required level of accuracy. Alternatively,
    a process called accuracy supplementation will be necessary. There may also be
    data which does not contain required entries, and in such cases, it is processed
    by supplementing the data with data from other sources, or inferring the missing
    entries, etc. In these other cases, it is necessary to take appropriate measures
    on a case-by-case basis. A variety of research is being conducted on using data
    outside of its intended purpose, and applying the results is expected to enable
    better utilization of data.
    It should go without saying that security and privacy are the highest priorities
    in utilizing big data. At the same time, how to utilize data which contains personal
    information while maintaining the anonymity of the individuals is also a very
    important question. Research and development on techniques for this purpose known
    as homomorphic encryption and differential privacy is being pursued.
    The development of legislation is also progressing. The laws associated with the
    utilization of big data may vary from region to region, so it is important to
    understand the legal system of the target region before using the data. For example,
    there is the Amended Act on the Protection of Personal Information in Japan, while
    in the EU, the General Data Protection Regulation (GDPR) is applied. It is important
    to develop a deeper understanding of these regulations.
    In conclusion, I would like to sum up the contents of this lecture.
    In this lecture, we learned about the basic concepts of big data, as well as the
    underlying IoT and CPS data systems which support them, and the data systems which
    support big data and IoT/CPS in turn. Lastly, we learned about the issues in
    utilizing big data. This concludes the lecture. Thank you for watching.
    —00:42:27
Title:
ビッグデータとIoT / CPS
Description:

more » « less
Video Language:
Japanese
Duration:
42:30

English subtitles

Incomplete

Revisions