9:59:59.000,9:59:59.000 Advanced Data Management and Utilization: Big Data and IoT/CPS[br]00:00:06—[br]Hello, everyone. I am Toshiyuki Amagasa of the University of Tsukuba Center for[br]Computational Sciences and Artificial Intelligence Research Center. I would like[br]to begin my data science video lecture (Advanced Data Management) on big data and[br]IoT/CPS. Thank you for your time.[br]I will explain the aims of this lecture. First of all, you will learn about the[br]concept of big data. Next, you will learn about the IoT/CPS systems which support[br]a digitalized society. You will also develop an understanding of the data systems[br]which support these systems. Finally, you will learn about the issues in utilizing[br]big data.[br]With that said, I would like to dive in right away.[br]First, since it is in the title of the lecture, I will explain what big data is.[br]True to its name, big data refers to massive amounts of data, but just how massive[br]are we talking about? The standard definition and so on is shown here, but it is[br]an amount of data that is quite difficult to process using the typical computers[br]and software of the time. Such massive amounts of data are referred to as big data.[br]So, where is data like this found?[br]While we are usually not very aware of it, the world is overflowing with data. To[br]give some examples of big data, the amount of data generated every minute is shown[br]here. For example, the Google search engine performs 6 million search processes[br]per minute. On YouTube, a video streaming site you probably use regularly, 500[br]hours of video are posted per minute. Similarly, on the well-known social media[br]site Facebook,150,000 messages are posted per minute, and likewise, there are[br]350,000 messages posted per minute on the social media site X (formerly known as[br]Twitter). As for the email we use every day, statistics indicate that there are[br]240 million emails sent per minute around the world. If such a massive amount of[br]data as this is generated in one minute, then you can imagine what a huge amount[br]of data is generated per day or per year.[br]This diagram shows the growth in digital data worldwide. This diagram is a little[br]out of date, but the horizontal axis represents time, and the vertical access[br]represents the amount of data. As you can see, the amount of data generated is[br]growing rapidly. The units on the vertical axis may be unfamiliar, but ZB stands[br]for zettabytes, with 1ZB equal to 1 trillion GB, so clearly this is an extremely[br]large amount. Under the current circumstances, this huge amount of data is[br]generated daily, and it is increasing at an accelerating rate.[br]One of the catalysts for generating such an enormous amount of data is the[br]increasing digitalization in every sector. Digitalization refers to loading[br]information from the real world onto computers so that it can be processed and[br]analyzed on computers. That is to say, digitalization has made things which were[br]never treated as data before into things that can be loaded onto a computer and[br]processed. There are various important factors in the advance of digitalization,[br]but to start with is the reductions in the size and cost of things like computers[br]and sensors. Another is the ability to transmit data via high-speed wide-area[br]network infrastructure and so on.[br]This enables, for example, measuring previously unmeasured temperature and humidity[br]readings using the sensors in smartphones and the like, so there are now[br]measurements everywhere there are people. The are security cameras on every street[br]corner, and if you include images taken with the cameras in the smartphones people[br]carry, it is now possible to record videos and images anywhere. Smartphones are[br]equipped with GPS sensors as well, so it is possible to accurately record their[br]location.[br]Also, while you usually may not give it much thought, the act of viewing the web[br]on commonly used web browsers itself is also subject to digitalization. We will[br]look at these aspects in more detail later.[br]Now, let’s look at some specific examples of utilizing data.[br]The first is air conditioning in buildings and factories. Large amounts of[br]environmental sensors are installed on the roofs of factories, and they are able[br]to measure factors like the internal temperature and humidity accurately and[br]comprehensively. This makes it possible to optimize the air conditioning layout[br]and output, the airflow, and the placement of people and things inside, and as a[br]result, it improves the efficiency of air conditioning and reduces energy costs,[br]while also helping to boost workplace productivity.[br]I am sure you regularly use smartphones, and this next example is about the map[br]apps. Google Maps, for example, or the iOS Map app if you use an iPhone. You[br]probably know that maps app can superimpose traffic congestion information on the[br]map. Google and Apple are able to collect and analyze smartphone location[br]information in real time to predict congestion conditions in real time, which makes[br]these services possible. Specifically, if devices are moving above a certain speed[br]on the map, you can tell that area is not congested, and conversely, if the devices[br]are moving very slowly or nearly stopped, you can tell that congestion is occurring[br]there.[br]The old VICS was a similar system which used a small number of sensors on street[br]corners, but it used probe cars and sensors at intersections to predict congestion[br]information. In comparison, the services by Google and Apple are able to use the[br]very large amount of smartphone location information, giving them a higher density[br]of information and an overwhelmingly larger coverage area, which enables them to[br]make more accurate congestion predictions.[br]The next may be a somewhat surprising example, but it relates to viewing various[br]websites with the web browsers we commonly use. Many websites are digitalizing the[br]browsing behavior of users. For example, collecting data on what links they click,[br]what pages they view, whether they move on to the next page and the course of[br]actions leading up to it, their dwell time, and so on. If a page is displayed for[br]a long time, you can tell they are interested in that page, and if they quickly[br]move on to the next page, you can tell they are not very interested. This data can[br]be obtained by analyzing the web server access logs. The results are important[br]clues in selecting content that will attract more viewers and fine tuning the[br]layout. This is called web content optimization.[br]Lastly, I will explain an example of data recommendation. Many of you have seen[br]that most shopping sites like Amazon will recommend products with a message saying[br]that people who bought this product also bought X. These product recommendations[br]are known to increase the site’s sales significantly if the recommendations are[br]pertinent. For this reason, it is extremely important to give accurate[br]recommendations. In recent years, they have been using a technique called[br]collaborative filtering. In collaborative filtering, products purchased by users[br]are recommended to other users with similar purchasing behavior. Alternatively,[br]items are recommended to the people who buy other items which are associated with[br]the same purchasing behavior. Interestingly, it is possible to give accurate[br]recommendations based on the site’s purchasing records without looking up the[br]details of the product or the user at all. Because the recommendations are made by[br]multiple users and products in collaboration, it is called collaborative filtering.[br]There are many more examples of this. In many sports, they analyze the performance[br]of the team and athletes to devise strategies, make decisions on player positions,[br]and design training plans. In literature, they use the frequency with which terms[br]occur in literary works and the like as an approach to conduct statistical data[br]analysis and evaluate things like the author’s style or the authenticity of a[br]work. Even in the arts, they analyze details like the shape, coloring, and threedimensional[br]structure of a work of art in the same way to analyze the artist’s[br]style or the determine the authenticity of the piece. In the field of biology,[br]they analyze DNA data obtained from organisms and bioinformatics based on it, as[br]well as behavioral data from GPS sensors and camera sensors attached to animals,[br]which is called biologging. There are many such cases like these in which data[br]analysis utilizing this kind of digitalization is employed.[br]One thing that is extremely important when using big data is this loop shown here.[br]Big data analysis is not something you do just once and are finished. To start[br]with, collecting big data requires systems for continuous acquisition and storage.[br]First, you have to establish these. Here, you perform analysis on the collected[br]data, but you need to understand that big data analysis itself is very difficult[br]at this stage, and takes a long time. As shown here, the big data analysis step is[br]broken down into a series of processes. You decide which data to actually analyze[br]from the collected big data, and then perform preprocessing before doing the actual[br]analysis. Specifically, you clean up the formatting of the data, convert values,[br]correct errors, and supplement for any missing values. Here, the data is shaped[br]through much time and effort. In fact, this preprocessing is said to make up as[br]much as 80% of the total cost of analysis. That is how long the processing takes.[br]Once it has gone through preprocessing, you are finally ready to perform the[br]analysis with machine learning and analytical algorithms. While the results of the[br]analytical algorithms are obtained, you may also need to do visualization[br]processing of the results, as well as examination and interpretation of the results[br]of the analysis. Then you determine whether the results are valid. In many cases,[br]you cannot get the intended results in one go, so in these cases, you have to go[br]back to the data analysis or preprocessing step, and repeat this process, reworking[br]things several times until you get valuable data.[br]Only after going through this process can you can discover new knowledge or valuable[br]data.[br]Although you would expect feeding the knowledge or value obtained back to the[br]target organization or community would lead to improvements in that community or[br]organization, in reality, you can only assess whether the community or organization[br]was improved by continuously repeating this series of processes. In some cases, it[br]is important to make continual adjustments to the overall process as appropriate[br]through an ongoing process of modifying the data acquired and methods of evaluation.[br]Finally, I will sum up this section.[br]Big data is often called 5V. It is characterized by five Vs, and the first of these[br]is “Volume.” Volume refers to the quantity of data, and this represents the huge[br]amount of data. The next is “Velocity, “which refers to the speed of transfer,[br]and this represents the high speed of data generation and transmission over[br]networks. The third V is “Variety,” which refers to its diversity. It represents[br]the wide variety of text, numbers, images, videos and other media data that is[br]generated, transmitted, and stored. After that is “Veracity,” which means its[br]accuracy. Big data sometimes includes inaccurate or ambiguous information, so this[br]indicates how important it is to obtain accurate information. The last is “Value,”[br]which represents the importance of how much new value it can provide based on the[br]previous four components.[br]So far, I have been explaining the concept of big data. Next, I will cover the[br]data systems important for using big data.[br]Data systems are necessary in order to collect, manage, and utilize large amounts[br]of data. Here, I will explain the concepts of IoT (the Internet of Things) and CPS[br](Cyber-Physical Systems). As the name says, IoT is the Internet of things. This is[br]a system in which the things surrounding us are given communication capabilities[br]to enable them to transmit data obtained from sensors and so on over the Internet.[br]Recently, in addition to home appliances like air conditioners and refrigerators,[br]all kinds of things like athletic shoes have been equipped with this functionality.[br]As of 2024, the total number is said to be 17 billion devices. This is a huge[br]amount, and it is growing steadily.[br]This diagram shows the system IoT uses to process data. Let’s start from the[br]bottom up. The lowest layer is IoT devices. IoT devices have sensors and I/O[br]modules, so can send and receive the data they acquire over networks. On the[br]Internet, there are relay servers that collect the massive amount of data gathered[br]by IoT and other devices. They run applications to process the data depending on[br]its purpose of use. The data stored on the relay server is then transmitted to big[br]data analysis systems via networks. These days, they mostly exist on the cloud.[br]After the final processing is performed there, it is provided to various[br]applications.[br]In recent years, there has been discussion of systems called Cyber-Physical Systems[br](CPS) in similar contexts to IoT systems. So, what are they? CPS are a further[br]evolution of the IoT. They digitalize real world phenomena using sensors and so[br]on, and transmit the collected data via networks. The transmitted data is analyzed[br]in cyberspace using big data processing, and the information and knowledge obtained[br]as a result are fed back into the real world. By doing so, we seek to solve issues[br]in various industries and communities. Cyber refers to the world of computers,[br]while physical refers to the real world, and data gathered from the physical world[br]is analyzed in cyberspace. The results obtained are then fed back into the physical[br]world, hence the name. Cyberspace can be described as a mirror image reflecting[br]real world models, and because of this, it also sometimes called a digital twin.[br]This is a prototype system of a Cyber-Physical System. As you can see, the world[br]is at the very bottom, and from there, real world information is digitalized by[br]various sensors. In this system, the data obtained is stored in the system, and[br]analysis is performed in order to understand real world problems. The results are[br]fed back into the real world, helping to improve government administration and the[br]lives of citizens[br]Like I said earlier, CPS are an evolution of IoT, so they can be understood as a[br]type of IoT. While IoT focuses on data acquisition, those systems which also[br]consider data analysis and feedback to the real world can be called Cyber-Physical[br]Systems (CPS).[br]Next, we will look at the data systems that support IoT and CPS. Let’s start with[br]some familiar devices.[br]As I have already explained several times in this lecture, compact devices and[br]sensors, etc., are entry points for collecting data. We will start with smartphones.[br]Smartphones are equipped with cameras and various sensors for GPS, temperature,[br]acceleration, and so on. Running many applications also makes it possible to[br]process data on smartphones. You can transmit the results to servers over the[br]Internet as well.[br]In addition to smartphones, smartwatches have also become popular recently.[br]Smartwatches are also equipped with GPS and acceleration sensors, but the biggest[br]difference with smartphones is that they can be worn at all times. This enables[br]them to collect bio-signals like heart rate, and they can continuously monitor[br]this information. As a result, they can monitor lifestyle habits and detect sudden[br]illnesses and the like. Of course, big data analysis is also possible by[br]transmitting the observed results to a server via network.[br]In addition to these, many compact embedded devices have been developed in recent[br]years. For example, using programmable devices like Raspberry Pi or Arduino[br]microcontrollers makes it possible to collect data with sensors and process it.[br]The data collected by the devices can be transmitted to servers via network.[br]Networks, which we commonly use without thinking about it, also come in several[br]varieties. The first is Bluetooth, a network accessed directly by the devices we[br]use. It supports short range communication for devices. There are various types of[br]devices, so it supports various profiles, and its features include low power[br]consumption. For example, it is used to connect things like headphones and[br]smartwatches, so many of you have probably heard of it.[br]There are also networks set up in local environments like homes, workplaces, and[br]schools. These are called local area networks, or LAN. For example, when you[br]connect a PC or smartphone, etc., to a network via LAN, you often use a wireless[br]network called Wi-Fi. By connecting through a Wi-Fi router, you can connect to a[br]broadband network, which is a wide area network.[br]You can also directly connect a smartphone to a wireless broadband network. In[br]this case, you are communicating with the Internet through a mobile communication[br]network like 4G, LTE, or 5G. Among these, 5G in particular is the newest mobile[br]communication network, and its features include high bandwidth and low latency.[br]Currently, the coverage area is expanding, and it can be accessed in many places.[br]Now, how is the big data collected via the Internet managed? Database systems are[br]essential for managing and utilizing big data. Of the types of databases systems,[br]relational database systems are currently the type most commonly used.[br]In relational database systems, data is stored in a tabular form called relations.[br]As you can see in this image, here are two examples of a relation, user, and[br]message. Relations have multiple attributes, and in the case of a user, it has[br]three attributes: ID, name, and E-mail. This is the same for a message. Users can[br]perform searches of data stored in this way using SQL, a standard language for[br]queries. For example, a query to show messages by the user named Tsukuba Taro can[br]be processed by an SQL query like the one in this diagram.[br]In a relational database system, even if there is a huge amount of data, applying[br]a special data structure called an index makes it possible to process queries[br]quickly.[br]Another very important point is that it supports transactions. As an example of a[br]transaction, if you think about a shopping site, for instance, you probably know[br]that when an order is received, you have to create the purchase history and update[br]the inventory simultaneously. If there is a purchase history but the inventory has[br]not decreased, there will be an inconsistency between the inventory and the order.[br]In a relational database system, these changes to multiple databases which need to[br]be performed simultaneously can be handled in batch increments called transactions.[br]The system guarantees that no matter how complicated a transaction is, it will[br]always be either executed or not executed in its entirety. There is no way that it[br]can only be partially executed and cause an inconsistency. This makes it possible[br]to manage complicated real-world data.[br]Relational databases are incredibly useful, but there are many kinds of big data[br]that are ill-suited for relational databases. For example, data which has a simple[br]structure and is read and written quickly. They are also not very good at storing[br]data that does not have a fixed structure. To meet these diverse needs, other types[br]of databases were developed besides relational databases. These are collectively[br]known as NoSQL databases, because they are non-SQL databases.[br]For example, the one shown on the right is an example of a key-value store, which[br]is one of the main types of NoSQL database. In a key-value store, it only handles[br]pairs composed of a key, which is the search key, and the associated value. It[br]also only supports searches for exact matches to a key. In exchange, this not only[br]allows ultra-fast searching of massive data sets, it also supports writing large[br]amounts of data.[br]In addition, document stores for JSON, a type of semi-structured data that is more[br]loosely organized, are also widely used. One point to note about these NoSQL[br]databases is that they either do not support transactions, or the support is quite[br]limited. If complicated transaction processing like the kind I explained in the[br]previous slide is necessary, then you need to be careful.[br]So, how is the big data stored in a database processed? You may have realized that[br]an ordinary PC is not sufficient for processing big data. For example, consider[br]1PB (petabyte) of data. Even if you read it with a storage device capable of[br]reading 100MB per second, it would end up taking 115 days to read all of the data.[br]Practical processing is not possible with this kind of hardware. Therefore,[br]parallel dispersion processing is used to process big data. Specifically, they[br]prepare massive computers composed of multiple PCs, which are called cluster[br]computers. The data is broken down into smaller chunks in advance, and when the[br]data is processed, those small chunks of data are read by each PC and partially[br]processed. The overall processing is performed by aggregating these partial results.[br]They are able to process individual pieces of data in parallel, making it possible[br]to perform analysis of big data in a realistic time frame.[br]Finally, let’s look at the issues in utilizing big data.[br]While there are various issues in the utilization of big data, these five are the[br]main ones. The first is that the data is monopolized by certain companies and[br]organizations. Also, the storage and analysis costs are immense. Moreover, there[br]is a shortage of staff able to utilize big data. In addition, there is utilization[br]of data which is specialized for a particular purpose. And then there are issues[br]of security, privacy, and so on. I will explain these issues in the following[br]slides.[br]There are a limited number of organizations and companies able to collect and store[br]big data, and the monopolization by these companies is a problem. This is why the[br]concept of open data has attracted interest in recent years. In 2013, the G8[br]nations adopted the Open Data Charter. It holds that the current monopolization of[br]data by governments and businesses presents a loss of opportunity for the people,[br]and promoting open data can give rise to general purpose innovations which fulfill[br]the needs of citizens.[br]Open data has the five following steps. It starts with releasing data publicly[br]with an open license that allows anyone to use it regardless of its format. The[br]next step is to release the data in a machine-processable format. For example,[br]releasing documents not only scanned photo images, but also in formats like Word[br]or Excel. The third step is to release data in formats that can be accessed openly.[br]An Excel file, for example, requires a program like Excel to access the file, but[br]releasing the data in an openly accessible format like CSV instead enables people[br]who do not have Excel to access the data as well. The fourth step is to release[br]the data in RDF format. The fifth and final step is to link the data in RDF format[br]to other data. I will explain RDF data in detail with the next slide.[br]RDF, which I explained in the previous slide, is short for Resource Description[br]Framework, and it refers both to the general-purpose data representation model[br]established by the Worldwide Web Consortium (W3C) and the associated format. In[br]RDF, all kinds of data are represented by triples made up of a subject, predicate,[br]and object. For example, the fact that BOB is interested in the Mona Lisa can be[br]represented by a triple in which “BOB” is the subject, “is interested in” is[br]the predicate, and “the Mona Lisa” is the object. In the bottom diagram, the[br]arrow between BOB and the Mona Lisa corresponds to this triple. In the same way,[br]the fact that BOB is friends with Alice, or that BOB’s date of birth is July 14,[br]1990, etc., are described by different triples. In this way, any information can[br]be described with a graph model. This is sometimes called a knowledge graph.[br]One of the key features of RDF is that, as you can see from this diagram, you can[br]freely post links to other data sets, so you can build a data web by posting links[br]to other data sets. Statistical data from national and local governments as well[br]as various other information is already stored in RDF format.[br]The next issue is the cost required for storage and analysis. As we know from what[br]we have learned so far, collecting big data requires large-scale data systems, and[br]there are high costs for storage and the cluster computers, etc., used to analyze[br]it. Furthermore, personnel with specialist knowledge are required to maintain,[br]manage, and analyze this data, and their numbers are limited. Only a limited number[br]of organizations and companies have both of these things, so the monopolization of[br]big data has become a problem.[br]In connection with this, the third point, the shortage of personnel capable of[br]utilizing big data, is also a major issue. Discovering useful information by[br]analyzing big data requires experience and understanding of the problem space in[br]addition to deep knowledge of statistical analysis, data mining, and AI. Personnel[br]who possess this expertise are called data scientists, and experts knowledgeable[br]about computers, networks, and storage are required to operate the large-scale[br]parallel computing systems to maintain, manage, and analyze big data. This type of[br]personnel is called infrastructure engineers. The number of people knowledgeable[br]about infrastructure who also have training as a data scientist is extremely[br]limited, and they are incredibly valuable right now. How to train this kind of[br]staff is a very important societal issue.[br]The utilization of data specialized for a particular purpose is also a serious[br]issue. Big data is intrinsically collected and stored for a specific purpose. At[br]the same time, using it for applications other than its original purpose and[br]linking it or combining it with unintended data is also growing more important[br]from the perspective of utilizing data resources.[br]That said, caution is necessary in such cases. For example, using units or a level[br]of granularity which differs from that of the original collected data requires[br]conversion. If something was originally tallied weekly in dozens, for example, and[br]you want to use daily increments and single units instead, you will need to[br]convert the units and time. The accuracy of data may vary as well. In these cases,[br]you discard data that does not meet the required level of accuracy. Alternatively,[br]a process called accuracy supplementation will be necessary. There may also be[br]data which does not contain required entries, and in such cases, it is processed[br]by supplementing the data with data from other sources, or inferring the missing[br]entries, etc. In these other cases, it is necessary to take appropriate measures[br]on a case-by-case basis. A variety of research is being conducted on using data[br]outside of its intended purpose, and applying the results is expected to enable[br]better utilization of data.[br]It should go without saying that security and privacy are the highest priorities[br]in utilizing big data. At the same time, how to utilize data which contains personal[br]information while maintaining the anonymity of the individuals is also a very[br]important question. Research and development on techniques for this purpose known[br]as homomorphic encryption and differential privacy is being pursued.[br]The development of legislation is also progressing. The laws associated with the[br]utilization of big data may vary from region to region, so it is important to[br]understand the legal system of the target region before using the data. For example,[br]there is the Amended Act on the Protection of Personal Information in Japan, while[br]in the EU, the General Data Protection Regulation (GDPR) is applied. It is important[br]to develop a deeper understanding of these regulations.[br]In conclusion, I would like to sum up the contents of this lecture.[br]In this lecture, we learned about the basic concepts of big data, as well as the[br]underlying IoT and CPS data systems which support them, and the data systems which[br]support big data and IoT/CPS in turn. Lastly, we learned about the issues in[br]utilizing big data. This concludes the lecture. Thank you for watching.[br]—00:42:27