Advanced Data Management and Utilization: Big Data and IoT/CPS 00:00:06— Hello, everyone. I am Toshiyuki Amagasa of the University of Tsukuba Center for Computational Sciences and Artificial Intelligence Research Center. I would like to begin my data science video lecture (Advanced Data Management) on big data and IoT/CPS. Thank you for your time. I will explain the aims of this lecture. First of all, you will learn about the concept of big data. Next, you will learn about the IoT/CPS systems which support a digitalized society. You will also develop an understanding of the data systems which support these systems. Finally, you will learn about the issues in utilizing big data. With that said, I would like to dive in right away. First, since it is in the title of the lecture, I will explain what big data is. True to its name, big data refers to massive amounts of data, but just how massive are we talking about? The standard definition and so on is shown here, but it is an amount of data that is quite difficult to process using the typical computers and software of the time. Such massive amounts of data are referred to as big data. So, where is data like this found? While we are usually not very aware of it, the world is overflowing with data. To give some examples of big data, the amount of data generated every minute is shown here. For example, the Google search engine performs 6 million search processes per minute. On YouTube, a video streaming site you probably use regularly, 500 hours of video are posted per minute. Similarly, on the well-known social media site Facebook,150,000 messages are posted per minute, and likewise, there are 350,000 messages posted per minute on the social media site X (formerly known as Twitter). As for the email we use every day, statistics indicate that there are 240 million emails sent per minute around the world. If such a massive amount of data as this is generated in one minute, then you can imagine what a huge amount of data is generated per day or per year. This diagram shows the growth in digital data worldwide. This diagram is a little out of date, but the horizontal axis represents time, and the vertical access represents the amount of data. As you can see, the amount of data generated is growing rapidly. The units on the vertical axis may be unfamiliar, but ZB stands for zettabytes, with 1ZB equal to 1 trillion GB, so clearly this is an extremely large amount. Under the current circumstances, this huge amount of data is generated daily, and it is increasing at an accelerating rate. One of the catalysts for generating such an enormous amount of data is the increasing digitalization in every sector. Digitalization refers to loading information from the real world onto computers so that it can be processed and analyzed on computers. That is to say, digitalization has made things which were never treated as data before into things that can be loaded onto a computer and processed. There are various important factors in the advance of digitalization, but to start with is the reductions in the size and cost of things like computers and sensors. Another is the ability to transmit data via high-speed wide-area network infrastructure and so on. This enables, for example, measuring previously unmeasured temperature and humidity readings using the sensors in smartphones and the like, so there are now measurements everywhere there are people. The are security cameras on every street corner, and if you include images taken with the cameras in the smartphones people carry, it is now possible to record videos and images anywhere. Smartphones are equipped with GPS sensors as well, so it is possible to accurately record their location. Also, while you usually may not give it much thought, the act of viewing the web on commonly used web browsers itself is also subject to digitalization. We will look at these aspects in more detail later. Now, let’s look at some specific examples of utilizing data. The first is air conditioning in buildings and factories. Large amounts of environmental sensors are installed on the roofs of factories, and they are able to measure factors like the internal temperature and humidity accurately and comprehensively. This makes it possible to optimize the air conditioning layout and output, the airflow, and the placement of people and things inside, and as a result, it improves the efficiency of air conditioning and reduces energy costs, while also helping to boost workplace productivity. I am sure you regularly use smartphones, and this next example is about the map apps. Google Maps, for example, or the iOS Map app if you use an iPhone. You probably know that maps app can superimpose traffic congestion information on the map. Google and Apple are able to collect and analyze smartphone location information in real time to predict congestion conditions in real time, which makes these services possible. Specifically, if devices are moving above a certain speed on the map, you can tell that area is not congested, and conversely, if the devices are moving very slowly or nearly stopped, you can tell that congestion is occurring there. The old VICS was a similar system which used a small number of sensors on street corners, but it used probe cars and sensors at intersections to predict congestion information. In comparison, the services by Google and Apple are able to use the very large amount of smartphone location information, giving them a higher density of information and an overwhelmingly larger coverage area, which enables them to make more accurate congestion predictions. The next may be a somewhat surprising example, but it relates to viewing various websites with the web browsers we commonly use. Many websites are digitalizing the browsing behavior of users. For example, collecting data on what links they click, what pages they view, whether they move on to the next page and the course of actions leading up to it, their dwell time, and so on. If a page is displayed for a long time, you can tell they are interested in that page, and if they quickly move on to the next page, you can tell they are not very interested. This data can be obtained by analyzing the web server access logs. The results are important clues in selecting content that will attract more viewers and fine tuning the layout. This is called web content optimization. Lastly, I will explain an example of data recommendation. Many of you have seen that most shopping sites like Amazon will recommend products with a message saying that people who bought this product also bought X. These product recommendations are known to increase the site’s sales significantly if the recommendations are pertinent. For this reason, it is extremely important to give accurate recommendations. In recent years, they have been using a technique called collaborative filtering. In collaborative filtering, products purchased by users are recommended to other users with similar purchasing behavior. Alternatively, items are recommended to the people who buy other items which are associated with the same purchasing behavior. Interestingly, it is possible to give accurate recommendations based on the site’s purchasing records without looking up the details of the product or the user at all. Because the recommendations are made by multiple users and products in collaboration, it is called collaborative filtering. There are many more examples of this. In many sports, they analyze the performance of the team and athletes to devise strategies, make decisions on player positions, and design training plans. In literature, they use the frequency with which terms occur in literary works and the like as an approach to conduct statistical data analysis and evaluate things like the author’s style or the authenticity of a work. Even in the arts, they analyze details like the shape, coloring, and threedimensional structure of a work of art in the same way to analyze the artist’s style or the determine the authenticity of the piece. In the field of biology, they analyze DNA data obtained from organisms and bioinformatics based on it, as well as behavioral data from GPS sensors and camera sensors attached to animals, which is called biologging. There are many such cases like these in which data analysis utilizing this kind of digitalization is employed. One thing that is extremely important when using big data is this loop shown here. Big data analysis is not something you do just once and are finished. To start with, collecting big data requires systems for continuous acquisition and storage. First, you have to establish these. Here, you perform analysis on the collected data, but you need to understand that big data analysis itself is very difficult at this stage, and takes a long time. As shown here, the big data analysis step is broken down into a series of processes. You decide which data to actually analyze from the collected big data, and then perform preprocessing before doing the actual analysis. Specifically, you clean up the formatting of the data, convert values, correct errors, and supplement for any missing values. Here, the data is shaped through much time and effort. In fact, this preprocessing is said to make up as much as 80% of the total cost of analysis. That is how long the processing takes. Once it has gone through preprocessing, you are finally ready to perform the analysis with machine learning and analytical algorithms. While the results of the analytical algorithms are obtained, you may also need to do visualization processing of the results, as well as examination and interpretation of the results of the analysis. Then you determine whether the results are valid. In many cases, you cannot get the intended results in one go, so in these cases, you have to go back to the data analysis or preprocessing step, and repeat this process, reworking things several times until you get valuable data. Only after going through this process can you can discover new knowledge or valuable data. Although you would expect feeding the knowledge or value obtained back to the target organization or community would lead to improvements in that community or organization, in reality, you can only assess whether the community or organization was improved by continuously repeating this series of processes. In some cases, it is important to make continual adjustments to the overall process as appropriate through an ongoing process of modifying the data acquired and methods of evaluation. Finally, I will sum up this section. Big data is often called 5V. It is characterized by five Vs, and the first of these is “Volume.” Volume refers to the quantity of data, and this represents the huge amount of data. The next is “Velocity, “which refers to the speed of transfer, and this represents the high speed of data generation and transmission over networks. The third V is “Variety,” which refers to its diversity. It represents the wide variety of text, numbers, images, videos and other media data that is generated, transmitted, and stored. After that is “Veracity,” which means its accuracy. Big data sometimes includes inaccurate or ambiguous information, so this indicates how important it is to obtain accurate information. The last is “Value,” which represents the importance of how much new value it can provide based on the previous four components. So far, I have been explaining the concept of big data. Next, I will cover the data systems important for using big data. Data systems are necessary in order to collect, manage, and utilize large amounts of data. Here, I will explain the concepts of IoT (the Internet of Things) and CPS (Cyber-Physical Systems). As the name says, IoT is the Internet of things. This is a system in which the things surrounding us are given communication capabilities to enable them to transmit data obtained from sensors and so on over the Internet. Recently, in addition to home appliances like air conditioners and refrigerators, all kinds of things like athletic shoes have been equipped with this functionality. As of 2024, the total number is said to be 17 billion devices. This is a huge amount, and it is growing steadily. This diagram shows the system IoT uses to process data. Let’s start from the bottom up. The lowest layer is IoT devices. IoT devices have sensors and I/O modules, so can send and receive the data they acquire over networks. On the Internet, there are relay servers that collect the massive amount of data gathered by IoT and other devices. They run applications to process the data depending on its purpose of use. The data stored on the relay server is then transmitted to big data analysis systems via networks. These days, they mostly exist on the cloud. After the final processing is performed there, it is provided to various applications. In recent years, there has been discussion of systems called Cyber-Physical Systems (CPS) in similar contexts to IoT systems. So, what are they? CPS are a further evolution of the IoT. They digitalize real world phenomena using sensors and so on, and transmit the collected data via networks. The transmitted data is analyzed in cyberspace using big data processing, and the information and knowledge obtained as a result are fed back into the real world. By doing so, we seek to solve issues in various industries and communities. Cyber refers to the world of computers, while physical refers to the real world, and data gathered from the physical world is analyzed in cyberspace. The results obtained are then fed back into the physical world, hence the name. Cyberspace can be described as a mirror image reflecting real world models, and because of this, it also sometimes called a digital twin. This is a prototype system of a Cyber-Physical System. As you can see, the world is at the very bottom, and from there, real world information is digitalized by various sensors. In this system, the data obtained is stored in the system, and analysis is performed in order to understand real world problems. The results are fed back into the real world, helping to improve government administration and the lives of citizens Like I said earlier, CPS are an evolution of IoT, so they can be understood as a type of IoT. While IoT focuses on data acquisition, those systems which also consider data analysis and feedback to the real world can be called Cyber-Physical Systems (CPS). Next, we will look at the data systems that support IoT and CPS. Let’s start with some familiar devices. As I have already explained several times in this lecture, compact devices and sensors, etc., are entry points for collecting data. We will start with smartphones. Smartphones are equipped with cameras and various sensors for GPS, temperature, acceleration, and so on. Running many applications also makes it possible to process data on smartphones. You can transmit the results to servers over the Internet as well. In addition to smartphones, smartwatches have also become popular recently. Smartwatches are also equipped with GPS and acceleration sensors, but the biggest difference with smartphones is that they can be worn at all times. This enables them to collect bio-signals like heart rate, and they can continuously monitor this information. As a result, they can monitor lifestyle habits and detect sudden illnesses and the like. Of course, big data analysis is also possible by transmitting the observed results to a server via network. In addition to these, many compact embedded devices have been developed in recent years. For example, using programmable devices like Raspberry Pi or Arduino microcontrollers makes it possible to collect data with sensors and process it. The data collected by the devices can be transmitted to servers via network. Networks, which we commonly use without thinking about it, also come in several varieties. The first is Bluetooth, a network accessed directly by the devices we use. It supports short range communication for devices. There are various types of devices, so it supports various profiles, and its features include low power consumption. For example, it is used to connect things like headphones and smartwatches, so many of you have probably heard of it. There are also networks set up in local environments like homes, workplaces, and schools. These are called local area networks, or LAN. For example, when you connect a PC or smartphone, etc., to a network via LAN, you often use a wireless network called Wi-Fi. By connecting through a Wi-Fi router, you can connect to a broadband network, which is a wide area network. You can also directly connect a smartphone to a wireless broadband network. In this case, you are communicating with the Internet through a mobile communication network like 4G, LTE, or 5G. Among these, 5G in particular is the newest mobile communication network, and its features include high bandwidth and low latency. Currently, the coverage area is expanding, and it can be accessed in many places. Now, how is the big data collected via the Internet managed? Database systems are essential for managing and utilizing big data. Of the types of databases systems, relational database systems are currently the type most commonly used. In relational database systems, data is stored in a tabular form called relations. As you can see in this image, here are two examples of a relation, user, and message. Relations have multiple attributes, and in the case of a user, it has three attributes: ID, name, and E-mail. This is the same for a message. Users can perform searches of data stored in this way using SQL, a standard language for queries. For example, a query to show messages by the user named Tsukuba Taro can be processed by an SQL query like the one in this diagram. In a relational database system, even if there is a huge amount of data, applying a special data structure called an index makes it possible to process queries quickly. Another very important point is that it supports transactions. As an example of a transaction, if you think about a shopping site, for instance, you probably know that when an order is received, you have to create the purchase history and update the inventory simultaneously. If there is a purchase history but the inventory has not decreased, there will be an inconsistency between the inventory and the order. In a relational database system, these changes to multiple databases which need to be performed simultaneously can be handled in batch increments called transactions. The system guarantees that no matter how complicated a transaction is, it will always be either executed or not executed in its entirety. There is no way that it can only be partially executed and cause an inconsistency. This makes it possible to manage complicated real-world data. Relational databases are incredibly useful, but there are many kinds of big data that are ill-suited for relational databases. For example, data which has a simple structure and is read and written quickly. They are also not very good at storing data that does not have a fixed structure. To meet these diverse needs, other types of databases were developed besides relational databases. These are collectively known as NoSQL databases, because they are non-SQL databases. For example, the one shown on the right is an example of a key-value store, which is one of the main types of NoSQL database. In a key-value store, it only handles pairs composed of a key, which is the search key, and the associated value. It also only supports searches for exact matches to a key. In exchange, this not only allows ultra-fast searching of massive data sets, it also supports writing large amounts of data. In addition, document stores for JSON, a type of semi-structured data that is more loosely organized, are also widely used. One point to note about these NoSQL databases is that they either do not support transactions, or the support is quite limited. If complicated transaction processing like the kind I explained in the previous slide is necessary, then you need to be careful. So, how is the big data stored in a database processed? You may have realized that an ordinary PC is not sufficient for processing big data. For example, consider 1PB (petabyte) of data. Even if you read it with a storage device capable of reading 100MB per second, it would end up taking 115 days to read all of the data. Practical processing is not possible with this kind of hardware. Therefore, parallel dispersion processing is used to process big data. Specifically, they prepare massive computers composed of multiple PCs, which are called cluster computers. The data is broken down into smaller chunks in advance, and when the data is processed, those small chunks of data are read by each PC and partially processed. The overall processing is performed by aggregating these partial results. They are able to process individual pieces of data in parallel, making it possible to perform analysis of big data in a realistic time frame. Finally, let’s look at the issues in utilizing big data. While there are various issues in the utilization of big data, these five are the main ones. The first is that the data is monopolized by certain companies and organizations. Also, the storage and analysis costs are immense. Moreover, there is a shortage of staff able to utilize big data. In addition, there is utilization of data which is specialized for a particular purpose. And then there are issues of security, privacy, and so on. I will explain these issues in the following slides. There are a limited number of organizations and companies able to collect and store big data, and the monopolization by these companies is a problem. This is why the concept of open data has attracted interest in recent years. In 2013, the G8 nations adopted the Open Data Charter. It holds that the current monopolization of data by governments and businesses presents a loss of opportunity for the people, and promoting open data can give rise to general purpose innovations which fulfill the needs of citizens. Open data has the five following steps. It starts with releasing data publicly with an open license that allows anyone to use it regardless of its format. The next step is to release the data in a machine-processable format. For example, releasing documents not only scanned photo images, but also in formats like Word or Excel. The third step is to release data in formats that can be accessed openly. An Excel file, for example, requires a program like Excel to access the file, but releasing the data in an openly accessible format like CSV instead enables people who do not have Excel to access the data as well. The fourth step is to release the data in RDF format. The fifth and final step is to link the data in RDF format to other data. I will explain RDF data in detail with the next slide. RDF, which I explained in the previous slide, is short for Resource Description Framework, and it refers both to the general-purpose data representation model established by the Worldwide Web Consortium (W3C) and the associated format. In RDF, all kinds of data are represented by triples made up of a subject, predicate, and object. For example, the fact that BOB is interested in the Mona Lisa can be represented by a triple in which “BOB” is the subject, “is interested in” is the predicate, and “the Mona Lisa” is the object. In the bottom diagram, the arrow between BOB and the Mona Lisa corresponds to this triple. In the same way, the fact that BOB is friends with Alice, or that BOB’s date of birth is July 14, 1990, etc., are described by different triples. In this way, any information can be described with a graph model. This is sometimes called a knowledge graph. One of the key features of RDF is that, as you can see from this diagram, you can freely post links to other data sets, so you can build a data web by posting links to other data sets. Statistical data from national and local governments as well as various other information is already stored in RDF format. The next issue is the cost required for storage and analysis. As we know from what we have learned so far, collecting big data requires large-scale data systems, and there are high costs for storage and the cluster computers, etc., used to analyze it. Furthermore, personnel with specialist knowledge are required to maintain, manage, and analyze this data, and their numbers are limited. Only a limited number of organizations and companies have both of these things, so the monopolization of big data has become a problem. In connection with this, the third point, the shortage of personnel capable of utilizing big data, is also a major issue. Discovering useful information by analyzing big data requires experience and understanding of the problem space in addition to deep knowledge of statistical analysis, data mining, and AI. Personnel who possess this expertise are called data scientists, and experts knowledgeable about computers, networks, and storage are required to operate the large-scale parallel computing systems to maintain, manage, and analyze big data. This type of personnel is called infrastructure engineers. The number of people knowledgeable about infrastructure who also have training as a data scientist is extremely limited, and they are incredibly valuable right now. How to train this kind of staff is a very important societal issue. The utilization of data specialized for a particular purpose is also a serious issue. Big data is intrinsically collected and stored for a specific purpose. At the same time, using it for applications other than its original purpose and linking it or combining it with unintended data is also growing more important from the perspective of utilizing data resources. That said, caution is necessary in such cases. For example, using units or a level of granularity which differs from that of the original collected data requires conversion. If something was originally tallied weekly in dozens, for example, and you want to use daily increments and single units instead, you will need to convert the units and time. The accuracy of data may vary as well. In these cases, you discard data that does not meet the required level of accuracy. Alternatively, a process called accuracy supplementation will be necessary. There may also be data which does not contain required entries, and in such cases, it is processed by supplementing the data with data from other sources, or inferring the missing entries, etc. In these other cases, it is necessary to take appropriate measures on a case-by-case basis. A variety of research is being conducted on using data outside of its intended purpose, and applying the results is expected to enable better utilization of data. It should go without saying that security and privacy are the highest priorities in utilizing big data. At the same time, how to utilize data which contains personal information while maintaining the anonymity of the individuals is also a very important question. Research and development on techniques for this purpose known as homomorphic encryption and differential privacy is being pursued. The development of legislation is also progressing. The laws associated with the utilization of big data may vary from region to region, so it is important to understand the legal system of the target region before using the data. For example, there is the Amended Act on the Protection of Personal Information in Japan, while in the EU, the General Data Protection Regulation (GDPR) is applied. It is important to develop a deeper understanding of these regulations. In conclusion, I would like to sum up the contents of this lecture. In this lecture, we learned about the basic concepts of big data, as well as the underlying IoT and CPS data systems which support them, and the data systems which support big data and IoT/CPS in turn. Lastly, we learned about the issues in utilizing big data. This concludes the lecture. Thank you for watching. —00:42:27