[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,9:59:59.99,9:59:59.99,Default,,0000,0000,0000,,Advanced Data Management and Utilization: Big Data and IoT/CPS\N00:00:06—\NHello, everyone. I am Toshiyuki Amagasa of the University of Tsukuba Center for\NComputational Sciences and Artificial Intelligence Research Center. I would like\Nto begin my data science video lecture (Advanced Data Management) on big data and\NIoT/CPS. Thank you for your time.\NI will explain the aims of this lecture. First of all, you will learn about the\Nconcept of big data. Next, you will learn about the IoT/CPS systems which support\Na digitalized society. You will also develop an understanding of the data systems\Nwhich support these systems. Finally, you will learn about the issues in utilizing\Nbig data.\NWith that said, I would like to dive in right away.\NFirst, since it is in the title of the lecture, I will explain what big data is.\NTrue to its name, big data refers to massive amounts of data, but just how massive\Nare we talking about? The standard definition and so on is shown here, but it is\Nan amount of data that is quite difficult to process using the typical computers\Nand software of the time. Such massive amounts of data are referred to as big data.\NSo, where is data like this found?\NWhile we are usually not very aware of it, the world is overflowing with data. To\Ngive some examples of big data, the amount of data generated every minute is shown\Nhere. For example, the Google search engine performs 6 million search processes\Nper minute. On YouTube, a video streaming site you probably use regularly, 500\Nhours of video are posted per minute. Similarly, on the well-known social media\Nsite Facebook,150,000 messages are posted per minute, and likewise, there are\N350,000 messages posted per minute on the social media site X (formerly known as\NTwitter). As for the email we use every day, statistics indicate that there are\N240 million emails sent per minute around the world. If such a massive amount of\Ndata as this is generated in one minute, then you can imagine what a huge amount\Nof data is generated per day or per year.\NThis diagram shows the growth in digital data worldwide. This diagram is a little\Nout of date, but the horizontal axis represents time, and the vertical access\Nrepresents the amount of data. As you can see, the amount of data generated is\Ngrowing rapidly. The units on the vertical axis may be unfamiliar, but ZB stands\Nfor zettabytes, with 1ZB equal to 1 trillion GB, so clearly this is an extremely\Nlarge amount. Under the current circumstances, this huge amount of data is\Ngenerated daily, and it is increasing at an accelerating rate.\NOne of the catalysts for generating such an enormous amount of data is the\Nincreasing digitalization in every sector. Digitalization refers to loading\Ninformation from the real world onto computers so that it can be processed and\Nanalyzed on computers. That is to say, digitalization has made things which were\Nnever treated as data before into things that can be loaded onto a computer and\Nprocessed. There are various important factors in the advance of digitalization,\Nbut to start with is the reductions in the size and cost of things like computers\Nand sensors. Another is the ability to transmit data via high-speed wide-area\Nnetwork infrastructure and so on.\NThis enables, for example, measuring previously unmeasured temperature and humidity\Nreadings using the sensors in smartphones and the like, so there are now\Nmeasurements everywhere there are people. The are security cameras on every street\Ncorner, and if you include images taken with the cameras in the smartphones people\Ncarry, it is now possible to record videos and images anywhere. Smartphones are\Nequipped with GPS sensors as well, so it is possible to accurately record their\Nlocation.\NAlso, while you usually may not give it much thought, the act of viewing the web\Non commonly used web browsers itself is also subject to digitalization. We will\Nlook at these aspects in more detail later.\NNow, let’s look at some specific examples of utilizing data.\NThe first is air conditioning in buildings and factories. Large amounts of\Nenvironmental sensors are installed on the roofs of factories, and they are able\Nto measure factors like the internal temperature and humidity accurately and\Ncomprehensively. This makes it possible to optimize the air conditioning layout\Nand output, the airflow, and the placement of people and things inside, and as a\Nresult, it improves the efficiency of air conditioning and reduces energy costs,\Nwhile also helping to boost workplace productivity.\NI am sure you regularly use smartphones, and this next example is about the map\Napps. Google Maps, for example, or the iOS Map app if you use an iPhone. You\Nprobably know that maps app can superimpose traffic congestion information on the\Nmap. Google and Apple are able to collect and analyze smartphone location\Ninformation in real time to predict congestion conditions in real time, which makes\Nthese services possible. Specifically, if devices are moving above a certain speed\Non the map, you can tell that area is not congested, and conversely, if the devices\Nare moving very slowly or nearly stopped, you can tell that congestion is occurring\Nthere.\NThe old VICS was a similar system which used a small number of sensors on street\Ncorners, but it used probe cars and sensors at intersections to predict congestion\Ninformation. In comparison, the services by Google and Apple are able to use the\Nvery large amount of smartphone location information, giving them a higher density\Nof information and an overwhelmingly larger coverage area, which enables them to\Nmake more accurate congestion predictions.\NThe next may be a somewhat surprising example, but it relates to viewing various\Nwebsites with the web browsers we commonly use. Many websites are digitalizing the\Nbrowsing behavior of users. For example, collecting data on what links they click,\Nwhat pages they view, whether they move on to the next page and the course of\Nactions leading up to it, their dwell time, and so on. If a page is displayed for\Na long time, you can tell they are interested in that page, and if they quickly\Nmove on to the next page, you can tell they are not very interested. This data can\Nbe obtained by analyzing the web server access logs. The results are important\Nclues in selecting content that will attract more viewers and fine tuning the\Nlayout. This is called web content optimization.\NLastly, I will explain an example of data recommendation. Many of you have seen\Nthat most shopping sites like Amazon will recommend products with a message saying\Nthat people who bought this product also bought X. These product recommendations\Nare known to increase the site’s sales significantly if the recommendations are\Npertinent. For this reason, it is extremely important to give accurate\Nrecommendations. In recent years, they have been using a technique called\Ncollaborative filtering. In collaborative filtering, products purchased by users\Nare recommended to other users with similar purchasing behavior. Alternatively,\Nitems are recommended to the people who buy other items which are associated with\Nthe same purchasing behavior. Interestingly, it is possible to give accurate\Nrecommendations based on the site’s purchasing records without looking up the\Ndetails of the product or the user at all. Because the recommendations are made by\Nmultiple users and products in collaboration, it is called collaborative filtering.\NThere are many more examples of this. In many sports, they analyze the performance\Nof the team and athletes to devise strategies, make decisions on player positions,\Nand design training plans. In literature, they use the frequency with which terms\Noccur in literary works and the like as an approach to conduct statistical data\Nanalysis and evaluate things like the author’s style or the authenticity of a\Nwork. Even in the arts, they analyze details like the shape, coloring, and threedimensional\Nstructure of a work of art in the same way to analyze the artist’s\Nstyle or the determine the authenticity of the piece. In the field of biology,\Nthey analyze DNA data obtained from organisms and bioinformatics based on it, as\Nwell as behavioral data from GPS sensors and camera sensors attached to animals,\Nwhich is called biologging. There are many such cases like these in which data\Nanalysis utilizing this kind of digitalization is employed.\NOne thing that is extremely important when using big data is this loop shown here.\NBig data analysis is not something you do just once and are finished. To start\Nwith, collecting big data requires systems for continuous acquisition and storage.\NFirst, you have to establish these. Here, you perform analysis on the collected\Ndata, but you need to understand that big data analysis itself is very difficult\Nat this stage, and takes a long time. As shown here, the big data analysis step is\Nbroken down into a series of processes. You decide which data to actually analyze\Nfrom the collected big data, and then perform preprocessing before doing the actual\Nanalysis. Specifically, you clean up the formatting of the data, convert values,\Ncorrect errors, and supplement for any missing values. Here, the data is shaped\Nthrough much time and effort. In fact, this preprocessing is said to make up as\Nmuch as 80% of the total cost of analysis. That is how long the processing takes.\NOnce it has gone through preprocessing, you are finally ready to perform the\Nanalysis with machine learning and analytical algorithms. While the results of the\Nanalytical algorithms are obtained, you may also need to do visualization\Nprocessing of the results, as well as examination and interpretation of the results\Nof the analysis. Then you determine whether the results are valid. In many cases,\Nyou cannot get the intended results in one go, so in these cases, you have to go\Nback to the data analysis or preprocessing step, and repeat this process, reworking\Nthings several times until you get valuable data.\NOnly after going through this process can you can discover new knowledge or valuable\Ndata.\NAlthough you would expect feeding the knowledge or value obtained back to the\Ntarget organization or community would lead to improvements in that community or\Norganization, in reality, you can only assess whether the community or organization\Nwas improved by continuously repeating this series of processes. In some cases, it\Nis important to make continual adjustments to the overall process as appropriate\Nthrough an ongoing process of modifying the data acquired and methods of evaluation.\NFinally, I will sum up this section.\NBig data is often called 5V. It is characterized by five Vs, and the first of these\Nis “Volume.” Volume refers to the quantity of data, and this represents the huge\Namount of data. The next is “Velocity, “which refers to the speed of transfer,\Nand this represents the high speed of data generation and transmission over\Nnetworks. The third V is “Variety,” which refers to its diversity. It represents\Nthe wide variety of text, numbers, images, videos and other media data that is\Ngenerated, transmitted, and stored. After that is “Veracity,” which means its\Naccuracy. Big data sometimes includes inaccurate or ambiguous information, so this\Nindicates how important it is to obtain accurate information. The last is “Value,”\Nwhich represents the importance of how much new value it can provide based on the\Nprevious four components.\NSo far, I have been explaining the concept of big data. Next, I will cover the\Ndata systems important for using big data.\NData systems are necessary in order to collect, manage, and utilize large amounts\Nof data. Here, I will explain the concepts of IoT (the Internet of Things) and CPS\N(Cyber-Physical Systems). As the name says, IoT is the Internet of things. This is\Na system in which the things surrounding us are given communication capabilities\Nto enable them to transmit data obtained from sensors and so on over the Internet.\NRecently, in addition to home appliances like air conditioners and refrigerators,\Nall kinds of things like athletic shoes have been equipped with this functionality.\NAs of 2024, the total number is said to be 17 billion devices. This is a huge\Namount, and it is growing steadily.\NThis diagram shows the system IoT uses to process data. Let’s start from the\Nbottom up. The lowest layer is IoT devices. IoT devices have sensors and I/O\Nmodules, so can send and receive the data they acquire over networks. On the\NInternet, there are relay servers that collect the massive amount of data gathered\Nby IoT and other devices. They run applications to process the data depending on\Nits purpose of use. The data stored on the relay server is then transmitted to big\Ndata analysis systems via networks. These days, they mostly exist on the cloud.\NAfter the final processing is performed there, it is provided to various\Napplications.\NIn recent years, there has been discussion of systems called Cyber-Physical Systems\N(CPS) in similar contexts to IoT systems. So, what are they? CPS are a further\Nevolution of the IoT. They digitalize real world phenomena using sensors and so\Non, and transmit the collected data via networks. The transmitted data is analyzed\Nin cyberspace using big data processing, and the information and knowledge obtained\Nas a result are fed back into the real world. By doing so, we seek to solve issues\Nin various industries and communities. Cyber refers to the world of computers,\Nwhile physical refers to the real world, and data gathered from the physical world\Nis analyzed in cyberspace. The results obtained are then fed back into the physical\Nworld, hence the name. Cyberspace can be described as a mirror image reflecting\Nreal world models, and because of this, it also sometimes called a digital twin.\NThis is a prototype system of a Cyber-Physical System. As you can see, the world\Nis at the very bottom, and from there, real world information is digitalized by\Nvarious sensors. In this system, the data obtained is stored in the system, and\Nanalysis is performed in order to understand real world problems. The results are\Nfed back into the real world, helping to improve government administration and the\Nlives of citizens\NLike I said earlier, CPS are an evolution of IoT, so they can be understood as a\Ntype of IoT. While IoT focuses on data acquisition, those systems which also\Nconsider data analysis and feedback to the real world can be called Cyber-Physical\NSystems (CPS).\NNext, we will look at the data systems that support IoT and CPS. Let’s start with\Nsome familiar devices.\NAs I have already explained several times in this lecture, compact devices and\Nsensors, etc., are entry points for collecting data. We will start with smartphones.\NSmartphones are equipped with cameras and various sensors for GPS, temperature,\Nacceleration, and so on. Running many applications also makes it possible to\Nprocess data on smartphones. You can transmit the results to servers over the\NInternet as well.\NIn addition to smartphones, smartwatches have also become popular recently.\NSmartwatches are also equipped with GPS and acceleration sensors, but the biggest\Ndifference with smartphones is that they can be worn at all times. This enables\Nthem to collect bio-signals like heart rate, and they can continuously monitor\Nthis information. As a result, they can monitor lifestyle habits and detect sudden\Nillnesses and the like. Of course, big data analysis is also possible by\Ntransmitting the observed results to a server via network.\NIn addition to these, many compact embedded devices have been developed in recent\Nyears. For example, using programmable devices like Raspberry Pi or Arduino\Nmicrocontrollers makes it possible to collect data with sensors and process it.\NThe data collected by the devices can be transmitted to servers via network.\NNetworks, which we commonly use without thinking about it, also come in several\Nvarieties. The first is Bluetooth, a network accessed directly by the devices we\Nuse. It supports short range communication for devices. There are various types of\Ndevices, so it supports various profiles, and its features include low power\Nconsumption. For example, it is used to connect things like headphones and\Nsmartwatches, so many of you have probably heard of it.\NThere are also networks set up in local environments like homes, workplaces, and\Nschools. These are called local area networks, or LAN. For example, when you\Nconnect a PC or smartphone, etc., to a network via LAN, you often use a wireless\Nnetwork called Wi-Fi. By connecting through a Wi-Fi router, you can connect to a\Nbroadband network, which is a wide area network.\NYou can also directly connect a smartphone to a wireless broadband network. In\Nthis case, you are communicating with the Internet through a mobile communication\Nnetwork like 4G, LTE, or 5G. Among these, 5G in particular is the newest mobile\Ncommunication network, and its features include high bandwidth and low latency.\NCurrently, the coverage area is expanding, and it can be accessed in many places.\NNow, how is the big data collected via the Internet managed? Database systems are\Nessential for managing and utilizing big data. Of the types of databases systems,\Nrelational database systems are currently the type most commonly used.\NIn relational database systems, data is stored in a tabular form called relations.\NAs you can see in this image, here are two examples of a relation, user, and\Nmessage. Relations have multiple attributes, and in the case of a user, it has\Nthree attributes: ID, name, and E-mail. This is the same for a message. Users can\Nperform searches of data stored in this way using SQL, a standard language for\Nqueries. For example, a query to show messages by the user named Tsukuba Taro can\Nbe processed by an SQL query like the one in this diagram.\NIn a relational database system, even if there is a huge amount of data, applying\Na special data structure called an index makes it possible to process queries\Nquickly.\NAnother very important point is that it supports transactions. As an example of a\Ntransaction, if you think about a shopping site, for instance, you probably know\Nthat when an order is received, you have to create the purchase history and update\Nthe inventory simultaneously. If there is a purchase history but the inventory has\Nnot decreased, there will be an inconsistency between the inventory and the order.\NIn a relational database system, these changes to multiple databases which need to\Nbe performed simultaneously can be handled in batch increments called transactions.\NThe system guarantees that no matter how complicated a transaction is, it will\Nalways be either executed or not executed in its entirety. There is no way that it\Ncan only be partially executed and cause an inconsistency. This makes it possible\Nto manage complicated real-world data.\NRelational databases are incredibly useful, but there are many kinds of big data\Nthat are ill-suited for relational databases. For example, data which has a simple\Nstructure and is read and written quickly. They are also not very good at storing\Ndata that does not have a fixed structure. To meet these diverse needs, other types\Nof databases were developed besides relational databases. These are collectively\Nknown as NoSQL databases, because they are non-SQL databases.\NFor example, the one shown on the right is an example of a key-value store, which\Nis one of the main types of NoSQL database. In a key-value store, it only handles\Npairs composed of a key, which is the search key, and the associated value. It\Nalso only supports searches for exact matches to a key. In exchange, this not only\Nallows ultra-fast searching of massive data sets, it also supports writing large\Namounts of data.\NIn addition, document stores for JSON, a type of semi-structured data that is more\Nloosely organized, are also widely used. One point to note about these NoSQL\Ndatabases is that they either do not support transactions, or the support is quite\Nlimited. If complicated transaction processing like the kind I explained in the\Nprevious slide is necessary, then you need to be careful.\NSo, how is the big data stored in a database processed? You may have realized that\Nan ordinary PC is not sufficient for processing big data. For example, consider\N1PB (petabyte) of data. Even if you read it with a storage device capable of\Nreading 100MB per second, it would end up taking 115 days to read all of the data.\NPractical processing is not possible with this kind of hardware. Therefore,\Nparallel dispersion processing is used to process big data. Specifically, they\Nprepare massive computers composed of multiple PCs, which are called cluster\Ncomputers. The data is broken down into smaller chunks in advance, and when the\Ndata is processed, those small chunks of data are read by each PC and partially\Nprocessed. The overall processing is performed by aggregating these partial results.\NThey are able to process individual pieces of data in parallel, making it possible\Nto perform analysis of big data in a realistic time frame.\NFinally, let’s look at the issues in utilizing big data.\NWhile there are various issues in the utilization of big data, these five are the\Nmain ones. The first is that the data is monopolized by certain companies and\Norganizations. Also, the storage and analysis costs are immense. Moreover, there\Nis a shortage of staff able to utilize big data. In addition, there is utilization\Nof data which is specialized for a particular purpose. And then there are issues\Nof security, privacy, and so on. I will explain these issues in the following\Nslides.\NThere are a limited number of organizations and companies able to collect and store\Nbig data, and the monopolization by these companies is a problem. This is why the\Nconcept of open data has attracted interest in recent years. In 2013, the G8\Nnations adopted the Open Data Charter. It holds that the current monopolization of\Ndata by governments and businesses presents a loss of opportunity for the people,\Nand promoting open data can give rise to general purpose innovations which fulfill\Nthe needs of citizens.\NOpen data has the five following steps. It starts with releasing data publicly\Nwith an open license that allows anyone to use it regardless of its format. The\Nnext step is to release the data in a machine-processable format. For example,\Nreleasing documents not only scanned photo images, but also in formats like Word\Nor Excel. The third step is to release data in formats that can be accessed openly.\NAn Excel file, for example, requires a program like Excel to access the file, but\Nreleasing the data in an openly accessible format like CSV instead enables people\Nwho do not have Excel to access the data as well. The fourth step is to release\Nthe data in RDF format. The fifth and final step is to link the data in RDF format\Nto other data. I will explain RDF data in detail with the next slide.\NRDF, which I explained in the previous slide, is short for Resource Description\NFramework, and it refers both to the general-purpose data representation model\Nestablished by the Worldwide Web Consortium (W3C) and the associated format. In\NRDF, all kinds of data are represented by triples made up of a subject, predicate,\Nand object. For example, the fact that BOB is interested in the Mona Lisa can be\Nrepresented by a triple in which “BOB” is the subject, “is interested in” is\Nthe predicate, and “the Mona Lisa” is the object. In the bottom diagram, the\Narrow between BOB and the Mona Lisa corresponds to this triple. In the same way,\Nthe fact that BOB is friends with Alice, or that BOB’s date of birth is July 14,\N1990, etc., are described by different triples. In this way, any information can\Nbe described with a graph model. This is sometimes called a knowledge graph.\NOne of the key features of RDF is that, as you can see from this diagram, you can\Nfreely post links to other data sets, so you can build a data web by posting links\Nto other data sets. Statistical data from national and local governments as well\Nas various other information is already stored in RDF format.\NThe next issue is the cost required for storage and analysis. As we know from what\Nwe have learned so far, collecting big data requires large-scale data systems, and\Nthere are high costs for storage and the cluster computers, etc., used to analyze\Nit. Furthermore, personnel with specialist knowledge are required to maintain,\Nmanage, and analyze this data, and their numbers are limited. Only a limited number\Nof organizations and companies have both of these things, so the monopolization of\Nbig data has become a problem.\NIn connection with this, the third point, the shortage of personnel capable of\Nutilizing big data, is also a major issue. Discovering useful information by\Nanalyzing big data requires experience and understanding of the problem space in\Naddition to deep knowledge of statistical analysis, data mining, and AI. Personnel\Nwho possess this expertise are called data scientists, and experts knowledgeable\Nabout computers, networks, and storage are required to operate the large-scale\Nparallel computing systems to maintain, manage, and analyze big data. This type of\Npersonnel is called infrastructure engineers. The number of people knowledgeable\Nabout infrastructure who also have training as a data scientist is extremely\Nlimited, and they are incredibly valuable right now. How to train this kind of\Nstaff is a very important societal issue.\NThe utilization of data specialized for a particular purpose is also a serious\Nissue. Big data is intrinsically collected and stored for a specific purpose. At\Nthe same time, using it for applications other than its original purpose and\Nlinking it or combining it with unintended data is also growing more important\Nfrom the perspective of utilizing data resources.\NThat said, caution is necessary in such cases. For example, using units or a level\Nof granularity which differs from that of the original collected data requires\Nconversion. If something was originally tallied weekly in dozens, for example, and\Nyou want to use daily increments and single units instead, you will need to\Nconvert the units and time. The accuracy of data may vary as well. In these cases,\Nyou discard data that does not meet the required level of accuracy. Alternatively,\Na process called accuracy supplementation will be necessary. There may also be\Ndata which does not contain required entries, and in such cases, it is processed\Nby supplementing the data with data from other sources, or inferring the missing\Nentries, etc. In these other cases, it is necessary to take appropriate measures\Non a case-by-case basis. A variety of research is being conducted on using data\Noutside of its intended purpose, and applying the results is expected to enable\Nbetter utilization of data.\NIt should go without saying that security and privacy are the highest priorities\Nin utilizing big data. At the same time, how to utilize data which contains personal\Ninformation while maintaining the anonymity of the individuals is also a very\Nimportant question. Research and development on techniques for this purpose known\Nas homomorphic encryption and differential privacy is being pursued.\NThe development of legislation is also progressing. The laws associated with the\Nutilization of big data may vary from region to region, so it is important to\Nunderstand the legal system of the target region before using the data. For example,\Nthere is the Amended Act on the Protection of Personal Information in Japan, while\Nin the EU, the General Data Protection Regulation (GDPR) is applied. It is important\Nto develop a deeper understanding of these regulations.\NIn conclusion, I would like to sum up the contents of this lecture.\NIn this lecture, we learned about the basic concepts of big data, as well as the\Nunderlying IoT and CPS data systems which support them, and the data systems which\Nsupport big data and IoT/CPS in turn. Lastly, we learned about the issues in\Nutilizing big data. This concludes the lecture. Thank you for watching.\N—00:42:27