Ivica Siladić - 13/05/2021 | 4 min read
Telematics Big Data
The law in China mandates the use of GPS tracking for all heavy trucks. Heavy truck owners must purchase a GPS tracking device from a certified telematics service provider, regardless of whether they will use any accompanying telematics services. At the same time, telematics service providers must forward the acquired GPS tracking data to the appropriate government agency.
There are about 12 million heavy trucks in China. To streamline the operation, the government mandates specific data transfer protocols, both between GPS tracking devices and service providers and between a service provider and government agency. The name of the devices-to-provider protocol is JT 808, while JT 809 is the name of the provider-to-government protocol. Both protocols use TCP/IP.
An interesting side effect of mandated JT 808/809 protocols is that virtually all GPS tracking devices in China use the same wire protocol (JT 808), making it easy for customers to switch between telematics providers while retaining the tracking device. This fact is quite different from tracking business in other countries, where telematics providers try to keep the customer by using either proprietary devices or proprietary protocols, keeping the barrier to switch service providers much higher.
Another interesting fact is that in China, third-party companies may purchase GPS tracking data from many telematics providers and use them because all sources use the same JT 809 protocol for server-to-server communication. In other words, becoming a Vehicle Data Hub in China is simple from a technical point of view. GDPR-like rules in China do not exist, so there are almost no business or technical obstacles for anyone to jump in, purchase the tracking data and do whatever they want with it.
So, what do Chinese companies do with the GPS tracking data from heavy trucks? Many compelling things, of course, but one utterly unexpected use case came from our partner in China a few years ago - a financial institution (a bank) wanted to measure the frequency of heavy truck visits to a factory area. The catch here is that those heavy trucks visiting the factory do not belong to the factory itself. Instead, the factory hires trucks for a transport task, and the same truck may never be hired again by the same factory. To figure out how many trucks visited the factory, say, last Monday, one has to analyze the journeys of all 12 million heavy trucks and come up with an answer.
But why the bank wanted to have this information in the first place? Well, the level of business trust among companies in China is not quite as high as in, for example, Western World. This lack of trust makes lending money very expensive because loan interest rates must cover a relatively high amount of fraud. For that reason, our bank was looking for a more objective measure of factory business health. They came up with the idea to indirectly measure the number of goods leaving the factory by simply counting the number of trucks going in and out of the factory! Simple, unexpected, but brilliant idea!
Before we proceed with the description of the solution, let's do some basic math. A device in the truck sends approximately 100,000 GPS probes yearly. All 12 million trucks would then send 12 million x 100,000 = 1.2 trillion GPS probes per year. If we only store each of these GPS probes in a database, we'll end up with a table with a massive 1.2 trillion entries! To extract the data that our bank needs, we need to run a query that finds two consecutive GPS probes of a truck, one inside the factory's polygon and the other outside of it (or vice versa). That would be a classical spatial join between a table with GPS probes and a polygon.
From the technical point of view, spatial joins are generally among the most difficult ones. In this case, the database has to perform potentially billions of point-in-polygon tests, which is a dramatically slow operation. Pair that with, for relational databases, unnatural "consecutive rows" selection (which has to be implemented here as SQL window operation with sort), we end up with a query that would take forever to execute.
Without going into the technical details, let's only point out here that, unfortunately, it is simply not possible to use a "bigger hammer" as an analytical backend that would produce the answer to the bank's question within a reasonable time. Spatiotemporal data are a different breed, and no Spark/Hadoop setup or some magic cloud solution could come as a rescue. That effectively led us to build Mireo SpaceTime, a data storage/analysis system explicitly designed to cope with massive spatiotemporal datasets.
So, let's demonstrate the solution to the problem by using the Mireo SpaceTime framework. The framework first processes GPS probes (points) to map-matched continuous trajectories, making analysis of vehicle movements much easier and more precise. Then, the SQL language coupled with the proprietary multidimensional index executes the query well below one second.
Precisely, the following Mireo SpaceTime SQL query finds all vehicles from the database whose trajectory (a trip) crosses the boundary of a fixed polygon. The query determines whether the found crossing was an entry or exit to the polygon, and it also finds the time of that event. The query almost directly answers our bank's question, where "almost" means that we need to do some post aggregation/grouping of results to obtain required trucks' visit frequencies, but we're omitting that part for clarity.
with fences as (
select id, poly from (values
(0, ST_GeomFromText('POLYGON ((
6920696 38210240, 6919936 38209480, 6920176 38207712,
6921568 38207680, 6923016 38208320, 6923216 38209432,
6923176 38210256, 6922368 38210728, 6920696 38210240
))'))
) f(id, poly)
)
select
vid, t[0] as t,
if(ST_Within(ST_Point(x[0], y[0]), poly), 'Exit', 'Entry') as ev
from st.segments
join fences on ST_Crosses(ST_Line(x[0], y[0], x[1], y[1]), fences.poly)
where vid between 1300000 and 1300499 and
t[1] > ts('2020-07-01', 'Europe/Zagreb') and
t[0] < ts('2021-01-01', 'Europe/Zagreb')
order by 2
The polygon boundary in the query is expressed in integer-normalized standard spherical Mercator projection (i.e., linearly transformed Web Mercator projection) of the corresponding longitude and latitude values. We use projected coordinates because geometry operations are much faster in Euclidean than in spherical space. Note also that the polygon from the query is actually in Zagreb, Croatia, and that the time filter spans 6 months in CET - we use these values for illustration purposes only. Apart from these values, the production query is exactly the same.
The output of the query looks like this:
Vehicle | Date/Time | Event |
---|---|---|
#1300172 | 23/07 04:23 | Entry |
#1300172 | 23/07 04:38 | Exit |
#1300164 | 01/08 05:42 | Entry |
#1300164 | 01/08 05:43 | Exit |
#1300164 | 01/08 05:43 | Entry |
#1300164 | 01/08 07:48 | Exit |
#1300108 | 11/08 11:55 | Entry |
#1300108 | 11/08 11:56 | Exit |
#1300108 | 11/08 12:08 | Entry |
... | ... | ... |
In a sense, the answer to the original bank question was somewhat a trip to the Moon and back. Once we've become aware of the data amount that we have to process and the costly spatial operation that we cannot avoid, we've learned that we cannot use mainstream database tools for almost any telematics data analysis. That was the main reason why we've built Mireo SpaceTime, an infinitely scalable spatiotemporal database.