What’s new in GTFS-realtime v2.0

10 min readSep 25, 2017

British author Arthur C. Clarke, the co-writer for the 2001: A Space Odyssey screenplay, said:

Any sufficiently advanced technology is indistinguishable from magic.

100 years ago, the first streetcars were rolling out in San Francisco. If you would have approached a transit rider and given them an iPhone that showed not only where the streetcar was located on a map in real-time, but a prediction for when it will arrive, it surely would have blown their mind.

San Francisco streetcar opening approximately 100 years ago (Photo credit - SFMTA Photography Department and Archive, https://tinyurl.com/ycq5asge)

Fast forward to today — smartphones are a little less magical, but transit riders still appreciate the benefits of real-time transit information (RTI), including:

Shorter perceived wait time¹ ²
Shorter actual wait time¹
Lower learning curve for new riders³
Increased ridership⁴ ⁷
Increased feeling of safety (e.g., at night)⁵ ⁶

Quality of RTI is also important — in one study⁸, 9% of riders said they took the bus less often due to errors in RTI they experienced.

TL;DR — GTFS-realtime v2.0 will help transit agencies improve the quality of their RTI — but, before we dive into the details, we need to quickly re-cap some of the basics of how RTI gets from the bus into the palm of your hand.

Where does real-time data come from?

Typical real-time information flow from a transit agency to a mobile app

Mobile transit apps get RTI directly from transit agencies, typically generated by an automatic vehicle location (AVL) system. In the early days of AVL the exchange format for RTI varied widely — each AVL vendor had it’s own Application Programming Interface (API) design. Over the past few years, agencies have begun to standardize on the General Transit Feed Specification (GTFS)-realtime format, a companion to the wildly popular GTFS format. TransitFeeds.com lists over 50 transit agencies with public GTFS-realtime feeds.

GTFS-realtime describes how to exchange information about Trip Updates (arrival and departure predictions), Vehicle Positions (yup, the position of the vehicle), and Service Alerts (human-readable descriptions of disruptions to the network).

What’s wrong with GTFS-realtime v1.0?

Having a de facto standard for RTI is great — it lets app developers focus on creating awesome new features instead of wrangling data.

However, as more transit agencies and app developers started using GTFS-realtime, they noticed something peculiar — almost all of the GTFS-realtime fields were optional. To be exact, of the 63 GTFS-realtime data fields, only 7 were required — about 11%.

The overwhelming number of optional fields makes it very simple for AVL system implementers to roll out a GTFS-realtime feed that is officially compliant with GTFS-realtime v1.0 — they can leave most of the values blank. However, this makes life difficult for everyone when people start consuming that information — some critical information may be missing. Transit riders get angry when an app gives them bad data, and this reflects poorly on the app developer and transit agency, and causes problems with the AVL vendor due to unmet data quality expectations.

Let’s look at an example — here’s a fully compliant GTFS-realtime v1.0 feed for a vehicle position:

header {
  gtfs_realtime_version: "1.0"
}
entity {
  id: "d131dd02"
  vehicle {
    position {
      latitude: 28.04265
      longitude: -82.45945
    }
  }
}

We’re missing critical information:

When was this position calculated? Is it one minute old? One day?
What route or trip is this vehicle currently serving? At best we can guess which route this serves by comparing it against route geometries, but that’s pretty ugly.
How do we describe the vehicle to a transit rider? Surely d131dd02 isn’t a valid bus number…or is it?

A second example — when providing arrival predictions in GTFS-realtime v1.0, the stop_sequence field is optional:

trip {
 trip_id: "277725"
}
stop_time_update {
 arrival {
   delay: 900 // 15 minutes
 }
 stop_id: “A”
}

This works fine with stop_id alone for most routes. However, a missing stop_sequence value creates problems when you have a route with a loop that visits a stop more than once.

A route with a loop that visits Stop A twice

Is the bus delayed for 15 minutes the first time it arrives at Stop A, or the second? This makes a huge difference, especially for large loops with many stops in between. Riders waiting to board at Stop B will be very annoyed if we tell them they have time to grab an extra coffee (because we think the 15 minute delay is in the 1st half of the loop) and then they miss the bus when it arrives on-time (because the delay was actually in the 2nd half of the loop). We need the stop_sequence field in this case to correctly interpret the prediction.

Why so many optional fields?

So, why are there so many optional and so few required fields in GTFS-realtime v1.0?

It’s related to the format used to exchange GTFS-realtime data - Protocol Buffers. Protocol Buffers are an extremely compact way to represent information in a binary format. Instead of the feed data being formatted as Unicode text characters (the way I’ve shown GTFS-realtime data earlier in this article), each of which takes at least 1 byte (8 bits), it can be compressed into a smaller representation of 0s and 1s.

The space savings of Protocol Buffers adds up quickly, especially when you consider that this information is updated in the order of seconds. Here’s a single response from Massachusetts Bay Transportation Authority (MBTA)’s GTFS-realtime Trip Updates feed — the binary Protocol Buffer version is under 1 MB, and the plain text version is more than 6 times bigger at just over 5.5 MB.

The Protocol Buffer version of MBTA’s GTFS-realtime feed is over six times smaller than the plain text version

While binary formats are extremely space-efficient, it can be a real pain to write code that processes them. Protocol Buffers solve this problem by generating this code for you. First, you create a .proto file that describes that data elements you want to exchange (in our case, the official gtfs-realtime.proto). Then, you can feed this .proto file into the open-source Protocol Buffer tools to auto-generate the code that compresses the information to a binary format (used in the Transit Agency AVL server) and the code to extract it from a binary format (used in the App Developer’s server).

The Protocol Buffer compiler auto-generates code to exchange binary GTFS-realtime messages

To makes things even more convenient, Google has already done this step for you and created a readily-usable gtfs-realtime-bindings library that supports easily exchanging GTFS-realtime messages in the programming languages Java, .NET, JavaScript / Node.js, PHP, Python, Ruby, and Golang.

So, what does all this have to do with optional vs. required GTFS-realtime fields?

Well, the original GTFS-realtime v1.0 documentation included a Cardinality field with the values of required and optional…and repeated??? What the heck does repeated mean?

It turns out that this Cardinality documentation was copied from the gtfs-realtime.proto file and doesn’t have anything to do with public transportation. Protocol Buffer Cardinality simply defines whether or not software parsing the binary message expects a field to exist— it has no direct mapping to GTFS or transit-specific logic (repeated, in case you’re wondering, is Protocol Buffer-lingo for a list of optional elements). This becomes a problem in GTFS-realtime because many software engineers choose not to label any Protocol Buffer fields as Required because of forwards-compatibility issues with Protocol Buffer implementations (see the Protocol Buffer docs “Required is Forever” and this GTFS-realtime Google Group discussion for nitty-gritty details). As a result, nearly all fields in GTFS-realtime v1.0 are shown as Optional, even if that field is necessary for a transit app to show proper real-time information to a transit rider.

The solution — GTFS-realtime v2.0

To fix this confusion, the GTFS-realtime community created a new version of the format that defines the semantic requirements and cardinality of RTI. In other words, GTFS-realtime v2.0 now defines which fields should be required based on domain-specific (transit) logic (detailed proposal here). These definitions are completely independent of the Protocol Buffer format and would still apply even if a different format was used for GTFS-realtime data.

In GTFS-realtime v2.0, each field now has a Required column that can contain the following values:

Required: This field must be provided by a GTFS-realtime feed producer.
Conditionally required: This field is required under certain conditions, which are outlined in the field Description. Outside of these conditions, the field is optional.
Optional: This field is optional and is not required to be implemented by producers. However, if the data is available in the underlying automatic vehicle location systems (e.g., VehiclePosition timestamp) it is recommended that producers provide these optional fields when possible.

The Cardinality column now represents the number of elements that can be provided for a particular field — One or Many (e.g., a list of predictions applying to more than one stop within a trip).

Below is a snapshot of what the GTFS-realtime FeedHeader looks like in GTFS-realtime v1.0 (top), and GTFS-realtime v2.0 (bottom):

GTFS-realtime v2.0 defines semantic requirements in new “Required” and “Cardinality” fields

You can see the new Required column, which now makes critical fields like the header time stamp mandatory. This helps address the previously discussed problem of determining the age of a vehicle position (an aside — please also include individual timestamps for each vehicle! It’s important!).

The StopTimeUpdate message, which contains the information about arrival and departure predictions, is a good illustration of the new Conditionally required value:

`stop_sequence` is now “*Conditionally required”* for loop routes

Remember the problem with the ambiguous arrival prediction for the loop route? In GTFS-realtime v2.0, stop_sequence is Conditionally required, and one of the required cases outlined in the Description is the loop.

What’s next?

GTFS-realtime v2.0 provides much-needed guidance to producers (transit agencies and AVL vendors) and consumers (app developers and transit riders) on the various conditions under which certain fields are required. Now, validation tools can flag errors in GTFS-realtime v2.0 data based on the new use cases (we updated the open-source GTFS-realtime Validator tool to detect v2.0 errors a few weeks ago). This will lead to shorter software development and QA cycles for GTFS-realtime producers, lower deployment costs, and, perhaps most importantly, better quality information for transit riders.

So we’re done, right? All problems are solved? Well, it’s never that easy :). We haven’t gotten into troubleshooting pure prediction errors — when it says the bus was going to arrive in 5 minutes…15 minutes ago. As mentioned earlier, accuracy of predictions is very important to riders, and right now there isn’t even an agreed-upon definition of how accuracy and precision should be measured. We can tackle that next.

Oh, and don’t forget about your GTFS data! Real-time information is only as good as the GTFS data on which it’s built. Check out the below article for a short 5 minute read on the GTFS Best Practices initiative:

GTFS Best Practices now available!

The General Transit Feed Specification (GTFS) has revolutionized multi-modal information and open transit data. Started…

medium.com

Acknowledgements

Our work at the Center for Urban Transportation (CUTR) at the University of South Florida (USF) on the GTFS-realtime v2.0 proposal to define field semantic requirements and cardinality and the development of the open-source GTFS-realtime Validator has been funded by the National Institute for Transportation and Communities (NITC). The contents of this article reflect the views of the authors, who are solely responsible for the facts and the accuracy of the material and information presented herein.

References

Kari Edison Watkins, Brian Ferris, Alan Borning, G. Scott Rutherford, and David Layton (2011), “Where Is My Bus? Impact of mobile real-time information on the perceived and actual wait time of transit riders,” Transportation Research Part A: Policy and Practice, Vol. 45 pp. 839–848.
Candace Brakewood, Sean Barbeau, Kari Watkins (2014). “An experiment evaluating the impacts of real-time transit information on bus riders in Tampa, Florida”, Transportation Research Part A: Policy and Practice, Volume 69, November 2014, Pages 409–422, ISSN 0965–8564, http://dx.doi.org/10.1016/j.tra.2014.09.003.
C. Cluett, S. Bregman, and J. Richman (2003). “Customer Preferences for Transit ATIS,” Federal Transit Administration. Available at http://ntl.bts.gov/lib/jpodocs/repts_te/13935/13935.pdf#sthash.jwn5Oltr.dpuf
Lei Tang and Piyushimita Thakuriah (2012), “Ridership effects of real-time bus information system: A case study in the City of Chicago,” Transportation Research Part C: Emerging Technologies, Vol. 22 pp. 146–161.
Brian Ferris, Kari Watkins, and Alan Borning (2010), “OneBusAway: results from providing real-time arrival information for public transit,” in Proceedings of the 28th International CHI Conference on Human Factors in Computing Systems, Atlanta, Georgia, USA, pp. 1807–1816.
A. Gooze, K. Watkins, and A. Borning (2013), “Benefits of Real-Time Information and the Impacts of Data Accuracy on the Rider Experience,” in Transportation Research Board 92nd Annual Meeting, Washington, D.C., January 13, 2013.
Brakewood, Macfarlane and Watkins (2015). The Impact of Real-Time Information on Bus Ridership in New York City. Transportation Research Part C: Emerging Technologies, Volume 53, pp. 59–7
A. Gooze, K. Watkins, and A. Borning (2013), “Benefits of Real-Time Information and the Impacts of Data Accuracy on the Rider Experience,” in Transportation Research Board 92nd Annual Meeting, Washington, D.C., January 13, 2013.