Entity relationships with real-time data

Table of contents

Nowadays we have tools like GraphQL and DataConnect which make it so much easier to manage and collect complex relationships between data types stored in a database.

This is great for when you have a client which needs an entity and its relationship tree from the server in a single request... but what about when you need access to the data in real-time?

A pretty standard solution to creating real-time subscriptions to things in a database would be to build out a streaming database, a WebSocket connection, and a query for the client machine to subscribe to. But there are issues with this approach. With this, I’d pretty much be locked into potentially massive responses sent back from the streaming server from every update to any node in the entire set of entities that my front end would need! I’m not satisfied with this limitation, and wouldn’t have time to optimize the streaming database solution in a way to avoid this. So here I’ll be going over my process building out a client driven data collection tool to do just that while making it as easy to use and reusable as possible.

📖 A bit of context

The project which drove me down this path is an internal UI tool to build a timeline of tasks and their geographic locations (which we’ll simply call an itinerary). In the operational context, this itinerary would be sent to some field agent to complete. Think early days of Uber Eats or Doordash when an employee at a dispatch center would be tasked with organizing the set of requirements for an employee in the field to complete a delivery. This itinerary is ultimately sent to the field agent via a mobile app to execute.

This is built on Vue and a NoSQL database: Firebase Realtime Database (RTDB).

In our system, the Itinerary data type alone is not of a particularly complex structure. But the series of relationships connected to it is where the query can get complicated. There are relationships between an itinerary and separate nodes for certain data points like the the waypoints on the itinerary, the user assigned to perform the tasks, their current location in real time, the set of tasks and their statuses, orders, instructions provided by third parties, etc all of which may be variable.

To handle the collection of all entities established in the relationships of an itinerary, the existing solution for fetching all related data was built on a Google Cloud Function (serverless, like Lambda on AWS) with a manually defined list of child relationships to be fetched in series...

For example:

interface GetItineraryDataRequest {
  ...
  includeShipmentsDataInWaypointActionData?: boolean;
  includeTasksDataInItineraryData?: boolean;
  includeWaypointsDataInItineraryData?: boolean;
  ...
}

Not a particularly elegant solution... You can imagine the very linear approach to collecting an entity’s relationships this would take and its issues with development scalability; the code for each conditional inclusion is manually written to perform the fetch and bundle that entity into the final response object. On top of that, any new relationship added to the data type would require a new data fetch to be written.

Upon some rough benchmarking, there were cases of massive relationship trees (500+ nodes) taking the cloud function over 60 seconds to collect and hand back to the client. There are a series of reasons for the underperformance of this including cold start times, underperforming server-side resources, and the volume of data being transferred. And it doesn't help that the function was built with a linear approach to collecting the data, consuming resources for every function call and executing each nest level in series. Not only was this inefficient, but also lacked the ability to provide real-time data access.

Separately, up until this point, an itinerary (at least via any UI) was entirely immutable once it was committed to the database. A pressing limitation of the existing system was this rigidity since there were constantly changes to requirements of an itinerary by the dispatch center out of their control.

Ultimately, the dispatch team’s requested the ability to edit aspects about an itinerary to make reconcilable edits and this opened an opportunity to make some sweeping improvements.

tl; dr

Project uses Firebase Realtime Database for complex itineraries with numerous relationships
Previous system of data fetching was slow and inefficient to both execute and maintain
Existing solution: slow cloud function taking up to 60 seconds for large data fetches
Current limitation: itineraries are immutable once created, access is slow, no real-time data
Needs: efficient data fetching and editable itinerary system

🕊️ Two Three birds, one stone?

This had me thinking...

Maybe our tech had abstracted too far from the core functionality of Firebase’s Realtime Database. It’s sold as a real-time data store, but the implementation used their cloud functions to build a system that wasn’t much different from your typical REST API. Although, in defense of this architectural decision, our goal was to implement quickly being at a much earlier stage of the project and simultaneously offload the verification of auth access from data requests from Firebase RTDB's confusing syntax to more readable TS code.

But RTDB is WICKED FAST (I’m glad the little things still excite me). On a client-side project where an RTDB connection is configured, a WebSocket connection is swiftly established and application data is served from server to client in a streaming fashion; in what feels like an instantaneous moment. An object of some reasonable size (~100kb) can be downloaded in under 40ms. This was enough indication for me that given a large set of nodes to subscribe to, the request time would be far superior to the cloud function approach.

So, I started thinking about how and in what ways leveraging this real-time system would solve each of the three issues.

🔎 Real-time synchronization of data

✅ It’s in the name!

🔎 Editable itineraries

✅ Changes can simply be committed continuously, allowing for edits to be made in a real-time context

🔎 Speedy and robust access to data

▢ Here’s where a thoughtful design strategy could come into play… I want to do more than just migrate the backend collection logic to client-side code

🌱 Design and proof of concept

Not only did the specific problem drive me to rethink the data fetch methodology that the cloud function serving this itinerary data was using, but I envisioned a solution that could be more declarative and reusable. After all, using the WebSocket connection Firebase gives us is a valuable system that the entire team should take advantage of!

So I devised a plan to build out a type safe and recursive utility function (I settled on the name fetchWithRelationships which we’ll refer to as FwR for short). My goal is for FwR to fetch any entity, any of its child/sibling relationships, any of those child/sibling's relationships, and so on until all nodes requested are fetched or bound.

I’m no stranger to recursion, so my initial focus was to define a proof of concept strictly via declaring the type definitions I’d need.

Design an intuitive utility interface experience

Let’s consider a very simple example of the data structures we might be working with:

interface Itinerary {
  id: string;
  assignedUserID: string;
  startTime: number; // unix timestamp (ms)
  waypoints: string[]; // array of waypoint IDs
}

interface Waypoint {
  id: string;
  location: [number, number]; // latitude, longitude
  actions: string[]; // array of action IDs
}

interface Action {
  id: string;
  type: 'pickup' | 'dropoff' | 'task';
  orderID: string;
  status: 'pending' | 'in_progress' | 'completed';
}

interface Order {
  id: string;
  date: number; // unix timestamp (ms)
}

interface User {
  id: string;
  name: string;
}

And a representation of the NoSQL database might look like this:

enum DbRef {
  ITINERARIES = 'itineraries',
  WAYPOINTS = 'waypoints',
  ACTIONS = 'actions',
  ORDERS = 'orders',
  USERS = 'users',
}

interface Database {
  [DbRef.ITINERARIES]: {
    [key: string]: Itinerary;
  };
  [DbRef.WAYPOINTS]: {
    [key: string]: Waypoint;
  };
  [DbRef.ACTIONS]: {
    [key: string]: Action;
  };
  [DbRef.ORDERS]: {
    [key: string]: Order;
  };
  [DbRef.USERS]: {
    [key: string]: User;
  };
}

Given a developer using the FwR utility has a familiarity with the project's data structures, I wanted to design the request structure in as similar of a way to the request target schema as possible.

You’ll notice that in the type definitions, the relationships between entity types are defined as either a string or an array of strings. These are ID values of the related entity/entities. Up until now, this is a relationship definition that is understood by the developer. For example, if I was interested in collecting the list of waypoints for the itinerary, I would make an assumption or rely on documentation around the Itinerary data structure to know to fetch the related waypoints from DbRef.WAYPOINTS. But there is no systematic indication of which DbRef the related entity can actually be found in.

So to get this to work, the configuration for FwR will need all relationships for each type structure explicitly defined. These relationships could be represented as a subset of the type’s keys. For example, in the case of the Itinerary data type, the set of keys which are allowed to be defined as a relationship is a subset of the keys of the Itinerary interface as shown below:

type ItineraryRelationshipKey = keyof Pick<Itinerary,
'assignedUserID'
| 'waypoints'
>;

And the explicit definition of the DbRef for each relationship is defined as a record:

export const ItineraryRelationshipKeys: Record<ItineraryRelationshipKey, DbRef> = {
  assignedUserID: DbRef.USERS,
  waypoints: DbRef.WAYPOINTS,
};

Once this explicit definition was set up for all of our data types, the potential for what this can enable was especially exciting!

My goal was to allow an FwR request to be made in a format like the following:

const itinerary = await fetchWithRelationships<Itinerary>(
  DbRef.ITINERARIES,
  'itinerary_123', {
    assignedUserID: true,
    waypoints: true,
  }
);

And the intention would be that the relationship keys set to true would be collected in addition to the target entity being fetched.

But we need to take this a step further, no? I not only need to fetch the itinerary and it’s relationships, but the relationships of those entities, and so on.

A full-fledged recursive definition of the request shown below:

const itinerary = await fetchWithRelationships<Itinerary>(
  DbRef.ITINERARIES,
  'itinerary_123', {
    assignedUserID: true,
    waypoints: {
      actions: {
        orderID: true,
      }
    }
  }
);

The recursive definition of the request body is what would make this utility flexible and declarative.

Type safety is a crucial part of this utility, for both the developer to be restricted to explicitly defined relationship keys and also to provide IntelliSense recommendations for what the developer is actually able to fetch.

Though blood, sweat, and tears, I put together a working type validation system for the request body that ensured accuracy of key names (even recursively!)

// Modified RecursiveRelationships type to enforce strict property checks.
export type NestedRequestWithRelationships<
  ParentType,
> = ParentType extends { ref: infer Ref }
  ? Ref extends RelationshipKeySupportedDbRefs
    ? {
      [SelfKey in keyof typeof dbRefsWithRelationshipKeys[Ref]]?: true
      | NestedRequestWithRelationships<
        typeof dbRefsWithRelationshipKeys[Ref][SelfKey]
      >
    }
    : never
  : never; // Disallow properties not defined in DbRefsWithRelationshipKeys.

export type RequestWithRelationships<RootRef extends RelationshipKeySupportedDbRefs> = {
  [RootKey in keyof DbRefsWithRelationshipKeys[RootRef]]?: true
  | NestedRequestWithRelationships<
  // This object points to the expected structure of the nested object
  typeof dbRefsWithRelationshipKeys[RootRef][RootKey]
  >;
};

Up until this point, I’d spent roughly 12 hours conceptualizing and devising the type definitions for the FwR utility, and this was the proof I needed to know to invest in building this out.

🛠️ Lock in

From here, I had a firm grasp of the conceptual strategy. The utility at runtime would properly have access to all information needed to perform the chain of data requests.

Take in a target database reference (DbRef) and target entity ID.
Recursively walk through the request body
Perform the data requests for child nodes concurrently. If additional child nodes are requested, go back to step 1.
Collect all entities into a response object represented simply and consistently as a subset of the database.

This is only the process for a single fetch operation, but serves as the core of the structure. At this point, I had enough to test the request time and the functional integrity of the utility.

📈 Benchmarking

I performed some rough benchmarking of the FwR utility against the existing system of cloud functions.


Test Itinerary Collection Size	FwR (Avg of 3 tests)	Cloud Function (Avg of 3 tests)
1MB (51 entities)	1.390s	16.313s
4MB (264 entities)	2.090s	42.829s
10MB (630 entities)	6.200s	67.532s

These were exciting findings! The FwR utility completes data fetch operations at a significantly faster time than the cloud function strategy. Now that this method was proven, I could begin to implement the binding system.

🔄 Real-time data binding

In the event of performing FwR with a binding instruction for real-time synchronization, so many more considerations needed to be made:

Some callback must be provided to handle the distribution of the updates to data to any active listeners.
The request instruction body and callbacks would need to persist in JS memory for the lifecycle of the subscription. This was handled with a simple map stored at the global context of the utility code.
In the event of a database response handled as a bind response, there is a possibility of new child nodes being added to the response object. The utility would need to execute the entire fetch chain again to ensure all nodes are collected and up to date.
There is a high possibility that a single entity node (at any level of any relationship tree) is requested via multiple bind requests across the same context but sourced by different parent requests. For that reason, the utility limits a bind operation on a unique database path to a single listener but allows for and stores multiple callback definitions! Instead of performing another fetch operation, the callback is provided a cached value to use.
With the writing callback definitions, generally the responses are applied to the store modules in which the data is housed for the UI context. Writing each response one at a time can severely bottleneck the JS thread and cause the UI to become unresponsive. To mitigate this, a batch write operation is performed on a 50ms interval.

🎯 The result

What came out of this was a powerful utility that declaratively fetches an entity relationship map of any complexity in a fraction of the time of any of our previous systems, dynamically allows for real-time synchronization, and provides a simple experience for developers to use to implement fast data retrieval for any context.

This specific context of the project took roughly a week to develop and my own experience integrating FwR into the project was seamless and easily allowed for the exact goal of real-time editing of itinerary data in mind.

I left one consideration out though. The distribution of process load had suddenly been off-loaded from the cloud process to the client. As stated earlier, one of the only considerations around client-side performance was around the number of subsequent write operations potentially bottlenecking the JS thread. But CPU load aside, the effects on memory consumption by the client were a bit of a concern.

In my next blog post, I’ll be going over my investigation around memory consumption and how I optimized the integration with store libraries.

- Pumposh