Most FHIR Servers are Unusable in Production
FHIR Proof: Call to Action for FHIR Solutions to Prove their Usability
Here is the ammo to ask the important questions regarding your FHIR solution.
An in depth analysis of common FHIR offerings and their performance against real-world scenarios — along with a Call to Action for better product transparency
Disclaimer: I am employed at Google Cloud as a Solution Architect.
**This story has been edited as of January 29th, 2022 to include SMILE CDR’s Benchmarking the organization published after my call to action. And while I disagree about their “numbers often don’t even matter” bit, since we are talking about patients here — not cars and phones — I am still delighted they published despite the hesitations. I wrote this article as a way to make it clear; we cannot operate on just reputations, as some of the largest, most trusted organizations are failing to meet expectations. So in the name of transparency, I want to thank the entire SMILE team and what an incredible reminder we can push ourselves forward as an industry, together. Next up, crucible battles?
After the recent publication of Google Cloud’s Healthcare API Benchmarking White Paper — I did some digging into what other benchmarking exists for FHIR servers. Turns out, this was a harder task than expected. My findings uncovered dismal scale and performance numbers, varying widely across vendors. I also found an overall lack of transparency and documentation for products making me wonder… Enterprises make decisions based on scale, performance, and cost — so where are those numbers for FHIR offerings? How do decision makers know what questions to ask or what expectations to set for their solutions, if there is no documentation to guide them?
So here is my attempt to give enterprises a guide to what questions to ask, what expectations to set, and what your organization may need for FHIR solutions.
My Research and Call to Action: This blog is using public information — most of it being from product documentation or anecdotal, either from users or threads for various products. This means, this is in no way a complete analysis — in fact, I request anyone who can provide public documentation that materially contributes to this analysis to please email it to girlonfhir AT google.com and I will happily update this blog.
FHIR will only be the interoperability standard if it can be used in real-world use cases. Every database row is a life, every extra second could make a difference for someone. At the end of the day… this is why you should care about scale and performance. This isn’t just an industry buzzword, these are real lives and patient outcomes.
So here are some real world scenarios for perspective…
Why Care about Scale?
In a previous blog I discussed the issues with having multiple FHIR servers.
There are multiple complications with trying to use multiple databases, FHIR servers, and/or federating the servers. The heart of the issue is with referential integrity, synchronization/consistency, and most importantly — authentication, user access, and security. Thus, an organization can only (reasonably) use a single FHIR offering across their population.
If an organization sees ten million patients a year, they would create around five billion FHIR resources a year (1 patient = ~500 FHIR resources). With data retention around three years, the organization will need storage that can support fifteen billion resources, or ~30 TB of data.
A FHIR server should be able to scale to terabytes of data and store billions of resources.
Why Care about Performance?
In the same blog I set up a real-life scenario of streaming healthcare data throughput from electronic health records (EHR) (HL7v2 feeds).
If a large system is sending around 8 million HL7v2 messages a day, there would be ~500,000 messages during a peak hour. There are around 15 FHIR resources created per message, this means the organization is ingesting ~2,000 FHIR resources a second streaming in from their EHR.*
*This is the architecture Google’s Healthcare Data Engine is solving for
A FHIR server should be able to ingest/create thousands of resources per second.
The Options
Scale — storage size supported by the service
For scale — I used what the largest implementation of the product had been documented as. If that information was unavailable, I used limits stated on product pages.
Performance — speed of the service
For performance — there are many many API operations that could be tested for various performance needs. I focused on the creation or ingestion of FHIR data — by resources, FHIR bundles, or in bulk — and the throughput of those options.
Note: I believe search may be the most indicative measure of performance at scale, however, due to the unavailability of documentation from other vendors on this benchmarking, I was unable to provide comparisons at this time.
Type of Server
There are two very different classes of products — unmanaged versus managed services. You can do some googling on the difference between the options, however, from a scale and performance point of view — unmanaged is directly related to the amount of infrastructure provisioned, versus managed is heavily reliant on the service provider.
Managed Services
Google Cloud
Scale
As stated in the white paper — Google was tested using a single FHIR datastore with 50 million patient records (26 billion resources) which equals about 60 TB of data.
Performance
Bulk Import — Ingested 10,038 resources a second (95 percentile)
Execute Bundle — Response time at 18 milliseconds per resource, or ~55 resources/second* (95%) from a single client thread
*Can be parallelized to achieve increased resource per second execution
Overall
Meets enterprise requirements for size and throughput.
Azure
Scale
The documentation states the maximum available RUs for their FHIR API is 1 million, allowing for about 25 TB of data assuming 40 RUs per GB.
Performance
Bulk Import — not supported (user has to use the FHIR loader)
*FHIR Loader — 500 resources a second at default 10,000 RU/s
- This could be improved by adding more RU/s to the FHIR server, thus cost, and it is unclear what performance improvements would look like
- Any parallel operations being done would bottleneck this
Execute Bundle — Users have documented around 120k resources per hour, or ~33 resources/second
*Note: The FHIR loader parallelizes execute bundle calls to “bulk ingest” data. This means instead of a single, long running operation — it is many operations all at once, creating additional cost and throttling the API. (For comparison, Google’s Bulk FHIR Ingest is a free long running operation and 20x faster at default configuration).
Overall
Azure’s FHIR Healthcare API may meet enterprise requirements for size, however, will likely be throttled with streaming or API performance needs. An interesting note for Azure’s offering is the storage and API operations are all dependent on resource units. This means if the resource units are being utilized for say an ingest operation, users must provision additional resources (read: costly) to support their other use cases like read, search, and storage.
AWS
Scale
The documentation states the maximum is 500,000 resources per database. Based on their documentation of 2 KB per resource, AWS can support 0.001 TBs.
Performance
Unclear product documentation and little documented enterprise usage for there to be reliable threads/community discussions about.
That being said, I do think import performance is irrelevant when the data limits are so low.
Overall
It is unclear if AWS HealthLake could meet either enterprise scale or performance needs.
Unmanaged Services
This option is a traditional way of viewing infrastructure. It requires the customer to provision everything themselves. Most of the information on why to choose managed vs unmanaged is core to Cloud’s positioning and can be found on any number of marketing documentation.
For FHIR specifically there is a lot of work that needs to be done from the server side to horizontally scale the server and database — however, the most difficult task can be creating, managing and optimizing indexes. An enterprise FHIR server that is unmanaged likely requires an additional team of database administrators and infrastructure experts to ensure the solution can meet scale and uptime requirements.
Performance Implications
With unmanaged services, scale and performance are directly related to the database and infrastructure being used. If a SQL server is being used, it will likely not scale to the same size as a NoSQL or GraphQL database. Similarly, performance is going to be correlated with how much infrastructure the server is provisioned on.
For some perspective — IBM recently wrote a blog on their performance testing (way to go, IBM!)
HAPI*
I do not discuss the scale of HAPI in this blog, as IBM out performed HAPI in all three of the performance tests that were conducted.
- HAPI is the backbone of SmileCDR.
SmileCDR
SmileCDR has since published their performance testing and I have included it as an update to this post (1/29/2022). Thank you SMILE CDR for heeding my call, even if you roasted benchmarking in the process :)! I have moved them down to the unmanaged section as well.
Scale
This is the second largest scale test I have seen be published to date (see above). SMILE was tested with 1 million patient records (1 billion resources) which equals about 2.2 TB of FHIR data. I suspect this number was done for time and simplicity of the test, as a user could configure additional infrastructure and the author makes it clear it can be scaled to greater loads.
The maximum capacity of the infrastructure SMILE chose (Amazon’s Aurora PostgreSQL) is 128 TB of data.
** One thing I would love to see from SMILE is how indexes affect the storage numbers. It mentions they turn OFF some default search parameters for customers, as they increase storage space. This is a major point for managed services that can provide full-functionality out of the box without reducing performance or increasing cost and storage.
Performance
Bulk Import — after some digging, it appears that SMILE does not supports bulk import of raw FHIR data. SMILE does support bulk import of CSVs to be transformed to FHIR, as documented here.
Due to this, similar to Azure’s method, SMILE suggests parallelizing the bulk load, in a bundle format.
Execute Bundle — the import performed by execute bundle is impressive at 11,716 resource per second — however this is over 100 upload threads. In order to compare it to other numbers provided above, that averages to about 117 resources per second per thread. That is still the best that has been calculated for execute bundle in this blog, however, per thread is still significantly slower than what one can achieve with a bulk import functionality.
Cost
It wouldn’t be a true enterprise comparison without a cost comparison!
Based on the tests run, SMILE used a 4–6 EC2 instances throughout the test. That has a runtime cost of around $484/month at their maximum use tested.
Amazon’s Runtime: $194.40 ($0.27/hour), Azure’s Runtime: $292.00 ($0.40/hour), Google’s: Free
The storage and use of database costs are on Amazon’s Aurora, and with 2 TB of data and 64 ACUs (their tested max) the storage costs is around $3,000/month. This ACUs model is similar to Azure’s RU/s model where ACUs need to be scaled up by operations.
This is one drawback of not supporting bulk import, where both AWS’ HealthLake and GCP’s FHIR APIs support free import.
All of this is with an asterisk that I do not know the cost of licensing for SMILE and those should be considered as well when doing a true analysis.
Overall
One sticking point, when compared to the other offerings above, is the need to set up infrastructure opposed to simply “click to deploy” models that the Cloud vendors offer. The level of fine tuning someone is willing to do impacts the usability, for example, infrastructure was reconfigured and scaled up/down manually through out the scale test. This can be a positive and negative as I elaborated above, however it does make it more difficult to directly compare to other offerings that do not need infrastructure configuration.
IBM
Scale
The scale of open sourced, unmanaged servers will be tied to what database they are being implemented on. The scale of that database will be the limiting factor to size. The blog does not talk about the size of their implementation tests, however, the authors used PostgreSQL databases. Limits to PostgreSQL databases are not documented, however, in order to meet tens of terabytes of scale users would likely need to shard the system into many servers.
Performance
Performance is tied to how much infrastructure is provisioned. The largest tested infrastructure in the blog was 100-threads on a Kubernetes server, where IBM performed ingest at 750 resources/second.
Overall
An enterprise could likely provision enough infrastructure to meet their size and performance needs, however, it would be difficult to manage and require constant tuning and upkeep.
Honorable Mentions
IBM’s search functionalities were impressive in this exercise, however, even though latency was performant (quick to return) — the authors did not test the scale of many operations at once. With a single server being used as an operational store it should be assumed that it will need to support many requests hitting the API at once and the solution needs to be tested for such.
Limits
This blog covers latency at an extremely high level, however, I cannot account for the networking specifications such as on-premise versus Cloud — and the many networking options enterprise Clouds provide. These limitations should be considered when making a decision regarding FHIR solutions.
This blog also does not get into the specifics of search performance. Search is one of the most important functionalities for a FHIR server to support at scale and users should be requiring search benchmarking from solutions. Specifically, there are documented issues with things like pagination tokens in search (such as with HAPI: https://github.com/hapifhir/hapi-fhir/issues/536) that need to be considered when making a decision for your organization.
Conclusion
Takeaways:
- Almost all servers reviewed — except Google Cloud’s — could likely not handle streaming EHR data into their servers
- Azure should consider supporting bulk FHIR ingest as they are currently operating on an open source functionality that bottlenecks this workflow (and is more $)
- — Google is 20x faster than Azure at default for ingest
- AWS would need to clarify their scale capabilities before performance testing can be considered
- Where is Smile CDR and their documentation?
- Comparing open source/unmanaged systems to managed systems in this context is extremely difficult as they are apples to oranges
- — I applaud IBM for doing benchmarking on their system and releasing the results
Ideally, this blog brings to light the need for clarity, documentation, and performance testing. I hope this encourages organizations to be transparent about their limitations, and thus, the use cases their products are capable of supporting.
The storage and performance implications are the differences in people’s lives. I hope this calls out the questions and expectations decision makers need to be asking of their FHIR solutions.
Author’s Notes
Let me be real for a second — and transparent about my thoughts that I cannot validate with data. FHIR is meant to be an interoperability standard. What this means, beyond just storing and surfacing data, is the connection layer. FHIR is building an ecosystem — it wants to be an ecosystem for connecting data to models, applications, and ultimately patients.
What this really means is the performance requirements I can calculate today (thousands API hits per second) will be exponentially raised going forward. If FHIR becomes what I hope it does, we will need scale and performance far beyond what is outlined here.
Enterprises should be prepared for that future.