- Dec 14, 2022
- 11 MIN
Project Norsu: Replacing JCR in Magnolia with an (ele)fantastic new content store
The Magnolia DXP uses JCR as its underlying storage solution, but this design choice comes with a certain complexity making it more challenging to ensure performance and scalability in the cloud.
So, the Magnolia development team set out to build a better solution and kicked off a project that became known as Project Norsu. Aleksandr (Sasha) Pchelintcev, Engineering Manager at Magnolia, spoke about Project Norsu at the Magnolia DevDays in October 2022, and it was one of the most popular topics of the conference.
That’s why I asked him to have a follow-up conversation about Norsu. In this interview, he will explain why the project is important and answer some of the audience’s questions.
Sandra: Hi Sasha, let’s dive right in. What is Norsu?
Sasha: So, if you're asking about the word: it means ‘elephant’ in Finnish. When I was looking for a name we had already decided to use a Postgres database as the JCR content storage successor. So why not use something associated with elephants? As I live in Finland, I chose ‘norsu’.
Sandra: And what is Project Norsu?
Sasha: It is a thin Java API and SDK on top of a Postgres database schema. As you know, Magnolia has been using JCR and Jackrabbit as content store. But there has been a certain demand to replace it with something different.
After doing our research we settled on building a solution using PostgreSQL. The primary goal of the project is to provide Magnolia developers with a simple API to store their content in a better way.
Sandra: You say there were ‘certain demands’ of replacing the JCR. Why?
Sasha: JCR is a rather quirky technology. It's very mature, and in my personal opinion, it's quite brilliant. But there are many reasons to replace it.
It's actually pretty complicated to manage JCR-based solutions for various reasons such as the index being stored separately, the problems with scalability, and the in-memory nature of Jackrabbit. We had to deal with infrastructure concerns, management concerns, performance concerns, and even programming concerns.
Even though you can overcome all of these problems, in the cloud they are amplified. Instead of fighting our way through with JCR, we decided to build a simpler solution based on technologies that are better suited for the cloud. So, in a nutshell, that's why.
Sandra: Ok, so there were several reasons why you wanted to replace JCR. Why did you decide to use a Postgres database?
Sasha: We've done the research and considered several candidates for the successor of the content store. We chose Postgres because it ticks all the boxes.
Of course, there were other options for unstructured or flexible content management. We thought of a NoSQL solution. But if you dig a little deeper into what NoSQL databases do, you realize that they're niche solutions. You have to know exactly what kind of data you're working with to realize the full benefits of a NoSQL solution and we wanted to keep our content flexible.
NoSQL databases don't come with joins, they don't typically come with ACID transactions, and indexing NoSQL data is very specific. So, a NoSQL database is actually quite a raw database that offloads a lot of tasks onto developers. Many things would need to be done on the application layer level requiring a certain domain knowledge.
We needed to make sure that the database could store, index, and manage documents easily without limiting us to a niche solution. Postgres does that for us. It is open source and very powerful. It offers both relational capabilities such as constraints, integrity, ACID transactions, and at the same time has a lot of NoSQL features such as support for complex structures like JSON. And it lets us combine them through features that are pretty unique, even in the relational database landscape, such as support for hierarchical data types.
It allowed us to build an API that does what JCR used to do for Magnolia, but based on a database that is widely used in the industry these days and booming in terms of adoption and community. It is also supported out of the box by all cloud providers. So, basically, we went with the trend while at the same time being pragmatic.
Sandra: Ok. So that’s why Norsu is built on Postgres. I wonder if you are using Postgres out of the box and then developed the API on top of it?
Sasha: Yes. We try to stay within the standard. We don't enforce a specific version or distribution. At the moment we use an AWS-managed solution because we use AWS in some other projects.
Norsu deploys a standard Postgres database with its schema specification. It includes some database-level API functions that let you operate that schema. The Java API sits on top of it.
We actually use some funky technology that lets us generate the bindings. We take the API on the database level and use this code-generation tool to produce the respective Java functions that you can use from within your code.
Then we add an extra API level to sugarcoat those functions and make it easier for developers to operate them. Currently, the API operates with JDBC, so you need a connection pool to use it. Maybe in the future, we'll consider other protocols to separate the SDK from the database.
Sandra: If you were to summarize the benefits of Norsu over JCR, what would be your top 5?
Sasha: Norsu is easier to operate. It is as easy to deploy Norsu as it is to deploy a Postgres database in a container. Everything is stored together, including the actual data and indexes. It’s also easy to backup and restore, manipulate, and inspect. So you can just fire up your admin tool and work with it. And then you just get the Java API on top of it.
Norsu is more scalable, especially when it comes to read performance. Postgres can scale out of the box with standby replicas and cloud providers alleviate concerns even more by offering managed solutions such as AWS Aurora. That gives you a cluster that can scale with demand. So it's pretty robust compared to Jackrabbit.
Norsu has custom APIs that we built with Magnolia’s requirements in mind. We didn’t just take a generic solution; we made the API a first-class citizen. We started with known use cases and the API can evolve with Magnolia. So, our SDK is slimmer and easier to work with and evolve than that of Jackrabbit.
Norsu has better indexing capabilities, which I believe is a big benefit. As you know, Jackrabbit comes with Lucene. And as I already mentioned, you have to store the Lucene index somewhere separate from the persistent storage. While Lucene is a great index, with Norsu, everything is stored in one place, and it's much, much easier to operate and more performant.
Norsu is more open. It exposes the content and lets you work with standard tools. This opens up very interesting doors to integrations. Let's say, for example, you take the content that is stored in Postgres and feed it to an indexing engine like Elastic. I think that with Jackrabbit that wouldn't be easily possible. An open solution allows you to be more integrated with the rest of the technology world.
Sandra: During your presentation at the Magnolia DevDays you spoke specifically about nodes, which is an important concept in Norsu. What are nodes and why did you develop them?
Sasha: Any person familiar with JCR knows about nodes as building blocks for content. Norsu adopts this concept, although slightly differently.
The main difference is that in JCR a node is a container of a list of properties. Thus, a JCR node represents flat content. If you want to build a complex piece of content, such as a web page, you need to nest a bunch of nodes, like areas and components, into a hierarchy.
From a JCR perspective, those are individual nodes; from a Magnolia perspective, they belong to the single entity of the page. This creates a mental and technical burden.
Norsu, in turn, treats nodes as an entity unit, no matter how complex it is under the hood. In Norsu, a page is represented by just one node, backed by a complex nested content structure. Components and areas lose their individuality.
Sandra: How are these nodes represented in Postgres?
Sasha: From a database standpoint, a node is a record with an ID, like a UUID in JCR, and a version identifier consisting of a JSON object and a pointer to a previous version. This simple setup enables tracking of the node’s entire history.
The core intention is that Norsu acts like an append-only storage where you push your content so you can browse the history of changes and go back to a specific version.
Sandra: Cool. Let’s address a few questions from your presentation at DevDays. The first question refers to sessions in JCR vs. Norsu. How does Norsu handle transactions across sessions?
Sasha: The JCR repository uses the concept of a session. You can obtain a session – Magnolia follows the so-called session-per-request pattern – to do a bunch of changes to different nodes and then merge your session back. This is a very convenient pattern, but it comes with a complexity cost: sessions tend to get stale, and there are concerns such as concurrency concerns.
If you look at the typical flow of content editing in Magnolia, you only work with a few entities at the same time. When you're working with Content Apps, for example, you choose the entry that you want to modify, click edit, and change some properties and maybe some nested parts. It’s the same with pages etc. This is also how your brain works.
The fact that you do bulk modification stems mostly from the way JCR presents complex structures as we discussed before.
So, how about thinking of your page and its complex content as one?
There's only one ID at the top regardless of what happens under the hood. Maybe it's complex; maybe it's nested; maybe it’s branching; it's still the same content. And whenever you update this content, it's a transaction.
The typical flow is that you fetch all complex content, work with it as a JSON object in your Java code, and then you just push a new version of that content back to Norsu. There is your transaction.
So instead of updating a whole bunch of nodes in one transaction, you update the complex content of a single node in one transaction. And that is very easy to handle with a relational database, such as Postgres.
Since we keep version IDs, it's also possible to use optimistic locking. For example, if I want to update my content from version x to version y. The database can easily detect that version x is not actually the current version, but version z is.
As this update would interfere with someone else's changes, Magnolia can fire up a conflict notification. For what it's worth, I think it's more robust than it used to be.
Sandra: Okay. As a user of Magnolia myself, that’s really how I’d want it to be.
So, coming to another question from your session: will there be a migration path from JCR to Norsu?
Sasha: At the moment, Norsu is targeted at new SaaS deployments, and we don’t offer a migration path out of the box.
We do already have tooling for simple use cases that can reach a certain level of automation. We will have to revisit this once Norsu becomes available for our PaaS or the self-hosted deployment model.
Sandra: The next question from your session relates to indexing content, which you've already spoken about briefly. The question is: Can you index content for Elasticsearch?
Sasha: Yes, I had an offline conversation about this topic. The company is building a commerce solution with Magnolia, and they wanted to have an advanced search.
They want to take content from Norsu and index it in Elasticsearch so that their application can benefit from that index. Unlike JCR, there is no conceptual blocker that would stop them from connecting to the Postgres database from an Elasticsearch cluster. Modern cloud providers such as Azure or AWS provide the means to connect relational databases with other services through serverless functions.
Integrating Your CMS and Your Ecommerce Platform
Looking for a one-size-fits-all solution for all your ecom challenges? Good luck! Looking for a flexible digital experience platform for ecommerce? Check out our blog about building an integrated platform for content and commerce.
Sandra: What type of users would actually have to know about Magnolia’s storage solution?
Sasha: Primarily developers. They have to work with the new APIs with different capabilities. If they are already on Magnolia, their solutions would probably need to be re-engineered.
For authors who interact via Magnolia apps, such as the pages or Content Type apps, the underlying storage is transparent. The backend is different, but otherwise, the functionality and the user experience are the same.
Sandra: Frontend developers would still be using the Delivery API, right?
Sasha: Precisely. The process is (almost) transparent to developers that build headless solutions on top of Magnolia. Consuming complex content will be somewhat different, and definitions will change a little bit, but the output of Delivery API will be very similar.
We also want to provide backward compatibility for the Delivery API’s output so that it looks like the output from JCR, so that our users won’t have to change their frontend code. The team that maintains our frontend helpers, the libraries that bind Magnolia to different frontend frameworks, did not have to change them much to get them to work with Norsu.
But the hardcore backend developers that build extensions to Magnolia using JCR APIs will have to adapt.
Sandra: Another question from your session also relates to the Delivery API: how does GraphQL fit with Norsu?
Sasha: Typically GraphQL does not talk directly to a database. It does not usually make direct database queries, it rather sits on top of an application layer and aggregates the response from various storages and APIs via so-called data fetchers That's what makes it so powerful.
To provide this functionality, you need to implement a GraphQL server, for example, using a GraphQL Java framework. The Java code connects the data fetchers to your backend, and the Norsu API fits just as well as JCR in this scenario. The GraphQL Java server fetches your data via Norsu’s Java API, dissects a content slice set according to the GraphQL query, and provides the output in the same format as with Jackrabbit.
It would be interesting to have this super performant implementation that translates a GraphQL query directly into a SQL query and executes it on a database level without the middleman, eliminating any performance overhead. Theoretically, it’s possible, but it might contradict the whole purpose of GraphQL in terms of type safety. Let’s see.
Sandra: Interesting thought. Thanks.
Quick context switch to digital asset management (DAM): will Norsu store Magnolia-hosted assets, such as images?
Sasha: Our prior experience with Jackrabbit shows that it’s a good idea to separate binaries from content, and we believe that there are better ways to store assets. So, Norsu is meant to store structured web content and will probably also act as a store for asset metadata. So, it will host all information about assets while the actual binaries are hosted elsewhere, for example, S3.
Sandra: Okay, great. Let’s address a final question: what are the next steps in Project Norsu?
Sasha: Norsu will be the primary content store for the Magnolia SaaS. It is also tightly coupled with how Magnolia will evolve in the future as we gain more insights.
It's important to understand that getting rid of JCR and improving performance and scalability to Magnolia is not our only concern. We are also thinking about our existing customers and how to make their transition as smooth as possible. This complicates the task significantly. So stay tuned.
Sandra: Thank you, Sasha.
Sasha: Thank you very much.
If you’re interested to learn more about Project Norsu, watch Sasha’s DevDays talk here: