Blog

The New York Times uses decoupled publishing

09/05/2024
Juan Prieto

A while ago, while reviewing a publication on cloud computing by the Bankinter Innovation Foundation, I came across a great example illustrating what can be achieved by decoupling the repository from the exploitation layer:

“The New York Times decided to make all articles published between 1851 and 1922 available to the public through scanned images of the original documents.

Initially, the images were dynamically composed into a PDF upon user request. However, as traffic to their website increased, dynamic PDF composition was no longer an adequate way to deliver the information.

For this reason, the newspaper leveraged cloud computing through Amazon services, storing 4 terabytes of images in Amazon S3 and processing these images with a program developed in-house on the Amazon EC2 platform.

By using 100 instances of the Amazon EC2 service, the newspaper generated the PDF files for all the articles in just 24 hours. These files were then stored in Amazon S3 and are now available to the public through their website.”

(Original commentary on the NYT blog about the TimesMachine service)

In our opinion, it’s essential to always find the right balance between the dynamic and static parts of a website.

Just as we allow videos or images to be downloaded from content delivery networks (CDNs) or specialized servers, we can store static content in our web server farm and “shift” part of the “dynamism” to pagination and navigation by pre-generating, for instance, the pages of a newsletter.

The questions we always ask ourselves when approaching a project are:

¿Qué frecuencia de actualización tiene este contenido?
What level of user interaction is required?

If the answer is that the content is updated once every few minutes and the interaction is achieved through other elements (tag clouds, related news, etc.), then that specific type of content is a perfect candidate to be published in static format, either by embedding it in a dynamic page or by pre-generating all possible navigation combinations (mind the complexity of the problem, though we’re always glad we chose this path).

The advantages of this approach are several:

Greater content security: By reducing potential breakpoints, as database requests are not required.
Increased portal scalability: For some clients, we have managed to reduce the number of necessary servers by 50%.
Smaller and less specialized work teams: We have developed comparable portals with an order of magnitude less effort (measured in man-months).
Better content reuse: Being better “encapsulated,” it is much more natural to reuse content than when it is served through dynamic elements. This advantage is further enhanced by various Ximdex functionalities (symbolic links between documents, XML includes via ximlets, XML for Active Publishing documents—XAP—defined by Ximdex, etc.)

The disadvantage is that the page’s computation time is anticipated at the moment of publishing the content rather than when the visit occurs. However, the page is calculated only once, for millions of potential views, rather than being generated repeatedly or retrieved from an intermediate cache. As seen in the approach followed by the NYT to avoid overloading servers with the same information, this is also the ad-hoc strategy used by nearly all mass-media portals with significant traffic volumes.

Note: The decoupled publishing method pioneered by Ximdex has been a trend since 2016 in the form of so-called Headless CMS.

Share the post

The New York Times uses decoupled publishing

Related Articles

APRECIA: Data Intelligence to Anticipate Academic Performance

Facilitating access to information with plain language

contact

address