Data Infrastructure for Culture Change

What data?

For the past year, Harmony Labs has been working with commercial partners on data sharing agreements across a broad spectrum of media platforms. As a 501(c)3 organization, bound to work for the public good, we have adopted a set of principles and practices around our use of data to: ensure we only gather data that is useful to our mission; anonymize any data that contains personally identifiable information (PII) at ingest; maintain robust security with regard to laws and regulations, limited access, and encryption; actively work with our partners and in-house data science team to provide the highest possible standards for scientific integrity, clearly communicating methods, assumptions, and practices.

With these principles and practices in mind, we’ve amassed over 50 terabytes of media content to support our Narrative Observatory work, including content scraped from websites and APIs; TV, web, and mobile consumption data from opt-in panels; song lyrics; and closed captions for nearly all U.S. broadcast television going back to 2014. The integration of so much heterogenous data into a single accessible, reliable platform presents unique value and some thorny engineering challenges.

How do we make this data useful?

The first step when working with data at this scale is to determine how we’ll receive it from our partners. In some cases there are established export processes while others have never shared their data before and require a more collaborative process. Transfer mechanisms range from S3 uploads, direct PostgreSQL connections, streaming live updates from RabbitMQ, and SFTP transfers.

With the data in our hands, we can then get to work sifting through sample data and documentation. This is a joint effort between our engineering and data science teams. As is typical when working with proprietary data, these teams attempt to optimize the data storage architecture to fit their own use cases. For our partners that don’t have externally-facing APIs, it’s rare that many considerations have been made — and rightly so — to design the storage of their data so as to make it easily accessible to external partners.

The biggest challenge we face when integrating new data is how do we maximize its usefulness to data scientists and our partners while maintaining our data ethics principles and practices. We like to think that no one has quite the valuable network of partners that we do and we’re confident no one is as dedicated to conscientiously selecting these data, coordinating them in such a way that increases their collective value, and making them accessible to our internal teams and our research partners in a safe and secure manner.

What do we do with the data?

There is a saying in the world of software engineering: “death by 1,000 integrations.” As a software architecture adds integrations with external dependencies or data sources, the process for implementing those integrations needs to be as streamlined and similar as possible to ensure correctness, robustness, and timely availability. Typical software teams have external integrations for features like user authentication, text messaging, or software monitoring. In most cases those are constrained and separate integrations that barely interact. Our software ecosystem is made up of the opposite use case — we generate very little data internally except as derivative analysis products.

Knowing that external integrations will always be our primary use case, we’ve tried to optimize workflows and technologies around the following categories: security, correctness, and shared infrastructure systems.

A maxim we follow strictly is the principle of least privilege. All of Harmony Labs’ employees, partners, and software are only given access to the specific set of resources and permissions necessary to complete their pre-specified goals. Even our applications and supporting software are only able to interact with specific services and data, preventing unintentional cross-contamination and spillage vectors. This commitment to security and access controls, along with annual independent security audits — the most recent of which was successfully completed in December 2020 — increases the confidence of our partners who graciously share data with us.

Data is useless if it’s not correct and, because we are making statements about media content and consumption, the correctness of our data is absolutely critical for our partners. But, we don’t stop at simply running our extract, transform, load (ETL) processes and hoping for the best. We’re also building periodic audit processes to account for partner data that is new or may have changed, or issues that may have arisen during initial ETL.

Shared infrastructure systems, or the standardization of compute, storage, and analysis infrastructure, allow us to continue active development while maintaining existing applications. All our production software runs on Kubernetes and utilizes managed services from AWS wherever possible. We limit the amount of software we’re maintaining to only our necessary applications or open source software to enable better abstractions or monitoring and maintenance, we plan and complete deliberate software and service upgrades regularly in order to prevent urgent interruptions to long-running projects, and we use application templates and a single, shared method for our CI/CD processes. All of these practices together ensure that data access and linking is more repeatable and predictable for both software engineers and the users of the data.

What’s Next?

To get here, our engineering team has overcome significant challenges with source data; from recombining terabytes of snippets of text that are delivered to us by line and out of order, to multiple schema and format changes over a partner’s historical backlog, to a request from a partner to completely rewrite our ingest app to pull data from Elasticsearch instead of PostgreSQL!

We’re confident that the systems and processes we’ve designed will continue to meet new sets of engineering challenges while ensuring that our infrastructure is increasingly secure, efficient, and, most importantly, accessible to those most ready to use these tools to effect real culture change. In the months ahead we’re working to expand our corpus of data to include more media types and audience dimensions, while continuing to innovate technically. Follow this space for updates on our progress or get in touch if you have any feedback.

#Economic Mobility
#Narrative
#Audience
Blog

Rewriting the Story of Economic Opportunity

#Audience
#Racial Justice
#Health Equity
Blog

How Do We All Contribute?

#Audience
#Racial Justice
#Health Equity
Blog

How Should We Achieve Flourishing?

#Audience
#Racial Justice
#Health Equity
Blog

by Harmony Labs

What data?

How do we make this data useful?

What do we do with the data?

What’s Next?

Latest News

Rewriting the Story of Economic Opportunity

How Do We All Contribute?

How Should We Achieve Flourishing?

Where Does Agency Live?

About

01

02

03

04

Approach

01

02

03

04

Work

01

02

News

01

02

03

Other

01

02

03

04

Email us at hello@harmonylabs.org

©2025 HARMONYLABS