With the Narrative Observatory project we’re harnessing powerful industry relationships and an academic research network to develop data infrastructure purpose-built to identify and track narratives and story opportunities, and to learn about audiences across platforms. We use the data that flow through culture — the TV, films, news, social media, and music — to tell us where our audiences are; what kinds of stories they consume when they’re not thinking about social issues; what narratives they do encounter about issues like racial justice, climate change, or gender equity; and when a specific story represents a narrative opportunity to reach and resonate with them.
Under the hood is some fancy data dancing and advanced software engineering that makes everything go. This includes proprietary and hard-to-get data sets; a data infrastructure that prioritizes privacy, security and flexibility; industry-grade processing power; and a growing library of analytic tools and predictive models. The system we’ve designed is intended to maximize the impact of our small (but mighty!) engineering team so that we can focus on our mission and continue to build and link our best-in-class data holdings.
For the past year, Harmony Labs has been working with commercial partners on data sharing agreements across a broad spectrum of media platforms. As a 501(c)3 organization, bound to work for the public good, we have adopted a set of principles and practices around our use of data to: ensure we only gather data that is useful to our mission; anonymize any data that contains personally identifiable information (PII) at ingest; maintain robust security with regard to laws and regulations, limited access, and encryption; actively work with our partners and in-house data science team to provide the highest possible standards for scientific integrity, clearly communicating methods, assumptions, and practices.
With these principles and practices in mind, we’ve amassed over 50 terabytes of media content to support our Narrative Observatory work, including content scraped from websites and APIs; TV, web, and mobile consumption data from opt-in panels; song lyrics; and closed captions for nearly all U.S. broadcast television going back to 2014. The integration of so much heterogenous data into a single accessible, reliable platform presents unique value and some thorny engineering challenges.
The first step when working with data at this scale is to determine how we’ll receive it from our partners. In some cases there are established export processes while others have never shared their data before and require a more collaborative process. Transfer mechanisms range from S3 uploads, direct PostgreSQL connections, streaming live updates from RabbitMQ, and SFTP transfers.
With the data in our hands, we can then get to work sifting through sample data and documentation. This is a joint effort between our engineering and data science teams. As is typical when working with proprietary data, these teams attempt to optimize the data storage architecture to fit their own use cases. For our partners that don’t have externally-facing APIs, it’s rare that many considerations have been made — and rightly so — to design the storage of their data so as to make it easily accessible to external partners.
The biggest challenge we face when integrating new data is how do we maximize its usefulness to data scientists and our partners while maintaining our data ethics principles and practices. We like to think that no one has quite the valuable network of partners that we do and we’re confident no one is as dedicated to conscientiously selecting these data, coordinating them in such a way that increases their collective value, and making them accessible to our internal teams and our research partners in a safe and secure manner.
There is a saying in the world of software engineering: “death by 1,000 integrations.” As a software architecture adds integrations with external dependencies or data sources, the process for implementing those integrations needs to be as streamlined and similar as possible to ensure correctness, robustness, and timely availability. Typical software teams have external integrations for features like user authentication, text messaging, or software monitoring. In most cases those are constrained and separate integrations that barely interact. Our software ecosystem is made up of the opposite use case — we generate very little data internally except as derivative analysis products.
Knowing that external integrations will always be our primary use case, we’ve tried to optimize workflows and technologies around the following categories: security, correctness, and shared infrastructure systems.
A maxim we follow strictly is the principle of least privilege. All of Harmony Labs’ employees, partners, and software are only given access to the specific set of resources and permissions necessary to complete their pre-specified goals. Even our applications and supporting software are only able to interact with specific services and data, preventing unintentional cross-contamination and spillage vectors. This commitment to security and access controls, along with annual independent security audits — the most recent of which was successfully completed in December 2020 — increases the confidence of our partners who graciously share data with us.
Data is useless if it’s not correct and, because we are making statements about media content and consumption, the correctness of our data is absolutely critical for our partners. But, we don’t stop at simply running our extract, transform, load (ETL) processes and hoping for the best. We’re also building periodic audit processes to account for partner data that is new or may have changed, or issues that may have arisen during initial ETL.
Shared infrastructure systems, or the standardization of compute, storage, and analysis infrastructure, allow us to continue active development while maintaining existing applications. All our production software runs on Kubernetes and utilizes managed services from AWS wherever possible. We limit the amount of software we’re maintaining to only our necessary applications or open source software to enable better abstractions or monitoring and maintenance, we plan and complete deliberate software and service upgrades regularly in order to prevent urgent interruptions to long-running projects, and we use application templates and a single, shared method for our CI/CD processes. All of these practices together ensure that data access and linking is more repeatable and predictable for both software engineers and the users of the data.
In the past year this data infrastructure has powered the initial prototype for a Narrative Observatory and enabled public facing data analysis to help organizations better understand their audiences, help inform storytelling about poverty and economic mobility in the U.S, and identify story opportunities for advocates of healthy AI. The system we’ve designed is also flexible enough to accommodate our growing research network.
To get here, our engineering team has overcome significant challenges with source data; from recombining terabytes of snippets of text that are delivered to us by line and out of order, to multiple schema and format changes over a partner’s historical backlog, to a request from a partner to completely rewrite our ingest app to pull data from Elasticsearch instead of PostgreSQL!
We’re confident that the systems and processes we’ve designed will continue to meet new sets of engineering challenges while ensuring that our infrastructure is increasingly secure, efficient, and, most importantly, accessible to those most ready to use these tools to effect real culture change. In the months ahead we’re working to expand our corpus of data to include more media types and audience dimensions, while continuing to innovate technically. Follow this space for updates on our progress or get in touch if you have any feedback.