1. Data Harvesters Overview¶
The Data Harvesters are responsible for ingesting real-time data from external sources into the system. These sources can include sports data feeds, social media updates, game state information, and more. The goal of the harvesters is to fetch, transform, and dispatch data to other system components efficiently and in near real-time.
Each feed is treated as an isolated, independent entity, where it defines its own transformation logic and the type of data to be ingested. This is key to maintaining a flexible and scalable architecture, as it allows each feed to evolve independently based on its source and requirements.
The design of the harvesters is focused on scalability. Harvesters are implemented as classes that can either run as background tasks in container apps or be invoked via triggers in function apps, providing flexibility to scale and grow independently as needed. This allows the system to adapt based on the volume of data, feed requirements, and infrastructure needs.
2. Data Flow¶
The data flow within the harvesters follows the event sourcing pattern:
- Ingestion: The Data Harvesters connect to external data sources and continuously pull data from them in real-time.
- Command Generation: Once the data is retrieved, the harvester maps this external data into a command format. These commands represent the relevant actions or events that need to be captured.
- Service Bus Queue: The generated commands are placed onto an Azure Service Bus Queue. This decouples the harvesting process from the rest of the system and ensures that commands can be processed independently by other system components.
- Storage: Raw data, in the form of blobs or content, is stored in Azure Data Lake Storage (ADLS) Gen2 for further processing, analysis, or archiving.
3. Data Harvester Components¶
Each type of harvester has a dedicated component in the system. These include:
3.1 Data Harvesters (Container Apps)¶
- These are responsible for connecting to external feeds (e.g., sports statistics providers, game state data sources).
- Each harvester is designed to work with a single feed and independently handle data ingestion.
- The harvester is implemented as a class that can run as a background task within a container app, ensuring scalability as needed.
- Once data is ingested, it generates commands which are placed in the service bus queue for further processing.
3.2 Generic Feed Ingester (Function App)¶
- A function app is used to create a generic feed that maps external data to our internal command structure.
- The function app uses predefined mappings that transform incoming data into actionable commands that the system can process.
- This is useful for handling feeds that don't require complex transformations but still need to be normalized into a consistent format.
3.3 Manual Game State Ingester (Function App)¶
- Similar to the generic feed ingester but focused on manual data entries, particularly for game states.
- This ingester provides an API for manually updating game-related information (e.g., game status, player performance) into the system.
- The system triggers commands for each update, which are processed downstream.
3.4 Data Feed Monitor (Container Apps)¶
- This component is responsible for monitoring and validating the data flow from various feeds.
- It ensures that data is consistently ingested without interruption, alerting the team if any issues arise.
- The monitor can scale independently to handle increasing feed volumes.
3.5 Social Harvesters (Container Apps)¶
- Social media platforms provide dynamic, fast-moving data, such as game reactions, player updates, and fan sentiments.
- The social harvesters are responsible for ingesting content from platforms like Twitter, Instagram, or Facebook, transforming that content into commands, and storing it as raw content in ADLS for later processing.
- These components are implemented as container apps that can scale based on traffic and the frequency of social media updates.
3.6 Internal Feed Harvester (Container Apps)¶
- This component is designed to scale on demand and handle internal data sources within the organization.
- It could handle internal event streams, operational logs, or any proprietary data that needs to be ingested into the system.
4. Event Sourcing and Command Generation¶
- Each feed is the "master" of its own data, meaning that the harvester for that feed is responsible for mapping the raw data into an internal command format.
- The mapping should take into account any transformations necessary to convert the external data into a system-friendly format.
- Commands represent business actions, like "update game score," "new player performance data," or "new fan interaction." These commands are serialized and placed into the service bus queue.
5. Data Storage¶
- All raw data, both structured and unstructured, is stored in Azure Data Lake Storage (ADLS) Gen2.
- This serves as the long-term storage for all feed data, allowing for easy access, analysis, and potential reprocessing.
- The data can be stored in two primary formats:
- Raw Content Blobs: For social media or other unstructured content.
- Raw Data Blobs: For structured data, such as game scores or player statistics.
6. Scalability and Reliability¶
- The harvesters are built to scale independently, whether as background tasks in container apps or as function app invocations.
- Each harvester can be independently scaled up or down based on the needs of the corresponding data feed, allowing for a flexible and efficient use of resources.
- The service bus acts as a buffering mechanism, ensuring that data is processed at a consistent rate even under high load.
- Harvesters can scale based on the volume of data from external sources, ensuring real-time performance even as the system grows.
7. Next Steps¶
In the next sections of the documentation, we will cover: - The detailed structure of each command generated by the harvesters. - How the service bus queue interacts with downstream processing components. - Security measures for managing access to external data sources and ADLS storage. - Monitoring and alerting mechanisms for detecting issues in data ingestion.