Integrating Data Streaming with AI and GenAI

Experts explore businesses' challenges and considerations when integrating data streaming with AI strategies.

Reading Time: 6 min  

Topics

MIT SMR CONNECTIONS

At MIT SMR Connections we explore the latest trends on leadership, managing technology, and digital transformation.
More in this series

  • [Source photo: Krishna Prasad/Fast Company Middle East]

    Data is everywhere.

    From financial transactions, social media interactions, and supply chain logs to simple things like the number of steps you take or the series you binge-watch, an invaluable amount of data is generated every second.  

    How one uses this deluge of data – Big Data – determines the success or failure of a decision, an invention, or even a business strategy. 

    For instance, to offer personalized recommendations, Netflix considers your viewing habits, ratings, and preferences. To understand and process the behavior and patterns of its nearly 270 million global subscribers, the OTT platform leverages a big data architecture. It uses tools like Apache Kafka for real-time data streaming, Apache Flink for stream processing, and Amazon Web Services (AWS) for cloud infrastructure. For batch processing and data analysis, it employs Apache Hadoop and Apache Spark.

    By harnessing the power of big data in real-time, Netflix can make data-driven decisions regarding content acquisition, original content production, and targeted marketing campaigns. 

    Today, data streaming is emerging as the fundamental infrastructure for the modern AI stack. However, no technology comes without its own set of challenges. MIT SMR Middle East spoke with three industry experts from Qatar to understand the roadblocks when integrating data streaming, maintaining the accuracy and relevance of AI models, and steps businesses can take to mitigate potential biases within streaming data and ensure fair AI outcomes.

    Jassim Moideen, Lead Data Scientist, TOD, beIN Media Group, says, “Businesses integrating AI framework based on dynamic, real-time data face a labyrinth of technical complexities. This necessitates the creation of robust infrastructures capable of managing vast data streams. Adopting cutting-edge models such as decoder-style transformers and Large Language Models (LLMs) emphasizes the critical need for a strategic approach to efficiently manage computational expenses in terms of both time and space.”

    Muhammad Zuhair Qadir, Data Scientist and AI/ML Product Manager at Kahramaa highlights “infrastructure readiness, data governance and adaptability” as three key roadblocks to integrating data streaming with AI strategies. 

    Dr. Ahmed Riadh B M Rebai, Senior Information Technology Program Manager, Public Works Authority ‘Ashghal’ recommends a four-step strategy to overcome these challenges:

    1. First and foremost, the required mindset to data access and data sharing to facilitate data understanding, categorization, data responsibility/permission, and then flow ingestion. Data release is always constrained by its value to the organization (how to monetize it) and organization/country privacy rules (violation of users’ privacy). Without setting a clear data-sharing strategy on the country level and within the organization, we won’t be able to unlock data potential and enable/promote the local AI ecosystem.  
    2. Secondly, the required skills and resources to dive into such a journey. Developing a strong academic and experiential learning curriculum for AI and data engineering is mandatory for building local capacity. Also, adopting strategies to attract top AI talent from all over the world will facilitate innovation and policies to encourage the integration of locally developed resources and augment the in-house-made AI use cases via international expertise & solutions (like fast-track residence permits, AI golden Visa, tech-intensive investment ecosystem, etc.).
    3. Infrastructure readiness and cloud strategy (data-free or hybrid cloud models) must be in place to further democratize the development of new data-driven and AI-augmented solutions. Without a clear understanding of data sources and a transparent cloud strategy, we won’t be able to allow a methodical shift of data streams and  build the required analytics and business intelligence. A ”fit-to-cloud” exercise needs to take place to avoid blind lift-and-shift activities while taking into consideration multiple data attributes, e.g., size, quality, latency, availability, etc.
    4. Creating centers of excellence within the company, business process re-engineering, ideation of new concepts/solutions based on a clear DT roadmap while including/involving the business in every step, building MVPs around these ideas, testing them, eliminating some and augmenting the others to full-fledge products on agile and incremental approach: are the essential ingredients to a successful implementation of such AI-powered and  data-driven solutions while ensuring their acceptance, adoption and use in the future at the enterprise level.

     

    In a steaming environment, data drift is another major concern. The change in data distribution  over time can lead to degraded model performance misinterpretations, rendering them less accurate or even irrelevant.

    To ensure the ongoing accuracy of AI models, experts say, businesses must embrace adaptive continuous monitoring and learning techniques. “Techniques such as dynamic model updating with new data inputs become crucial in mitigating the impact of data drift,” adds Moideen.

    Below, Qadir elaborates on practices that have proven effective in dealing with data drift: 

     

    1. Continuous Monitoring: Implementing real-time monitoring systems that can detect shifts in data distribution as soon as they occur is vital. These systems should be capable of alerting data scientists or relevant stakeholders when potential data drift is detected.
    2. Automated Retraining Pipelines: Once data drift is identified, having an automated retraining pipeline for your models can be immensely beneficial. This pipeline should be triggered based on specific drift detection metrics and thresholds, ensuring that models are updated with new data in a timely manner.
    3. Regular Model Evaluation: Even in the absence of explicit drift detection, periodically evaluating models against recent data can uncover subtle or gradual drifts. This evaluation should compare key performance metrics against historical baselines.
    4. Data Versioning: Keeping track of different versions of datasets used for training and validation helps in understanding how data changes over time. Data versioning is also crucial for reproducibility and for auditing the model’s performance.
    5. Windowing Techniques: For streaming data, using windowing techniques (such as sliding or rolling windows) to continuously update the data set for model training and validation can help in adapting to recent data trends and patterns.
    6. Adaptive Learning Models: Using adaptive learning models that are inherently designed to evolve with incoming data can be more resilient to data drift. These models adjust their parameters in real-time or near-real-time to adapt to new patterns in the data.
    7. Diverse Data Sources: Incorporating diverse  data sources can mitigate the risk of drift in any single source. It’s crucial to continuously monitor and maintain each data source’s quality and relevance.
    8. Human-in-the-Loop (HITL) Approach: Involving domain experts in the loop for periodic review of model predictions and inputs can provide valuable insights. These experts can identify nuanced shifts that automated systems might miss.
    9. Feature Engineering: Regularly updating and revising the feature set used for model training can help in capturing new patterns and mitigating the impact of irrelevant or outdated features.

     

    DATA STREAMING AND AI 

    Challenges Technical complexities, Infrastructure readiness, Data governance, Adaptability
    Considerations Mindset for data access and sharing, Skills and resources, Infrastructure readiness and cloud strategy
    Fairness and Bias Ethics and national public policy, Transparency and accountability, Algorithm explainability
    Data Streaming Pipeline Considerations Data integrity, Flexibility, Scalability, Security, Latency, Integration challenges, Data governance model

     

    When dealing with this volume of data, it also becomes even more difficult for businesses to mitigate potential biases within streaming data and ensure fair AI outcomes. This requires a multifaceted approach, including leveraging diverse training datasets, implementing bias detection algorithms, and conducting regular audits by teams of diverse backgrounds, highlights Moideen. Qadir stresses algorithm explainability and simplicity, and Dr Rebai calls for a clear national public policy on AI use. 

    AI will inherit any biases consecrated in data, and mechanisms are required to guarantee outputs consistent with societal norms. Transparency and accountability require AI models to be explainable,” Dr Rebai further adds. 

    Talking about Qatar, he says the country needs to build and vote for an AI law to regulate this field while covering many related aspects like ‘high-risk’ AI uses that will get banned (dealing with people’s fundamental rights, such as in healthcare, education, policing…), more obviousness/awareness when you’re interacting with an AI system, more transparency by AI companies including better data governance, ensuring human oversight, and assessing how these systems will affect people’s rights, and even maybe citizens will have the right to complain if they have been harmed by an AI.

    Experts believe flexibility, scalability, and security are paramount considerations for building robust data streaming pipelines.. Enterprises must also focus on data integrity, latency, and integration challenges.

    “Develop your enterprise data governance model first”, says Dr Rebai. Without having data stewards’ identification, data matching tools, data lineage automation, golden data classification, data quality mechanisms, master data management, and data privacy in place first, we will never be able to govern the AI models, monitor the data used for training, modifications, in consequence, how ‘accurate’ and ‘responsible’ the generated AI solutions will be.

    He explains that then you will be able to categorize the data streams and monitor/govern the effectiveness and purity of the generated machine-learning models through the connectors (the data preparation layer is crucial and directly impacts the resulting solutions).


    The experts were part of the MIT SMR CONNECTIONS thought leaders briefing “Making Sense of the AI & Gen AI Trend with Data Streaming” powered by Confluent. At MIT SMR Connections, we explore the latest trends in leadership, managing technology, and digital transformation.

    Topics

    MIT SMR CONNECTIONS

    At MIT SMR Connections we explore the latest trends on leadership, managing technology, and digital transformation.
    More in this series

    More Like This

    You must to post a comment.

    First time here? : Comment on articles and get access to many more articles.