Anyone with an idea can use it
The Microsoft for Startups Founders Hub brings together people, knowledge, and benefits to help founders at every stage solve their startup challenges. Sign up in minutes with no funds required.
This is the second part of a three-part AI-Core Insights series. Click here for Part 1, “The Foundation Model: Open Source or Not?”
In Part 1 of this three-part blog series, we explored a practical approach to foundational models (FMs) for both open and closed source. From a deployment perspective, it proves which underlying model is most effective in solving the intended use case.
Let’s simplify the seemingly endless infrastructure required to bring a product to life from a compute-intensive underlying model.there are two well-discussed problem statement:
- Fine-tuning costs that require lots of data and GPUs with enough vRAM and memory to host large models – this builds a moat around differentiated fine-tuning or rapid engineering. This is especially true if you
- A small inference cost per call, but compounded with the number of inference calls. This is maintained regardless.
Simply put, returns and investments must go hand in hand. However, initially this may require a huge sunk cost. So what do you focus on?
FM Startup Infrastructure Dilemma
If you have a fine tuning pipeline, it will look something like this:
- Data preprocessing and labeling: I have a large pool of datasets. I’m doing some pre-processing of the data (cleaning, resizing, background removal, etc.). A small GPU is needed here. Then maybe label with a smaller model and a smaller GPU.
- are you OK-tuning: Once you start fine-tuning your model, you’ll need a massive GPU, which the A100 is famous for.these are expensive GPULoad large models and fine tune specialized data. Hopefully no hardware failures in the process. In that case, hopefully the checkpoints will be minimal (which takes time). If it fails and there was a checkpoint, I try to get as much tweaking as possible. However, depending on how suboptimal your checkpoints are, you’ll lose quite a few hours anyway.
- Search and inference: After this, we will serve the model for inference. The size of the model is still huge, so we host it in the cloud and increase the inference cost per query. If you want a super-optimal configuration, argue between A10 and A100. Configuring the GPU to fully spin up and down causes cold start issues. If you keep running GPUs, you’re racking up huge GPU costs (i.e. investment) without paying for your users (i.e. return).
Note: Without pipeline tweaks, there is no preprocessing element, but we are still thinking about providing infrastructure.
The biggest decisions related to the sunk cost debate are:What constitutes infrastructure? A) Infrastructure issues and borrow Do you focus on your core product and offer it from a provider, or B) build Do you build components in-house, invest time and money upfront, find and solve problems? A) Consolidate locations and save a lot of costs associated with ingress/egress and regions and zones. or B) Distribute it from different sources to diversify the points of failure, but spread it across zones or regions and potentially cause latency issues? Need a solution?
The trend we see in growing startups is to focus on core product differentiation and commoditize the rest. Infrastructure can be a complex overhead that keeps you away from monetizable problem statements, or it can be a big power plant with bits and pieces that can be easily scaled with a single click as you grow .
Beyond Compute: The Role of Platforms and Accelerating Inference
There’s a euphemism I’ve heard in the startup community. “You can’t throw a GPU at every problem.” “Optimization is (generally speaking) a problem that cannot be fully solved in hardware.” Not to mention the important role of platform and runtime software, other factors such as model compression and quantization come into play. Inference acceleration and Checkpoint.
Given the big picture, the role of optimization and acceleration is rapidly centralized. Runtime accelerators like ONNX enable 1.4x faster inference, and rapid checkpointing like Nebula helps training jobs recover from hardware failures, saving the most important resource: time. increase. In addition to this, using simple techniques such as autoscaling, scaling, and workload triggers can bring the number of GPUs idle and waiting for the next burst of inference requests back to the lowest possible scale. You can spin down with .
At the roundtables we host for startups, sometimes the simplest questions are the ones that burn the most cash. To manage growth, how do you balance serving your customers in the short term with the most efficient hardware and scale, and in the long term? A term that involves scaling up and down?
summary
When considering the productization of underlying models with training and inference at scale, we must consider the role of platform and inference acceleration along with the role of infrastructure. Techniques like ONNX Runtime and Nebula are just a few of those considerations, and there are many others. Ultimately, start-ups face the challenge of efficiently serving customers in the short term while managing growth and scalability in the long term.
Sign up today for the Microsoft for Startups Founders Hub for more tips on bringing AI to your startups and getting started building industry-leading AI infrastructure.