AI architecture: the key to a scalable and high-performing project

What is an AI architecture?

AI architecture refers to the set of technical choices that shape an artificial intelligence project: model used, access mode (API, local deployment, fine-tuning), data management, orchestration of components. A well-designed architecture determines the performance, cost and scalability of an AI project.

For Swiss SMEs, these choices are critical. A poor technical start can cost months of delay and tens of thousands of francs. This guide presents proven architectural patterns, selection criteria and mistakes to avoid.

The three fundamental architectural patterns

Pattern 1: API Wrapper

The simplest pattern is to call an LLM via API (GPT-4o, Claude, Gemini) and build your business logic around it. No infrastructure to manage, no GPUs to provision. Cost is proportional to usage (pay-per-token).

When to use it: rapid prototyping, moderate volumes (under 10,000 requests/day), generic use cases (summary, classification, content generation).

Typical architecture: your application calls the provider's API, sends a prompt enriched with business context, receives the response and processes it. A smart cache (Redis, Upstash) reduces costs by avoiding redundant calls.

Key advantage: time-to-market of a few days. An SME can have a working prototype within a week.

Limitation: dependency on a single provider. From the start, plan an abstraction layer (Provider Pattern) so you can switch models without rewriting your code.

Pattern 2: RAG (Retrieval-Augmented Generation)

RAG has become the standard for enterprise AI projects. Instead of stuffing everything into the prompt, you combine a semantic search engine with an LLM.

The three-step process:

Indexing: your documents (PDFs, web pages, internal databases) are split into chunks and converted into embeddings (numerical vectors).
Search: when a user asks a question, the system identifies the most relevant chunks by vector similarity.
Generation: the LLM receives the question AND the relevant chunks, then generates a contextualised, sourced response.

When to use it: when the LLM must answer based on your business data (internal documentation, knowledge base, product catalogue, legal corpus).

Key technical choices:

Embedding models: OpenAI text-embedding-3, Cohere Embed v3, open-source models (BGE-M3, E5-large-v2). For multilingual (French/German/English), favour BGE-M3 or Cohere.
Vector databases: Pinecone, Weaviate, Qdrant, or pgvector on PostgreSQL. The latter offers a good compromise for SMEs: no additional service, controlled costs, sufficient performance up to several million documents.

Key advantage: contextualised responses, reduced hallucinations, verifiable sources.

Limitation: RAG quality depends directly on the quality of indexed data. Outdated or poorly structured documents produce mediocre answers.

Pattern 3: Autonomous agents

AI agents represent the next layer of complexity. An agent is an LLM capable of using tools (web search, calculations, API calls, database queries) and planning a sequence of actions to accomplish a complex task.

When to use them: tasks requiring multiple steps (research, analysis, decision, action), integration with existing systems (CRM, ERP), automation of complex workflows.

Orchestration frameworks: LangGraph, CrewAI, the OpenAI Agents SDK, or Anthropic's Claude Agent SDK. The choice depends on your technical ecosystem and the level of control you want.

Key advantage: end-to-end automation of full business processes. Next-generation AI chatbots often rely on this pattern to offer contextual, multi-step interactions.

Limitation: increased complexity, difficult debugging, high token costs (an agent can make 10 to 50 LLM calls for a single task). Reserve for high value-added use cases.

LLM selection criteria

Choosing a model is not a one-shot decision, but it sets the bounds on performance and cost. Here are the criteria to evaluate:

Criterion	Key questions
Performance on your use case	Test 2-3 models on 50 real examples from your domain. Generic benchmarks are not enough.
Cost per request	Calculate the average cost per user interaction, not just the price per token.
Latency	A customer chatbot demands response under 2 seconds. Batch processing tolerates 30 seconds.
Context window	For RAG with long documents, a 128K+ token window is an advantage.
Multilingual support	In Switzerland, support for French, German and English is often essential.
Compliance and hosting	Where is data processed? What contractual commitments on non-retention?

In 2026, the reference models for enterprise projects are Claude (Anthropic), GPT-4o (OpenAI) and Gemini (Google). For local deployments requiring data sovereignty, Llama (Meta), Mistral and Qwen offer competitive performance.

Real costs of an AI architecture in Switzerland

Costs fall into four categories:

1. Initial development costs

Simple API Wrapper: CHF 5'000.– to CHF 15'000.–
Complete RAG with interface: CHF 15'000.– to CHF 40'000.–
Multi-agent system: CHF 40'000.– to CHF 100'000.–

2. Recurring API costs

For an application with 1,000 active users per month:

API Wrapper: CHF 200.– to CHF 800.– per month
RAG (embeddings + generation): CHF 300.– to CHF 1'500.– per month
Agents: CHF 1'000.– to CHF 5'000.– per month

3. Infrastructure costs

Cloud solution (Supabase, Vercel, AWS): CHF 50.– to CHF 500.– per month
Local deployment with GPU: CHF 2'000.– to CHF 10'000.– per month (hardware amortisation included)

4. Maintenance costs

Plan 15 to 20% of the initial development cost per year for maintenance, model updates and system evolution.

The key point: start with the least expensive option (API Wrapper) and scale up complexity only when business value is proven.

Scalability patterns

Horizontal scaling

When the number of users grows, the architecture must follow. The recommended pattern:

Asynchronous queue (Redis Queue, BullMQ) to absorb load spikes without losing requests
Semantic cache for recurring questions: if a similar question has already been asked, serve the cached answer rather than calling the LLM again
Smart rate limiting per user and per request type

Vertical scaling (quality)

Improving response quality without changing the model:

Re-ranking: after vector search, a re-ranking model (Cohere Rerank, cross-encoder) reorders results by actual relevance.
Adaptive chunking: adjust chunk size by document type (paragraphs for narrative text, sections for technical documentation).
Feedback loop: collect user feedback and use it to refine the system continuously.

Provider Pattern

Build an abstraction layer that isolates your business code from the LLM provider. When a new, better-performing or cheaper model arrives, you switch providers without touching the rest of the application. This pattern is necessary: the LLM market evolves every quarter.

The 5 most common mistakes

1. Starting with fine-tuning

Fine-tuning is rarely necessary as a first step. In 80% of cases, a good RAG with well-built prompts yields equivalent results at a fraction of the cost. Reserve fine-tuning for situations where the model must adopt a very specific style or master technical vocabulary that RAG cannot cover.

2. Ignoring data quality

A RAG system fed with outdated, badly formatted or contradictory documents will produce mediocre answers. Cleaning and structuring data often represent 50 to 60% of the total effort of an AI project. This is not wasted effort; it is the most profitable investment in the project.

3. Neglecting evaluation

Without quality metrics (relevance, faithfulness, completeness), it is impossible to measure progress. Set up an evaluation framework from the start: a reference set of 50 to 100 questions and answers, evaluated automatically and manually.

4. Underestimating API costs in production

Token costs grow rapidly with volume. A prototype that costs CHF 50.– per month can reach CHF 2'000.– per month in production. Build a realistic forecast based on the number of users, frequency of use and average request size.

5. Building a monolith

A monolithic AI system is hard to evolve and debug. Adopt a modular architecture: the search engine, the LLM, the cache, the user interface and the evaluation layer are independent components. If one component must change, the others are not affected.

The recommended progressive strategy

For a Swiss SME starting an AI project, here is the pragmatic roadmap:

Weeks 1-2: Proof of Concept with an API (Claude or GPT-4o) on a precise, measurable use case. Tools such as Claude Code accelerate this phase. Budget: CHF 5'000.– to CHF 10'000.–.
Months 1-2: RAG prototype with your business data, systematic quality evaluation, user testing. Budget: CHF 10'000.– to CHF 25'000.–.
Months 3-4: Production rollout with monitoring, feedback loop, continuous improvement of prompts and chunking.
Months 6+: Evaluation of fine-tuning or local deployment if volumes justify it. Adding agents if complex workflows must be automated.

This progressive approach minimises risk and validates business value before investing heavily. It is exactly the approach we apply in our Express AI PoC.

Summary

Three patterns structure AI projects: API Wrapper, RAG and Agents. Start with the simplest.
RAG is the standard for enterprise projects: contextualised answers, verifiable sources, controlled costs.
The Provider Pattern protects your investment by making your architecture independent of the LLM provider.
Data quality and continuous evaluation matter more than model choice.
Budget realistically: an AI project in Switzerland starts at CHF 5'000.– for a PoC and CHF 15'000.– to CHF 40'000.– for a RAG production rollout.
Contact MCVA Consulting to structure your AI project with a scalable architecture suited to your context.

Frequently asked questions

Should you fine-tune a model or use RAG?

In the vast majority of cases, RAG is the right first choice. Fine-tuning is relevant in two specific situations: when the model must adopt a very particular communication style (legal tone, ultra-specialised medical vocabulary), or when query volumes are so high that RAG cost (embeddings + context tokens) exceeds that of a fine-tuned model. In practice, fewer than 20% of enterprise AI projects require fine-tuning. Always start with a well-built RAG, measure its limits, then assess whether fine-tuning brings an improvement that justifies its cost (CHF 5'000.– to CHF 20'000.– for quality fine-tuning, plus maintenance).

What is the budget for an AI architecture in Switzerland?

A realistic Proof of Concept lies between CHF 5'000.– and CHF 15'000.–, including development and testing. A full production deployment with RAG, user interface and monitoring costs between CHF 15'000.– and CHF 40'000.–. Recurring costs (API, hosting, maintenance) range from CHF 500.– to CHF 3'000.– per month depending on usage volume. ROI is measured in productivity gains or additional revenue: an AI assistant that saves 2 hours per day for 10 employees represents savings of CHF 8'000.– to CHF 15'000.– per month in Switzerland. ROI is generally reached in 3 to 6 months for well-targeted projects.

How do you guarantee data sovereignty?

Three approaches, in order of increasing constraint. First, contractual commitments: Anthropic, OpenAI and Google offer non-retention clauses for enterprise clients (Business or Enterprise plans). Second, deployment on European infrastructure: services such as Azure OpenAI allow you to host the model in a datacentre in Switzerland or Europe; data does not leave the jurisdiction. Third, local deployment of open-source models (Llama, Mistral): data stays entirely on your infrastructure. This option is more expensive in hardware but offers full control. For companies subject to the FADP or sectoral regulations (finance, health), a formal risk analysis is recommended before choosing the architecture.

Got an AI project and not sure where to start? Contact MCVA Consulting for a free technical diagnosis. We help you choose the architecture suited to your use case, your budget and your regulatory constraints. Discover our Express AI PoC to move from concept to prototype in two weeks.

Technique25 mars 2026

Next-generation AI chatbots: far more than automated FAQs

RAG chatbots and AI agents are no longer simple FAQ bots. A guide to modern architectures for Swiss SMEs.

8 min

Stratégie12 avril 2026

FiscalDoc: how I replaced CHF 1'400.–/year of fiscal SaaS with an AI running on my Mac

An experiment: replacing a CHF 1'400.–/year fiscal SaaS subscription with a local application powered by an open-source AI. Total sovereignty, zero subscription.

6 min

Technique8 mars 2026

Claude Code: a hands-on review of AI-assisted development

Claude Code is reshaping software development. A hands-on review of its strengths, limits and best practices to get the most out of it.

11 min