It started innocently enough—adding Kubernetes support to Stakpak. Just another feature, right? Fast forward six weeks and we’re knee-deep in refactoring, looking for workarounds, and out of coffee. What was meant to be a simple task turned into a full-blown rewrite. The more we dug into the system, the more we realized it was like trying to patch a sinking ship with duct tape. Error handling was a mess, performance was dragging, the database was… well, let’s just say it wasn’t winning any awards, and every time a user imported broken code, the service went down for everyone. After spending six weeks just integrating Kubernetes, we realized that if adding one provisioner was this complex, rewriting the entire backend would be worth it if it spares us future pain. Sometimes, you have to hit the reset button to move forward.
We’re on a mission to bring joy back to software development—yes, actual joy, not the “Praise the gods it just works, let’s move on” kind. We started with the most eyebrow-raising work developers face today: the kind that steals your weekends and makes you question your career choices. The kind of work only 3.3% of developers dare to take on as a job—DevOps and infrastructure work!
Let’s face it, most of us have felt completely adrift in the endless ocean of tools, navigating long debugging sessions and trial-and-error marathons after joining a new team (or starting a personal project). If you’re lucky, you might find the shore before AWS networking configurations convince you to set the project aside for something less frustrating. But no more!
Introducing Stakpak, the first AI-powered DevOps IDE designed to make DevOps and Infrastructure as Code (IaC) FUN. No need to be a wizard in Terraform, OpenTofu, Kubernetes, Dockerfiles, or any other tool in the ever-growing list. Stakpak simplifies everything and constantly evolves to empower developers to build and maintain their own customized, production-ready infrastructure!
What does Stakpak do *so far*?
- Generate and modify your infrastructure with AI (No IaC degree required).
- Guide you to the right documentation (so you can stop pretending you understand cloud forums/documentation).
- Visualize your code (because seeing is believing—or at least debugging).
- Scan your IaC for security threats (getting the security team off your shoulder).
- Help you sleep soundly on weekends (actual results may vary).
Why did we decide to rewrite the backend?
To make DevOps truly enjoyable, we decided it was time to rebuild Stakpak’s backend. Initially, our backend was built for a narrow use case: Terraform/Opentofu, which served our MVP perfectly. However, as we wanted to support Kubernetes and other Infrastructure-as-Code (IaC) tools, we encountered significant limitations in the existing architecture. These challenges made it clear that a complete rewrite was necessary to scale and evolve Stakpak to meet the growing demands of our users. Here’s the story behind why and how we made this transition.
Motivation Behind the Rewrite
The Old Stack:
- Database: Neo4j
- API Layer: GraphQL
- Programming Languages: JavaScript/TypeScript
- Backend Framework: Node.js
- GraphQL Server: Apollo Server
Why We Hit the Reset Button?
It took us six weeks to add Kubernetes support, and to make matters worse, the server would crash whenever users imported broken code, requiring a restart each time. One failed session could block everyone else, turning our system into a digital game of “who’s next?”—and frankly, we were tired of playing. That’s when we decided, this is it—we’re rewriting Stakpak.
While this architecture worked initially, several critical issues emerged:
- Schema Constraints and Database Limitations:
- Neo4j GraphQL schema has significant restrictions, e.g. we can’t enforce multi-key unique constraints.
- Tight coupling between database and API layer, e.g. all fields are exposed by default, forcing us to explicitly mark private fields and manually enforce access control restrictions.
- The data layer was not abstracted, meaning any changes to how we stored data impacted the entire API. Since the database and API were tightly coupled, even small adjustments to the data storage method would require updates across all API endpoints, making the system harder to change.
- Our graph database was great for representing relationships, but it couldn’t handle other aspects of our business logic that required relational constraints (e.g. Users and Organizations). We needed a more multi-modal database that could handle all the other types of data we needed to store.
- GraphQL was used to simplify editing resource graphs, but we quickly found that most of our update operations were too complex, requiring custom resolvers. This made GraphQL less effective for our use case.
- Convoluted Data Processing
- We had to convert configurations/code to JSON for analysis, which made everything unnecessarily complicated and painfully slow. On top of that, this approach wasn’t universal—some languages couldn’t be converted to JSON (e.g. Rego policies), and that extra transformation layer just made it slower!
- This resulted in overly complex validation and error-handling mechanisms that were challenging to maintain and scale.
- Performance Bottlenecks
- Inefficient processing of large-scale data, e.g. we had a lot of CPU-bound tasks blocking our NodeJS threads
- Performance limitations when handling high-traffic scenarios, where a single session could consume excessive CPU resources, causing delays or crashes for other sessions.
- The API was prone to failure, where a single error could bring down the entire system. This not only made the backend unstable but also added complexity to maintenance, making it hard to ensure consistent performance.
- Complex Frontend:
- Our front end was overly complicated, with a lot of business logic handled through complex state machines and in-browser code processing and validation. While this made the Stakpak IDE very responsive, it also made it difficult to support new technologies or build other clients on top of the Stakpak API like VSCode extensions, since we had to replicate this complex logic across clients.
- Architectural Complexity and Inflexibility
- Our data model and parsing tools were closely tied to Terraform, making it challenging to support other configuration languages. This lack of flexibility resulted in a rigid system that couldn’t easily adapt to new technologies or the many other DevOps tools available.
- Difficulties in implementing robust error handling
Example; TypeScript doesn’t always help in identifying potential errors, meaning some errors aren’t explicitly handled. Additionally, implicit error propagation causes issues, as errors aren’t always passed along correctly, leading to inconsistent behavior across the system.
import { GraphQLError } from "graphql";
async function getFlow(flowId: string): Promise {
// Simulates a database call that might throw an error
if (!flowId) throw new Error("User ID is required");
return { id: flowId, name: "Flow 1" };
}
async function resolveFlow(_, args): Promise {
try {
const flow = await getFlow(args.flowId);
// Further processing
return { success: true, user };
} catch (e) {
// Return a GraphQL-friendly error
throw new GraphQLError("Failed to fetch flow data");
}
}
- Implicit Error Sources: Functions like getFlow don’t specify the errors they might throw in their type signatures, leaving callers unaware of what errors could occur or how to handle them properly.
- Error Swallowing: The catch block wraps errors into a generic GraphQLError without retaining the original error details, making debugging significantly harder.
- Error Propagation Complexity: The resolver assumes all errors should be turned into GraphQLError, but some errors—like database connection issues—require a different handling approach. This lack of distinction adds unnecessary complexity and makes it difficult to manage errors effectively.
Why Not Just Refactor?
When we looked at the challenges in our existing stack, we faced a crucial decision: should we refactor the current system or go all in with a complete rewrite? While refactoring was tempting, we quickly realized that the deep-rooted limitations in our architecture made it almost impossible to achieve the flexibility and scalability we needed. Here’s why we decided a complete rewrite was the only way forward.
- Architectural Considerations
-
- The current stack relies heavily on custom resolvers for Neo4j/GraphQL and we needed a more flexible database.
-
- Migrating the database would require us to completely rebuild the entire API.
- We needed a more flexible interface to support different provisioners beyond Terraform and accommodate other client types, such as VSCode extensions and CLI tools.
2. Language and Architectural Goals
-
- We realized that incremental refactoring wouldn’t solve the core limitations of our technology stack.
3. Strategic Objectives for New Architecture
-
- We wanted to decouple the data layer from the API layer
- Implement more flexible code parsing and validation
- Create a more modular and extensible codebase
- We needed to improve performance for faster data processing because developers value responsiveness, and slow tools are a quick way to lose their attention.
- We’re sticking with a monolith for now because our team is small, we want to move fast, and we don’t want the # of services to exceed the # of devs. Keeping it simple means fewer things to juggle and more time for what really matters: shipping awesome stuff!
- We wanted to support multiple DevOps tools (aka Provisioners) to expand Stakpak beyond Terraform, as we had always intended.
- We shifted core functionality, such as validation, from the front end to the backend. This simplifies the front end’s responsibility, allowing it to focus solely on delivering a seamless UX.
Why Rust?
Rust stood out as the ideal choice for our needs, offering a powerful mix of performance, reliability, and flexibility:
- Supports Universal Language Parser: Rust’s support for tree-sitter makes it a perfect fit for handling all supported languages effortlessly.
- Error Reliability: With Rust’s strong type system and safety guarantees, we have end-to-end typed errors, helping us minimize bugs and runtime errors.
- Blazing Performance: Rust shines when working with large files, handling them with speed and efficiency. This also makes it perfect for analyzing source code.
- Rapid Development: It lets us iterate quickly without sacrificing code quality or performance—a win-win for productivity and maintainability.
- Extendability: Rust’s mature generics make it easy to build scalable support for multiple provisioners, LLM providers (Stakpak uses more than 4 different AI models), and future-proofing our work.
- Explicitness: Rust’s explicitness and multi-threading allowed us to make sure failures are isolated and errors are handled properly.
- Fun: Rust has some functional elements like enum types and pattern matching, which helps us write MUCH less code and have fun doing so (fun is seldom appreciated nowadays when coding).
Other Languages We Considered
While evaluating alternatives, Elixir and Go stood out but fell short for our needs.
- Typescript + Effect: This lets us keep our existing stack, but adds strong end-to-end typed errors and more functional elements to Typescript. But this came at a cost, we had to rewrite everything in the new Effect tongue.
- Elixir: Offers excellent concurrency and rapid development through Phoenix, but lacks native tree-sitter support and is super slow with CPU-bound tasks. We experimented with writing native extensions in Rust to speed things up but this added extra maintenance overhead.
- Go: A strong contender with a mature ecosystem, simple syntax, and good performance, but the limited language features, lack of enums, mature generics, and manual error management made it less fitted for our use case. We had to write a lot of code or heavily rely on code generation to build generic data processing and LLM generation pipelines.
Despite our extensive experience with Go (we took 2 products to production with Go), Rust’s combination of functional elements, efficiency, and modern features proved to be a better fit for our goals.
Building the New Backend
The New Stack
- Primary Language: Rust
- Web Framework: Axum
- Database: EdgeDB
- API Design: RESTful
Architectural Decisions and Implementation
Decoupled Provisioner Architecture
To support new configuration languages quickly, we designed an interface for provisioner-specific logic. This architecture:
- Keeps Provisioner Functionality Isolated: Ensures clean separation from core system components, reducing potential ripple effects.
- Minimizes Core Impact: Core systems remain stable and unaffected by provisioner changes.
- Extensible: Adding or modifying provisioners is straightforward and low-risk.
trait Provisioner {
fn parse(&self, code: &str) -> Result, AppError>;
fn validate(&self, config: &Config) -> Result;
// The rest of the interface methods
}
struct TerraformProvisioner;
struct KubernetesProvisioner;
impl Provisioner for TerraformProvisioner {
// Implementation for Terraform-specific analysis
}
impl Provisioner for KubernetesProvisioner {
// Implementation for Kubernetes-specific analysis
}
Tolerating Corrupt Input & Broken Configurations
We designed our system to handle imperfect input gracefully, allowing users to import all their configurations—even if they’re broken. Instead of demanding 100% valid input, we focus on helping users identify and fix issues seamlessly.
Embracing the Monolith
We deliberately keep things simple, resisting the urge to split into microservices too soon. A monolithic architecture reduces complexity, ensuring our team can focus on delivering value without being bogged down by unnecessary communication or infra overhead.
Minimizing Datastore Complexity
Initially, managing text embeddings required dedicated vector databases like Weaviate or Qdrant. Today, with vector storage becoming a standard feature in most databases, we’ve simplified our architecture by minimizing the number of datastores we rely on, keeping things simple.
Shift Complexity to the Backend
We decided to do this to simplify the frontend and make it easier to build other kinds of clients on top of the API. For instance, core validation functionality was offloaded from the frontend to the backend via sockets (like replit). This shift reduced frontend complexity and ensured validation logic is maintained in one place, enhancing both maintainability and scalability.
Navigating Challenges and Making Trade-offs
Building a robust and extensible system isn’t smooth sailing—it’s more like fixing a plane mid-flight. Each challenge comes with its own set of surprises, and we had to get creative with solutions while making trade-offs that wouldn’t leave us regretting our life choices (again).
Challenge 1: LLM Toolchain Integration
The Rust ecosystem lacked robust support for integrating LLM toolchains. We tackled this by designing an abstract provider-agnostic interface for AI model providers, similar to a LangChain-style API (but much simpler, it works only for our use-cases). Additionally, we used AI-driven code generation to quickly scaffold lightweight SDKs based on the API documentation of each provider, speeding up the process. The Rust linter Clippy, the Rust compiler, and a batch of unit tests made this process reliable.
Challenge 2: Database Mocking
Testing traits in Rust isn’t straightforward, and some of our tests required the ability to mock the database. We built a custom mock database—a lightweight implementation designed specifically for testing—giving us the flexibility to simulate real database interactions without the complexity of a full-fledged database (praise generics and interfaces).
pub struct MockDatabase {
responses: RefCell>>,
}
impl MockDatabase {
pub fn new() -> Self {
let mut responses: HashMap> = HashMap::new();
responses.insert(String::from("query_json"), vec![]);
responses.insert(String::from("query_single_json"), vec![]);
responses.insert(String::from("query_required_single_json"), vec![]);
Self {
responses: RefCell::new(responses),
}
}
pub fn add_response(&self, method: DatabaseClientMethod, response: Value) {
let mut responses = self.responses.borrow_mut();
responses
.get_mut(method.as_str())
.expect("Object not found")
.push(response);
}
}
impl DatabaseClient for MockDatabase {
async fn transaction(&self, mut body: B) -> Result
where
B: FnMut(Option) -> F,
F: Future
TradeOffs
- While Rust excels in low-level operations like text manipulation and analysis, its community support for integrations with AI toolchains and LLMs isn’t as great as Python or JavaScript, making some tasks more challenging.
- Our CI pipelines also took a hit with long build and test times, especially since compiling Rust code can be very VERY slow (Go spoiled us). We faced some issues with build arguments for Apple Silicon that slowed things down even more.
- Additionally, Rust has a steep learning curve. While we haven’t yet fully unlocked Jedi-level mastery, we’re making progress. The language design is rich, requiring time and practice to master, but we’re steadily getting there.
Results and Insights
The rewrite of Stakpak’s backend delivered significant improvements across performance, reliability, and operational simplicity. Here’s what changed:
Epic Upgrades
- 900x Faster Processing: Large codebases now process in record time, reducing wait times and boosting developer productivity. And if you need more convincing, just take a look at the magic happening above!
- Expanded Platform Support: Whether you’re Team Terraform, rocking OpenTofu, or dabbling in something exotic, Stakpak’s got you covered. Since the migration, we’ve rolled out support for GitHub Actions, Dockerfiles, and auto-magically handling whatever you need—just say the word, and we’ve got you covered!
- Better Error Handling and Stability: The system gracefully handles edge cases and broken configurations, minimizing disruptions and downtime.
What’s Not Changing in Stakpak
Stakpak is still all about helping developers build and manage their own production-ready infrastructure—easy, customizable, and stress-free (well, mostly).
We’re just getting started, and we’d love to hear what you think. Got ideas, questions, or random memes? Join us on Discord and let’s figure it out together—no bs, we promise!