The quest for efficiency and enhanced productivity in software development is perpetual. Last week our leadership team (Igor, Misha, and Stacey) dove deep into the rapidly evolving world of AI-powered coding assistants.
Now up to 25 tools on our list, we really want to pick a tool and just go for it. The challenge in this is how the tools are still in their adolescent stage—quickly evolving—but not quite mature enough for allegiance. This alone is causing thrash in an attempt to become more productive, we are spending a lot of time switching tools and evaluating their new capabilities. Plus the LLMs that power them are not steady-state. They have their good days and bad days, adding another layer of unpredictability into a software pipeline.
That said, the goal to get to 90% code generation still remains.
Mapping the Terrain: Types of AI Coding Tools
We started to think of code generation assistants by categories, recognizing they broadly fall into three buckets:
- Pure Chat Generators: Tools like Vercel, Lovable, and Bolt take natural language prompts and generate code, often UI components. While useful for initial generation, they typically require downloading the code for manual integration and modification. Our team has found success using these specifically for generating UIs.
- Context-Aware IDE Assistants: This category includes tools like GitHub Copilot, Cursor, Augment Code, Windsurf, and Cline. These often function as plugins or forks of existing IDEs (like VS Code). They aim to understand your existing codebase, allowing for more integrated code generation, modification, and analysis. The effectiveness of their code indexing and context comprehension seems to be a key differentiator. For our purposes, unless it can index code, we shouldn’t spend any more time on it.
- Autonomous Agents: Tools like OpenHands and Devika (and the much-discussed Devin) attempt a more autonomous approach. You give them a task, and they try to figure out the steps, potentially generating code, analyzing files, and even suggesting PRs. Our initial tests with Open Hands showed promise in analyzing code but sometimes generated incomplete solutions requiring significant manual follow-up. These often run locally, which can be resource-intensive. In general, these are less mature than context aware IDE assistants, so we are going to pause and recheck when a major advancement comes about.
Putting Tools to the Test: A Live Coding Showdown
Unfortunately, our recording of the conversation did not save so there is no video this week. However we were concerned about what we could share because we ran tests on several scenarios on our live PromptOwl code base—reviewing a couple scenarios and doing a security evaluation. Given the nature of the tests, we shouldn’t be sharing too many visuals.
That said, we can share our observations.
Our Ad Hoc Bakeoff for Context Aware IDE Assistants
Igor fired up Windsurf, while Misha ran Cline and Augment Code (both VS Code extensions). Igor also had Cursor available for comparison.
The initial test was simple: Describe what this codebase does. Windsurf provided a concise, accurate summary relatively quickly. Augment also performed well after taking some time indexing the code. The group noted that many of these tools share a similar UI structure—file explorer, code editor, chat panel—but the quality of the chat interaction and code analysis varies significantly.
We then attempted a more complex task: adding a new feature to our monitoring tab. The goal was to add an icon to each row of a chat history table that loads in the conversation chain. Clicking the icon should open a modal window allowing a user to chat about the conversation history displayed, reusing existing chat components from the codebase.
This is where the differences became more apparent:
- Prompting Strategy: Misha, using Augment, adopted a multi-step, conversational prompting approach, treating the AI like a junior developer—providing context, asking clarifying questions, and breaking down the request. Igor, using Windsurf, tried a more direct, albeit detailed, initial prompt. Both agreed that breaking tasks down and doing them one at a time is often more effective. The smaller you can go, the better the fidelity of your result.
- Code Generation & Modification: Windsurf managed to generate the initial code to add the button and the modal structure relatively quickly and correctly, modifying the existing files appropriately. Augment, despite the careful prompting, ran into some issues, generating errors and getting stuck in loops that required manual intervention and even switching to a different LLM to debug the generated code. Misha noted a cache issue might have also contributed to the problems with Augment initially.
- The Stubborn LLM: Misha observed that LLMs can be stubborn. If they latch onto an incorrect approach, they might keep trying it unless explicitly redirected with very specific instructions. Getting stuck in error loops seems common, requiring the developer to guide the tool out of it. This is a skill in itself.
- Review is Crucial: Generated code always needs careful review. Igor mentioned spending 15 minutes getting an initial Dockerfile generated by an AI, but then 6-8 hours manually fixing the remaining 5% to make it work exactly as intended (fully local dependencies). The AI kept defaulting to cloud databases despite prompts for local ones.
By the end of the session, both Igor and Misha felt Windsurf had performed better, and we are going to bring it to the team for further exploration. We are going to plan a bake-off this week to be able to measure and demonstrate what works best for us to the team.
The Bigger Picture: Productivity, Skills, and Security
Does using these tools actually save us time?
Igor felt he saved some time on the Docker task, but not nearly as much as hoped, given the extensive debugging required. Misha cited VC reports claiming 200-400% productivity gains are being seen. The reality likely lies somewhere in between and depends heavily on the task complexity, the tool’s capability, and the developer’s skill level and knowledge of the code base.
We also believe that reducing file size would help with indexing as well as how fast the AI can generate. We are going to do an audit of the code base and attempt to reduce all files to about 300 lines of code and see if we can speed up generation times.
A significant concern raised was the potential impact on skill development, especially for junior engineers. If developers rely heavily on AI generation without deeply understanding the underlying code, will they build the necessary expertise to debug complex issues or design robust systems? Reading and understanding AI-generated code requires its own set of skills.
We also touched upon security. An initial prompt asking an AI tool to perform a security audit quickly identified potentially critical (and sensitive) issues in our code. Nothing crazy, but a great list of housekeeping items to tidy up our security profile. Those made it right into our backlog. We were very pleased with that excercise.
While powerful, this also highlighted the risk of exposing vulnerabilities if such audits or the resulting code were shared publicly. Ultimately, this led to a decision to explore security improvements but be cautious about discussing specific findings openly.
Next Steps on the Path
Our exploration is far from over. Based on this session, we plan to:
- Focus on Windsurf: Get the team experimenting primarily with Windsurf next week, given its promising performance in our tests.
- Explore Refactoring & Testing: Using AI tools to help refactor our largest files, breaking them down to about 300 likes which might improve AI performance. We also want to spend some time on driving towards more of a Test-Driven Development (TDD) pattern to help AI tools know how to validate their results early.
- Refine Prompting: Continue practicing and refining our prompting strategies, treating the AI as a capable but sometimes literal-minded pair programmer that needs clear context and guidance.
The journey into AI-assisted development is fascinating, challenging, and full of potential. While no tool is a magic bullet, the right assistant, wielded effectively, could significantly reshape our workflow. We’ll continue sharing our findings as we navigate this evolving landscape.
