I’m working on something in a similar direction and would appreciate feedback from people who’ve built or operated this kind of thing at scale.
The idea is to hook into Bitbucket PR webhooks so that whenever a PR is raised on any repo, Jenkins spins up an isolated job that acts as an automated code reviewer. That job would pull the base branch and the feature branch, compute the diff, and use that as input for an AI-based review step. The prompt would ask the reviewer to behave like a senior engineer or architect, follow common industry review standards, and return structured feedback - explicitly separating must-have issues from nice-to-have improvements.
The output would be generated as markdown and posted back to the PR, either as a comment or some attached artifact, so it’s visible alongside human review. The intent isn’t to replace human reviewers, but to catch obvious issues early and reduce review load.
What I’m unsure about is whether diff-only context is actually sufficient for meaningful reviews, or if this becomes misleading without deeper repo and architectural awareness. I’m also concerned about failure modes - for example, noisy or overconfident comments, review fatigue, or teams starting to trust automated feedback more than they should.
If you’ve tried something like this with Bitbucket/Jenkins, or think this is fundamentally a bad idea, I’d really like to hear why. I’m especially interested in practical lessons.
We have coding agents heavily coupled with many aspects of the company's RnD cycle. About 1k devs.
Yes, you definitely need the project's context to have valuable generations. Different teams here have different context and model steering, according to their needs.
For example, specific aspects of the company's architecture is supplied in the context. While much of the rest (architecture, codebases, internal docs, quarterly goals) are available as RAG.
It can become noisy and create more needless review work. Also, only experts in their field find value in the generations. If a junior relies on it blindly, the result is subpar and doesn't work.
The idea is to hook into Bitbucket PR webhooks so that whenever a PR is raised on any repo, Jenkins spins up an isolated job that acts as an automated code reviewer. That job would pull the base branch and the feature branch, compute the diff, and use that as input for an AI-based review step. The prompt would ask the reviewer to behave like a senior engineer or architect, follow common industry review standards, and return structured feedback - explicitly separating must-have issues from nice-to-have improvements.
The output would be generated as markdown and posted back to the PR, either as a comment or some attached artifact, so it’s visible alongside human review. The intent isn’t to replace human reviewers, but to catch obvious issues early and reduce review load.
What I’m unsure about is whether diff-only context is actually sufficient for meaningful reviews, or if this becomes misleading without deeper repo and architectural awareness. I’m also concerned about failure modes - for example, noisy or overconfident comments, review fatigue, or teams starting to trust automated feedback more than they should.
If you’ve tried something like this with Bitbucket/Jenkins, or think this is fundamentally a bad idea, I’d really like to hear why. I’m especially interested in practical lessons.
Yes, you definitely need the project's context to have valuable generations. Different teams here have different context and model steering, according to their needs. For example, specific aspects of the company's architecture is supplied in the context. While much of the rest (architecture, codebases, internal docs, quarterly goals) are available as RAG.
It can become noisy and create more needless review work. Also, only experts in their field find value in the generations. If a junior relies on it blindly, the result is subpar and doesn't work.