Using GPT Driver Across XML and Compose UI in Android Test Automation

Christian Schiller
10. Juli 2025
14 Min. Lesezeit

The Hybrid UI Testing Problem (XML + Compose = Brittle Tests)

Mixing legacy XML-based layouts with new Jetpack Compose screens in one Android app can wreak havoc on UI tests. QA teams often find their end-to-end tests becoming brittle when an app uses both UI frameworks. For example, an Espresso test might click through an XML-based login screen only to stall when the next screen is Compose-based – the testing framework can’t “see” the Compose elements and the test fails. This hybrid approach creates potential conflicts in automation: locators don’t line up, UI hierarchies differ, and one framework’s tools don’t seamlessly handle the other’s components. In short, hybrid XML+Compose UIs tend to confuse traditional test frameworks, leading to flaky tests and duplicated effort.

Why Do Mixed UIs Confuse Test Frameworks?

Several factors make hybrid UIs challenging for Appium/Espresso-based testing:

Different Locator Strategies: XML views have resource IDs and are easily found with Espresso’s onView(withId(...)) or Appium’s accessibility locator. Compose UI elements don’t use XML; they rely on Semantics (test tags, content descriptions, visible text). This means the same “Login” button might be referenced by an R.id.login_button in XML, but by a Modifier.testTag("loginButton") or text label in Compose. Traditional tests must handle both, doubling the locator definitions.
Divergent UI Hierarchies: Espresso is built around the View hierarchy and can directly query View IDs. Compose builds its own UI tree; Espresso can’t directly query inside a Compose tree unless using the Compose Testing APIs. In fact, if an app launches an XML-based Activity that later sets a Compose content, Espresso cannot directly find those Compose nodes. One Stack Overflow answer notes that for a hybrid xml+compose app where the launch Activity uses XML and a later screen is Compose, Espresso + ComposeTestRule cannot fully traverse the flow – you’re basically forced to use UiAutomator for true end-to-end tests. This is because Espresso doesn’t “see” Compose components in the view hierarchy, whereas lower-level tools like UiAutomator (which Appium uses under the hood) operate on the accessibility layer and can see them (usually as basic accessibility nodes).
Asynchronous Rendering and Timing: Compose renders UI components differently (often asynchronously or on the Android main thread in bursts). Espresso’s synchronization mechanisms (idling resources) don’t automatically cover Compose unless you use the Compose Testing library (with an AndroidComposeTestRule). In a hybrid test, coordinating these can be tricky – e.g. the Compose content might not be ready when Espresso proceeds, causing flaky failures. Compose’s testing framework has its own idling and needs the test to control the Compose content or use createAndroidComposeRule, which isn’t trivial to combine with standard Espresso tests.
Accessibility and Tagging Differences: By default, Compose UI elements are only accessible via their visible text or content descriptions to tools like UiAutomator/Appium. If a Compose component has no text or content-desc (e.g. an icon with just a testTag), Appium won’t see the testTag unless the app enabled the testTagsAsResourceId semantic flag in Compose. In contrast, XML views typically always have a resource ID or content-desc for automation. This means without extra setup, certain Compose elements might be invisible to Appium/Espresso, leading to “element not found” errors.

The result of all these differences is that a unified test can break simply because the UI technology switches under the hood. One part of the app might load slower or present a slightly different hierarchy, causing the test to not find an element or to interact at the wrong time. These issues tend to surface in staging environments and nightly CI runs – e.g. a test passes on an XML-based flow, but the Compose-based variant of that flow fails due to a timing issue or missing locator, causing flaky CI results.

Traditional Workarounds for Mixed UI Automation

How have teams coped so far? In practice, QA engineers have tried several approaches to handle apps with both XML and Compose UIs – each with pros and cons:

Maintaining Dual Locators or Test Logic: One common workaround is to write separate code paths for each UI type. For instance, a test might check, “If new Compose screen is present, use composeTestRule.onNodeWithTag(\"X\"), else use onView(withId(...)).” This conditional logic ensures coverage, but it bloats the test code and doubles maintenance. Any update requires changing both branches, and it’s easy for one to rot. It’s a fragile, hard-to-scale solution.
Using UiAutomator for Everything: Some teams default to a lower-level approach (e.g. using UiAutomator via Appium) to interact with both XML and Compose through the accessibility layer. This does unify the locator strategy (everything is a generic accessibility node), and UiAutomator can see Compose elements by their text or content-desc when Espresso can’t. However, this comes at a cost: less specificity and more flakiness. Without careful usage of testTagsAsResourceId in Compose, tests might rely on visible text which can change or be non-unique. Also, UiAutomator is generally slower than Espresso and lacks the rich assertions and synchronization Espresso/ComposeTest provide. Essentially, you trade speed and robustness for compatibility.
Duplicating Tests or Using Separate Suites: Some organizations write two sets of tests – one using Espresso for the legacy XML screens and another using the Compose testing framework for new screens. For example, an Espresso test covers the old login flow, and a separate Compose test covers the new signup flow. This avoids hybrid complexity in a single test, but duplicates a lot of effort. It also doesn’t truly simulate a user journey that crosses from an XML screen to a Compose screen, so end-to-end coverage suffers unless you use a higher-level tool.
Increasing Waits and Retries: To tackle flaky loads or mismatched timing, teams often add manual waits, retries, or polling for elements. For instance, after navigating to a Compose screen, a test might sleep for a couple seconds or loop until a “magic text” appears. While this can band-aid some timing issues, it makes tests slower and can mask real problems. Moreover, picking an arbitrary wait time is brittle – too short and tests still fail intermittently; too long and your pipelines slow down unnecessarily.

Each of these approaches is a compromise. They add complexity or maintenance overhead, and they don’t fully solve the problem of hybrid UI automation. QA leads have been eager for a more elegant solution – one that abstracts away whether a screen is XML or Compose and just lets the test interact with it seamlessly.

GPT-Driver’s Unified Approach to Hybrid UI Testing

How can GPT-Driver help, and what is it exactly?

GPT-Driver in a Nutshell: GPT-Driver is an AI-driven mobile test automation tool that works with existing frameworks (Espresso, Appium, XCUITest) but lets you write tests in plain language or via a simple SDK. In essence, you describe the test steps in natural language, and GPT-Driver executes them on a device. Under the hood, it translates those steps into actions on the app, using the appropriate engine (for Android it can plug into Espresso or Appium’s UIAutomator, for example). This means you can write a high-level test case without worrying about how it finds the button or text field – GPT-Driver’s AI and abstraction layer handle that.

No-Code Instructions & Low-Code SDK: GPT-Driver supports a no-code approach (writing steps like “Tap on the Login button” in a studio or config) and a low-code SDK where you call an API in Python, Java, etc. The key is that these instructions are platform-agnostic. You don’t say “find view by ID X” or “find node by testTag Y” – you just say what a user would do (e.g. “Enter the username Alice and press the Login button”). GPT-Driver’s engine interprets this and figures out the right locator strategy for the current screen.

Unified Locator Resolution: GPT-Driver was designed to abstract away differences between UI implementations. For a given instruction, it employs multiple strategies behind the scenes to locate elements robustly:

It can use existing element identifiers or text just like Espresso/Appium would, but it has an AI-based fallback if those identifiers change or aren’t readily available. For instance, if a button’s resource ID changed or a Compose element has no resource ID, GPT-Driver might fall back to using the button’s label text or even its relative position on screen.
It treats Compose and XML uniformly. If you tell GPT-Driver “tap the Submit button”, you don’t need to specify whether it’s onView(withText("Submit")) or composeRule.onNodeWithText("Submit") – GPT-Driver will search the UI hierarchy and accessibility tree for a matching Submit element. If it’s a standard Button with that text, Appium’s accessibility lookup will find it. If it’s a Compose button with a content description or text, that too is exposed to the accessibility layer and will be found. And if it’s only identifiable by a Compose testTag, GPT-Driver’s Espresso integration could handle it (assuming the Compose test integration is configured). Essentially, the same test step works on both UI types without changes.
Self-Healing and Fallbacks: A big advantage of GPT-Driver’s AI layer is self-healing locators. Whereas a normal Espresso test would hard-code a locator (and crash if it’s wrong), GPT-Driver will try alternatives if the first attempt doesn’t match. For example, if a step says “Tap the Profile icon” and initially it tries to find an element with accessibility label “Profile” and doesn’t find it, the AI might look for an image or button that semantically looks like a profile icon. This dramatically reduces brittleness when UI implementations change. As the MobileBoost docs note, GPT-Driver can use element IDs or text like other frameworks, with the added advantage of auto-correcting if those identifiers or texts change. It’s like having a tester that adapts to minor UI updates.
Handling Async and Flakiness: GPT-Driver’s design also includes smart waiting and context awareness. Because it operates at a higher level, it knows to wait for screens to load or certain conditions to be true before proceeding (you can also explicitly instruct waits in natural language). Moreover, the AI agent can handle unexpected pop-ups or minor layout changes without failing the test. For instance, if a random “Cookie consent” dialog appears (a common flakiness in staging), GPT-Driver might automatically detect it and dismiss it to proceed – something a rigid script would choke on. This kind of resilience is built-in, making tests more stable across real devices and configurations.

How GPT-Driver Bridges XML and Compose: In practice, GPT-Driver leverages the underlying automation frameworks to interact with the app. On Android, it can use Espresso for certain operations or Appium (UiAutomator2). With Espresso, GPT-Driver’s SDK for Compose can directly interface with Compose UI elements (their docs have separate setup for “View/XML based apps” and “Jetpack Compose based apps”). With Appium, GPT-Driver operates via the accessibility tree – meaning it will see both XML views and Compose elements (via descriptors) as accessibility nodes. (Developers should ensure important Compose nodes have content descriptions or enabled test tags as resource-ids for best results, but GPT-Driver’s fallback means even if not, it may use visual context or text to find what it needs.) The end result: the tester doesn’t need to know or care what UI toolkit is under the hood. You write the scenario once, and GPT-Driver’s abstraction layer interacts with the app seamlessly, whether a screen is old-school XML or the latest Compose.

Best Practices for Mixed UI Testing in CI/CD

Testing hybrid UI apps can still be done with traditional tools, but requires discipline. Here are some practical recommendations for QA teams, both with and without GPT-Driver, especially when running in continuous integration (Bitrise, Jenkins, GitHub Actions, etc.):

Use Unique Identifiers for All UI Elements: Ensure that every interactive element in both XML and Compose screens has a unique identifier for testing. In XML, that means android:id and meaningful content descriptions for accessibility. In Jetpack Compose, use Modifier.testTag("...") for elements without text, and set contentDescription for images or icons. If you plan to use Appium/UIAutomator in your tests, consider enabling Compose’s testTagsAsResourceId on your top-level composable (as per Android docs) so those testTags become accessible resource IDs at runtime. This will let you find Compose nodes by an ID string, similarly to XML views – reducing flaky searches.
Coordinate Synchronization: In a hybrid app, the default Espresso sync might not wait for Compose content. If you stick to Espresso tests, you may need to introduce idling resources or explicit waits when transitioning between XML and Compose portions. For example, wait for a known text from the Compose screen to appear before asserting or interacting. Compose testing has ComposeTestRule.waitUntil which can be used if you integrate it. On the other hand, GPT-Driver’s engine often handles waits implicitly (it won’t try an action until the UI shows something that matches the instruction), but you can also write steps like “Wait until the Welcome screen is displayed” to be explicit. In any case, syncing with the app’s state is crucial for stable runs in CI.
Leverage Robust Tools in CI: If not using GPT-Driver, decide on a single framework for end-to-end tests in CI to avoid tool fragmentation. Many teams running hybrid UI tests in device clouds choose Appium (which uses UiAutomator) because it can cover both UI types in one flow (as noted, Espresso alone can’t do a full E2E across XML->Compose). Appium with UiAutomator will rely on accessibility attributes (text, desc) – so double down on providing those in your UI code. The downside is slower tests; mitigate this by running on powerful executors or using smaller test shards. If using Espresso/Compose tests separately, you can still automate both on CI (e.g. run the Espresso suite then the ComposeTest suite), but you won’t have one seamless user journey in a single test. Consider your goals and possibly use a combination (Espresso for pure view tests, ComposeTest for pure compose tests, and a few Appium-based smoke tests that cover critical end-to-end flows).
Integrate GPT-Driver into CI Pipelines: If you adopt GPT-Driver, treat it like any other test runner in your CI. You can upload your app build to the GPT-Driver service or use their SDK in a script that runs as a CI step. For example, on GitHub Actions or Bitrise, you might add a step to build your APK and then call GPT-Driver’s API to run the tests, or use their provided action/script to do so. The GPT-Driver documentation provides CI/CD examples for GitHub, Bitrise, Jenkins, etc., which show how to upload the app and trigger tests, then retrieve results. The key is that GPT-Driver can run tests on real or virtual devices in the cloud and report back pass/fail status. Because the tests are more resilient to minor UI changes and timing issues, you’ll likely see fewer flaky failures in your nightly runs. As a precaution, you should still run on a stable device farm and possibly retry tests on failure, but many GPT-Driver users report drastically reduced flaky test counts.
Monitor and Tune in Staging: Hybrid apps often have feature flags or gradually migrated screens. Make sure to test both versions (XML and Compose) if they might both be live. With GPT-Driver, the same test might automatically work on both, but it’s good to validate on both types of screens (you can e.g. force the old version in a staging build and run the test, then force the new version and run again). In staging or nightly environments, enable verbose logging or recording. GPT-Driver, for instance, records live test runs for review – use these recordings to troubleshoot any odd behavior. Even if GPT-Driver “worked around” an issue, you want to know if, say, a Compose element took 10 seconds to load or a fallback was needed, as that might indicate a performance problem in the app.
Keep CI Fast and Reliable: Whichever approach, optimize what you can for speed. Disable animations on test devices (this applies to both Espresso and GPT-Driver runs). Use release builds for testing (UI tests don’t need debug overhead). Run tests in parallel on multiple devices if possible (e.g. Firebase Test Lab or multiple Bitrise workflow instances) to cut overall time. GPT-Driver tests can sometimes run a bit slower than raw Espresso (due to the AI reasoning), but because they can replace many lines of code and avoid certain waits by intelligently proceeding, the difference is often small. And the pay-off is fewer false failures. On Jenkins or self-hosted runners, make sure to allocate enough memory/CPU for emulators or use cloud devices for consistency.

By following these practices, teams can ensure that even a mix of Compose and XML UI can be automated with minimal flakes, whether using traditional tools or an AI-powered solution.

Example: One Test Flow – Traditional vs. GPT-Driver

Let’s walk through a simple login flow that spans an XML screen and a Compose screen, showing how you’d implement it with Espresso and then how GPT-Driver simplifies it.

Suppose we have a Login Activity built with XML (with fields and a login button), and after logging in, the app navigates to a Home screen built with Jetpack Compose (showing a “Welcome” message). Our test will enter credentials and verify the Welcome message.

Traditional Espresso/UIAutomator approach: We start the test with Espresso for the login screen. Once we submit and the Compose-based Home screen loads, we cannot use Espresso to check the Compose UI directly (Espresso can’t find Compose nodes in this scenario). We’ll use UiAutomator (via UiDevice) to look for the “Welcome” text on the screen.

// Espresso + UiAutomator hybrid test example (Kotlin)
@Test
fun loginAndSeeWelcome() {
    // 1. Interact with XML-based Login screen using Espresso     onView(withId(R.id.email_field)).perform(typeText("user@example.com"))
    onView(withId(R.id.password_field)).perform(typeText("password123"))
    onView(withId(R.id.login_button)).perform(click())
    // 2. Now the Home screen is Compose UI. Use UiAutomator to verify a Compose element. 
val device = UiDevice.getInstance(getInstrumentation())
// Wait up to 5s for an element with text "Welcome" to appear 
device.wait(Until.hasObject(By.text("Welcome")), 5000)
val welcomeLabel = device.findObject(By.text("Welcome"))
assertTrue(welcomeLabel != null && welcomeLabel.text == "Welcome") }

In the above, we had to drop to a lower level UiDevice API to find the Compose UI’s “Welcome” text. We rely on the text node because that Compose element might not have any resource ID. This is workable, but if the text is dynamic or translated, the test could break. It also adds complexity (mixing Espresso and UiAutomator). Note that without `device.wait(...)` the test might have failed if the Compose content wasn’t immediately ready. This illustrates the hoops testers jump through for hybrid UIs.

GPT-Driver approach: Now, consider the same test written using GPT-Driver’s Python SDK (for example). The test doesn’t need to specify which framework to use – we just describe the steps. GPT Driver will handle each action on the appropriate UI element, whether XML or Compose.

# GPT-Driver test example (Python pseudocode)
from gptdriver_client import GptDriver
gptd = GptDriver(api_key="YOUR_API_KEY", platform="android", device_name="Pixel 5", platform_version="13.0")
# 1. Enter email and password, and tap login (GPT Driver finds the fields and button regardless of UI type)
gptd.execute("Enter 'user@example.com' into the email field")
gptd.execute("Type 'password123' into the password field")
gptd.execute("Tap the Login button")
# 2. Assert that we see the Welcome screen (GPT-Driver will look for a "Welcome" indicator on the new screen)
gptd.assert_condition("The Welcome screen is displayed")

Notice how the GPT-Driver code is declarative and high-level. We didn’t have to switch context or call different APIs when the app moved from XML to Compose. The instruction Tap the Login button would work whether that button is an android.widget.Button with the text “Login” or a Compose Button with a contentDescription “Login” – GPT-Driver will identify the intended element through the UI hierarchy or even by using its semantic understanding of the screen. Similarly, the final assert_condition in plain English might internally check that a label “Welcome” is visible or some other indicator of the Home screen, but we didn’t have to script that manually.

The result is a single test flow that’s easier to read and maintain. There’s no branching logic for Compose vs XML. If tomorrow the login screen is also rewritten in Compose, the GPT-Driver test likely still works (it will find the fields by their labels or accessibility text). If the “Welcome” text changes to “Hello”, a traditional test would fail an exact match assertion, but GPT-Driver’s assertion could be written more flexibly (or the AI might even catch the change if instructed generally to look for a welcome screen). This illustrates how GPT-Driver abstracts away the hybrid UI complexity while standard frameworks require extra code to handle it.

(In practice, GPT-Driver runs the above steps on a device or emulator. Under the hood it might use Espresso to type into fields if accessible, or Appium to tap the button – but these details are managed by GPT-Driver’s engine. The QA engineer only worries about the test logic.)

Closing Takeaways

For teams dealing with a mix of XML and Jetpack Compose UIs, traditional test automation can indeed run into conflicts – mainly due to locator mismatches and tooling gaps between the UI frameworks. These issues make tests fragile and time-consuming to upkeep. Solutions like duplicating tests or using UIAutomator hacks can work, but at the cost of complexity and reliability.

GPT-Driver offers a modern approach by abstracting UI differences with AI-powered context awareness. It allows one test description to work across old and new UI implementations, resolving elements in a unified way and healing itself when things change. This leads to more stable tests in CI pipelines and lets teams focus on testing the app’s behavior rather than babysitting the test code. As reported by early adopters, it can significantly cut down flaky failures and maintenance effort – for example, Duolingo’s QA team was able to reduce manual regression testing by 70% after adopting GPT-Driver.

When evaluating your testing stack, consider how well it handles hybrid app scenarios. If your Android app is mid-transition to Compose (a reality for many in 2025), ensure your test strategy won’t break at the seams. You might invest in better in-house frameworks, or explore AI-driven tools like GPT-Driver to handle the heavy lifting. The key lesson is that robust automation should be resilient to UI changes. Testing tools are evolving to meet this need by working at a higher level of abstraction.

In summary, using GPT-Driver across XML and Compose UIs can eliminate the usual conflicts seen in hybrid UI testing. It empowers QA engineers to write one set of tests for the whole app without worrying about the underlying UI toolkit. Ultimately, this means faster test development, more reliable CI results (on Bitrise, Jenkins, GitHub Actions, or whichever platform), and confidence that as your app’s UI evolves, your tests will keep up with far less effort. For engineering and QA leads, that means fewer headaches and more time focused on building quality features rather than fixing broken tests.

Answering the question directly: Yes, there are inherent conflicts when using traditional frameworks on apps with both XML and Compose UIs – locators and timing issues can cause flaky tests. GPT-Driver mitigates these conflicts by providing a unified automation layer that handles both UI types seamlessly, so the same test can run on hybrid screens without brittle workarounds.