How to Run Mobile Tests on Specific High-Priority Devices in CI Pipelines

Christian Schiller
11. Sept.
10 Min. Lesezeit

Imagine your continuous integration (CI) pipeline automatically testing your app on the Samsung Galaxy S24 Ultra and iPhone 13 – exactly the high-priority devices your customers use. This is not only possible; it’s increasingly essential for mobile QA. Modern mobile ecosystems are highly fragmented, with thousands of device models and OS versions in use. Given limited time and resources, QA teams must focus on a subset of top devices to catch critical bugs. These flagship models (like the Galaxy S24 Ultra and iPhone 13) often represent a large portion of your user base, so ensuring they pass all tests in CI helps prevent high-impact regressions. Real device testing typically focuses on high-priority devices for practical reasons. In short, targeting specific devices in your CI pipeline is both feasible and vital for coverage and stability.

Why Targeting Specific Devices Matters

Device Fragmentation: The mobile market spans an overwhelming array of hardware and OS combinations. Android alone had over 24,000 unique devices in circulation by 2024, while iOS had ~1,800 models. It’s impossible to test everything. Instead, teams prioritize a handful of devices that cover the majority of their users. For example, focusing on the top 10 devices can cover about 80% of user traffic. By running tests on models like the S24 Ultra (latest Galaxy flagship) and iPhone 13 (still widely used iPhone), you maximize real-world coverage with minimal devices.

Real-World Coverage: Different devices can expose different issues. Manufacturers often customize Android (e.g. Samsung’s One UI) in ways that affect app behavior. A layout bug or crash might appear on a Galaxy phone but not on a Google Pixel, or on iOS 16 but not iOS 17. Testing on real high-priority hardware catches these device-specific bugs early. It also verifies performance on actual devices – e.g. ensuring the app runs smoothly on the S24 Ultra’s high-res screen and that iPhone 13 (with its older chip) handles new features. This protects the user experience for your largest customer segments.

CI Pipeline Stability: Incorporating specific devices into CI provides a safety net before release. By automatically running critical tests on your top devices each build, you ensure that if a change breaks the app on, say, iPhone 13, the pipeline flags it immediately. This prevents shipping a regression that affects thousands of users. Moreover, using consistent target devices can reduce flakiness in CI – you aren’t dealing with unpredictable environments each run. As long as those devices (or their equivalents in a device cloud) are available, your tests have a stable environment. In practice, teams often maintain a “golden device” list for CI smoke tests (e.g. latest iPhone, latest Samsung) to balance speed and coverage.

Traditional Approaches to Device Targeting (and Their Limits)

Historically, targeting a specific phone in test automation required a lot of manual setup. In frameworks like Appium or Espresso, you must hard-code the device identifier or model in your test capabilities or config. For example, an Appium test can specify a particular Android device by its udid (unique device ID). This means someone has to find and update that device ID in the test config for each target device. If two devices are connected or available, misconfiguration can even direct the test to the wrong one.

Common tactics include:

Hard-Coding Device Names/IDs: Simple but brittle. You set your desired device (e.g. “iPhone 13 – iOS 16” or a specific device ID) in the CI job or test script. This works as long as that exact device is accessible. The drawback is if the device is offline, busy, or its ID changes (in clouds or after resets), the tests fail or run on a different device. It’s also static – every time you want to change the target (say to a new model or OS), you must manually edit the config.

Custom Scripts for Device Selection: Some teams write scripts to dynamically pick an available device from a pool. For instance, a script might query a local device lab or cloud for a “Galaxy S24 Ultra” and allocate it to the test. While this adds flexibility, it’s error-prone. You have to handle cases where the device isn’t free, manage concurrency, and update the script for new models. These scripts can break if device names or APIs change, leading to maintenance headaches.

Device Cloud Tagging/Pools: Device cloud services (AWS Device Farm, BrowserStack, etc.) allow grouping devices or using filters. You might create a “HighPriority” device pool containing your target models. In CI, you then run tests against that pool. This abstracts the low-level IDs – if one device is busy, the cloud might use another from the pool. It improves availability but not determinism: the test could end up on a slightly different model if not carefully constrained. Also, setting up and updating pools is a manual process. If you truly need that exact model and OS, you often still end up specifying it by name in the cloud config.

Pros and Cons: Traditional approaches do let you run on specific hardware, but with notable downsides. Hard-coded configs are fragile (one small change in the environment can derail the test). Scripting and tagging solutions add complexity and still require constant curation of device lists. In fast-paced CI/CD environments, these manual steps can slow down iteration and lead to “mis-targeting” – e.g. your test unknowingly ran on a similar device but not the one with the known bug, so the bug slipped through. Or the test failed because the device was unavailable, wasting a CI cycle. In summary, while you can target devices with legacy frameworks, it’s a brittle process that doesn’t scale well as devices and teams change.

GPT Driver’s AI-Enhanced Approach to Device Targeting

Modern AI-driven testing solutions like GPT Driver aim to simplify this problem. GPT Driver provides a no-code/low-code platform specifically built to integrate with device clouds and labs, allowing teams to target exact models and OS versions with minimal effort. How does it differ from the traditional methods?

Natural Language Test Definition: Instead of writing code or fiddling with device IDs, you describe tests in plain English. For example, a QA could write, “On an iPhone 13, log in with valid credentials and verify the welcome message,” as a test step. The GPT Driver platform understands the intent and handles the device setup behind the scenes. This means anyone on the team can specify the target device in an intuitive way, without deep Appium knowledge. Duolingo’s QA team, for instance, was able to write test steps in natural language (“Tap on the profile tab icon…”) and watch them execute on a virtual device, with hardly any coding.

Easy Device Configuration: GPT Driver’s interface includes straightforward device selection settings – you can pick the exact phone model, OS version, and even locale from a dropdown, no scripting needed. Under the hood, GPT Driver integrates with various device providers (cloud services or on-premise device labs). Once you select, say, Samsung Galaxy S24 Ultra – Android 14, the platform takes care of provisioning that device for the test run. It can launch the appropriate emulator/simulator or reserve a real device from the cloud automatically. This integration saves you from dealing with raw device IDs or constantly updating capabilities. (Of course, you’ll need access to the device in some form – GPT Driver works with both virtual devices and real physical devices as targets.)

Deterministic Execution with AI Assist: A concern with any AI-driven tool is consistency – tests should produce the same result every run. GPT Driver was built to ensure deterministic test execution despite using AI. It uses techniques like zero-temperature prompts (ensuring the AI doesn’t inject randomness) and versioned prompts so the same inputs yield the same actions each time. In practice, GPT Driver combines traditional command-based steps with AI reasoning. It will use a reliable command (like an Appium-style locator tap) whenever possible, and only fall back to AI guidance if something unexpected occurs (e.g. a pop-up appears). This hybrid approach gives you consistency on known paths and flexibility on new or changed screens. The result is that tests on specific devices run repeatably and with low flakiness. In fact, the platform’s visual+LLM engine was shown to reduce false failures (“false alarms”) compared to pure script tests – a big win for CI stability.

Seamless CI Integration: GPT Driver is designed to plug into CI/CD pipelines easily. You can trigger tests via its CLI or API as part of your build process. Because device targeting is handled by configuration, your pipeline doesn’t need complex logic per device. You simply specify which test suite to run and which device configuration to use. For example, you might have a pipeline step that calls: “Run SmokeTests on Galaxy_S24_Ultra_Android_14 and iPhone_13_iOS_16.” The AI agent will acquire those devices (via the integrated cloud or lab), deploy the app build, and execute the tests. If those devices are unavailable, GPT Driver can queue or use an equivalent fallback if configured, ensuring the pipeline doesn’t just break. From the CI perspective, you get a pass/fail result for each device’s run, along with rich reporting (screenshots, logs, etc.) for debugging. This abstraction means QA leads can assign tests to high-priority devices with a simple setting, and trust the platform to manage the device orchestration reliably.

Best Practices for Prioritizing Devices in CI

Even with the right tooling, you should apply strategic thinking to which devices to prioritize and how to structure your pipeline:

Identify High-Priority Devices: Use data to drive this. Check your app’s analytics or market research to see which devices (models and OS versions) dominate your user base. High-priority might include the latest flagship phones (e.g. Galaxy S24 Ultra), popular older models (e.g. iPhone 13 or an older Samsung A-series if that’s big in your market), and any device/OS known to have caused issues in the past. Focus on a handful – perhaps the top 5 devices that cover a large percentage of users.
Designate Critical Test Suites: Not every test needs to run on every high-priority device on each commit – that would be slow. Instead, create a smoke test or critical path suite (covering login, signup, checkout, etc.). These are fast, essential tests that you will run on all your must-cover devices in CI. This way, if something fundamental breaks on a top device, you catch it immediately. Less critical or more exhaustive tests can run on a broader device matrix less frequently (e.g. nightly).
Leverage Parallelism for Coverage: Modern CI and device clouds allow parallel test execution. Take advantage of that. For example, run your smoke tests in parallel on an iPhone and an Android device to get results faster. GPT Driver supports parallel execution, which helps keep pipeline times reasonable even when targeting multiple devices. Similarly, you can run different groups of tests on different devices concurrently (one job on iPhone 13, another on S24 Ultra, others on emulators or secondary devices). This balances speed and coverage.
Use Device Pools/Profiles for Flexibility: If using a device cloud directly, maintain device pools for easy switching. With GPT Driver, maintain device configuration profiles – e.g. one profile for “Latest iPhone (iOS 17)” that you update as new models come out. This way you can adjust the actual device model in one place, and all tests using that profile will automatically target the new device. Regularly revisit which devices are high-priority (market share can shift every few months, and new OS versions roll out). Update your CI device list at least quarterly to stay aligned with reality.
Monitor and Refine: Treat device targeting as a living strategy. Monitor test results across devices – if a certain model is consistently passing without issues, but another secondary device starts showing unique bugs, you might re-prioritize. Also keep an eye on test stability. If one device is yielding flaky results (maybe due to hardware quirks or lab issues), consider swapping it out or adding self-healing logic. The goal is to ensure your CI gives reliable feedback. High-priority tests should be signal, not noise, so you can trust failures as real problems.

Example: Smoke Tests on Top Devices, Broader Regression in Parallel

To illustrate, let’s say we have an e-commerce app. Analytics show most of our users are on iPhone 13 (iOS 16) and Samsung Galaxy S24 Ultra (Android 14), with a long tail on other devices. We define the checkout flow and login flow as critical. In our CI pipeline, we configure GPT Driver to run the “Critical Smoke Suite” on those two devices every build. When a developer opens a pull request, the CI triggers these tests. GPT Driver spins up an iPhone 13 and a Galaxy S24 Ultra in the cloud, deploys the app, and runs the smoke tests in parallel. Suppose a new bug causes a checkout crash on iOS – the iPhone 13 test fails and immediately flags the PR. Meanwhile, the Android device passes. The developer sees the failure report (with video and logs from the iPhone run) and can fix the issue before merging.

At the same time, the pipeline (or a nightly job) can kick off a broader regression suite on a range of devices. Using the device cloud integration, GPT Driver might run the full test suite across, say, 10 devices: those two high-priority ones plus others like Pixel phones, more iPhone models, tablets, etc. These can run concurrently without blocking the merge. They might surface less critical bugs (e.g. a layout issue on a Pixel 6). The QA team reviews those results separately. This two-tier approach ensures fast feedback on must-pass devices and comprehensive coverage overall. The high-priority device tests act as gatekeepers for quality, while the wider tests maintain confidence that you aren’t broken on other devices either.

In practice, teams using GPT Driver have found it much easier to orchestrate such setups. You assign tests to device profiles in the tool’s dashboard, and the AI takes care of execution across the board. The visual AI capabilities also mean if a pop-up only appears on the Samsung device (say due to an OS-specific permission), the test can handle it gracefully instead of failing – reducing flaky failures. This all contributes to faster, more reliable CI runs on real devices.

Key Takeaways

Running tests on specific high-priority mobile devices in CI is not only possible – it’s become a QA best practice for robust releases. Focusing on crucial devices like the Galaxy S24 Ultra and iPhone 13 addresses the reality of device fragmentation by covering the largest user segments with real hardware. Traditionally, engineers achieved this with manual configurations, device IDs, and custom scripts, which worked but were labor-intensive and brittle. We’ve seen how mis-targeting or rigid setups in those approaches can lead to wasted cycles or missed bugs.

AI-enhanced solutions like GPT Driver significantly simplify targeting exact devices. By allowing test authors to describe scenarios in natural language and handling the device selection under the hood, GPT Driver removes much of the overhead. It ensures tests run deterministically and consistently on the chosen models, and it integrates with CI/CD so teams can seamlessly include real device testing in their pipelines. The combination of plain-English test cases with behind-the-scenes command execution offers the best of both worlds: ease of use and reliable execution.

For QA leads and senior engineers, the path forward is clear. Identify your high-impact devices, automate testing on them in CI, and leverage modern tools to reduce the maintenance burden. This approach will catch critical issues on the devices your users care about most, without slowing down your development cycle. Whether through GPT Driver’s no-code platform or a well-tuned traditional setup, investing in targeted device testing pays off in higher confidence and quality. In summary, yes – you can and should run tests on specific devices like the S24 Ultra and iPhone 13 in CI, and doing so will help ensure that “works on my machine” truly means works on my users’ devices too.