diff --git a/content/lab-deployment-plan.md b/content/lab-deployment-plan.md new file mode 100644 index 0000000000000000000000000000000000000000..d11d2afd3716d6a7789d83293a1d5adf70659f82 --- /dev/null +++ b/content/lab-deployment-plan.md @@ -0,0 +1,70 @@ +--- +title: Collabora Lava Lab device deployment plan +--- + +## In stock, ready for deployment +### Ampere emag servers +* Quantity: 4 +* Ready to go in next batch +* 1 that can go in straight away, just needs some testing with the dhcp for the correct efi grub to be sent to it +* 3 others need some firmware flashing as well. +* [T34634](https://phabricator.collabora.com/T34634) + +## In stock, awaiting dependencies +### Chromebook Tomato (cherry) Acer +* Quantity: 12 +* Ready to be deployed when new dispatchers are set up +* [T36522](https://phabricator.collabora.com/T36522) + +### Renegade elite (rk3399) +* Quantity: 5 +* just needs a tweak to the docs to point to the right firmware to flash them so that if we add more or need to re-flash we are doing the same as they currently are set up +* [T38110](https://phabricator.collabora.com/T38110) + +### Rock 5B +* Quantity: NA +* A couple that could go in but they are of a different spec. Lower priority + +### Ampere Mt jade +* Quantity: 1(?) +* Awaiting confirmation we can use it in the Lab – unit we have was pre-release + +## With engineer for integration +### Chromebook Berknip (zork) HP +* Quantity: 6 +* One on staging so Laura can work on depthcharge +* [T40291](https://phabricator.collabora.com/T40291) + +### Chromebook Dewatt (guybrush) Acer +* Quantity: 12 +* In bring up with Laura/Lucas +* [T39039](https://phabricator.collabora.com/T39039) + +### Apertis potential new renesas +* Quantity: 5 +* Apertis working on roadmap for lab deployment + +### Chromebook Kaisa (puff) +* Quantity: 12 +* On it’s way to Laura for her to work on depthcharge +* [T40243](https://phabricator.collabora.com/T40243) + +### Chromebook Volmar (brya) Acer +* Quantity: 12 +* On it’s way to Laura for to work on the depthcharge +* [T40244](https://phabricator.collabora.com/T40244) + +## Awaiting stock +### Chromebook arcada +* Quantity: NA +* Not with us yet. Nick working on customs invoice with google + +### Chromebook Volteer +* Quantity: 5 or 10 +* Not with us yet. Nick working on customs invoice with google +* Mesa would like some more of these before the split so we are in communication with google to source more. +* [T39591](https://phabricator.collabora.com/T39591) + +## Potential but unknown +### TI AM62xx ?? 5 – 10 +* Quantity: 5 - 10 diff --git a/content/roadmap.md b/content/roadmap.md new file mode 100644 index 0000000000000000000000000000000000000000..f770d5dd62628354ab17d9f20395b0f00d19d2a2 --- /dev/null +++ b/content/roadmap.md @@ -0,0 +1,133 @@ +--- +title: LAVA team roadmap +--- + +## LAVA Development +### Internals (T31327) +#### Review internal/external LAVA server-worker API +* Find differences between internal/external flow +* Verify if it can be unified +* Reduce LAVA code bas by reusing common components + +#### Improve job logs +* Lower occurrences of "Listened connection for namespace '%s' for up to %ds" message (T37051) +* Consider `\r` as a valid line end marker when monitoring the DUT's console (T37054) + - Issue reported upstream: https://git.lavasoftware.org/lava/lava/-/issues/561 +* Allow keeping escape control characters (T37055) + - Both items above resolved by: https://gitlab.collabora.com/lava/lava/-/merge_requests/120 (to be upstreamed) + +#### Traffic reduction (T32184) +* Main goal is to provide new more efficent ways for handling logs +* Keep in mind to document any dropped solution proposal +* Start by mimicing Open Build Service log handling + +#### Benchmarks (T32182) +* Ping upstream for review, update demo-related branches across all relevant repositories (less than half day) +* Add benchmarks for frequently used API endpoints (less than quarter day) +* Enable benchmarking pipeline at least in the internal GitLab (less than half day) +* Extend benchmarking scenarios (for generated database and tests) +* Review bottlenecks found by benchmarks (preferably with solution proposals) +* Submit a blog post with rationale and implementation details + +### Option for disabling viewinggroups +* [LAVA MR 1942](https://git.lavasoftware.org/lava/lava/-/merge_requests/1942) +* Awaiting approval or decline by LAVA Team + +### Revise stats collection in the database +* Review index usage and look for little used ones – drop them from Django or from Postgres + +### Postgres Vacuum +* Periodic stall check Kubernetes provides support for long running crontaabs + +### DB Use cases +* Which package should it be put in? lava-dev, lava-debug? (latter does not exist yet) + +### Job output compression +* currently timing out - do binary chop on compression period + +## LAVA CI +* Results comparison using internal pytest-benchmark mechanism + +## Security +### Codebase review +* Run as gitlab runners? +####Automated scanning: +* [Verifying Django generated HTML](https://github.com/peterbe/django-html-validator) +* [Finding security flaws in python](https://pypi.org/project/bandit/) +* [Being fixed by LAVA team](https://git.lavasoftware.org/lava/lava/-/issues/584) +* [Python code quality checker](https://github.com/PyCQA/pyflakes) + + +## System administration +### Resource issues +* What if someone is unavailable - how do we mitigate - create a plan + +### Alerting for predictable defects +* If support services are unavailable or are about to become unavailable, alert and remedy. + +### Storing and extracting metadata: Loki, Prometheus/Victoria/Mimir +* Kubernetes only stores 10MB data – large logs, and we lose data. Develop a mitigation strategy/ +* Sometimes Loki loses connection after upgrade. Investigate underlying causes + +### Postgres optimization +* Use Unix sockets instead of TCP, outline comparison +* Find out what the performance benefit, if any, would result + +### Dispatcher version synchronisation +* Plan a move to lavapeur and automated upgrades + +### Device controllers +#### Fleet management +#### Conserver, PDU control etc, etc. +* Analyse actual reasons for issue occurrences + +#### Align deployment +* Docker image alignment with upstream + +#### Consider Prometheus alternatives +* Investigate and produce a plan if suitable alternative found + +## Monitoring +### Revisit db index usage +* [How often is it updated?](https://monitoring.core.collabora.dev/d/IDWko4VVk/postgresql-stats) +* Replace ratio value with cache misses +* Add Grafana alerts for potential defects + +## LAVA Lab device integration and deployment +* See deployment road map in gitlab + +## Operator's perspective +### Hardware management + +#### Configuration and fleet management (controller boards) +* Unify configuration management to use Ansible, e.g. for device configuration changes rollout (T21468) +* Move DUT controlling utilities (pdudaemon, conserver, etc.) from dispatcher to external [Target Managers](https://elinux.org/Test_Glossary) + +#### Operator's routines +* Provide a list of _known failures_ (e.g. pending external support) to prevent ignoring new alerts +* Add a _"blame hardware"_ CronJob for issues resolved by reseating connections + +### Administration and integration +#### Monitoring cloud-friendliness (T32181) +* Check which tracing solution (Sentry, Jaeger, etc.) fits best with current setup +* Provide minimal working setup for initial testing and change verification +* Add tracing service to the deployment + +#### Investigate available storage solutions +* Take into account other products than Kubernetes volumes +* Compare benefit-to-cost ratios +* Keep in mind storage size reduction efforts (outdated jobs, job artifacts removal) + +#### Component upgrades: Synchronize dispatcher version with server +* Determine how the dispatcher version is exposed and when upgrade should be enforced (half day) +* Verify if upstream approach with host daemon can be reused or improved (half day - a day) +* Verify dispatcher upgrade mechanism with Kubernetes-based server (half-day) + +#### Component upgrades: Extend component version management +* Set up mirror repository with a CI job triggered by a new tag (less than half day) +* Rebase staging branch on the new release assuming no merge conflicts - to be reviewed manually (half day) +* Determine which components might need version pinning/manual upgrades (if any) + +#### Batch processing +* Parse job output from [lava-gitlab-runner](https://gitlab.collabora.com/lava/lava-gitlab-runner) +