Posts: 187
Threads: 22
Joined: Dec 2020
Reputation:
1
Hi!
I wondered why Tilengine didn't have vertical raster effects or per pixel effects while the Commodore Amiga can do stuff each 8 pixels, or less in demos.
Well, I try to create my own graphics engine in Nim with not only horizontal raster effects, but also vertical and both at once.
Aaaand well... It hurts performances a lot. It now makes sense why that kind of feature isn't implemented.
But I wonder about another thing, how do you do to reach such high performances to draw at least a simple tilemap? I hardly go under 7ms per frame drawing for one layer and a 384x216 screen.
I also read Tilengine's code to understand how rendering is done, maybe I didn't understood well, but it seems you draw each scanlines each times as the number of layers. is it that?
Posts: 673
Threads: 33
Joined: Jan 1970
Reputation:
13
Hi!
The Commodore Amiga has a totally different architecture compared to consoles and arcade boards, the ones that inspired Tilengine design.
The Amiga doesn't have a hardware tilemap/tileset engine (character generator), but instead a planar framebuffer assisted by a hardware blitter -a unit that copies variable size chunks of graphics from RAM to the framebuffer-. It's also assisted by the copper, an additional coprocessor that can execute very simple programs in parallel to the main CPU. These basic programs are called "copper lists" and allow to write registes of the other units -blitter, color palettes, sprites, scroll, etc- in sync with the retrace beam of the monitor. In fact the copper can only execute two types of instructions: write registers and wait until the retrace beam is at a given x,y position. So the Amiga doesn't have a per-pixel interrupt, but a dedicated coprocessor that can block execution until the beam reaches a given position, wth a granularity of 4 pixels wide.
This is in contrast with console/arcade systems -and Tilengine too- where an horizontal interrupt is generated after each scanline has been completed, and a vertical one when the whole frame has been completed. These interrupts require the main CPU to attend them whenever they trigger, pausing whatever the CPU is doing at that moment. There's no such thing as "per-pixel interrupt".
The rendering loop in tilengine composites each scanline, from top to bottom, drawing sequentially whatever happens to be at that scanline, and then calls the raster callback. Real hardware composites from left to right synchronised with the retrace beam. For each pixel the hardware must take the layers, sprites, priorities, blending, etc, combine all and ouptut the pixel. This is done in dedicated hardware. On the other hand Tilengine doesn't work per-pixel but per-item. It does multiple passes per scanline, overdrawing layers and sprites as needed. This allows a much higher performance, but the end result is the same than traditional hardware. In actual console/arcade hardware you cannot tune video register in the middle of a scanline.
Hope to have clarified a bit these architectures!
Posts: 187
Threads: 22
Joined: Dec 2020
Reputation:
1
08-19-2023, 05:16 AM
(This post was last modified: 08-19-2023, 05:24 AM by System64.)
(08-19-2023, 04:56 AM)megamarc Wrote: Hi!
The Commodore Amiga has a totally different architecture compared to consoles and arcade boards, the ones that inspired Tilengine design.
The Amiga doesn't have a hardware tilemap/tileset engine (character generator), but instead a planar framebuffer assisted by a hardware blitter -a unit that copies variable size chunks of graphics from RAM to the framebuffer-. It's also assisted by the copper, an additional coprocessor that can execute very simple programs in parallel to the main CPU. These basic programs are called "copper lists" and allow to write registes of the other units -blitter, color palettes, sprites, scroll, etc- in sync with the retrace beam of the monitor. In fact the copper can only execute two types of instructions: write registers and wait until the retrace beam is at a given x,y position. So the Amiga doesn't have a per-pixel interrupt, but a dedicated coprocessor that can block execution until the beam reaches a given position, wth a granularity of 4 pixels wide.
This is in contrast with console/arcade systems -and Tilengine too- where an horizontal interrupt is generated after each scanline has been completed, and a vertical one when the whole frame has been completed. These interrupts require the main CPU to attend them whenever they trigger, pausing whatever the CPU is doing at that moment. There's no such thing as "per-pixel interrupt".
The rendering loop in tilengine composites each scanline, from top to bottom, drawing sequentially whatever happens to be at that scanline, and then calls the raster callback. Real hardware composites from left to right synchronised with the retrace beam. For each pixel the hardware must take the layers, sprites, priorities, blending, etc, combine all and ouptut the pixel. This is done in dedicated hardware. On the other hand Tilengine doesn't work per-pixel but per-item. It does multiple passes per scanline, overdrawing layers and sprites as needed. This allows a much higher performance, but the end result is the same than traditional hardware. In actual console/arcade hardware you cannot tune video register in the middle of a scanline.
Hope to have clarified a bit these architectures!
This is pretty interesting! I thought you work per pixel instead of per scanline, I thought calculating everything for one pixel and then move to the next one and once the scanline is fully drawn, you execute the callback, then move to the next scaline and redo the process per pixel.
I thought overdrawing was a bad thing, since of, if in this case we have 3 layers and a resolution of 256x224, you would process 172.032 pixels. This is why I thought I should compute one pixel at a time instead of one scanline at a time.
Edit : How do you store the tileset pictures in memory? do you store it in a big byte array and then store one line, then another one, and then another one and so one? Or do you store the first line of the actual tile, then the next one, ... until the end of the tile and then move to the next one? Or even another method?
Posts: 673
Threads: 33
Joined: Jan 1970
Reputation:
13
Hi!
There are two things that kill performance on a CPU: branching (i.e. conditionals) and scattered memory fetches. Writing a pixel -that is, writing a memory location- is way more cheaper. So it's much more efficient to do a bit of overdraw but reduce the number of branching and memory reads. Working per-pixel requires lots of test conditions and reads from differents sources until the final pixel value is determined. On the other hand, once the start addresses of the items are determined, the pixels of the item are outputted in a row with very little branching, and with high cache coherence: as the source pixels are adjacent in memory, chances of being already cached are much higher than doing scattered reads. This is at the cost of a bit of overdraw, but as you have found yourself, this is much more efficient in terms of performance.
I store the scanlines of tilesets sequentially:
- 1st row of tile 0
- 2nd row of tile 0
- ...
- nth row of tile 0
- 1st row of tile 1
- 2nd row of tile 1
- ...
- nth row of tile 1
Posts: 187
Threads: 22
Joined: Dec 2020
Reputation:
1
08-20-2023, 08:32 AM
(This post was last modified: 08-20-2023, 10:58 AM by System64.)
(08-19-2023, 04:28 PM)megamarc Wrote: Hi!
There are two things that kill performance on a CPU: branching (i.e. conditionals) and scattered memory fetches. Writing a pixel -that is, writing a memory location- is way more cheaper. So it's much more efficient to do a bit of overdraw but reduce the number of branching and memory reads. Working per-pixel requires lots of test conditions and reads from differents sources until the final pixel value is determined. On the other hand, once the start addresses of the items are determined, the pixels of the item are outputted in a row with very little branching, and with high cache coherence: as the source pixels are adjacent in memory, chances of being already cached are much higher than doing scattered reads. This is at the cost of a bit of overdraw, but as you have found yourself, this is much more efficient in terms of performance.
I store the scanlines of tilesets sequentially:
- 1st row of tile 0
- 2nd row of tile 0
- ...
- nth row of tile 0
- 1st row of tile 1
- 2nd row of tile 1
- ...
- nth row of tile 1
You are right! This is way more efficient!
However, performances might suffer when introducing scaling and affine transformations, especially on multiple layers at once.
Also with rotated tiles, since you don't access contigous data.
However, vertical and horizontal flips are fine, but if statements are required
Edit : Scaling seems pretty easy : You just need to multiply the coordinates of the pixel you are drawing by a factor (one factor for X, and another one for Y)
Posts: 673
Threads: 33
Joined: Jan 1970
Reputation:
13
Hi!
Basic scaling (Neo-Geo style) is quite cheap, if fact zoomed-in layers have more pixel throughput than regular layers, because as each tile covers more screen space, fewer calls are required to fetch the next tile. Affine transformations (SNES-style) are the opposite: as the source and destination scanning are not parallel, every single pixel must fetch its parent tile, and what position occupies inside the tile. Thus these kind of layers offer the worst performance.
90º rotated tiles and sprites may have more cache misses than regular ones, but the overall performance is somewhat similar and it's a rarely used feature. Horizontal and vertical flips are free. When a tile is fetched, some checks and assignments must be done to determine scanning pointers (accounting for flips and 901 rotations), but once they're setup, all the pixels of the tile are output straight. By this rule, 16x16 tiles offer better performance than 8x8.
However if you check the benchmarks, even a Raspberry Pi 3 has enough power to run a standard game at 60 fps. Let's say you render a 16:9 240p game: 400x240 at 60 fps, that's 5.7 MPixels/s. Pixel throughput on a Pi 3 of regular layers and sprites is between 50 - 60 MPixels/s, that is 10x than needed. That leaves plenty of room to overdraw multiple scroll planes and sprites, blending, scaling, etc. Affine layers on a Pi 3 is about 8 Mpixels/s, enough to be used smoothly but don't overdraw a lot -the SNES had just one layer on Affine mode in mode 7-. And we're talking about a Pi 3, that is the humblest thing i have on my hands. The Pi 4 doubles that performance, and any low-end PC has even more power.
Posts: 187
Threads: 22
Joined: Dec 2020
Reputation:
1
(08-23-2023, 02:54 PM)megamarc Wrote: Hi!
Basic scaling (Neo-Geo style) is quite cheap, if fact zoomed-in layers have more pixel throughput than regular layers, because as each tile covers more screen space, fewer calls are required to fetch the next tile. Affine transformations (SNES-style) are the opposite: as the source and destination scanning are not parallel, every single pixel must fetch its parent tile, and what position occupies inside the tile. Thus these kind of layers offer the worst performance.
90º rotated tiles and sprites may have more cache misses than regular ones, but the overall performance is somewhat similar and it's a rarely used feature. Horizontal and vertical flips are free. When a tile is fetched, some checks and assignments must be done to determine scanning pointers (accounting for flips and 901 rotations), but once they're setup, all the pixels of the tile are output straight. By this rule, 16x16 tiles offer better performance than 8x8.
However if you check the benchmarks, even a Raspberry Pi 3 has enough power to run a standard game at 60 fps. Let's say you render a 16:9 240p game: 400x240 at 60 fps, that's 5.7 MPixels/s. Pixel throughput on a Pi 3 of regular layers and sprites is between 50 - 60 MPixels/s, that is 10x than needed. That leaves plenty of room to overdraw multiple scroll planes and sprites, blending, scaling, etc. Affine layers on a Pi 3 is about 8 Mpixels/s, enough to be used smoothly but don't overdraw a lot -the SNES had just one layer on Affine mode in mode 7-. And we're talking about a Pi 3, that is the humblest thing i have on my hands. The Pi 4 doubles that performance, and any low-end PC has even more power.
Oh alright, so it's pretty fast!
I tried to do a scanline rendering algorithm and I am around 4ms per frame, at 1000 FPS
This is the algorithm I wrote :
Code: proc blendColor(layer: Layer, colorSrc: var ColorRGBX, colorResult: color.Color) =
case layer.blend:
of NONE:
colorSrc = cast[ColorRGBX](colorResult)
of ADD:
colorSrc = cast[ColorRGBX](cast[color.Color](colorSrc) + colorResult)
of SUB:
colorSrc = cast[ColorRGBX](cast[color.Color](colorSrc) - colorResult)
of MOD:
colorSrc = cast[ColorRGBX](cast[color.Color](colorSrc) mod colorResult)
of MIX25:
colorSrc = cast[ColorRGBX](cast[color.Color](colorSrc) | colorResult)
of MIX50:
colorSrc = cast[ColorRGBX](cast[color.Color](colorSrc) || colorResult)
of MIX75:
colorSrc = cast[ColorRGBX](cast[color.Color](colorSrc) ||| colorResult)
of OR:
colorSrc = cast[ColorRGBX](cast[color.Color](colorSrc) or colorResult)
of XOR:
colorSrc = cast[ColorRGBX](cast[color.Color](colorSrc) xor colorResult)
of AND:
colorSrc = cast[ColorRGBX](cast[color.Color](colorSrc) and colorResult)
proc paintScanlineBitmap(layer: Layer, linePtr: ptr ColorRGBX, width: int, line: int) =
if(layer.bitmap == nil): return
let bm = layer.bitmap
var linePtr = linePtr
for i in 0..<width:
let
idX = cast[uint](i + layer.position.x) mod cast[uint](bm.width)
idY = cast[uint](line + layer.position.y) mod cast[uint](bm.height)
let colIndex = bm[cast[int](idX), cast[int](idY)]
if colIndex != 0:
let color = bm.palette[cast[int](colIndex)]
layer.blendColor(linePtr[], color)
# amigafy(linePtr[])
linePtr[].a = 255
linePtr = cast[ptr ColorRGBX](cast[uint64](linePtr) + cast[uint64](sizeof(ColorRGBX)))
proc paintScanlineTilemap(layer: Layer, linePtr: ptr ColorRGBX, width: int, line: int) =
if(layer.tilemap == nil): return
var linePtr = linePtr
let
tmap = layer.tilemap
for x in 0..<width:
let idX = cast[uint](x + layer.position.x) mod cast[uint](tmap.widthPixels)
let idY = cast[uint](line + layer.position.y) mod cast[uint](tmap.heightPixels)
let tileX = idX div tmap.tileWidth.uint
let tileY = idY div tmap.tileHeight.uint
let tile = tmap[tileX.int, tileY.int]
let t = tile.uint32
if(tile.index == 0 or tile.masked):
let xAddr = x.addr
xAddr[].inc(tmap.tileWidth - 1)
linePtr = cast[ptr ColorRGBX](cast[uint64](linePtr) + cast[uint64](sizeof(ColorRGBX)))
continue
let
tileset = tmap.tilesets[tile.tileset]
var
pixX = idX.int mod tileset.tileWidth
pixY = idY.int mod tileset.tileHeight
# if(tile.rotate):
# let temp = pixY
# pixY = pixX
# pixX = temp
# pixY = min(pixX, tileset.tileHeight - 1)
# pixX = min(temp, tileset.tileWidth - 1)
# pixY = tileset.tileHeight - 1 - pixY
# pixX = tileset.tileWidth - 1 - pixX
if(tile.flipV): pixY = tileset.tileHeight - 1 - pixY
if(tile.flipH): pixX = tileset.tileWidth - 1 - pixX
let
tOffset = tile.index.int * tileset.tileWidth * tileset.tileHeight
pOffset = pixY * tileset.tileWidth + pixX
colIndex = tileset[tOffset + pOffset]
# echo tile.tileset
if(colIndex.int >= tileset.palette.size):
echo tileset
if colIndex != 0:
let color = tileset.palette[colIndex.int]
layer.blendColor(linePtr[], color)
linePtr[].a = 255
# window.screen.pix.data[index + x] = cast[ColorRGBX](color)
# window.screen.pix.data[index + x].a = 255
linePtr = cast[ptr ColorRGBX](cast[uint64](linePtr) + cast[uint64](sizeof(ColorRGBX)))
proc renderScanline(window: Window, line: int) {.inline.} =
# echo line
let index = line * context.width
var myPtr = (window.screen.pix.data[index].addr)
# Draw background color
for i in 0..<context.width:
myPtr[] = cast[ColorRGBX](context.backgroundColor)
myPtr[].a = 255
myPtr = cast[ptr ColorRGBX](cast[uint64](myPtr) + cast[uint64](sizeof(ColorRGBX)))
myPtr = (window.screen.pix.data[index].addr)
# Draw layers
for l in context.layers:
case l.layerType:
of LAYER_BITMAP:
l.paintScanlineBitmap(myPtr, context.width, line)
of LAYER_TILEMAP:
l.paintScanlineTilemap(myPtr, context.width, line)
return
proc renderScreen(window: Window) {.inline.} =
for j in 0..<context.height:
if(context.lineCallback != nil): context.lineCallback(j)
GC_disableMarkAndSweep()
window.renderScanline(j)
GC_enableMarkAndSweep()
I think it's fairly simple, but I wonder if I can optimize further, knowing most of my objects are reference types
Code: e.lineCallback = (
proc(line: int, myPtr = nil.pointer): void =
e.layer(0).y = int(sin((TAU * line.float + offset)/32) * 4)
e.layer(1).x = int(sin((TAU * line.float + offset)/32) * 3)
if((line and 1) == 0):
e.layer(0).x = int(sin((TAU * line.float + offset)/32) * 2 + x)
else:
e.layer(0).x = int(-sin((TAU * line.float + offset)/32) * 2 + x)
e.backgroundColor = Color(r: line.byte, g: 255, b: 255 - line.byte)
)
I also implemented a raster callback, however, I'm not sure if the userdata pointer is an useful feature. I took inspiration from SDL2's audio callback that allows you to pass data with a pointer.
Posts: 673
Threads: 33
Joined: Jan 1970
Reputation:
13
Hi!
Sorry, I can't give advice on reimplementing Tilengine in another language I don't know. I can however add new features based on suggestions, or fix existing bugs
Posts: 187
Threads: 22
Joined: Dec 2020
Reputation:
1
08-26-2023, 05:43 AM
(This post was last modified: 08-26-2023, 11:05 AM by System64.)
(08-26-2023, 01:11 AM)megamarc Wrote: Hi!
Sorry, I can't give advice on reimplementing Tilengine in another language I don't know. I can however add new features based on suggestions, or fix existing bugs
Oh alright, sorry then.
I saw you fixed the black screen bug, I will give it a try. Thanks for fixing that!
I also have another question, what happens if you enable scaling and affine on a layer at the same time?
Edit : Feature idea : Being able to get number of entries of a palette
Posts: 673
Threads: 33
Joined: Jan 1970
Reputation:
13
Hi!
I want to apologize if I sounded rude for not providing support on other languages, but that's the truth. I know that Nim transpiles to C (and Javascript), but I don't know about its architecture or optimizations. So I can't really tell which construct is more performant. As a general rule is good strategy to avoid conditionals as much as possible, and replace them with sequentials, or function pointers. Tilengine makes extensive use of function pointers, and I thought about adding some more on critical places that should increase performance.
For example replace this:
Code: loop while condition:
if a:
do_treatment_a()
else:
do_treatment_b()
with this:
Code: if a:
loop while condition:
do_treatment_a()
else:
loop while condition:
do_treatment_b()
Yes, I've fixed the initial black screen in non-vsync mode. I have yet to upload the builds to itch.io, but you can do a pull from GitHub and build it from sources. Let me know if it gets fixed for you.
You cannot enable scaling and affine on the same layer, as it can only have one state at a given time. The new mode just replaces the old one.
Feature accepted about getting the number of entries on a palette :-) this number is already available internally, just need to add a public function to retrieve it.
|