Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Vertical callback or per pixel callback : I understand/How does Tilengine work?
#1
Hi!

I wondered why Tilengine didn't have vertical raster effects or per pixel effects while the Commodore Amiga can do stuff each 8 pixels, or less in demos.
Well, I try to create my own graphics engine in Nim with not only horizontal raster effects, but also vertical and both at once.
Aaaand well... It hurts performances a lot. It now makes sense why that kind of feature isn't implemented.

But I wonder about another thing, how do you do to reach such high performances to draw at least a simple tilemap? I hardly go under 7ms per frame drawing for one layer and a 384x216 screen.

I also read Tilengine's code to understand how rendering is done, maybe I didn't understood well, but it seems you draw each scanlines each times as the number of layers. is it that?
Reply
#2
Hi!

The Commodore Amiga has a totally different architecture compared to consoles and arcade boards, the ones that inspired Tilengine design.

The Amiga doesn't have a hardware tilemap/tileset engine (character generator), but instead a planar framebuffer assisted by a hardware blitter -a unit that copies variable size chunks of graphics from RAM to the framebuffer-. It's also assisted by the copper, an additional coprocessor that can execute very simple programs in parallel to the main CPU. These basic programs are called "copper lists" and allow to write registes of the other units -blitter, color palettes, sprites, scroll, etc- in sync with the retrace beam of the monitor. In fact the copper can only execute two types of instructions: write registers and wait until the retrace beam is at a given x,y position. So the Amiga doesn't have a per-pixel interrupt, but a dedicated coprocessor that can block execution until the beam reaches a given position, wth a granularity of 4 pixels wide.

This is in contrast with console/arcade systems -and Tilengine too- where an horizontal interrupt is generated after each scanline has been completed, and a vertical one when the whole frame has been completed. These interrupts require the main CPU to attend them whenever they trigger, pausing whatever the CPU is doing at that moment. There's no such thing as "per-pixel interrupt".

The rendering loop in tilengine composites each scanline, from top to bottom, drawing sequentially whatever happens to be at that scanline, and then calls the raster callback. Real hardware composites from left to right synchronised with the retrace beam. For each pixel the hardware must take the layers, sprites, priorities, blending, etc, combine all and ouptut the pixel. This is done in dedicated hardware. On the other hand Tilengine doesn't work per-pixel but per-item. It does multiple passes per scanline, overdrawing layers and sprites as needed. This allows a much higher performance, but the end result is the same than traditional hardware. In actual console/arcade hardware you cannot tune video register in the middle of a scanline.

Hope to have clarified a bit these architectures!
Reply
#3
(08-19-2023, 04:56 AM)megamarc Wrote: Hi!

The Commodore Amiga has a totally different architecture compared to consoles and arcade boards, the ones that inspired Tilengine design.

The Amiga doesn't have a hardware tilemap/tileset engine (character generator), but instead a planar framebuffer assisted by a hardware blitter -a unit that copies variable size chunks of graphics from RAM to the framebuffer-. It's also assisted by the copper, an additional coprocessor that can execute very simple programs in parallel to the main CPU. These basic programs are called "copper lists" and allow to write registes of the other units -blitter, color palettes, sprites, scroll, etc- in sync with the retrace beam of the monitor. In fact the copper can only execute two types of instructions: write registers and wait until the retrace beam is at a given x,y position. So the Amiga doesn't have a per-pixel interrupt, but a dedicated coprocessor that can block execution until the beam reaches a given position, wth a granularity of 4 pixels wide.

This is in contrast with console/arcade systems -and Tilengine too- where an horizontal interrupt is generated after each scanline has been completed, and a vertical one when the whole frame has been completed. These interrupts require the main CPU to attend them whenever they trigger, pausing whatever the CPU is doing at that moment. There's no such thing as "per-pixel interrupt".

The rendering loop in tilengine composites each scanline, from top to bottom, drawing sequentially whatever happens to be at that scanline, and then calls the raster callback. Real hardware composites from left to right synchronised with the retrace beam. For each pixel the hardware must take the layers, sprites, priorities, blending, etc, combine all and ouptut the pixel. This is done in dedicated hardware. On the other hand Tilengine doesn't work per-pixel but per-item. It does multiple passes per scanline, overdrawing layers and sprites as needed. This allows a much higher performance, but the end result is the same than traditional hardware. In actual console/arcade hardware you cannot tune video register in the middle of a scanline.

Hope to have clarified a bit these architectures!

This is pretty interesting! I thought you work per pixel instead of per scanline, I thought calculating everything for one pixel and then move to the next one and once the scanline is fully drawn, you execute the callback, then move to the next scaline and redo the process per pixel.
I thought overdrawing was a bad thing, since of, if in this case we have 3 layers and a resolution of 256x224, you would process 172.032 pixels. This is why I thought I should compute one pixel at a time instead of one scanline at a time.

Edit : How do you store the tileset pictures in memory? do you store it in a big byte array and then store one line, then another one, and then another one and so one? Or do you store the first line of the actual tile, then the next one, ... until the end of the tile and then move to the next one? Or even another method?
Reply
#4
Hi!

There are two things that kill performance on a CPU: branching (i.e. conditionals) and scattered memory fetches. Writing a pixel -that is, writing a memory location- is way more cheaper. So it's much more efficient to do a bit of overdraw but reduce the number of branching and memory reads. Working per-pixel requires lots of test conditions and reads from differents sources until the final pixel value is determined. On the other hand, once the start addresses of the items are determined, the pixels of the item are outputted in a row with very little branching, and with high cache coherence: as the source pixels are adjacent in memory, chances of being already cached are much higher than doing scattered reads. This is at the cost of a bit of overdraw, but as you have found yourself, this is much more efficient in terms of performance.


I store the scanlines of tilesets sequentially:
  • 1st row of tile 0
  • 2nd row of tile 0
  • ...
  • nth row of tile 0
  • 1st row of tile 1
  • 2nd row of tile 1
  • ...
  • nth row of tile 1
Reply
#5
(08-19-2023, 04:28 PM)megamarc Wrote: Hi!

There are two things that kill performance on a CPU: branching (i.e. conditionals) and scattered memory fetches. Writing a pixel -that is, writing a memory location- is way more cheaper. So it's much more efficient to do a bit of overdraw but reduce the number of branching and memory reads. Working per-pixel requires lots of test conditions and reads from differents sources until the final pixel value is determined. On the other hand, once the start addresses of the items are determined, the pixels of the item are outputted in a row with very little branching, and with high cache coherence: as the source pixels are adjacent in memory, chances of being already cached are much higher than doing scattered reads. This is at the cost of a bit of overdraw, but as you have found yourself, this is much more efficient in terms of performance.


I store the scanlines of tilesets sequentially:
  • 1st row of tile 0
  • 2nd row of tile 0
  • ...
  • nth row of tile 0
  • 1st row of tile 1
  • 2nd row of tile 1
  • ...
  • nth row of tile 1

You are right! This is way more efficient!
However, performances might suffer when introducing scaling and affine transformations, especially on multiple layers at once.
Also with rotated tiles, since you don't access contigous data.

However, vertical and horizontal flips are fine, but if statements are required

Edit : Scaling seems pretty easy : You just need to multiply the coordinates of the pixel you are drawing by a factor (one factor for X, and another one for Y)
Reply
#6
Hi!

Basic scaling (Neo-Geo style) is quite cheap, if fact zoomed-in layers have more pixel throughput than regular layers, because as each tile covers more screen space, fewer calls are required to fetch the next tile. Affine transformations (SNES-style) are the opposite: as the source and destination scanning are not parallel, every single pixel must fetch its parent tile, and what position occupies inside the tile. Thus these kind of layers offer the worst performance.

90º rotated tiles and sprites may have more cache misses than regular ones, but the overall performance is somewhat similar and it's a rarely used feature. Horizontal and vertical flips are free. When a tile is fetched, some checks and assignments must be done to determine scanning pointers (accounting for flips and 901 rotations), but once they're setup, all the pixels of the tile are output straight. By this rule, 16x16 tiles offer better performance than 8x8.

However if you check the benchmarks, even a Raspberry Pi 3 has enough power to run a standard game at 60 fps. Let's say you render a 16:9 240p game: 400x240 at 60 fps, that's 5.7 MPixels/s. Pixel throughput on a Pi 3 of regular layers and sprites is between 50 - 60 MPixels/s, that is 10x than needed. That leaves plenty of room to overdraw multiple scroll planes and sprites, blending, scaling, etc. Affine layers on a Pi 3 is about 8 Mpixels/s, enough to be used smoothly but don't overdraw a lot -the SNES had just one layer on Affine mode in mode 7-. And we're talking about a Pi 3, that is the humblest thing i have on my hands. The Pi 4 doubles that performance, and any low-end PC has even more power.
Reply
#7
(08-23-2023, 02:54 PM)megamarc Wrote: Hi!

Basic scaling (Neo-Geo style) is quite cheap, if fact zoomed-in layers have more pixel throughput than regular layers, because as each tile covers more screen space, fewer calls are required to fetch the next tile. Affine transformations (SNES-style) are the opposite: as the source and destination scanning are not parallel, every single pixel must fetch its parent tile, and what position occupies inside the tile. Thus these kind of layers offer the worst performance.

90º rotated tiles and sprites may have more cache misses than regular ones, but the overall performance is somewhat similar and it's a rarely used feature. Horizontal and vertical flips are free. When a tile is fetched, some checks and assignments must be done to determine scanning pointers (accounting for flips and 901 rotations), but once they're setup, all the pixels of the tile are output straight. By this rule, 16x16 tiles offer better performance than 8x8.

However if you check the benchmarks, even a Raspberry Pi 3 has enough power to run a standard game at 60 fps. Let's say you render a 16:9 240p game: 400x240 at 60 fps, that's 5.7 MPixels/s. Pixel throughput on a Pi 3 of regular layers and sprites is between 50 - 60 MPixels/s, that is 10x than needed. That leaves plenty of room to overdraw multiple scroll planes and sprites, blending, scaling, etc. Affine layers on a Pi 3 is about 8 Mpixels/s, enough to be used smoothly but don't overdraw a lot -the SNES had just one layer on Affine mode in mode 7-. And we're talking about a Pi 3, that is the humblest thing i have on my hands. The Pi 4 doubles that performance, and any low-end PC has even more power.

Oh alright, so it's pretty fast!
I tried to do a scanline rendering algorithm and I am around 4ms per frame, at 1000 FPS

This is the algorithm I wrote :
Code:
proc blendColor(layer: Layer, colorSrc: var ColorRGBX, colorResult: color.Color) =
    case layer.blend:
    of NONE:
        colorSrc = cast[ColorRGBX](colorResult)
    of ADD:
        colorSrc = cast[ColorRGBX](cast[color.Color](colorSrc) + colorResult)
    of SUB:
        colorSrc = cast[ColorRGBX](cast[color.Color](colorSrc) - colorResult)
    of MOD:
        colorSrc = cast[ColorRGBX](cast[color.Color](colorSrc) mod colorResult)
    of MIX25:
        colorSrc = cast[ColorRGBX](cast[color.Color](colorSrc) | colorResult)
    of MIX50:
        colorSrc = cast[ColorRGBX](cast[color.Color](colorSrc) || colorResult)
    of MIX75:
        colorSrc = cast[ColorRGBX](cast[color.Color](colorSrc) ||| colorResult)
    of OR:
        colorSrc = cast[ColorRGBX](cast[color.Color](colorSrc) or colorResult)
    of XOR:
        colorSrc = cast[ColorRGBX](cast[color.Color](colorSrc) xor colorResult)
    of AND:
        colorSrc = cast[ColorRGBX](cast[color.Color](colorSrc) and colorResult)

proc paintScanlineBitmap(layer: Layer, linePtr: ptr ColorRGBX, width: int, line: int) =
    if(layer.bitmap == nil): return
    let bm = layer.bitmap
    var linePtr = linePtr
    for i in 0..<width:
        let
            idX = cast[uint](i + layer.position.x) mod cast[uint](bm.width)
            idY = cast[uint](line + layer.position.y) mod cast[uint](bm.height)

        let colIndex = bm[cast[int](idX), cast[int](idY)]
        if colIndex != 0:
            let color = bm.palette[cast[int](colIndex)]
            layer.blendColor(linePtr[], color)
            # amigafy(linePtr[])
            linePtr[].a = 255
        linePtr = cast[ptr ColorRGBX](cast[uint64](linePtr) + cast[uint64](sizeof(ColorRGBX)))

proc paintScanlineTilemap(layer: Layer, linePtr: ptr ColorRGBX, width: int, line: int) =
    if(layer.tilemap == nil): return
    var linePtr = linePtr
    let
        tmap = layer.tilemap
    for x in 0..<width:
        let idX = cast[uint](x + layer.position.x) mod cast[uint](tmap.widthPixels)
        let idY = cast[uint](line + layer.position.y) mod cast[uint](tmap.heightPixels)
        let tileX = idX div tmap.tileWidth.uint
        let tileY = idY div tmap.tileHeight.uint

        let tile = tmap[tileX.int, tileY.int]
        let t = tile.uint32
        if(tile.index == 0 or tile.masked):
            let xAddr = x.addr
            xAddr[].inc(tmap.tileWidth - 1)
            linePtr = cast[ptr ColorRGBX](cast[uint64](linePtr) + cast[uint64](sizeof(ColorRGBX)))
            continue
        let
            tileset = tmap.tilesets[tile.tileset]
        var
            pixX = idX.int mod tileset.tileWidth
            pixY = idY.int mod tileset.tileHeight
       
        # if(tile.rotate):
        #    let temp = pixY
        #    pixY = pixX
        #    pixX = temp
            # pixY = min(pixX, tileset.tileHeight - 1)
            # pixX = min(temp, tileset.tileWidth - 1)
            # pixY = tileset.tileHeight - 1 - pixY
            # pixX = tileset.tileWidth - 1 - pixX
           
        if(tile.flipV): pixY = tileset.tileHeight - 1 - pixY
        if(tile.flipH): pixX = tileset.tileWidth - 1 - pixX
        let
            tOffset = tile.index.int * tileset.tileWidth * tileset.tileHeight
            pOffset = pixY * tileset.tileWidth + pixX

            colIndex = tileset[tOffset + pOffset]

        # echo tile.tileset
        if(colIndex.int >= tileset.palette.size):
            echo tileset

        if colIndex != 0:

            let color = tileset.palette[colIndex.int]

            layer.blendColor(linePtr[], color)
            linePtr[].a = 255

        # window.screen.pix.data[index + x] = cast[ColorRGBX](color)
        # window.screen.pix.data[index + x].a = 255
        linePtr = cast[ptr ColorRGBX](cast[uint64](linePtr) + cast[uint64](sizeof(ColorRGBX)))



proc renderScanline(window: Window, line: int) {.inline.} =
    # echo line
    let index = line * context.width
    var myPtr = (window.screen.pix.data[index].addr)

    # Draw background color
    for i in 0..<context.width:
        myPtr[] = cast[ColorRGBX](context.backgroundColor)
        myPtr[].a = 255
        myPtr = cast[ptr ColorRGBX](cast[uint64](myPtr) + cast[uint64](sizeof(ColorRGBX)))
       
    myPtr = (window.screen.pix.data[index].addr)
    # Draw layers
    for l in context.layers:
        case l.layerType:
        of LAYER_BITMAP:
            l.paintScanlineBitmap(myPtr, context.width, line)
        of LAYER_TILEMAP:
            l.paintScanlineTilemap(myPtr, context.width, line)

    return
       

proc renderScreen(window: Window) {.inline.} =
    for j in 0..<context.height:
        if(context.lineCallback != nil): context.lineCallback(j)
        GC_disableMarkAndSweep()
        window.renderScanline(j)
        GC_enableMarkAndSweep()

I think it's fairly simple, but I wonder if I can optimize further, knowing most of my objects are reference types
Code:
e.lineCallback = (
    proc(line: int, myPtr = nil.pointer): void =
      e.layer(0).y = int(sin((TAU * line.float + offset)/32) * 4)
      e.layer(1).x = int(sin((TAU * line.float + offset)/32) * 3)
      if((line and 1) == 0):
        e.layer(0).x = int(sin((TAU * line.float + offset)/32) * 2 + x)
      else:
        e.layer(0).x = int(-sin((TAU * line.float + offset)/32) * 2 + x)
      e.backgroundColor = Color(r: line.byte, g: 255, b: 255 - line.byte)

      )
I also implemented a raster callback, however, I'm not sure if the userdata pointer is an useful feature. I took inspiration from SDL2's audio callback that allows you to pass data with a pointer.
Reply
#8
Hi!

Sorry, I can't give advice on reimplementing Tilengine in another language I don't know. I can however add new features based on suggestions, or fix existing bugs
Reply
#9
(08-26-2023, 01:11 AM)megamarc Wrote: Hi!

Sorry, I can't give advice on reimplementing Tilengine in another language I don't know. I can however add new features based on suggestions, or fix existing bugs

Oh alright, sorry then.
I saw you fixed the black screen bug, I will give it a try. Thanks for fixing that!

I also have another question, what happens if you enable scaling and affine on a layer at the same time?

Edit : Feature idea : Being able to get number of entries of a palette
Reply
#10
Hi!

I want to apologize if I sounded rude for not providing support on other languages, but that's the truth. I know that Nim transpiles to C (and Javascript), but I don't know about its architecture or optimizations. So I can't really tell which construct is more performant. As a general rule is good strategy to avoid conditionals as much as possible, and replace them with sequentials, or function pointers. Tilengine makes extensive use of function pointers, and I thought about adding some more on critical places that should increase performance.

For example replace this:

Code:
loop while condition:
  if a:
    do_treatment_a()
  else:
    do_treatment_b()

with this:

Code:
if a:
  loop while condition:
    do_treatment_a()
else:
  loop while condition:
    do_treatment_b()

Yes, I've fixed the initial black screen in non-vsync mode. I have yet to upload the builds to itch.io, but you can do a pull from GitHub and build it from sources. Let me know if it gets fixed for you.

You cannot enable scaling and affine on the same layer, as it can only have one state at a given time. The new mode just replaces the old one.

Feature accepted about getting the number of entries on a palette :-) this number is already available internally, just need to add a public function to retrieve it.
Reply


Forum Jump:


Users browsing this thread: 1 Guest(s)